Online System for Grid Resource Monitoring

Online System for Grid Resource Monitoringand Machine Learning-Based Prediction

Liang Hu, Xi-Long Che, Member, IEEE, and Si-Qing Zheng, Senior Member, IEEE

Abstract—Resource allocation and job scheduling are the core functions of grid computing. These functions are based on adequate

information of available resources. Timely acquiring resource status information is of great importance in ensuring overall performance

of grid computing. This work aims at building a distributed system for grid resource monitoring and prediction. In this paper, we present

the design and evaluation of a system architecture for grid resource monitoring and prediction. We discuss the key issues for system

implementation, including machine learning-based methodologies for modeling and optimization of resource prediction models.

Evaluations are performed on a prototype system. Our experimental results indicate that the efficiency and accuracy of our system

meet the demand of online system for grid resource monitoring and prediction.

Index Terms—Grid resource, monitoring and prediction, neural network, support vector machine, genetic algorithm, particle swarm

optimization.

Ç

1 INTRODUCTION

GRID computing removes the limitations that exist intraditional shared computing environment, and be-

comes a leading trend in distributed computing system. Itaggregates heterogeneous resources distributed acrossInternet, regardless of differences between resources suchas platform, hardware, software, architecture, language,and geographical location. Such resources, that includecomputing, storage, data, communication bandwidth re-sources, and other resources, are combined dynamically toform high-performance computing capability of solvingproblems in large-scale applications. Dynamically sharingresources gives rise to resource contention. One of thechallenging problems is deciding the destination nodeswhere the tasks of grid application are to be executed. Fromthe perspective of system architecture, resource allocationand job scheduling are the most crucial functions of gridcomputing. These functions are based on adequate in-formation of available resources. Thus, timely acquiringresource status information is of great importance inensuring overall performance of grid computing [1].

There are mainly two mechanisms for acquiring in-formation of grid resources: grid resource monitoring andgrid resource prediction. Grid resource state monitoringcares about the running state, distribution, load, andmalfunction of resources in a grid system by means ofmonitoring strategies. Grid resource state prediction fo-cuses on the variation trend and running track of resourcesin a grid system by means of modeling and analyzinghistorical monitoring data. Historical information generatedby monitoring and future variation generated by prediction

are combined together to feed a grid system for analyzingperformance, eliminating bottleneck, diagnosing fault, andmaintaining dynamic load balancing, thus, to help gridusers obtain desired computing results by efficientlyutilizing system resources in terms of minimized cost,maximized performance, or trade-offs between cost andperformance. To reduce overhead, the goal of designing agrid resource monitoring and prediction system is toachieve seamless fusion between grid technologies andefficient resource monitoring and prediction strategies.

Resource monitoring is a basic function in most ofcomputing systems. Along with grid development, mon-itoring tools have been evolving to support grid computing,such as those developed in the PAPI project [2], Iperf project[3], Hawkeye project [4], and Ganglia project [5]. Inaddition, some projects have designed a distributed mon-itoring module of their own, such as Grid MonitoringArchitecture (GMA) project [6] and Autopilot project [7].The monitoring techniques employed by such projects arepartly compatible to the grid environment and, thus, fit forachieving grid resource monitoring.

Resource monitoring alone, however, can only supportinstantaneous resource information acquisition. It cannotgeneralize the dynamic variation of resources. Resource stateprediction is inevitable to fill this gap. Typical previousprediction systems, such as NWS [8] and RPS [9], can provideboth monitoring and prediction functions. However, dy-namic features of grid resources were not taken intoconsideration in these design frameworks. Some efforts likethe Collectors of Resource Information (CORI) project [10]and Dinda’s research [11] were devoted to integrate aprediction tool into a system as a patching component, butthe integration of component systems was realized bybuilding a message passing interface. Grid middlewares,such as the ATOP-Grid (Adaptive Time/Space Sharingthrough Over Partitioning) project [12] and the Grid HarvestService (GHS) project [13], were developed to include aprediction component. Nevertheless, these projects areusually restricted in certain applications. In summary,previous approaches have the limitation of being unable to

134 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 1, JANUARY 2012

. L. Hu and X.-L. Che are with the College of Computer Science andTechnology, Jilin University, No. 2699, QianJin Street, Changchun130012, China. E-mail: {hul, chexilong}@jlu.edu.cn.

. S.-Q. Zheng is with the Department of Computer Science, University ofTexas at Dallas, Richardson, TX 75083. E-mail: [email protected].

Manuscript received 16 May 2010; revised 4 Jan. 2011; accepted 15 Feb. 2011;published online 17 Mar. 2011.Recommended for acceptance by K. Li.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-2010-05-0295.Digital Object Identifier no. 10.1109/TPDS.2011.108.

1045-9219/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

achieve seamless fusion of various components and overallsimplification of system structure using a universal scheme.

This paper reports our effort aiming at building adistributed system for grid resource monitoring and predic-tion. We first outline the main design principles of oursystem. Then, we present our overall system architecturedesign that seamlessly integrates various cooperating com-ponents to achieve high performance. We discuss the keyissues in machine learning-based prediction, and justify ourdecisions by comparative studies through extensive simula-tions. We present a new optimization algorithm calledParallel Hybrid Particle Swarm Optimization (PH-PSO),and show its effectiveness. We then discuss the implementa-tion and performance evaluation of a prototype system.

The rest of this paper is organized as follows: Section 2gives the problem statement of grid resource monitoringand prediction. Section 3 provides the overall systemarchitecture based on design principles defined. Section 4discusses the key issues for building prediction compo-nents. Section 5 gives a description of the proposedoptimization algorithm. Section 6 explains the prototypesystem and evaluates its performance and overhead.Section 7 closes the paper with conclusions as well asindication for future works.

2 PROBLEM STATEMENT

Suppose that a computing grid consists of n nodes. Withoutloss of generality, assume that each node i has k resourceelements ri;j, 1 � j � k, 1 � i � n, which could be host load,bandwidth/latency to certain destination, available mem-ory, etc. The state of ri;j at time t is denoted by si;jðtÞ. The stateGSðtÞ of the entire grid is represented by a matrix as follows:

GSðtÞ ¼s1;1ðtÞ . . . s1;kðtÞ

. . . . . . . . .sn;1ðtÞ . . . sn;kðtÞ

24

35: ð1Þ

The monitoring and prediction on grid resources arerealized by monitoring and prediction on the state of eachconcerned resource element in the matrix GSðtÞ. A programthat generates resource performance data si;jðtÞ with time-stamps is called a resource sensor. A program that predictsresource performance data si;jðtÞ with timestamps is calleda prediction model.

Let S� ¼ ðsð1Þ; sð2Þ; . . . ; sðtÞÞ represent the history setgenerated by a resource sensor, and Sþ ¼ ðsðtþ 1Þ; sðtþ2Þ; . . .Þ represent the future set generated by a predictionmodel. Then, any mapping from S� to Sþ is a predictionfunction, and grid resource state prediction is a kind ofregression procedure [14].

What should be emphasized in our research is that wefocus on prediction of multi-step-ahead instead of one-step-ahead, as in

f : sðtþ qÞ ¼ fðsðtÞ; sðt� 1Þ; sðt� 2Þ; . . . ; sðt�mþ 1ÞÞ;ð2Þ

where m is the input feature number in the predictionmodel. The prediction pattern is schematically shown inFig. 1. A historical data set is divided into three parts:training, validation, and test sets. The training set is usedto build prediction model, which is optimized usingvalidation set and evaluated using test set. The model

takes historical data as input and generates prediction forfuture variation. Our research goal is to design adistributed system that seamlessly integrates variouscooperating components to achieve high performance ingrid resource monitoring and prediction.

3 SYSTEM ARCHITECTURE DESIGN

In this section, we present the architecture of our gridresource monitoring and prediction system. We first intro-duce the principles used in our design. Then, we illustrate theservice distribution and work flow of our system in detail.

3.1 System Design Principles

A computing grid is a complex distributed system.Embedded with the grid, its resource monitoring andprediction system is also a distributed system thatdynamically processes grid resource state information. Inwhat follows, we enumerate several features of such asystem. Attaining each of these features serves as a designprinciple for our system architecture.

Responsiveness and robustness. Since grid resourcestates vary dynamically, the information monitored orpredicted has to be timely updated, to guarantee onlinereflection of resource conditions. In our system, resourcesensors and prediction models are periodically executed togenerate up-to-date information for users. Function inde-pendent and starlike distribution are introduced in oursystem. Thus, monitoring and prediction components canwork well if some of nodes are down.

Modularity and extensibility. Modularity and extensi-bility are closely related to each other. In achieving tightcohesion as well as loose coupling, the componentsembedded in our system have independent functions andcan be integrated into most grid computing environments asan independent subsystem. Besides, information generatedor passed is designed in XML format in order to support newresource types and interact with other components.

Efficiency. Executing jobs is the fundamental function of agrid system, so embedded monitoring or prediction compo-nents should minimize overhead to guarantee grid’s normalservice. We deploy resource sensors on computing nodessince it’s inevitable, they also run and sleep dynamically toreduce overhead, while we deploy other components out ofcomputing nodes to avoid extra overhead.

Transparency. Grid users do not need traversal of all thenodes or grid expertise to get information. We design auniform and friendly interface component for accessing theinformation monitored or predicted.

3.2 Service Distribution

Considering the heterogenous and dynamic characteristicsof computing grid, resource monitoring and prediction

HU ET AL.: ONLINE SYSTEM FOR GRID RESOURCE MONITORING AND MACHINE LEARNING-BASED PREDICTION 135

Fig. 1. Prediction pattern.

system has a distributed service structure. Based on theintended system features, we propose to build the wholesystem that consists of two subsystems: resource monitoringsubsystem inside the computing environment, and resourcestate prediction subsystem outside the computing environ-ment. Most computing grid system maintains a servicecontainer for taking grid jobs, such container should bereused for seamless fusion between a grid environment andour system. Therefore, we design a series of supportingservices: monitoring service, prediction service, evaluationservice, and information service. These services are deployedon service containers of distributed nodes, and all thefunctions are realized through dynamic collaboration amongthem. Besides, the resource information is managed using ahierarchical structure. Fig. 2 presents the overall structuredesigned. In what follows, we explain each service in detail.

Monitoring service. Monitoring service is deployed oneach computing resource node. It manages resource sensorsand generates resource monitoring data. Following mon-itoring request customized by grid user, monitoring serviceenables or disables certain resource sensor dynamically.

Prediction service and evaluation service. In order toensure the responsiveness and robustness of the predictionsubsystem, a symmetrical starlike structure is adopted.Prediction service and evaluation service are deployed oneach prediction node. Corresponding to a predictionrequest customized by grid user, one prediction servicetakes charge of the whole prediction procedure andmanages resource prediction models, and then all theevaluation services work as prediction service’s assistantsfor evaluating accuracy and efficiency of candidate models.

Information service. Information service is deployed oninformation node; it interacts with grid users and runs the

storage, query, as well as publication of resource stateinformation. Two types of mechanism are defined forinformation acquisition: local register and group register.Local register timely collects information from resourcesensors to monitoring service and from prediction models toprediction service, while group register timely collectsinformation from both services and aggregates them forstorage or publication. In order to provide friendly interfaceto grid users, a web server is set on information node forcustomizing request and publishing information. Therefore,grid user needs nothing but a browser.

In our system, information generated or passed isdesigned in XML format in order to support new resourcetypes and interact with other components. Fig. 3 showsmonitoring information generated by a resource sensor, asan excerpt of a sample XML document.

3.3 System Work Flow

Monitoring is the precondition of prediction; thus, weenclose the monitoring work flow into the prediction work


Fig. 2. Overall system structure designed.

Fig. 3. A sample XML document.

flow for a more compact description. The sequence diagram

of the system work flow is illustrated in Fig. 4. In what

follows, we provide description in detail.

1. Grid user logs on information node and customizesthree terms before sending a monitoring/predictionrequest: which node, which resource type, and howlong the prediction will last. A prediction request isthen created accordingly and sent to informationservice.

2. Information service launches the monitoring workflow by sending a monitoring request to themonitoring service; a resource sensor is activatedas requested, and timely updated monitoring dataare then sent back to information service throughlocal and group registers; historical records arestored in the database for achieving prediction.

3. Information service chooses a prediction service andsends a customized prediction request. The chosenprediction service then takes charge of the wholeprediction procedure.

4. Prediction service acquires historical monitoringdata from information service, and builds a set ofcandidate prediction models of different types andparameters. It combines each candidate model withhistorical data as an evaluation subtask.

5. Prediction service sends subtasks to evaluationservices for feedback, and then fixes the model withbest performance out of comparison on evaluationresults. If an evaluation service is time out, predictionservice will redirect the subtask to another one.

6. Prediction service feeds the fixed prediction modelwith timely updated monitoring data for prediction,and then timely updated prediction data are sent toinformation service through local and group regis-ter; historical prediction records are stored in thedatabase for checking prediction error.

7. Grid user gets the resource information monitoredor predicted from a browser. If upon request orprediction error exceeds a certain threshold, theprediction service will reload the latest historicaldata, and go over step 4 for model optimization.

8. Information service terminates the monitoring orprediction procedure when the customized time isused out.

4 MACHINE LEARNING-BASED PREDICTION

In this section, we discuss key issues in realizing resourceprediction using machine learning strategies. First, wepropose a universal procedure for building and optimizinga prediction model. Then, we conduct comparative studies


Fig. 4. Sequence diagram of monitoring and prediction work flow.

and make decisions on selecting appropriate strategies forbuilding prediction components.

4.1 Universal Procedure

This research focuses on q-step-ahead prediction; itsequation is defined previously, and we augment it in

f : sðtþ qÞ ¼ fðsðtÞ; sðt� 1Þ; sðt� 2Þ; . . . ; sðt�mþ 1ÞÞ) f : y ¼ fðx1; x2; . . . ; xðm�1Þ; xmÞ

s:t: y ¼ sðtþ qÞ; xi ¼ sðt� iþ 1Þ; i ¼ 1 . . .m:

ð3Þ

This equation will certainly lead to a basic predictionmodel, as shown in Fig. 5.

The historical time series S ¼ ðsð1Þ; sð2Þ; . . . ; sðtÞÞ is aspot set; it can’t feed such model directly; therefore, wetransform it into a sample set based on overlappedsegmentation on time series. Table 1 shows the transforma-tion results.

Resource state prediction should be working and evolvingin a self-learning way; therefore, machine learning strategiesare applicable to achieve automodeling and autooptimiza-tion of prediction models. We propose a universal procedurefor this, with its structure given by Fig. 6.

1. Fix a machine learning algorithm for the predictionmodel, and set its default hyperparameters. Separatethe sample set into three parts: training set, valida-tion set, and test set.

2. Feed the learning algorithm with a sample of trainingset, and repeat it one by one until all samples are used.For some algorithms, the training procedure runsonly once; for others, iterations are needed.

3. Feed the trained model with all samples of valida-tion set, and record the errors between true data andpredicted ones.

4. Fix an optimization algorithm, which evolves thehyperparameters of the prediction model for betterfitness (performance).

5. When a termination condition is met, the optimizedprediction model is achieved, and then tested usingtest set.

4.2 Experimental Setups

Considering that computing grid is loosely distributed in

the environment of Internet, host load of a computing node

on Internet and bandwidth between two nodes across

Internet are the most representative resource information

that need to be monitored and predicted. Moreover, we

prefer using public data rather than historical data

generated by ourselves for the purpose of giving compar-

able and reproducible results.For available bandwidth data set, we believe that the set

of “iepm-bw.bnl.gov.iperf2” [24] can reflect the true

variation between two nodes across Internet. For host load

data set, we choose “mystere10000.dat” [25], a trace of

workstation node, for the reason that workstation is a most

typical computing node.After transformation of original data as in Table 1, the

latest 200 samples were sequentially chosen to form

experiment data set. The data set was then divided into

training set, validation set, and test set, with a proportion

of 100:50:50. Summary statistics for data sets are listed in

Table 2.Experiment was running on a single Intel Pentium IV

3.0 GHz CPU under Fedora Core Linux 9.0 system; all the

algorithms are coded in Java. We recorded the training CPU

time to measure efficiency, and used mean absolute error

(MAE) to measure accuracy, as in (4), where l counts the

number of samples, sðtþ qÞ and s�ðtþ qÞ denote the true

value and predicted value, respectively.

MAE ¼ 1

l

Xli¼1

jsðtþ qÞ � s�ðtþ qÞj: ð4Þ


Fig. 5. Basic prediction model.

TABLE 1Sample Set for a Prediction Model

Fig. 6. Universal procedure for resource prediction.

TABLE 2Statistics of Data Sets

4.3 Comparative Study on Modeling Methods

Artificial neural network (ANN) and support vectormachine (SVM) are two typical machine learning strate-gies in the category of regression computation. These twomethods can be employed for modeling resource stateprediction. ANN is a powerful tool for self-learning, and itcan generalize the characteristics of resource variations byproper training. ANN is inherently a distributed archi-tecture with high robustness. It is suitable for multi-information fusion, and competent for quantitative andqualitative analysis. ANNs have been used in resourcestate prediction in the past. It was indicated in [15] thatthe ANN prediction outperforms the NWS methods [8].However, ANN’s learning process is quite complex andinefficient for modeling. Furthermore, the choices ofmodel structures and parameters are lack of standardtheory, so it usually suffers from overfitting or under-fitting with ill chosen parameters.

As a promising solution to nonlinear regression pro-blems, SVM [16] has recently been winning popularity dueto its remarkable characteristics such as good general-ization performance, absence of local minima, and sparsesolution representation. SVM is proposed based onstructural risk minimization (SRM) principle, which triesto control model complexity as well as the upper bound ofgeneralization risk. On the contrary, traditional regressiontechniques, including neural networks, are based onempirical risk minimization (ERM) principle, which triesto minimize the training error only. Therefore, SVM isexpected to achieve performance better than traditionalmethods. Prem and Raghavan [17] have explored thepossibility of applying SVM to forecast resource measures.They indicated that the SVM-based forecasts outperformthe NWS methods, including autoregressive and mean/median-based methods.

This study aims at comparing efficiency and accuracy ofdifferent models for multi-step-ahead prediction of gridresources by simulations. The modeling methods consid-ered are variations of ANN, including back propagationneural network (BPNN) [20], radial basis function neuralnetwork (RBFNN) [21], and generalized regression neuralnetwork (GRNN) [22], which hybridizes RBFNN andBPNN, plus variations of SVM, including Epsilon-supportvector regression (ESVR) [16], and Nu-support vectorregression (NSVR) [23]. The model parameters are initi-alized with values that are commonly used, as given inTables 3 and 4. For the input feature number, since we arepredicting up to 5-step-ahead, it should be bigger than 5.

The MAE results of different models are shown inFigs. 7a and 7b. From both figures, we can find that GRNN

achieves better accuracy than BPNN and RBFNN, while

NSVR and ESVR win the best performance for all q values

considered. As prediction step q increases, the predictionerror of GRNN, NSVR, and ESVR, does not exceed

tolerance interval with bandwidth MAE below 40 Mbps

and host load MAE below 0.12. This means that these three

methods are suitable for both one-step-ahead and multi-

step-ahead resource state predictions.A remarkable characteristic of SVR is sparse solution

representation, namely model with less support vectors isbetter in achieving same accuracy. We can see from Figs. 7c

and 7d that the comparison results between the two are

data set dependent. In this case, we can see that these two

SVR methods achieve similar accuracy and complexity.


Fig. 7. Comparative results on modeling methods.

TABLE 3Parameters for ANNs

TABLE 4Parameters for SVRs

The training CPU time of models being considered arecompared in Figs. 7e and 7f. From each subfigure, we cansee that the training time does not show a remarkabletendency as step q increases. SVRs cost less time thanANNs, namely within 120 ms on both data sets.

Based on comparative results of accuracy and efficiency,SVR is selected by our system as prediction strategy formodeling resource variations.

4.4 Comparative Study on Optimization Methods

Genetic algorithm (GA) and particle swarm optimization(PSO) are two typical machine learning strategies in thecategory of evolutionary computation. These two methodscan be employed to optimize the prediction model, for theexpectation of achieving higher performance. GA wasproposed by John Holland and his students in 1975 [18],inspired by the theory of natural selection and evolution.GA uses a set of chromosomes to represent solutions. Thechromosomes from one population are taken and used toform a new population which is called offspring. Thechromosomes with better fitness will have more chances forreproduction, and consequently, the new population will bebetter than the old one.

The PSO was proposed by Kennedy and Eberhart [19],inspired by social behavior of nature system, such as birdflocking or fish schooling. The system initializes a popula-tion of random particles and searches a multidimensionalsolution space for optima by updating particle generations.Each particle moves based on the direction of local bestsolution discovered by itself, and global best solutionshared by the swarm.

This study aims at comparing the optimization perfor-mance of GA and PSO by simulation. We concentrate onhyperparameter selection using host load data set.Parameters are initialized with values that are commonly

used: acceleration constants c1 and c2 are selectedaccording to [19], decreasing inertia weight w linearlywith time as proposed in [26], and changing SVR’shyperparameters C; "; � exponentially during optimization[27]. The initialized parameters of GA and PSO andoptimized hyperparameters of SVRs are given in Tables 5and 6. If the input feature number is too small, we can’ttell the difference of optimizing time between the two, sowe set it to 10.

MAE was used to measure the accuracy of the optimizedmodel, and optimizing time was recorded to measure itsefficiency. From Figs. 8a and 8b we can find that PSOachieves lower error than GA in most of the q valuesconsidered, and has less optimizing time. Based on suchcomparative results, PSO is selected by our system asoptimization strategy for prediction models.

5 PROPOSED OPTIMIZATION ALGORITHM

According to the previous comparisons, SVR is selected asthe automodeling strategy, and PSO is selected as theautooptimization strategy. Generally, a prediction model


TABLE 5Parameters for GA-SVR

TABLE 6Parameters for PSO-SVR

Fig. 8. Comparative results on optimization methods.

relies directly on the choice of model’s hyperparameters. Inaddition, irrelevant input features in resource samples willalso spoil the accuracy and efficiency of the model.Moreover, hyperparameter selection and feature selectionalso correlate with each other. Besides, the predictionsubsystem has a starlike distributed structure; suchtopology should be utilized to accelerate the modelingand optimizing procedure.

In this section, we define a combined criteria for fitnessevaluation, and propose a Parallel Hybrid Particle SwarmOptimization algorithm for the resource prediction sub-system. PH-PSO takes both hyperparameter selection andfeature selection under consideration and, thus, is expectedto enhance the accuracy and efficiency of the subsystem.

5.1 Optimization Problem Definition

Concerning hyperparameter selection and feature selectionjointly, we code such combinational optimization problemwith a hybrid vector PR, which consists of real numbersand binary numbers. The real numbers p1; p2; . . . representthe hyperparameters of the model, and the binary numbersbf1; bf2; . . . represent the choice of sample features. Thevalue 1 or 0 for bfs, respectively, stands for whether thecorresponding feature in samples is selected, and m isthe full input feature number of samples. The optimizationcan be defined as

max FitðPRÞ

s:t:

PR ¼ fp1; p2; . . . ; bf1; bf2; . . . ; bfmg;p1; p2; . . . 2 R

bfs 2 f0; 1g; s ¼ 1; 2; . . . ;m:

8><>:

ð5Þ

The definition of the fitness function Fitð�Þ is crucial inthat it determines what an algorithm should optimize.Accuracy and efficiency are both concerned in evaluatingthe fitness of the prediction model. In other words, a modelis better (with larger fitness) only if it has lower predictionerror as well as less training time, thus comes to arelationship of symmetrical inverse proportion. Moreover,when the training time is acceptable, accuracy is consideredprior. Accordingly, we define the fitness function as in

Fitness ¼ h

MSEt � lnTt;

MSEt ¼1

l

Xli¼1

ðsðtþ qÞ � s�ðtþ qÞÞ;2ð6Þ

where MSEt is the training mean squared error of 5-foldcross validation [28], h is a constant controlling the bound offitness, and Tt denotes the model’s training time. l countsthe number of samples, sðtþ qÞ and s�ðtþ qÞ denote thetrue value and predicted value, respectively.

5.2 Parallel Hybrid Particle Swarm Optimization

There are mainly two types of PSO distinguished bydifferent updating rules for calculating the positions andvelocities of particles: continuous version [19], [26] andbinary version [29]. Hyperparameter selection is a kind ofcontinuous optimization problem, and feature selection is akind of binary optimization problem. Concerning ouroptimization problem definition, this study proposes a

parallel optimization algorithm which hybridizes contin-uous PSO and binary PSO together, namely PH-PSO.

The algorithm is initialized with a population of randomparticles and searches a multidimensional solution spacefor optima by updating particle generations. Each particlemoves based on the direction of local best solutiondiscovered by itself and global best solution discoveredby the swarm. Each particle calculates its own velocity andupdates its position in each iteration until the terminationcondition is met. Suppose P particles in a D-dimensionalsearch space.

1. AP�D denotes the position matrix of all particles,p ¼ 1; 2; . . . ; P ; d ¼ 1; 2; . . . ; D; row vector ap in Adenotes the position of the pth particle, recorded asap ¼ fap1; ap2; . . . ; apDg;

2. VP�D denotes the velocity matrix of all particles; rowvector vp in V denotes the velocity of the pth particle,recorded as vp ¼ fvp1; vp2; . . . ; vpDg;

3. LBP�D denotes the local best position of all particles;row vector lbp in LB denotes the local best positionof the pth particle, recorded as lbp ¼ flbp1; lbp2; . . . ;lbpDg;

4. Row vector gb ¼ fgb1; gb2; . . . ; gbDg denotes the glo-bal best position shared by all particles.

The particle is represented by hybrid vector PR, andduring each iteration the real and binary parts of PR areupdated jointly using different rules, namely (7) and (8),respectively. rdmð0; 1Þ, rdm1ð0; 1Þ, and rdm2ð0; 1Þ arerandom numbers evenly distributed, respectively, in ½0; 1�.t denotes the step of iteration. Inertia weight w plays therole of balancing global search and local search; it can be apositive constant or even a positive linear/nonlinearfunction of time. Acceleration constant c1 and c2 representpersonal learning factor and social learning factor, respec-tively. Sgð�Þ is a sigmoid function limiting transformation.

vpdðtþ 1Þ ¼ w� vpdðtÞþ c1 � rdm1ð0; 1Þ � ðlbpdðtÞ � apdðtÞÞþ c2 � rdm2ð0; 1Þ � ðgbdðtÞ � apdðtÞÞ;apdðtþ 1Þ ¼ apdðtÞ þ vpdðtþ 1Þ;

ð7Þ

vpdðtþ 1Þ ¼ w� vpdðtÞþ c1 � rdm1ð0; 1Þ � ðlbpdðtÞ � apdðtÞÞþ c2 � rdm2ð0; 1Þ � ðgbdðtÞ � apdðtÞÞ;if ðrdmð0; 1Þ < Sgðvpdðtþ 1ÞÞÞthen apdðtþ 1Þ ¼ 1;

else apdðtþ 1Þ ¼ 0;

SgðvÞ ¼ 1

1þ e�v :

ð8Þ

The flowchart of PH-PSO is shown in Fig. 9, with majorsteps explained as follows:

1. Initialize system: set parameters for the PSO system,including population, iteration number, and dimen-sional search intervals; set parameters for particlessuch as inertia weight, personal learning factor, andsocial learning factor; randomly generate position


and velocity for each particle which is coded usinghybrid vector PR.

2. Preprocessing in parallel: prepare the sample setwith corresponding features as well as candidatemodel with corresponding hyperparameters accord-ing to particle representation.

3. Fitness evaluation in parallel: use validation set toevaluate candidate model, and then calculate fitnessof particle according to (6).

4. Update the local best/global best: if a particle’sfitness is better than its local best fitness, updatecorresponding local best fitness and local bestposition; if a particle’s fitness is better than theglobal best fitness, update global best fitness andglobal best position.

5. Termination judgement: if the termination conditionis met, then go to step 7.

6. Update velocity and position of each particle, andthen go to step 2 for next iteration.

7. Finish: output the global best position, and preparethe sample set with selected features and predictionmodel with selected hyperparameters according tothe representation of the global best position.

6 EVALUATION ON PROTOTYPE SYSTEM

Prototype system’s nodes are running under Fedora CoreLinux 9.0 system and connected by 100 MB LAN; each nodeis equipped with a single Intel Pentium IV 3.0 GHz CPU.Globus environment is built based on Globus Toolkits 4.2.0[30], and Libsvm toolkit [31] is employed to solve the QPproblem of the SVR algorithm. Supporting services arecoded in Java, and deployed in globus container [32].

The whole system falls into two subsystems; the two areevaluated individually because of different evaluationcriteria. For the monitoring subsystem, we made acomparative study on overhead between our system andCondor Hawkeye system [4], which is a famous monitoringsystem that can work with Globus Toolkits. For the

prediction subsystem, we compared the accuracy andefficiency of different prediction models.

6.1 Evaluation on Monitoring Subsystem

Following our design, the monitoring subsystem is built bycoding monitoring service and resource sensors. Table 7illustrates the resource sensors implemented in our systemand their techniques used in codes. In the table, O indicatesthat the sensor executes certain operations (i.e., I/Ooperation) and calculates the running performance ofresource as monitoring data, such as latency and bandwidth;V means that the sensor gets information by parsing a /procvirtual file system.

Low overhead is the primary design purpose of themonitoring subsystem, which means that monitoringactivities should not bring obvious influence on computingnodes. Condor version 7.2.5 [33] and Hawkeye version 1.0.0[4] are used in our experiments. For both our monitoringsubsystem and Condor Hawkeye system, we recorded theCPU and memory usage when the monitoring is up anddown so as to evaluate their overhead. Sampling frequencyis set to once per minute, and the recording process lastedfor an hour. The performance data are given in Fig. 10.

Table 8 lists the statistics of the performance data. Bothmonitoring systems occupy similar CPU usage, namelyours 11 percent and Hawkeye 12 percent. The memory usedby our monitoring subsystem is about 21 MB, while thememory used by Condor and Hawkeye is about 50 MB: partof them is costed by Condor since Hawkeye needs Condorto achieve monitoring. It can be inferred that our monitor-ing subsystem does not bring obvious influence oncomputing nodes. Hence, the design of our monitoringsubsystem is acceptable.

6.2 Evaluation on Prediction Subsystem

The ESVR model has three hyperparameters, that is, C; "; �[16]. However, the NSVR model is able to select " by itself[23], only C and � are considered hyperparameters, which


Fig. 9. Flowchart of the PH-PSO Algorithm.

TABLE 7Resource Sensors and Monitoring Techniques Used

0 20 40 60110

120

130

140

150

160

170

180

time (min)

usag

e (M

B)

memory usage

monitoring offsubsystem onHawkeye on

Fig. 10. Comparative results on monitoring overhead.

means that the order of complexity for model optimizationdecreases from Oðn3Þ to Oðn2Þ. Therefore, NSVR is chosento build the resource state prediction model in realizing theprototype system.

PH-PSO is used to optimize the prediction model in theprototype system. The parameters of PH-PSO are initializedwith values that are commonly used: acceleration constantsc1 and c2 are selected according to [19], decreasing inertiaweight w linearly with time as proposed in [26], andchanging hyperparameters C; � exponentially during opti-mization [27], as is listed in Table 9.

One prediction service controls the overall predictionprocedure, and evaluation services are used for evaluatingmodel fitness in parallel. The number of evaluation servicesused for fitness evaluation is equal to the number of particlesin the PH-PSO algorithm. All the tests are implementedthrough dynamic collaboration of system services.

High accuracy and efficiency is the primary design goalof the prediction subsystem. We present the prediction andoptimization results of bandwidth and host load data sets.In q-step-ahead prediction, q ¼ 1; 2; 3; 4; 5 are considered. Inmodel optimization, we implemented four different strate-gies including feature selection with hyperparameterselection on NSVR (FH), feature selection without hyper-parameter selection on NSVR (F0), hyperparameter selec-tion without feature selection on NSVR (0H), andparameters given directly on SVR (00) without anyoptimization mechanism, as in [17]. The test data sets usedare the same as in Section 4. We recorded parallel CPU timein optimizing models, and used MAE to measure predic-tion accuracy.

The MAE results of different models are shown in Figs. 11aand 11c. The MAE of bandwidth stays below 17.9 Mbps, andthere is no remarkable difference between prediction of one-step-ahead and multi-step-ahead. The MAE of host load staysbelow 0.08, with one-step-ahead prediction slightly betterthan multi-step-ahead prediction. As prediction step q

increases, there is no obvious ascending trend in MAE onboth data sets. It is implied that our modeling method issuitable for both one-step-ahead and multi-step-aheadresource state predictions. Furthermore, comparing fourstrategies, the introduction of optimizing mechanism helps toenhance the accuracy of the prediction model. This isespecially true for the combinational optimization FH, sinceit achieves lower errors for most of the q values considered.

We can see from Figs. 11a and 11b that the NSVR modelsbeing optimized have higher accuracy than the SVR modelwithout optimizing strategy. NSVRs’ SV numbers are over50, whereas SVR’s SV number is less than 10. Similarphenomena can be observed from Figs. 11c and 11d, exceptthat SVR’s SV number is around 25. Clearly, there is a trade-off between model accuracy and solution sparseness: modelwith more SVs is more complicated as well as more capablein characterization.

The parallel CPU time of combinational/individualoptimization is compared in Figs. 12a and 12b. From eachsub-figure, we can see that the optimization time does notshow a remarkable tendency as step q increases. 0H costsmore time than FH and F0. This indicates that the model’straining time can be obviously reduced by feature selectionrather than hyperparameter selection. It is clear that theoptimizing time of combinational optimization FH is rathershort by means of parallelization, namely within 3 secondson both data sets.


TABLE 8Statistical Results of Overhead

TABLE 9PH-PSO Parameter Initialization

Fig. 11. Comparative results on prediction performance.

The global best fitness of combinational optimization FHduring each iteration was recorded to evaluate the con-vergence performance. Landscape comparisons amongdifferent q values are shown in Figs. 12c and 12d. A trendis obvious on host load data set that the global best fitnessdecreases clearly as the prediction step q increases, whilesuch a trend is not found on bandwidth data set, whichimplies that the bandwidth variation has got more noisethan host load. It is also implied in these subfigures that thecombinational optimization converges during proper itera-tions for most of the q values considered.

7 CONCLUSIONS AND FUTURE WORKS

We proposed a distributed resource monitoring and predic-tion architecture that seamlessly combines grid technologies,resource monitoring, and machine learning-based resourcestate prediction. This system consists of a set of distributedservices to accomplish all required resource monitoring, datagathering, and resource state prediction functions.

We defined a universal procedure for modeling andoptimization of resource state prediction. In building aprediction model of multi-step-ahead, ANNs and SVRswere compared concerning both efficiency and accuracycriteria. Comparative simulations indicate that Epsilon-Support Vector Regression and Nu-Support Vector Regres-sion achieve better performance than Back PropagationNeural Network, Radial Basis Function Neural Network,and Generalized Regression Neural Network. For theexpectation of achieving higher performance, we comparedGenetic Algorithm and Particle Swarm Optimization forprediction models’ hyperparameter selection. Comparativesimulations indicate that the PSO achieves lower error andcosts less optimizing time than GA.

In the prototype system, we implemented a series ofsensors that cover most of resource measures. Overhead

evaluation shows that the monitoring subsystem does not

bring obvious influence on computing nodes. A Parallel

Hybrid Particle Swarm Optimization algorithm was pro-

posed which combines discrete PSO and continuous PSO,

for the purpose of combinational optimization of prediction

model. Evaluation results indicate that the combinational

model of PH-PSO and NSVR meets the accuracy and

efficiency demand of an online system.The results of this paper will contribute to building and

advancing of computing grid infrastructure. We plan to

move on our research further in the following aspects:

monitoring and prediction of grid tasks, classification and

evaluation of grid resources, classification and evaluation of

grid tasks, etc. It is believed that machine learning strategies

are applicable tools for modeling and optimizing, and they

will play a more important role by the virtue of their

potential in distributed computing environment.

ACKNOWLEDGMENTS

This research work is funded by National Natural Science

Foundation of China under Grant No. 61073009, 60873235,

and 60473099 and by Science-Technology Development Key

Project of Jilin Province of China under Grant No. 20080318

and by Program of New Century Excellent Talents in

University of China under Grant No. NCET-06-0300.

REFERENCES

[1] L.F. Bittencourt and E.R.M. Madeira, “A Performance-OrientedAdaptive Scheduler for Dependent Tasks on Grids,” Concurrencyand Computation: Practice and Experience, vol. 20, no. 9, pp. 1029-1049, June 2008.

[2] F. Wolf and B. Mohr, “Hardware-Counter Based AutomaticPerformance Analysis of Parallel Programs,” Proc. Conf. ParallelComputing (ParCo ’03), pp. 753-760, Sept. 2003.

[3] J. Dugan et al., “Iperf Project,” http://iperf.sourceforge.net/,Mar. 2008.

[4] M. Livny et al., “Condor Hawkeye Project,” Univ. of Wisconsin-Madison, http://www.cs.wisc.edu/condor/hawkeye/, Sept.2009.

[5] M.L. Massie, B.N. Chun, and D.E. Culler, “The GangliaDistributed Monitoring System: Design, Implementation, andExperience,” Parallel Computing, vol. 30, no. 7, pp. 817-840, July2004.

[6] A. Waheed et al., “An Infrastructure for Monitoring and Manage-ment in Computational Grids,” Proc. Fifth Int’l WorkshopLanguages, Compilers and Run-Time Systems for Scalable Computers,vol. 1915, pp. 235-245, Mar. 2000.

[7] J.S. Vetter and D.A. Reed, “Real-Time Performance Monitoring,Adaptive Control, and Interactive Steering of ComputationalGrids,” Int’l J. High Performance Computing Applications, vol. 14,no. 4, pp. 357-366, 2000.

[8] D.M. Swany and R. Wolski, “Multivariate Resource PerformanceForecasting in the Network Weather Service,” Proc. ACM/IEEEConf. Supercomputing, pp. 1-10, Nov. 2002.

[9] P.A. Dinda and D.R. O’Hallaron, “Host Load Prediction UsingLinear Models,” Cluster Computing, vol. 3, no. 4, pp. 265-280, 2000.

[10] E. Caron, A. Chis, F. Desprez, and A. Su, “Design of Plug-inSchedulers for a GRIDRPC Environment,” Future GenerationComputer Systems, vol. 24, no. 1, pp. 46-57, 2008.

[11] P.A. Dinda, “Design, Implementation, and Performance of anExtensible Toolkit for Resource Prediction in Distributed Sys-tems,” IEEE Trans. Parallel and Distributed Systems, vol. 17, no. 2,pp. 160-173, Feb. 2006.

[12] A.C. Sodan, G. Gupta, L. Han, L. Liu, and B. Lafreniere, “Timeand Space Adaptation for Computational Grids with the ATOP-Grid Middleware,” Future Generation Computer Systems, vol. 24,no. 6, pp. 561-581, 2008.


Fig. 12. Comparative results on optimization performance.

[13] M. Wu and X.H. Sun, “Grid Harvest Service: A PerformanceSystem of Grid Computing,” J. Parallel and Distributed Computing,vol. 66, no. 10, pp. 1322-1337, 2006.

[14] L.T. Lee, D.F. Tao, and C. Tsao, “An Adaptive Scheme forPredicting the Usage of Grid Resources,” Computing Computers andElectrical Eng., vol. 33, no. 1, pp. 1-11, 2007.

[15] A. Eswaradass, X.H. Sun, and M. Wu, “A Neural Network BasedPredictive Mechanism for Available Bandwidth,” Proc. 19th IEEEInt’l Parallel and Distributed Processing Symp. (IPDPS ’05), p. 33a,2005.

[16] V.N. Vapnik, The Nature of Statistical Learning Theory, seconded. Springer-Verlag, 1999.

[17] H. Prem and N.R.S. Raghavan, “A Support Vector Machine BasedApproach for Forecasting of Network Weather Services,” J. GridComputing, vol. 4, no. 1, pp. 89-114, 2006.

[18] J.H. Holland, Adaptation in Natural and Artificial Systems. MITPress, 1975.

[19] J. Kennedy and R.C. Eberhart, “Particle Swarm Optimization,”Proc. IEEE Int’l Conf. Neural Networks, pp. 1942-1948, 1995.

[20] L. Fausett, Fundamentals of Neural Networks. Prentice-Hall, 1994.[21] S. Haykin, Neural Networks: A Comprehensive Foundation.

Macmillan Publishing, 1994.[22] D. Patterson, Artificial Neural Networks. Prentice-Hall, 1996.[23] A.J. Smola and B. Scholkopf, “A Tutorial on Support Vector

Regression,” Statistics and Computing, vol. 14, no. 33, pp. 199-222,2004.

[24] Bandwidth Data Set, http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/combinedata/, Feb. 2009.

[25] Host Load Data Set, http://people.cs.uchicago.edu/~lyang/Load/, Feb. 2009.

[26] Y. Shi and R.C. Eberhart, “A Modified Particle Swarm Optimizer,”Proc. IEEE Int’l Conf. Evolutionary Computation, pp. 69-73, 2000.

[27] C.W. Hsu, C.C. Chang, and C.J. Lin, “A Practical Guide to SupportVector Classification,” http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf Dept. of Computer Science and Informa-tion Eng., Nat’l Taiwan Univ., 2003.

[28] M.W. Browne, “Cross-Validation Methods,” J. Math. Psychology,vol. 44, no. 1, pp. 108-132, 2000.

[29] J. Kennedy and R.C. Eberhart, “A Discrete Binary Version of theParticle Swarm Optimization,” Proc. IEEE Int’l Conf. NeuralNetworks, pp. 4104-4108, 1997.

[30] Globus Alliance “Globus Project,” http://www.globus.org/toolkit/downloads/4.2.0/, 2008.

[31] C.C. Chang and C.J. Lin, “LIBSVM: a Library for Support VectorMachines,” http://www.csie.ntu.edu.tw/~cjlin/libsvm/, May2008.

[32] S. Borja, “The Globus Toolkit 4 Programmer’s Tutorial,” http://gdp.globus.org/gt4-tutorial/, Nov. 2005.

[33] M. Livny et al., “Condor Project,” Univ. of Wisconsin-Madison,http://www.cs.wisc.edu/condor/, Sept. 2009.

Liang Hu received the MS and PhD degrees incomputer science from Jilin University, in 1993and 1999, respectively. Currently, he is aprofessor and doctoral supervisor at the Collegeof Computer Science and Technology, JilinUniversity, China. His research areas are net-work security and distributed computing, includ-ing related theories, models, and algorithms ofPKI/IBE, IDS/IPS, and Grid Computing. He is amember of the China Computer Federation.

Xi-Long Che received the MS and PhD degreesin computer science from Jilin University, in 2006and 2009, respectively. Currently, he is a lecturerat the College of Computer Science and Tech-nology, Jilin University, China. His current re-search areas are machine learning and parallelcomputing, including related theories, models,and algorithms of ANN, SVC/SVR, GA/ACO/PSO, and their combinations with Parallel Com-puting. He is a member of the IEEE. He is the

corresponding author of this paper.

Si-Qing Zheng received the PhD degree inelectrical and computer engineering from theUniversity of California, Santa Barbara, in1987. He is currently a professor of computerscience, computer engineering, and telecom-munications engineering. He served as thehead of the Computer Engineering Programand Telecommunications Engineering Program,and associate head of the Computer ScienceDepartment and Electrical Engineering Depart-

ment, all at the University of Texas, Dallas. His research interestsinclude algorithms, computer architectures, networks, parallel anddistributed processing, performance evaluation, circuits and systems,hardware/software codesign, real-time and embedded systems, andtelecommunications. He has published in these areas extensively. Heis a senior member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Online System for Grid Resource Monitoring

Documents

Transcript of Online System for Grid Resource Monitoring