CheR: Cheating Resilience in the Cloud via Smart Resource ...

36
CheR: Cheating Resilience in the Cloud via Smart Resource Allocation Roberto Di Pietro, Flavio Lombardi, Fabio Martinelli, Daniele Sgandurra Dept. of Mathematics and Physics - University Roma Tre, Italy Institute for Informatics and Telematics - CNR, Italy [email protected] FPS 2013 22/10/2013

Transcript of CheR: Cheating Resilience in the Cloud via Smart Resource ...

Page 1: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR: Cheating Resilience in the Cloud via SmartResource Allocation

Roberto Di Pietro, Flavio Lombardi, Fabio Martinelli,Daniele Sgandurra

Dept. of Mathematics and Physics - University Roma Tre, ItalyInstitute for Informatics and Telematics - CNR, Italy

[email protected]

FPS 201322/10/2013

Page 2: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Outline

BackgroundCloud ComputingUse Case

CheRModeling The ProblemCurrent Prototype

ResultsTestsConclusion and Future Works

Page 3: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Background

Part I

Background

Page 4: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Background

Background

Computational outsourcing• is needed:

fast complex computations are required;very large datasets are to be processed;

• is successful since cloud computing offers:cheap computing/storage resources;scalable computing resources.

Page 5: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Background

Motivation

However, computational outsourcing does NOT guarantee:

• correctness of computations;

• availability of computations;

• timeliness of results.

In particular, cheating:

• allows computing nodes to save resources;

• if undetected, leads to erroneous results.

Page 6: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Background

Use Case

• A sysadmin needs to rent some cloud resources to perform a paralleltask over a large data set.

• Two main possible scenarios exist:

1. the computation needs to end reliably within a given timeframeand the sysadmin is willing to spend as little as possible;

2. the sysadmin has a fixed maximum budget and she has toreliably compute the function by minimizing the required time.

Page 7: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Background

Use Case

Figure : Use Case: Node N2 is a Cheater

Page 8: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Background

Possible Solution?

• Use only “trusted” nodes:

no parallelization;not cost-effective.

• The main goal here is to guarantee timeliness, cost-effectiveness andcorrectness of the computed results.

Page 9: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

Part II

CheR

Page 10: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

Generalizing the Problem (1)

• There are n cloud nodes that host computation.

• At most k nodes are rational adversaries.

• The remaining (n − k) cloud nodes are non-malicious andwell-behaving nodes

• We model the adversary as a static cheater:

it always fakes its computations with a given probability.

Page 11: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

Generalizing the Problem (2)

• Cloud nodes are required to compute a parallel function f over aninput vector X of length m.

• The output is a vector of length m where the j-th element is f (j).

• We assume that there are n cloud nodes where each node ni cancompute a subset (possibly overlapping) of the input vector.

• Each node ni has an associated unitary cost per operation ci and thetime to perform an unitary operation is ti .

Page 12: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

Generalizing the Problem (3)

• The percentage of cheaters (CheaterRate) is kn .

• The system administrator has to send multiple (possibly overlapping)subsets of X to the nodes to be confident that the output is correct.

• The confidence threshold is DetConf .

• The goal of the system administrator is to:

minimize the total cost of the operations,given a maximum time of computation Tmax ,all the results to be correct with an error less than DetConf .

Page 13: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

Assignment Matrix

• To better model the scenario we use an assignment matrix Mn×m:

where Mi,j means that the node ni receives the element xj tobe computed on the function f .

• Indexes of the rows are coupled with the nodes, where 1 ≤ i ≤ n;

• Indexes of the columns are associated with the elements of thevector, where 1 ≤ j ≤ m.

Page 14: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

An Example

VM1 VM2 VM3 VM4 VM5

c1 c2 c3 c4 c5t1 t2 t3 t4 t5

Table : Cost and Time Vectors

x1 x2 x3 x4 x5 x6 x7VM1 1 0 1 0 0 1 0VM2 0 1 0 1 1 0 1VM3 1 0 0 1 0 0 1VM4 0 1 0 1 1 0 1VM5 0 1 1 0 1 1 0

Table : Matrix Assignment of the Example

Page 15: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

Modeling The Problem

The total cost Ci of the operations performed by the i-th node:

Ci =m∑j=1

ci ·Mi,j = ci ·m∑j=1

Mi,j (1)

The total time Ti of the i-th node is:

Ti =m∑j=1

ti ·Mi,j = ti ·m∑j=1

Mi,j (2)

We can formulate the problem as a Linear Programming Model:

Minimizen∑

i=1

m∑j=1

ci ·Mi,j (3)

s.t.:

m∑j=1

ti ·Mi,j ≤ Tmax ∀i (4)

Page 16: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

Modeling The Problem

• Each input element can also be processed by a cheater node.

• To this end, we introduce the following equation:

n∑i=1

Mi,j ≥ Repl(j) ∀j (5)

where Repl(j) is the number of elements that has to be replicatedfor each input element j according to the confidence level DetConf .

• The chances of wrong results due to cheater nodes are lower.

• The number of requested replicas is computed using thehypergeometric distribution by considering that at least half of thereplicated elements are given to, and processed by, cheater nodes.

Page 17: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

Number of Replicas

• The hypergeometric distribution describes the probability of ksuccesses in n draws without replacement from a finite population ofsize N containing exactly K successes.

• Given the hypergeometric distribution H(n, h, r), where n is thenumber of nodes and h is the number of cheaters.

• Find the smallest r , i.e. the number of replicated elements for eachinput (i.e., Repl(j)), for which the following equation holds:

max∑i= r

2

P(i) =

(hi

)(n−hr−i

)(nr

) < DetConfLocal (6)

Page 18: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

Modeling The Problem

• Finally, we have to consider the binary condition variables that areused to decide which of the input elements are given to which nodes:

Binary Mi,j ∀i , j (7)

• Once this LP model is solved, if a solution exists, an optimalassignment of input elements to cloud nodes is returned thatsatisfies all the system administrator-imposed requirements.

Page 19: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

Modeling The Problem

That can be written as:

Minimizen∑

i=1

m∑j=1

ci ·Mi,j

Subject tom∑j=1

Mij ≤ Max(i) , 1 ≤ i ≤ n

n∑i=1

Mij ≥ Repl(j) , 1 ≤ j ≤ m

Binary Mi,j , 1 ≤ i ≤ n, 1 ≤ j ≤ m

Table : LP Model

Page 20: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

CheR Implementation

• CheR receives as input the cloud scenario and creates a linearprogramming problem and returns an optimal solution.

• Implemented in Java.

• GLPK used as problem solver.

• It includes also a simulator/validator.

Page 21: CheR: Cheating Resilience in the Cloud via Smart Resource ...

CheR

CheR Implementation

The CheR meta-program takes as input the following parameters:

• the number of nodes n;

• the number of input elements m;

• the confidence level DetConf ;

• the features of available nodes:

time ti required to process a single element,cost per unit ci .

• the time constraints for the termination of the reliable distributedcomputation (Tmax):

this means that every node ni can process at most Max(i)elements.

CheR firstly computes the number of replicas and then it outputs the LPmodel that is later fed to the LP solver.

Page 22: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Part III

Results

Page 23: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Testbed

• To model an Amazon-like cloud, 5 node typologies were tested:

ranging from Medium to XXXLarge according to real(normalized) cost and performance.

• The number of nodes n is set to 100.

20 nodes for each of the 5 typologies.

• 5 static cheaters P, i.e. 5% of the total nodes

• 1,000 input elements m are considered for each round.

Page 24: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Tested Scenarios

• We have depicted four scenarios, where Tmax is set so that nodescan process a different (decreasing) number of elements.

• These four scenarios depict four different alternatives:

1. extreme cost savings;2. moderate balance requirements;3. tight time constraints and cost containment;4. with extremely tight time constraints.

Page 25: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Matrix Assignment 1

Figure : Bitmap representing an actual assignment of elements (x-axis) tonodes (y -axis) for scenario E1 targeted at cost savings.

Page 26: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Matrix Assignment 1

Figure : Bitmap representing an actual assignment of elements (x-axis) tonodes (y -axis) for scenario E2 targeted at moderate cost savings.

Page 27: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Matrix Assignment 3

Figure : Bitmap representing an actual assignment of elements (x-axis) tonodes (y -axis) for scenario E3 targeted at moderate time constraints.

Page 28: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Matrix Assignment 4

Figure : Bitmap representing an actual assignment of elements (x-axis) tonodes (y -axis) for scenario E4 targeted at extremely tight time constraints.

Page 29: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Performance

Both the execution time and memory requirements are O(n ×m)

Figure : Execution Time of the Prototype

Page 30: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Validation Tests

• Several simulation tests that exploit the assignment matrix of E4:

to discover whether there was any wrong result that would beerroneously considered as correct by the collector.

• For each test, each experiment was repeated several times byrandomly choosing the set of k cheaters.

• In the end, for each test, we counted:

1. the number of wrong results;2. the number of tests that contain at least one wrong results.

Page 31: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Validation Tests

• Select a cheating probability for each static cheater P:

i.e. how often a cheater returns a fake result.

• As a consequence, if a cheater returns wrong results more often, itwill be easier to detect it.

• Conversely, if a smaller number of wrong results is returned, then itcan be detected with more difficulty.

• The central collector can discover a cheater by calculating adivergence index for each node:

keeping track of how many times such node has given responsesthat were different from other ones on the same input(percentage of results in which the cheater was in minority).

Page 32: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Validation Tests

P FNT FNE (Avg) FNTC FNEC (Avg)1 0.236% 0.006% (6.1) 0.00375% 0.0025% (2.445)0,9 0.223% 0.0045% (4.5) 0.022% 0.0012% (1.17)0.8 0.228% 0.0038% (3.8) 0.0075% 3.393E-4% (0.34)0.75 0.226% 0.0036% (3.45) 0.0047% 2.144E-4% (0.21)0.66 0.208% 0.0025% (2.5) 0.0011% 5.14E-5% (0.05)0.6 0.2% 0.0021% (2.1) 6.0E-4% 1.267E-5% (0.0127)0.5 0.19% 0.0015% (1.5) 0% 0% (0)0.4 0.17% 9.436E-4% (0.94) 0% 0% (0)0.33 0.16% 6.573E-4% (0.66) 0% 0% (0)0.3 0.16% 5.358E-4% (0.54) 0% 0% (0)0.25 0.14% 3.77E-4% (0.38) 0% 0% (0)0.2 0.12% 2.5E-4% (0.25) 0% 0% (0)0.1 0.05% 6.14E-5% (0.06) 0% 0% (0)

Table : Results of the Validation tests

P cheating probability of static cheaters: how often a cheater cheats;

FNT % of false negative tests: ratio of failed tests without centralized control (at least one wrong results in an experiment);

FNE % of false negative elements: ratio of wrong results without centralized control (in all the experiments);

FNTC % of false negative tests (centralized scenario): ratio of failed tests with centralized control (at least one wrong results in anexperiment).

FNEC % of false negative elements (centralized scenario): ratio of wrong results with centralized control (in all the experiments).

Page 33: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Discussion

• As the maximum execution time is compressed, the workload,including the replicated computations, seamlessly shifts towards thebottom lines, representing the more costly but faster nodes.

• For large sizes of the problem matchmaking this is nontrivial and assuch an automated approach is needed.

• The assignment matrix strives to pair elements to less-costly nodesas soon as the time constraints have been satisfied.

Page 34: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Conclusion

• CheR is a model for reliable execution of workload over a largenumber of heterogeneous computing nodes:

some node can cheat according to some identified models.

• CheR helps sysadmin that are assessing how to distributecomputation over a cloud, to reason about the possible scenarios,available approaches, and their convenience and feasibility.

• CheR provides probabilistic assurance that the result of thedistributed computations is not affected by misbehaving nodes.

• Experimental evidence based on a real-world cloud provider such asAmazon shows the viability of CheR.

Page 35: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Future Work

• Develop resilience approaches against smart cheaters.

• Consider trust of the cheaters:

past behaviors.

• Dynamically change the assignment matrix at each round.

• Further improvements: paper accepted at “The 10th IEEEInternational Conference on Autonomic and Trusted Computing”:

AntiCheetah: an Autonomic Multi-round Approach for ReliableComputing.

Page 36: CheR: Cheating Resilience in the Cloud via Smart Resource ...

Results

Questions?

E-mail:

[email protected]

[email protected]