CheR: Cheating Resilience in the Cloud via Smart Resource ...
Transcript of CheR: Cheating Resilience in the Cloud via Smart Resource ...
CheR: Cheating Resilience in the Cloud via SmartResource Allocation
Roberto Di Pietro, Flavio Lombardi, Fabio Martinelli,Daniele Sgandurra
Dept. of Mathematics and Physics - University Roma Tre, ItalyInstitute for Informatics and Telematics - CNR, Italy
FPS 201322/10/2013
Outline
BackgroundCloud ComputingUse Case
CheRModeling The ProblemCurrent Prototype
ResultsTestsConclusion and Future Works
Background
Part I
Background
Background
Background
Computational outsourcing• is needed:
fast complex computations are required;very large datasets are to be processed;
• is successful since cloud computing offers:cheap computing/storage resources;scalable computing resources.
Background
Motivation
However, computational outsourcing does NOT guarantee:
• correctness of computations;
• availability of computations;
• timeliness of results.
In particular, cheating:
• allows computing nodes to save resources;
• if undetected, leads to erroneous results.
Background
Use Case
• A sysadmin needs to rent some cloud resources to perform a paralleltask over a large data set.
• Two main possible scenarios exist:
1. the computation needs to end reliably within a given timeframeand the sysadmin is willing to spend as little as possible;
2. the sysadmin has a fixed maximum budget and she has toreliably compute the function by minimizing the required time.
Background
Use Case
Figure : Use Case: Node N2 is a Cheater
Background
Possible Solution?
• Use only “trusted” nodes:
no parallelization;not cost-effective.
• The main goal here is to guarantee timeliness, cost-effectiveness andcorrectness of the computed results.
CheR
Part II
CheR
CheR
Generalizing the Problem (1)
• There are n cloud nodes that host computation.
• At most k nodes are rational adversaries.
• The remaining (n − k) cloud nodes are non-malicious andwell-behaving nodes
• We model the adversary as a static cheater:
it always fakes its computations with a given probability.
CheR
Generalizing the Problem (2)
• Cloud nodes are required to compute a parallel function f over aninput vector X of length m.
• The output is a vector of length m where the j-th element is f (j).
• We assume that there are n cloud nodes where each node ni cancompute a subset (possibly overlapping) of the input vector.
• Each node ni has an associated unitary cost per operation ci and thetime to perform an unitary operation is ti .
CheR
Generalizing the Problem (3)
• The percentage of cheaters (CheaterRate) is kn .
• The system administrator has to send multiple (possibly overlapping)subsets of X to the nodes to be confident that the output is correct.
• The confidence threshold is DetConf .
• The goal of the system administrator is to:
minimize the total cost of the operations,given a maximum time of computation Tmax ,all the results to be correct with an error less than DetConf .
CheR
Assignment Matrix
• To better model the scenario we use an assignment matrix Mn×m:
where Mi,j means that the node ni receives the element xj tobe computed on the function f .
• Indexes of the rows are coupled with the nodes, where 1 ≤ i ≤ n;
• Indexes of the columns are associated with the elements of thevector, where 1 ≤ j ≤ m.
CheR
An Example
VM1 VM2 VM3 VM4 VM5
c1 c2 c3 c4 c5t1 t2 t3 t4 t5
Table : Cost and Time Vectors
x1 x2 x3 x4 x5 x6 x7VM1 1 0 1 0 0 1 0VM2 0 1 0 1 1 0 1VM3 1 0 0 1 0 0 1VM4 0 1 0 1 1 0 1VM5 0 1 1 0 1 1 0
Table : Matrix Assignment of the Example
CheR
Modeling The Problem
The total cost Ci of the operations performed by the i-th node:
Ci =m∑j=1
ci ·Mi,j = ci ·m∑j=1
Mi,j (1)
The total time Ti of the i-th node is:
Ti =m∑j=1
ti ·Mi,j = ti ·m∑j=1
Mi,j (2)
We can formulate the problem as a Linear Programming Model:
Minimizen∑
i=1
m∑j=1
ci ·Mi,j (3)
s.t.:
m∑j=1
ti ·Mi,j ≤ Tmax ∀i (4)
CheR
Modeling The Problem
• Each input element can also be processed by a cheater node.
• To this end, we introduce the following equation:
n∑i=1
Mi,j ≥ Repl(j) ∀j (5)
where Repl(j) is the number of elements that has to be replicatedfor each input element j according to the confidence level DetConf .
• The chances of wrong results due to cheater nodes are lower.
• The number of requested replicas is computed using thehypergeometric distribution by considering that at least half of thereplicated elements are given to, and processed by, cheater nodes.
CheR
Number of Replicas
• The hypergeometric distribution describes the probability of ksuccesses in n draws without replacement from a finite population ofsize N containing exactly K successes.
• Given the hypergeometric distribution H(n, h, r), where n is thenumber of nodes and h is the number of cheaters.
• Find the smallest r , i.e. the number of replicated elements for eachinput (i.e., Repl(j)), for which the following equation holds:
max∑i= r
2
P(i) =
(hi
)(n−hr−i
)(nr
) < DetConfLocal (6)
CheR
Modeling The Problem
• Finally, we have to consider the binary condition variables that areused to decide which of the input elements are given to which nodes:
Binary Mi,j ∀i , j (7)
• Once this LP model is solved, if a solution exists, an optimalassignment of input elements to cloud nodes is returned thatsatisfies all the system administrator-imposed requirements.
CheR
Modeling The Problem
That can be written as:
Minimizen∑
i=1
m∑j=1
ci ·Mi,j
Subject tom∑j=1
Mij ≤ Max(i) , 1 ≤ i ≤ n
n∑i=1
Mij ≥ Repl(j) , 1 ≤ j ≤ m
Binary Mi,j , 1 ≤ i ≤ n, 1 ≤ j ≤ m
Table : LP Model
CheR
CheR Implementation
• CheR receives as input the cloud scenario and creates a linearprogramming problem and returns an optimal solution.
• Implemented in Java.
• GLPK used as problem solver.
• It includes also a simulator/validator.
CheR
CheR Implementation
The CheR meta-program takes as input the following parameters:
• the number of nodes n;
• the number of input elements m;
• the confidence level DetConf ;
• the features of available nodes:
time ti required to process a single element,cost per unit ci .
• the time constraints for the termination of the reliable distributedcomputation (Tmax):
this means that every node ni can process at most Max(i)elements.
CheR firstly computes the number of replicas and then it outputs the LPmodel that is later fed to the LP solver.
Results
Part III
Results
Results
Testbed
• To model an Amazon-like cloud, 5 node typologies were tested:
ranging from Medium to XXXLarge according to real(normalized) cost and performance.
• The number of nodes n is set to 100.
20 nodes for each of the 5 typologies.
• 5 static cheaters P, i.e. 5% of the total nodes
• 1,000 input elements m are considered for each round.
Results
Tested Scenarios
• We have depicted four scenarios, where Tmax is set so that nodescan process a different (decreasing) number of elements.
• These four scenarios depict four different alternatives:
1. extreme cost savings;2. moderate balance requirements;3. tight time constraints and cost containment;4. with extremely tight time constraints.
Results
Matrix Assignment 1
Figure : Bitmap representing an actual assignment of elements (x-axis) tonodes (y -axis) for scenario E1 targeted at cost savings.
Results
Matrix Assignment 1
Figure : Bitmap representing an actual assignment of elements (x-axis) tonodes (y -axis) for scenario E2 targeted at moderate cost savings.
Results
Matrix Assignment 3
Figure : Bitmap representing an actual assignment of elements (x-axis) tonodes (y -axis) for scenario E3 targeted at moderate time constraints.
Results
Matrix Assignment 4
Figure : Bitmap representing an actual assignment of elements (x-axis) tonodes (y -axis) for scenario E4 targeted at extremely tight time constraints.
Results
Performance
Both the execution time and memory requirements are O(n ×m)
Figure : Execution Time of the Prototype
Results
Validation Tests
• Several simulation tests that exploit the assignment matrix of E4:
to discover whether there was any wrong result that would beerroneously considered as correct by the collector.
• For each test, each experiment was repeated several times byrandomly choosing the set of k cheaters.
• In the end, for each test, we counted:
1. the number of wrong results;2. the number of tests that contain at least one wrong results.
Results
Validation Tests
• Select a cheating probability for each static cheater P:
i.e. how often a cheater returns a fake result.
• As a consequence, if a cheater returns wrong results more often, itwill be easier to detect it.
• Conversely, if a smaller number of wrong results is returned, then itcan be detected with more difficulty.
• The central collector can discover a cheater by calculating adivergence index for each node:
keeping track of how many times such node has given responsesthat were different from other ones on the same input(percentage of results in which the cheater was in minority).
Results
Validation Tests
P FNT FNE (Avg) FNTC FNEC (Avg)1 0.236% 0.006% (6.1) 0.00375% 0.0025% (2.445)0,9 0.223% 0.0045% (4.5) 0.022% 0.0012% (1.17)0.8 0.228% 0.0038% (3.8) 0.0075% 3.393E-4% (0.34)0.75 0.226% 0.0036% (3.45) 0.0047% 2.144E-4% (0.21)0.66 0.208% 0.0025% (2.5) 0.0011% 5.14E-5% (0.05)0.6 0.2% 0.0021% (2.1) 6.0E-4% 1.267E-5% (0.0127)0.5 0.19% 0.0015% (1.5) 0% 0% (0)0.4 0.17% 9.436E-4% (0.94) 0% 0% (0)0.33 0.16% 6.573E-4% (0.66) 0% 0% (0)0.3 0.16% 5.358E-4% (0.54) 0% 0% (0)0.25 0.14% 3.77E-4% (0.38) 0% 0% (0)0.2 0.12% 2.5E-4% (0.25) 0% 0% (0)0.1 0.05% 6.14E-5% (0.06) 0% 0% (0)
Table : Results of the Validation tests
P cheating probability of static cheaters: how often a cheater cheats;
FNT % of false negative tests: ratio of failed tests without centralized control (at least one wrong results in an experiment);
FNE % of false negative elements: ratio of wrong results without centralized control (in all the experiments);
FNTC % of false negative tests (centralized scenario): ratio of failed tests with centralized control (at least one wrong results in anexperiment).
FNEC % of false negative elements (centralized scenario): ratio of wrong results with centralized control (in all the experiments).
Results
Discussion
• As the maximum execution time is compressed, the workload,including the replicated computations, seamlessly shifts towards thebottom lines, representing the more costly but faster nodes.
• For large sizes of the problem matchmaking this is nontrivial and assuch an automated approach is needed.
• The assignment matrix strives to pair elements to less-costly nodesas soon as the time constraints have been satisfied.
Results
Conclusion
• CheR is a model for reliable execution of workload over a largenumber of heterogeneous computing nodes:
some node can cheat according to some identified models.
• CheR helps sysadmin that are assessing how to distributecomputation over a cloud, to reason about the possible scenarios,available approaches, and their convenience and feasibility.
• CheR provides probabilistic assurance that the result of thedistributed computations is not affected by misbehaving nodes.
• Experimental evidence based on a real-world cloud provider such asAmazon shows the viability of CheR.
Results
Future Work
• Develop resilience approaches against smart cheaters.
• Consider trust of the cheaters:
past behaviors.
• Dynamically change the assignment matrix at each round.
• Further improvements: paper accepted at “The 10th IEEEInternational Conference on Autonomic and Trusted Computing”:
AntiCheetah: an Autonomic Multi-round Approach for ReliableComputing.