Download - Smart Redundancy for Distributed Computation

Transcript
Page 1: Smart Redundancy for Distributed Computation

Smart Redundancy forDistributed Computation

George EdwardsBlue Cell Software, LLC

Yuriy BrunUniversity of Washington

Jae young BangUniversity of Southern

California

Nenad MedvidovicUniversity of Southern

California

Page 2: Smart Redundancy for Distributed Computation

Distributed Computation Architectures• Solve large computational

problems and/or process large data sets

• Provide a platform and API for applications

• Transparently parallelize computation across a pool of computers

• Examples:– Clouds– Grids– Volunteer computing

Page 3: Smart Redundancy for Distributed Computation

DCA Applications

• Highly parallelizable problems– Find the 10100th digit of π– Factor 22011 – 1

• Driven by:– Basic research– Pharmaceutical applications– Web analytics– …

Page 4: Smart Redundancy for Distributed Computation

Volunteer Computing• Attempts to leverage the

more than 1 billion (mostly idle) machines on the Internet– Volunteers install a client– When idle, the client requests

work from a server and send back results

• Aids projects that have limited funding but large public appeal

Page 5: Smart Redundancy for Distributed Computation

Dealing with Faults

• Context:– Volunteers fail and maliciously return false results– Volunteers are not accountable– Malicious volunteers may collude– Well-formed but incorrect results are hard to

detect– The reliability of volunteers is difficult to estimate

• Solution:– Redundancy and voting

Page 6: Smart Redundancy for Distributed Computation

System Model• A task server subdivides

computations into tasks

• The task server replicates each task into multiple identical jobs

• The task server assigns each job to a node in the node pool

• Nodes perform work, send results, and rejoin the pool

• New volunteer nodes may join the pool while other nodes may leave

Page 7: Smart Redundancy for Distributed Computation

k-vote Traditional Redundancy (TR)• Performs k independent executions of

each task

• Takes a vote on the correctness of the result

• Requires expending a factor of k resources or suffering a factor of k slowdown in performance

Example

• k = 19• r = 0.7

Page 8: Smart Redundancy for Distributed Computation

Insights• Redundant computations

need not be simultaneous

• DCAs can dynamically adjust the level of redundancy based onrun-time information

• k-vote traditional redundancy wastes computations

Example

• 19 independent computations (k = 19)

• 70% node reliability (r = 0.7)

• (0.7)10 ≈ 2.8% of the time, the first 10 of them will return the correct result• The last 9 results are

irrelevant

Page 9: Smart Redundancy for Distributed Computation

k-vote Progressive Redundancy (PR)

• Distributes jobs in waves

• In each wave, distributes the minimum jobs needed to produce a consensus (assuming all agree)

• Repeats until a consensus is reached

Example

• k = 19• r = 0.7

Page 10: Smart Redundancy for Distributed Computation

Insights• The confidence level

associated with a result can be computed

• k-vote progressive redundancy produces results with varying confidence

Example

• k = 19, r = 0.7

• If the vote is 10-0, confidence level ≈ 99.98%

• If the vote is 10-9, confidence level = 70%

Page 11: Smart Redundancy for Distributed Computation

Iterative Redundancy (IR)

• Distributes jobs in waves

• In each wave, distributes the minimum jobs required to achieve a desired confidence level

• Repeats until desired confidence level is reached

Example

• d = 4• r = 0.7

Page 12: Smart Redundancy for Distributed Computation

Algorithm Comparison• System reliability

approaches 1 exponentially for TR, PR, and IR

• IR produces the same reliability at a lower cost– Or, equivalently, higher

reliability at the same cost

• IR is optimal with respect to cost– Guaranteed to use the

minimum computation needed to achieve desired system reliability

Cost Factor

Cost Factor

Syst

em R

elia

bilit

ySy

stem

Rel

iabi

lity

Page 13: Smart Redundancy for Distributed Computation

Algorithm Comparison

• PR and IR perform best when the reliability of the node pool is high

Node Reliability

Ratio

Impr

ovem

ent O

ver

Trad

ition

al R

ecov

ery

Page 14: Smart Redundancy for Distributed Computation

Adaptive Behavior

• IR maintains a constant system reliability as node reliability fluctuates

– Injects redundancy where it is needed• “Unlucky” situations

– Removes redundancy where it is unnecessary

Time

Time

Time

Nod

eRe

liabi

lity

Cost

Fac

tor

Syst

emRe

liabi

lity

Page 15: Smart Redundancy for Distributed Computation

Node Reliability Estimation

• Incorrectly estimating node reliability does not affect the performance of IR

Cost Factor

Syst

em R

elia

bilit

y

Page 16: Smart Redundancy for Distributed Computation

Conclusions

• Iterative redundancy automatically replicates computation with optimal efficiency

• Iterative redundancy can be used when:– A computation can be broken down into

independent tasks– Computation is performed by a pool of

independent processing resources– Task deployment decisions can be made at runtime– The reliability of resources in the pool is unknown

Page 17: Smart Redundancy for Distributed Computation

For More InformationTo appear in ICDCS 2011:

Smart Redundancy for Distributed Computationby Yuriy Brun, George Edwards, Jae young Bang and Nenad Medvidovic

http://www.cs.washington.edu/homes/brun/pubs/pubs/Brun11icdcs.pdf