Heading Off Correlated Failures through Independence-as-a ... · (Cloud1) Data Source2 (Cloud2)...

Post on 20-Jul-2020

7 views 0 download

Transcript of Heading Off Correlated Failures through Independence-as-a ... · (Cloud1) Data Source2 (Cloud2)...

Heading Off Correlated Failures through Independence-as-a-Service

Ennan Zhai1

Ruichuan Chen2, David Isaac Wolinsky1, Bryan Ford1

1 Yale University 2 Bell Labs/Alcatel-Lucent

• Cloud services ensure reliability by redundancy:- Amazon S3 replicates data on multiple racks- iCloud rents EC2 and Azure redundantly

Background

• Cloud services ensure reliability by redundancy:- Amazon S3 replicates data on multiple racks- iCloud rents EC2 and Azure redundantly

Background

• Cloud services ensure reliability by redundancy:- Amazon S3 replicates data on multiple racks- iCloud rents EC2 and Azure redundantly

Background

• Cloud services ensure reliability by redundancy:- Amazon S3 replicates data on multiple racks- iCloud rents EC2 and Azure redundantly

Background

Unexpected common dependencies!

Service Outage Losses

Data Center Outages Generate Big LossesDowntime in a data center can cost an average of $505,500 per incident, according to a Ponemon Institute study.

Analytics Slideshow: 2010 Data Center

Operational Trends Report

Service Outage Losses

Data Center Outages Generate Big LossesDowntime in a data center can cost an average of $505,500 per incident, according to a Ponemon Institute study.

Analytics Slideshow: 2010 Data Center

Operational Trends Report

What is correlated failure?

Rack1

Switch1

Rack2

Switch2

Rack3

Switch3

What is correlated failure?

Rack1

Switch1

Rack2

Switch2

Rack3

Switch3

Primary Backup Backup

What is correlated failure?

Agg Switch

Rack1

Switch1

Rack2

Switch2

Rack3

Switch3

Primary Backup Backup

What is correlated failure?

Agg Switch

Rack1

Switch1

Rack2

Switch2

Rack3

Switch3

Primary Backup Backup

What is correlated failure?

Agg Switch

Rack1

Switch1

Rack2

Switch2

Rack3

Switch3

Primary Backup Backup

What is correlated failure?

Realistic Example

Summary of the October 22, 2012 AWS Service Event in the US-East Region

We’d like to share more about the service event that occurred on Monday, October 22nd in the US-East Region. We have now completed the analysis of the events that affected AWS customers, and we want to describe what happened, our understanding of how customers were affected, and what we are doing to prevent a similar issue from occurring in the future.

The Primary Event and the Impact to Amazon Elastic Block Store (EBS) and Amazon Elastic Compute Cloud (EC2)

Realistic Example

Correlated failures resulting from EBSdue to bugs in one EBS server

Elastic Compute Cloud (EC2)

Elastic Block Store (EBS)

Realistic Example

... ...

Elastic Block Store (EBS)

Realistic Example

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

Elastic Block Store (EBS)

Realistic Example

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

EBS Server2EBS Server1

Realistic Example

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

EBS Server2EBS Server1

Realistic Example

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

EBS Server2EBS Server1

Realistic Example

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

EBS Server2EBS Server1

Realistic Example

Even Worse

Video App

Cloud Provider A Cloud Provider B

Video App

Cloud Provider A Cloud Provider B

Third-party infrastructure components

Video App

Cloud Provider A Cloud Provider B

Third-party infrastructure components

ISP Router BISP Router A ISP Router C

Video App

Cloud Provider A Cloud Provider B

Third-party infrastructure components

Power Source

ISP Router BISP Router A ISP Router C

Video App

Cloud Provider A Cloud Provider B

Third-party infrastructure components

Power Source

ISP Router BISP Router A ISP Router C

Video App

Cloud Provider A Cloud Provider B

Third-party infrastructure components

Power Source

ISP Router BISP Router A ISP Router C

Video App

Cloud Provider A Cloud Provider B

Third-party infrastructure components

Power Source

ISP Router BISP Router A ISP Router C

Cloud providers do not usually share

information about their dependencies

• Cloud providers allocate or tolerate failures via: - diagnosis systems;- fault-tolerant systems.

Existing Efforts

• Solving the problem after outage occurs.

• We want to prevent the problem before the outage occurs.

Existing Efforts

• Cloud providers allocate or tolerate failures via: - diagnosis systems;- fault-tolerant systems.

• Solving the problem after outage occurs.

• Prevent correlated failures before outage occurs.

Existing Efforts

• Cloud providers allocate or tolerate failures via: - diagnosis systems;- fault-tolerant systems.

• Solving the problem after outage occurs.

• Prevent correlated failures before outage occurs.

Existing Efforts

Independence-as-a-Service(INDaaS)

• Cloud providers allocate or tolerate failures via: - diagnosis systems;- fault-tolerant systems.

INDaaS

INDaaS Workflow

Service Provider, Alice

A Given Redundancy Configuration

INDaaS

INDaaS Workflow

Two-Way Redundancy Configuration

Service Provider, Alice

DependencyData Source1

DependencyData Source2

INDaaS

INDaaS Workflow

Service Provider, Alice

DependencyData Source1

DependencyData Source2

Two-Way Redundancy Configuration

Independence of this two-way redundancy ?

INDaaS

Step1: Specification Submission

DependencyData Source1

DependencyData Source2

INDaaS Workflow

Service Provider, Alice

INDaaS

Step1

DependencyData Source1

DependencyData Source2

Step2: Dependency data collection

Step2: Dependency data collection

INDaaS Workflow

Service Provider, Alice

INDaaS

Step1

Step2 Step2

DependencyData Source1

DependencyData Source2

Step3: Independence Evaluation

INDaaS Workflow

Service Provider, Alice

INDaaS

Step1

Relative Independence is 0.3

Step2 Step2

Step3

DependencyData Source1

DependencyData Source2

INDaaS Workflow

Service Provider, Alice

INDaaS

Step1

Step2 Step2

Multiple Cloud Providers

Step3

DependencyData Source1

DependencyData Source2

Service Provider, Alice

INDaaS

Step1

Step2 Step2Unwilling to share the

dependency dataUnwilling to share the

dependency data

Step3

Multiple Cloud Providers

✘ ✘

Data Source1(Cloud1)

Data Source2(Cloud2)

Service Provider, Alice

INDaaS

Step1

Step2 Step2

Private Independence Evaluation

Data Source1(Cloud1)

Data Source2(Cloud2)

Step3:Private auditing

Service Provider, Alice

INDaaS

Step1

Step2 Step2

Step3Step4 St

ep4

Service Provider, Alice

Private Independence Evaluation

Data Source1(Cloud1)

Data Source2(Cloud2)

INDaaS

Step1

Step2 Step2

Step3

Service Provider, Alice

Private Independence Evaluation

Only know my information

Only know my information

Data Source1(Cloud1)

Data Source2(Cloud2)

Step4 Step

4

INDaaS

Step1

Step2 Step2

Step3

Only the relative independence

Service Provider, Alice

Private Independence Evaluation

Data Source1(Cloud1)

Data Source2(Cloud2)

Step4 Step

4

INDaaS

Step1

Step2 Step2

Step3

Step5

Service Provider, Alice

Private Independence Evaluation

Data Source1(Cloud1)

Data Source2(Cloud2)

Step4 Step

4

Technical Challenges

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #1: Dependency collections- Solution: Reusing existing tools

Technical Challenges

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #2: Dependency representation- Solution: Fault graphs

• #1: Dependency collections- Solution: Reusing existing tools

Technical Challenges

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #2: Dependency representation- Solution: Fault graphs

• #1: Dependency collections- Solution: Reusing existing tools

Technical Challenges

• #3: Efficient auditing- Solution: Failure sampling algorithm

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

Technical Challenges

• #4: Private independence audit- Solution: Private Jaccard similarity

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #2: Dependency representation- Solution: Fault graphs

INDaaS

Step1 Step5

Step3

• #4: Private independence audit- Solution: Private Jaccard similarity

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

RoadMap

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #2: Dependency representation- Solution: Fault graphs

INDaaS

Step1 Step5

Step3

• #4: Private independence audit- Solution: Private Jaccard similarity

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

RoadMap

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #2: Dependency representation- Solution: Fault graphs

Dependency Data Collections

Type Dependency Expression

Network <src=”S” dst=”D” route=”x,y,z”/>

Hardware <hw=”H” type=”T” dep=”x”/>

Software <pgm=”S” hw=”H” dep=”x,y,z”/>

Our defined format

• Reuse existing data collection tools: - Convert the outputs to uniform format. - Three types of format: NET, HW and SW.

Please see our paper for more details

• #2: Dependency representation- Solution: Fault graphs

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

RoadMap

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #4: Private independence audit- Solution: Private Jaccard similarity

• #2: Dependency representation- Solution: Fault graphs

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

RoadMap

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #4: Private independence audit- Solution: Private Jaccard similarity

Example Redundancy

Example Redundancy

SW

HW

NET

Example Redundancy

Building Fault Graph Top-to-Bottom

SW

HW

NET

Redundancy configuration fails

Step1: Root Node

Server 2 failsServer 1 fails

Redundancy configuration fails

Step2: Server Nodes

AND gate: all the sublayer nodes fail, the upper layer node fails

Server 2 failsServer 1 fails

Step2: Server NodesRedundancy configuration fails

Net fails

+" +"

HW fails SW fails SW fails

Server 2 failsServer 1 fails

Net fails HW fails

Redundancy configuration fails

Step3: Dependency Nodes

OR gate: one of the sublayer nodes fails, the upper layer node fails

Net fails

+" +"

HW fails SW fails SW fails

Server 2 failsServer 1 fails

Net fails HW fails

Step3: Dependency NodesRedundancy configuration fails

Net fails

+" +"

HW fails

Disk2CPU2

+"

SW fails SW fails

Server 2 failsServer 1 fails

Disk1CPU1

+"

Net fails HW fails

Step4: Hardware DependencyRedundancy configuration fails

Net fails

+" +"

ToR1 Core2Core1

HW fails

+" +"

Disk2CPU2

+"

Path1 Path2

SW fails SW fails

Server 2 failsServer 1 fails

Disk1CPU1

+"

Net fails HW fails

Step5: Network DependencyRedundancy configuration fails

Net fails

+" +"

ToR1 Core2Core1

HW fails

libc6 libccllibsvnl

+" +"

Disk2CPU2

+"

Path1 Path2

+"

Riak

+"

Query

+"

SW fails SW fails

Server 2 failsServer 1 fails

Disk1CPU1

+"

Net fails HW fails

Step6: Software DependencyRedundancy configuration fails

• #2: Dependency representation- Solution: Fault graphs

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

RoadMap

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #4: Private independence audit- Solution: Private Jaccard similarity

• #2: Dependency representation- Solution: Fault graphs

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

RoadMap

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #4: Private independence audit- Solution: Private Jaccard similarity

• Two algorithms balancing cost and accuracy: - Minimal fault set algorithm- Failure sampling algorithm

Efficient Auditing

• Two algorithms balancing cost and accuracy: - Minimal fault set algorithm- Failure sampling algorithm

Efficient Auditing

Minimal Fault Set Algorithm

• Traditional algorithm in safety engineering- Exponential complexity (NP-hard)

• We are the first to apply it in Cloud area:- Analyzing a fat tree with 30,528 with ~40 hours

• We propose efficient failure sampling algorithm.

Minimal Fault Set Algorithm

• Traditional algorithm in safety engineer- Exponential complexity (NP-hard)

• We are the first to apply it in Cloud area:- Analyzing a fat tree with 30,528 with ~40 hours

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

Failure Sampling Algorithm

1 or 0 1 or 0 1 or 0

Failure Sampling Algorithm

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

1 or 0 ?

1 or 0 1 or 0 1 or 0

Failure Sampling Algorithm

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

Fault Sets

1 or 0 1 or 0 1 or 0

Failure Sampling Algorithm

1 or 0 ?Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

The 1st Sampling Round

Fault Sets

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

1 or 0 1 or 0 1 or 0

The 1st Sampling Round

Fault Sets

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

✔✘ ✘

The 1st Sampling Round

Fault Sets

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

✔✘ ✘

The 1st Sampling Round

Fault Sets

✘ ✘

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

✔✘ ✘Fault Sets

{Server1’s HW, Server2’s HW}

The 1st Sampling Round

✘ ✘

Fault Sets

{Server1’s HW, Server2’s HW}

The 2nd Sampling Round

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

1 or 0 1 or 0 1 or 0

Fault Sets

{Server1’s HW, Server2’s HW}

The 2nd Sampling Round

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

✔ ✘✔Fault Sets

{Server1’s HW, Server2’s HW}

The 2nd Sampling Round

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

✔ ✘✔

Fault Sets

{Server1’s HW, Server2’s HW}

The 2nd Sampling Round

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

✔ ✘

Fault Sets

{Server1’s HW, Server2’s HW}

The 3rd Sampling Round

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

✔✘✔Fault Sets

{Server1’s HW, Server2’s HW}

The 3rd Sampling Round

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

✔✘✔Fault Sets

{Server1’s HW, Server2’s HW}

The 3rd Sampling Round

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

✘ ✘

✔✘✔Fault Sets

{Server1’s HW, Server2’s HW}

{Switch1}

The 3rd Sampling Round

Redundancy configuration fails

+" +"

Switch1 fails

Server 2 failsServer 1 fails

Server2’s HW failsServer1’s HW fails

✘ ✘

Fault Sets{Server1’s HW, Server2’s HW}

{Switch1}

{Switch1, Server2’s HW}

{Switch1}

{Switch1, Server2’s HW}

... ...

After Many (e.g., 107) Rounds

Fault Sets{Server1’s HW, Server2’s HW}

{Switch1}

{Switch1, Server2’s HW}

{Switch1}

{Switch1, Server2’s HW}

... ...

Size-Based Ranking

Fault Sets{Switch1}

{Switch1}

{Switch1, Server2’s HW}

{Switch1, Server2’s HW}

{Server1’s HW, Server2’s HW}

... ...

Size-Based Ranking

• Multiple equations for option: - summation of sizes- weighted average of sizes

Independence Evaluation

Fault Sets{Switch1}

{Switch1}

{Switch1, Server2’s HW}

{Switch1, Server2’s HW}

{Server1’s HW, Server2’s HW}

... ...

• #2: Dependency representation- Solution: Fault graphs

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

RoadMap

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #4: Private independence audit- Solution: Private Jaccard similarity

• #2: Dependency representation- Solution: Fault graphs

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

RoadMap

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #4: Private independence audit- Solution: Private Jaccard similarity

• #2: Dependency representation- Solution: Fault graphs

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

RoadMap

INDaaS

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

Data Source1(Cloud1)

Data Source2(Cloud2)

• #4: Private independence audit- Solution: Private Jaccard similarity

Cloud A Cloud B Cloud C

INDaaS AgentService Provider

Cloud A Cloud B Cloud C

INDaaS Agent

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent

Trusted Third-Party

ISP A Power B ISP B Power CPower A

Service Provider

Service Provider

Cloud A Cloud B Cloud C

Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent

Trusted Third-Party

Cloud providers are reluctant to share this information!

ISP A Power B ISP B Power CPower A

Cloud A Cloud B Cloud C

Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent

Secure Multiparty Computation (SMPC)

SMPC

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent

Secure Multiparty Computation (SMPC)

SMPCSMPC is hard to scale![Xiao et al. CCSW’13]

ISP A Power B ISP B Power CPower A

Cloud A Cloud B Cloud C

Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

INDaaS Agent

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

Evaluating independence by the dataset similarity between clouds

INDaaS Agent

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

INDaaS Agent

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS AgentService Provider

INDaaS Agent

Cloud A Cloud B Cloud C

App Provider Auditor

ISP APower APower B

ISP BPower APower B

ISP BPower C

Using Jaccard similarity to evaluate the independence of each redundancy configuration.

ISP A Power B ISP B Power CPower A

INDaaS Agent

Cloud A Cloud B Cloud C

App Provider

ISP APower APower B

ISP BPower APower B

ISP BPower C

S1 S2 ... ... Sn

S1 S2 ... ... Sn

J(S1, S2, ..., Sn) = | |

| |

ISP A Power B ISP B Power CPower A

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS AgentService Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

=2=4

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

=2=4

J = 2/4

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

INDaaS Agent

Deployment SimCloud A&B 0.5

ISP A Power B ISP B Power CPower A

=2=4

J = 2/4

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

INDaaS Agent

Deployment SimCloud A&B 0.5Cloud B&C 0.25

ISP A Power B ISP B Power CPower A

=1=4

J = 1/4

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

Deployment SimCloud A&B 0.5Cloud B&C 0.25Cloud A&C 0

=0 =5 J = 0/5,,

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

Deployment SimCloud A&B 0.5Cloud B&C 0.25Cloud A&C 0

=0 =5 ,,

0  means  fully  independentService Provider

J = 0/5

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

Deployment SimCloud A&B 0.5Cloud B&C 0.25Cloud A&C 0

=0 =5 J = 0/5,,

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

Deployment SimCloud A&C 0Cloud B&C 0.25Cloud A&B 0.5

=0 =5 J = 0/5,,

Service Provider

Cloud A Cloud B Cloud C

INDaaS Agent

Deployment SimCloud A&C 0Cloud B&C 0.25Cloud A&B 0.5

ISP A Power B ISP B Power CPower A

Service Provider

S1 S2 ... ... Sn

S1 S2 ... ... Sn

J(S1, S2, ..., Sn) = | |

| |

P-SOP [Vaidya et al. JCS05]

• We apply Private Set Operation Protocol (P-SOP): - Private set intersection cardinality.- Private set union cardinality.

P-SOP [Vaidya et al. JCS05]

• Allow k parties to compute both intersection and union cardinalities without learning other information.

11

3

10

1

5

20

3

7

3

P-SOP [Vaidya et al. JCS05]

• Allow k parties to compute both intersection and union cardinalities without learning other information.

11

3

10

1

5

20

3

7

3

P-SOP

P-SOP [Vaidya et al. JCS05]

• Allow k parties to compute both intersection and union cardinalities without learning other information.

11

3

10

1

5

20

3

7

3

P-SOP

=1, =7=1, =7

=1, =7

P-SOP [Vaidya et al. JCS05]

• Allow k parties to compute both intersection and union cardinalities without learning other information.

11

3

10

1

5

20

3

7

3

Protocol

But I do not know which elements are overlapping/union.

P-SOP

But I do not know which elements are overlapping/union.

But I do not know which elements are overlapping/union.

P-SOP [Vaidya et al. JCS05]

• Allow k parties to compute both intersection and union cardinalities without learning other information.

Each party maintains a commutative encryption key

Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

37

15

203Kd

Kf

Ke

113

10

P-SOP [Vaidya et al. JCS05]

Each party maintains a commutative encryption key

Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

37

15

203Kd

Kf

Ke

113

10

P-SOP [Vaidya et al. JCS05]

37

15

203Kd

Kf

Ke

113

10

P-SOP [Vaidya et al. JCS05]

Each party maintains a commutative encryption key

Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

37

15

203Kd

Kf

Ke

113

10

Ef(Ee(Ed(3)))Ef(Ee(Ed(10)))Ef(Ee(Ed(11)))

Ee(Ed(Ef(3)))Ee(Ed(Ef(7)))

Ed(Ef(Ee(3)))Ed(Ef(Ee(5)))Ed(Ef(Ee(1)))

Ed(Ef(Ee(20)))

P-SOP [Vaidya et al. JCS05]

Each party maintains a commutative encryption key

Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

Kd

Kf

Ke

P-SOP [Vaidya et al. JCS05]

Each party maintains a commutative encryption key

Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

113

10

Ef(Ee(Ed(3)))Ef(Ee(Ed(10)))Ef(Ee(Ed(11)))

37

Ee(Ed(Ef(3)))Ee(Ed(Ef(7)))

15

203

Ed(Ef(Ee(3)))Ed(Ef(Ee(5)))Ed(Ef(Ee(1)))

Ed(Ef(Ee(20)))

Kd

Kf

Ke

P-SOP [Vaidya et al. JCS05]

Each party maintains a commutative encryption key

Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

113

10

Ef(Ee(Ed(3)))Ef(Ee(Ed(10)))Ef(Ee(Ed(11)))

37

Ee(Ed(Ef(3)))Ee(Ed(Ef(7)))

15

203

Ed(Ef(Ee(3)))Ed(Ef(Ee(5)))Ed(Ef(Ee(1)))

Ed(Ef(Ee(20)))

Kd

Kf

Ke

P-SOP [Vaidya et al. JCS05]

Each party maintains a commutative encryption key

Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

113

10

Ef(Ee(Ed(3)))Ef(Ee(Ed(10)))Ef(Ee(Ed(11)))

37

Ee(Ed(Ef(3)))Ee(Ed(Ef(7)))

15

203

Ed(Ef(Ee(3)))Ed(Ef(Ee(5)))Ed(Ef(Ee(1)))

Ed(Ef(Ee(20)))

Ef(Ee(Ed(3)))

Kd

Kf

Ke

P-SOP [Vaidya et al. JCS05]

Each party maintains a commutative encryption key

Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

11Ef(Ee(Ed(3)))Ef(Ee(Ed(10)))Ef(Ee(Ed(11)))

3Ee(Ed(Ef(3)))

Ee(Ed(Ef(7)))

1Ed(Ef(Ee(3)))

Ed(Ef(Ee(5)))Ed(Ef(Ee(1)))

Ed(Ef(Ee(20)))

Ef(Ee(Ed(3)))

Kd

Kf

Ke

P-SOP [Vaidya et al. JCS05]

Each party maintains a commutative encryption key

Commutative encryption holds: Ex(Ey(m)) = Ey(Ex(m))

Ef(Ee(Ed(10)))Ef(Ee(Ed(11)))Ee(Ed(Ef(7)))Ed(Ef(Ee(5)))Ed(Ef(Ee(1)))

Ed(Ef(Ee(20)))

Ef(Ee(Ed(3)))

7

Private Independence Evaluation

Cloud A Cloud B Cloud C

Select two clouds for redundancy: A&B? B&C? or A&C? INDaaS Agent

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

INDaaS Agent

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

INDaaS Agent

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS AgentService Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

P-SOP

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

=2

ISP A Power B ISP B Power CPower A

INDaaS AgentService Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

=2=4

ISP A Power B ISP B Power CPower A

INDaaS AgentService Provider

Cloud A Cloud B Cloud C

ISP BPower C

=2=4

J = 2/4

ISP A Power B ISP B Power CPower A

INDaaS Agent

ISP APower APower B

ISP BPower APower B

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

=2=4

J = 2/4

ISP A Power B ISP B Power CPower A

INDaaS Agent

Deployment SimCloud A&B 0.5

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

P-SOP

Deployment SimCloud A&B 0.5

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP A Power B ISP B Power CPower A

INDaaS Agent

ISP BPower APower B

ISP BPower C

=1=4

J = 1/4

Deployment SimCloud A&B 0.5

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP A Power B ISP B Power CPower A

INDaaS Agent

ISP BPower APower B

ISP BPower C

=1=4

J = 1/4

Deployment SimCloud A&B 0.5Cloud B&C 0.25

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

P-SOP

Deployment SimCloud A&B 0.5Cloud B&C 0.25

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

=0 =5 J = 0/5,,

Deployment SimCloud A&B 0.5Cloud B&C 0.25

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

=0 =5 J = 0/5,,

Deployment SimCloud A&B 0.5Cloud B&C 0.25Cloud A&C 0

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

=0 =5 J = 0/5,,

Deployment SimCloud A&B 0.5Cloud B&C 0.25Cloud A&C 0

Service Provider

Cloud A Cloud B Cloud C

ISP APower APower B

ISP BPower C

ISP A Power B ISP B Power CPower A

INDaaS Agent

=0 =5 J = 0/5,,

Deployment SimCloud A&C 0Cloud B&C 0.25Cloud A&B 0.5

Service Provider

Cloud A Cloud B Cloud C

INDaaS Agent

Deployment SimCloud A&C 0Cloud B&C 0.25Cloud A&B 0.5

ISP A Power B ISP B Power CPower A

Service Provider

Cloud A Cloud B Cloud C

INDaaS AgentDeployment SimCloud A&C 0Cloud B&C 0.25Cloud A&B 0.5

ISP A Power B ISP B Power CPower A

Service Provider

• #2: Dependency representation- Solution: Fault graphs

• #3: Efficient auditing- Solution: Failure sampling algorithm

• #1: Dependency collections- Solution: Reusing existing tools

RoadMap

INDaaS Agent

Step1 Step5

Step3

Step2

Step

4

Step2

Step4

DependencyData Source1

DependencyData Source2

• #4: Private independence audit- Solution: Private Jaccard similarity

Evaluation

Evaluation

• Three realistic case studies.• Tradeoff between auditing algorithms• Overhead of P-SOP protocol

Evaluation

• Three realistic case studies.• Tradeoff between auditing algorithms• Overhead of P-SOP protocol

Three Case Studies

• Common network dependency

• Common hardware dependency

• Common software dependency

Please see our paper for more details

Evaluation

• Three realistic case studies.• Tradeoff between auditing algorithms• Overhead of P-SOP protocol

• We evaluate efficiency/accuracy tradeoff.• We generate topology based on fat tree model.• We also bu

Tradeoff Evaluation

Topology A Topology B Topology C

# of Core Routers 64 144 576

# of Agg Switches 128 288 1,152

# of ToR Switches 128 288 1,152

# of Servers 1,024 3,456 27,648

Total # of devices 1,344 4,176 30,528

Tradeoff Evaluation

Tradeoff Evaluation

Topology A Topology B Topology C

# of Core Routers 64 144 576

# of Agg Switches 128 288 1,152

# of ToR Switches 128 288 1,152

# of Servers 1,024 3,456 27,648

Total # of devices 1,344 4,176 30,528

Topology C: 30,528 Devices

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 2048% c

ritic

al fa

ult s

ets

dete

cted

Computational time (minutes)

Minimal Fault Set AlgorithmFailure Sampling Algorithm

Failure Sampling Algorithm (106)

Failure Sampling Algorithm (104)

Minimal Fault Set Algorithm

Topology C: 30,528 Devices

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 2048% c

ritic

al fa

ult s

ets

dete

cted

Computational time (minutes)

Minimal Fault Set AlgorithmFailure Sampling Algorithm

Failure Sampling Algorithm (106)

Failure Sampling Algorithm (104)

Minimal Fault Set Algorithm

Evaluation

• Three realistic case studies.• Tradeoff between auditing algorithms• Overhead of P-SOP protocol

What we evaluate?

• Kissner and Song (KS) protocol for comparison.• Bandwidth overhead of P-SOP and KS.• Computational overhead of P-SOP and KS.

Bandwidth Overhead

0

50

100

150

200

1000 10000 100000

Tota

l tra

ffic

sent

(MB)

Number of elements in each provider’s dataset

KS (4)P-SOP (4)

KS (3)P-SOP (3)

KS (2)P-SOP (2)

Bandwidth Overhead

0

50

100

150

200

1000 10000 100000

Tota

l tra

ffic

sent

(MB)

Number of elements in each provider’s dataset

KS (4)P-SOP (4)

KS (3)P-SOP (3)

KS (2)P-SOP (2)

Bandwidth Overhead

0

50

100

150

200

1000 10000 100000

Tota

l tra

ffic

sent

(MB)

Number of elements in each provider’s dataset

KS (4)P-SOP (4)

KS (3)P-SOP (3)

KS (2)P-SOP (2)

~80MB

Computational Overhead

1

10

100

1000

10000

100000

1e+06

1000 10000 100000

Com

puta

tiona

l tim

e (s

econ

ds)

Number of elements in each provider’s dataset

KS (4)KS (3)KS (2)

P-SOP (4)P-SOP (3)P-SOP (2)

~103 sec

Conclusions

• INDaaS is a first step towards reliable clouds: - Dependency collections- Dependency representations- Efficient auditing- Private independence auditing

• We evaluated INDaas with three realistic case studies and large-scale simulations.

Thanks, questions?

• INDaaS: Heading off correlated failures in clouds

• Find out more at:- http://dedis.cs.yale.edu/cloud/

• We will be at the poster session tonight.