Cloud API Issues: an Empirical Study and Impact

NICTA Copyright 2012 From imagination to impact

Cloud API Issues: an Empirical Study and Impact

Qinghua Lu, Liming Zhu, Len Bass, Xiwei Xu, Zhanwen Li, Hiroshi Wada

Software Systems Research Group, NICTA

QoSA13, VancouverSlides at: http://www.slideshare.net/LimingZhu/


Motivation• Cloud applications fail due to operation issues

– Gartner reports: 80% of outage caused by operations• People/Process: replication/failover, auto-scaling, upgrade…

– Lessons from our own cloud DR product: Yuruware.com– DevOps movement

• Operational causes of failures– Infrastructure and processes, but, – Most things are done through infrastructure API

• Highly dependable cloud applications require – Architecting for not just the software but also its operation (thru API)– Architecting for indirect control (thru API)– Better understanding of Cloud API Issues

• reliability, performance, nature of failures and faults 2


Main Contributions

• Empirical study of cloud infrastructure API issues– 922 failure/fault cases from Amazon EC2 forums (2010 to 2012)

• Around five most used API calls

• Fault analysis supplemented by other sources

– Classified the API failures and faults (causes of failures) • Using the classic dependable computing taxonomy (Avizienis,04)

• Failures: content, late timing, halt, erratic

• Faults: development, physical, interaction

• Impact analysis through an initial proposal for tolerating cloud API failures/faults – Suggestions for tolerating content failures– 11 patterns for tolerating timing failures

3


Some Empirical Findings

• Majority (60%) of the cases of API failures are related to stuck API calls or unresponsive API calls

• 19% of the cases are related to the output issues of API calls– Error messages, missing/wrong/unexpected contents

• 12% of the cases are about slow responsive API calls

• 9% cases are related to API calls that – were pending for a certain time and then returned to the original state

without informing the caller properly

– were reported to be successful first but failed later

4


Methodology

5

• Amazon: EC2 forums and outage reports• Netflix: technical blogs and GitHub OSS projects• Yuruware.com: disaster recovery product which heavily relies on cloud

infrastructure APIs


Data Collected from Amazon EC2 Forum

6

Searched keywords and number of returned records

API of Interests Number of records from inception to 2012

Number of records from 2010 to 2012

describe instance 283 150start instance 227 204stop instance 349 348detach volume 235 203associate elastic IP 264 204Total 1358 1109

Case type, number and percentage in the found casesCase type Case number Percentage of all cases

from 2010-2012API failures 922 83%Enquiries 125 11%API enhancements 62 6%


Classification of API Failures

7

Fault -> Error -> Failure

Failure: deviation from correct service (external visible)

Error: internal erroneous state

Fault: adjudicated or hypothesized causes of a failure

[13] A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, "Basic concepts and taxonomy of dependable and secure computing," Dependable and Secure Computing, IEEE Transactions on, vol. 1, pp. 11-33, 2004.



• Content failures (19%)– With error messages; missing/wrong/unexpected content• 61% of the times users understood the causes/solutions from the error message

• 39% of the times users could not pinpoint the causes from the error message

8

Posted on Jan 10, 2012 5:42 AMSymptom: When a user tried to start an instance, the operation failed with an unclear error message.

Error message: State Transition Reason - Server.InternalError: Internal error on launch

Root cause: Unknown.

Solution: AWS engineers advised detaching the EBS volume from the instance and attaching it to another running instance.

Posted on Jun 14, 2012 9:57PMSymptom: Failed API calls and receiving Request limit exceeded error message.

Error message: Client.RequestLimitExceeded: Request limit exceeded

Root cause: API calls exceeded limit.

Solution: N/A. There is no official information on the limit or the time span on which the limit is calculated or suggested wait time.

Failed call where the error message is unclear.

Failed call where the error message is clear.



• Late timing failures (12%)– the arrival time of the delivered information deviates from the

expected time but they do eventually arrive

9

A late timing failure example.

Posted on Aug 27, 2012 11:57 AM

Symptom: It took 16 minutes for an instance to stop.

Root cause: n/a.

Solution: The AWS engineer advised to try “force stop” twice if this happens next time.



• Halt failures (60%)– The external state becomes constant.– Most frequent failures!

10

A general halt failure example.Posted on Jun 27, 2012 12:04 AM

Symptom: A user reported that the instance is stuck at stopping and “force stop” would not help.

Root cause: n/a.

Solution: The AWS engineer stopped the instance for the user on the AWS side (with some side effect).

A silent failure example.Posted on Oct 23, 2012 7:45 AM

Symptom: An instance was not accessible and the user could not stop/start it or create a snapshotRoot cause: AWS outage.Solution: The AWS engineer advised that the user must launch a replacement instance from a pre-existing backup (EBS AMI). Attempts to stop an inaccessible instance will likely result in an instance becoming stuck in the stopping state. Customers that do not have a known good backup must wait for the issue to be resolved for their instance connectivity to be restored.


Classification of API Failures• Erratic failures (35%)

– When the delivered service is unpredictable: Two subtypes: • the call is pending for a certain time and then returns to the original state• the call is successfully executed first but failed eventually

11

Two erratic failure examples.

Posted on Feb 1, 2012 8:15 AM

Symptom: A user associated an elastic IP with an instance and could SSH into the instance with the elastic IP. After a few minutes, the elastic IP was silently disassociated from the instance.

Root cause: An issue with the underlying host.

Solution: The AWS engineer advised that the quickest fix was to stop and then start the instance to relocate to a different host.

Posted on Jan 14, 2011 1:43 PM

Symptom: A user tried to start the instance several times. It indicated that the status is pending and it goes back to stop.

Root cause: n/a.

Solution: The AWS engineer returned the user’s EBS volume to the available state and believed this would resolve the user’s problem.


Classifying of Faults (Causes of Failures)

• Development faults – software bugs– User workarounds exist but may break after bug fixing

• Physical faults – Stopping/Starting to move to a new physical machine but

problematic stopping– Future work: classifying using virtual resource characteristics

• Interaction faults– Misconfiguration faults count for 30%

• Accidental & purposeful misconfiguration

– Purposeful misconfiguration• lack of knowledge (subjective uncertainty vs. stochastic uncertainty)

• Configuration and operation impact on availability 1,2

1. X. Xu, Q. Lu, L. Zhu, et al., "Availability Analysis of In-Cloud Applications," in ISARCS13 (11:30 tomorrow)

2. Q. Lu, X. Xu, L. Zhu, L. Bass, et al., "Incorporating Uncertainty into in-Cloud Application Deployment Decisions for Availability," in IEEE Cloud 2013 12


Tolerating API Failures/Faults

13

• Perspective– cloud consumer and application oriented– limited visibility: e.g. may not know the root cause– indirect control: e.g. solutions are thru APIs as well

• Different failures/faults require different approaches– Failure/Fault classification dependent– Suggestions, patterns and ad-hoc use of failure/fault characteristics:

• Content failure: alternative sources for content, defensive programming…

• Late timing failures: API call life cycle driven


API Call Life Cycle Driven Patterns

14


Pattern Examples

• Faster forced fail/complete – force-fail-r or force-fail-s

• Netflix Hystrix: fail fast based on 95-99 percentile delay

– force-complete-r• Yuruware: ignore some “describe” API calls

• Hedged requests or more sophisticated retry – continue-request

• Common: send the same request to 2 places and cancel the slow one

– reallocate or reallocate-s• Yuruware: attach the to-be-moved volume to different mover instances

after early mover failures

15

NICTA Copyright 2012 From imagination to impact 16

Conclusion and Future Work

• Empirical study of cloud infrastructure API issues– Analysed & classified 922 failure/faults from Amazon EC2 forums

• Inform better architecting for operations (i.e. operator as a stakeholder)

– Future work (completed)• Expanded to more cases from other sources (2087 issues)

• Proposed a new scheme for classifying faults

• Tolerating cloud API failures/faults – Patterns for tolerating different types of API failures/faults– Future work (ongoing)

• More actionable mechanisms/patterns and their implementation

• Use the characteristics of the faults and failures

– for smarter recovery and error diagnosis during operation

• What we need: more real world operation logs and collaborators

{Liming.Zhu, Len.Bass}@nicta.com.auSlides available at http://www.slideshare.net/LimingZhu/

Cloud API Issues: an Empirical Study and Impact

Technology

Transcript of Cloud API Issues: an Empirical Study and Impact