Performance Debugging in Data Centers: Doing More with Less
-
Upload
octavia-stevens -
Category
Documents
-
view
26 -
download
0
description
Transcript of Performance Debugging in Data Centers: Doing More with Less
![Page 1: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/1.jpg)
Performance Debugging inData Centers:
Doing More with Less
Prashant Shenoy, UMass Amherst
Joint work with Emmanuel Cecchet, Maitreya Natu, Vaishali Sadaphal and Harrick Vin
![Page 2: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/2.jpg)
Data Centers Today
• Large number of computing, communication, and storage systems
• Wide range of applications and services• Rapidly increasing scale and complexity• Limited understanding and control over the
operations
![Page 3: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/3.jpg)
Equity Trade Plant
Portion of the data center operated by an investment bank for processing trading orders; Nodes represent application processes; Edges indicate flow of requests;
![Page 4: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/4.jpg)
Equity Trade Plant
• Receives and processes • 4-6 million equity orders (trade requests)• 10-100 million market updates (news, stock-tick
updates, etc.)• IT infrastructure for processing orders and updates
consists of thousands of application components running on hundreds of servers
Portion of the data center operated by an investment bank for processing trading orders; Nodes represent application processes; Edges indicate flow of requests;
![Page 5: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/5.jpg)
Performance Debugging in Data Centers
• Low end-to-end latency for processing each request is a critical business requirement
• Increase in latency can be due to– Dynamic changes in workload– Slowing down of a processing node due to hardware
or software errors• Performance debugging involves detecting and
localizing performance faults• Longer localization time leads to greater business
impact
![Page 6: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/6.jpg)
Performance Debugging in Data Centers
• Four key steps– Build a model of normal operations of a system– Place probes to monitor the operational system– Detect performance faults in near-real-time– Localize faults by combining the knowledge
derived from model and monitored data
![Page 7: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/7.jpg)
Performance Debugging in Data Centers
• Four key steps– Build a model of normal operations of a system– Place probes to monitor the operational system– Detect performance faults in near-real-time– Localize faults by combining the knowledge
derived from model and monitored data
Effectiveness of these steps depends on the number and type of data collection probes available in the system.
However, system administrators are reluctant to introduce probes into production environment, especially if the probes are intrusive (and can modify the system behavior)
![Page 8: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/8.jpg)
Basic Practical Requirement
• Minimize the amount of instrumentation to gather real-time operational statistics
• Minimize the intrusiveness of the data gathering methods
![Page 9: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/9.jpg)
Basic Practical Requirement
• Minimize the amount of instrumentation to gather real-time operational statistics
• Minimize the intrusiveness of the data gathering methods
Much of the prior research ignores this requirement and demands:• Significant instrumentation
• (e.g., requiring probes to be placed at each process/server)• Significant intrusiveness
•(e.g., requiring each request to carry a request-ID to track request flows)
![Page 10: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/10.jpg)
Characterizing State-of-the-art
![Page 11: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/11.jpg)
Basic Practical Requirement
System operators are always • Minimize the amount of instrumentation to
gather real-time operational statistics• Minimize the intrusiveness of the data
gathering methods
Much of the prior research ignores this requirement and demands:• Significant instrumentation
• (e.g., requiring probes to be placed at each process/server)• Significant intrusiveness
•(e.g., requiring each request to carry a request-ID to track request flows)
• For automated performance debugging to become practical and effective, one needs to develop techniques that are more effective with less instrumentation and intrusiveness
• We raise several issues and challenges in designing these techniques
![Page 12: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/12.jpg)
Instrumentation Vs. Intrusiveness
• Extent of instrumentation and amount of intrusiveness complement each other– E.g., collection of request component dependency• High instrumentation-Low intrusiveness
– Each node monitors request arrival event
• Low instrumentation-High intrusiveness– Each request stores information of the component it passes
through
![Page 13: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/13.jpg)
Instrumentation Vs. Intrusiveness
• Extent of instrumentation and amount of intrusiveness complement each other– Collection of request component dependency– High instrumentation-Low intrusiveness• Each node monitors request arrival event
– Low instrumentation-High intrusiveness• Each request stores information of the component it
passes through
Observation: It is possible to tradeoff the level of instrumentation against the level of intrusiveness
needed for a technique
![Page 14: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/14.jpg)
Instrumentation Vs. Intrusiveness
• Extent of instrumentation and amount of intrusiveness complement each other– Collection of request component dependency– High instrumentation-Low intrusiveness• Each node monitors request arrival event
– Low instrumentation-High intrusiveness• Each request stores information of the component it
passes through
Observation: It is possible to tradeoff the level of instrumentation against the level of intrusiveness
needed for a technique
Production systems place significant restrictions on which nodes can be instrumented as well as the level of
intrusiveness permitted
![Page 15: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/15.jpg)
Instrumentation Vs. Intrusiveness
• Extent of instrumentation and amount of intrusiveness complement each other– Collection of request component dependency– High instrumentation-Low intrusiveness• Each node monitors request arrival event
– Low instrumentation-High intrusiveness• Each request stores information of the component it
passes through
Observation 3: It is possible to tradeoff the level of instrumentation against the level of intrusiveness
needed for a technique
Production systems place significant restrictions on which nodes can be instrumented as well as the level of
intrusiveness permitted
Is it possible to achieve effective performance debugging using low instrumentation and low
intrusiveness?
![Page 16: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/16.jpg)
Doing More With Less: An Example
![Page 17: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/17.jpg)
A Production Data Center: Characteristics and Constraints
![Page 18: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/18.jpg)
A Production Data Center: Characteristics and Constraints
• 469 nodes– Each node represents an
application component that processes trading orders and forwards them to downstream node
• 2,072 links• 39,567 unique paths• SLO: end-to-end latency
for processing each equity trade should not exceed 7-10ms
![Page 19: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/19.jpg)
A Production Data Center: Characteristics and Constraints
• 469 nodes– Each node represents an
application component that processes trading orders and forwards them to downstream node
• 2,072 links• 39,567 unique paths• SLO: end-to-end latency
for processing each equity trade should not exceed 7-10ms
• Environment imposes severe restrictions on the permitted instrumentation and intrusiveness• No instrumentation of intermediate nodes purely for performance debugging• SLA compliance is monitored at exit nodes by time-stamping request entry and exit
Available information• Per-hop graph• SLO compliance information at the monitors at exit nodes
No additional information is available
![Page 20: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/20.jpg)
Problem Definition
• Given:– System graph depicting application component
interactions– Instrumentation at the entry and exit nodes that
timestamp requests
• Determine:– The root cause of SLO violations when one more
exit nodes observe such violations
![Page 21: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/21.jpg)
Straw-man Approaches
• Signature-based localization• Online signature matching via graph coloring
![Page 22: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/22.jpg)
Signature-Based Localization• Node signature:
– Set of all monitors that are reachable from the node
– K-bit string where each bit represents the accessibility of a monitor
• In presence of a failure some monitors will observe SLO violation, thus creating a violation signature
• Fault localization task is to determine the node that could have generated the violation signature
Query exit points(SLA validation)
01101000
1110
1110
0001
0001
11111110
1000 0100 00010010
![Page 23: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/23.jpg)
Signature-Based Localization
Applying signature-based localization on equity trade plant system
• Monitors on 112 exit nodes generated 112-bit signatures • Generated 137 unique signatures for 357 non-exit nodes (38%)• Generated 71 unique signatures for 121 source nodes (58%)
![Page 24: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/24.jpg)
Online signature matching
• Graph coloring technique
SLA violation
Mark suspect nodes
Clear suspect nodes that lead to a valid request execution
Root cause ofSLA violation
![Page 25: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/25.jpg)
Opportunities and Challenges
![Page 26: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/26.jpg)
Deriving a System Model• Objective:– Real production systems are too large and complex to
manually derive a system model• Need for automatic generation and maintenance of model
• Challenges:– Need for reasonably low instrumentation and
intrusiveness– Several low-cost mechanisms can be considered here
• Network packet sniffing to derive component communication pattern
• Examining application logs – to derive component communication pattern– to derive request flows
![Page 27: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/27.jpg)
Monitor Placement
• Objective:– Place monitors at suitable locations to measure end-
to-end performance metrics• Challenges– Deployment of monitors involves instrumentation
overhead• Need to minimize the number of monitors
– Tradeoff between number of monitors and accuracy of fault detection and localization• Smaller number of monitors increases chances of signature
collisions
![Page 28: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/28.jpg)
Monitor Placement
– Structure of graph affects the distribution of signatures across nodes• In the ideal case n unique signatures can be generated
using log(n) monitors
Nod
es w
ith
sam
e si
gnat
ure
Nodes w
ith sam
e signature
![Page 29: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/29.jpg)
Real-Time Failure Detection
• Objective– Quick and accurate detection of the presence of
failures based on observation at the monitor nodes
• Challenges:– Differentiate between the effect due to workload
change and failure– Deal with scenario where a node failure affects
only few of the requests passing through the node– Transient failures
![Page 30: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/30.jpg)
Fault Localization
• Objective:– Identification of the root-cause of the problem after
detecting failure at one or more monitor nodes (SLO violation signature)
• Challenges:– Presence of multiple failures leads to composite signature– Edges from the failed node to the monitors are traversed
in a non-uniform manner leading to partial signature– Transient failures– Inherent non-determinism in real systems (e.g. presence
of load balancers)
![Page 31: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/31.jpg)
Conclusions
• Detecting and localizing performance faults in data centers has become a pressing need and a challenge
• Performance debugging can become practical and effective only if it requires low levels of instrumentation and intrusiveness
• We proposed straw man approaches for performance debugging and presented issues and challenges for building practical and effective solutions
![Page 32: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/32.jpg)
Instrumentation and Intrusiveness
![Page 33: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/33.jpg)
Observation 1: The instrumentation intrusiveness is a direct function of the performance metric of interest
Instrumentation for Failure Detection
– End-to-end latency: difference of the timestamps of arrival and departure of requests• High instrumentation intrusiveness
– Throughput: number of requests departing the system within a defined interval• Low instrumentation intrusiveness
![Page 34: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/34.jpg)
Instrumentation for Fault Localization
• Simple solution: Measure performance metrics and resource utilization at all servers– High instrumentation– High overhead (monitoring and data management)
• Sophisticated solutions: Collect operational semantics of the system (e.g., request component dependencies)– Low instrumentation (not each node needs to be
instrumented)– High intrusiveness (modifications at system,
middleware, application level)
![Page 35: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/35.jpg)
Instrumentation for Fault Localization
• Collection of different system information require different level of intrusiveness• Per-hop graph indicating component interactions: simple
network sniffing • Derivation of flow of requests: application aware
monitoring (e.g. by insertion of transaction-id in the requests)
![Page 36: Performance Debugging in Data Centers: Doing More with Less](https://reader036.fdocuments.in/reader036/viewer/2022070402/56813928550346895da0d00e/html5/thumbnails/36.jpg)
Characterizing State-of-the-art
Observation 2: Most techniques require high instrumentation or high intrusiveness or both