Pivot Tracing: Dynamic Causal Monitoring for...

21
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems Student: Hunter Ingle 1

Transcript of Pivot Tracing: Dynamic Causal Monitoring for...

Page 1: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Student: Hunter Ingle

1

Page 2: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Original Paper

Jonathon Mace, Ryan Roelke, and Rodrigo Fonseca. “Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems.” In ACM Transactions on Computer Systems, Vol. 35, Issue 4, pp 11:1-11:28. December 2018.

2

Page 3: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Distributed Systems

• Several machines (nodes) working together to perform a certain task

• Great for large scale data processing and parallel processing• However, some problems and issues exist

§ Data encryption and transmission§ Fault detectionoMonitoring and troubleshooting functionality

3

Page 4: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Main Problem

• Monitoring and troubleshooting distributed systems is both hard and time-consuming§ Hardware and software failures§ Misconfigurations across the system§ Unrealistic expectations

• Current tools:§ Logs, counters, and metrics

• Limitations:§ Recorded at deployment (a priori)oMay not always contain necessary information

§ Captured via components or machinesoDifficult to correlate between them

4

Page 5: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Solution: Pivot Tracing

• Combines dynamic instrumentation with causal tracing• Provides metrics at any one point of a system• Selects, filters, and groups events for other points• Allows crossing component and machine boundaries, as

mentioned before

5

Page 6: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Four Contributions

• “Happened-before join”• Query optimization for the join at runtime by combining

dynamic instrumentation and causal tracing• Prototype implementation of Pivot Tracing

§ Applied to Hadoop distributed system framework• Evaluation based on diagnosing problems at runtime

6

Page 7: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Overview

• Requirements in a system:§ Dynamic code injection§ Causal metadata propagation

• Based on tracepoints, where PT can insert instrumentation§ Instructions based on the location and methods needed for

changing the system

7

Page 8: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Overview Cont.

• Queries (1) – sent into DS via dataset (2) and utilize a vocab defined by tracepoints

• Queries are compiled into advice (3) – instruction set that processes queries

• Advice is mapped to code that PT injects to tracepoints(4). Each time execution reaches the tracepoint the advice is called as well.

8

Page 9: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Overview Cont.

• The happened-before join is based on advice in a tracepoint delivering information through the execution path to advice in other tracepoints. This uses causal metadata propagation known as baggage (5).

• Advice can also carry tuples (6) that are aggregated and sent to the client via a message bus (7) and (8)

9

Page 10: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Design

• Tracepoints§ Act as an entry point to the system for PT queries§ Based on some eventoRequests, I/O operation completion, etc.

§ Only reference the locations for entry points, so not defined or limited by a priori modifications

§ Only compiled and installed at runtime whenever a query is sent

§ When request reaches the tracepoint, instrumentation at that point is called, exporting some variables needed for a tupleoHost, timestamp, process ID, process name, and

definition

10

Page 11: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Tracepoint Example

11

Page 12: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Design – Happened-Before Joins

• Allows tuples from different PT queries to be joined based on the “happened before” relation. § If a and b are events on the same process and a happens

first, then a -> b and vice versao Same if the event a was the cause of event b

§ Based on Lambert’s study on timing in systems (see references slide)

• The join is based on queries and their effects on events• Using the join between two queries ( ) results in the

tuples t1 and t2 where all t1 in Q1 happened before t2 in Q2 (t1->t2)

• Provides insight into the relationships between events being monitored in a system

12

Page 13: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Design – Advice

• Intermediate representation of PT queries• Determines the operations to be performed at tracepoints

and provides monitoring code to be installed at those points• Based on an advice API that can

13

Page 14: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Design – Advice Cont.

14

Page 15: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Optimization - Baggage

• Optimizes happened-before joins during request execution• Container for an instance of a tuple

§ Propagated with a request through thread, application, and machine boundaries

§ Also observes the happened-before events of the request

15

Page 16: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Attached to Hadoop

• Previously mentioned design was attached to multiple aspects of the Hadoop framework§ HBase – non-relational database that runs atop the HDFS

• Extended Hadoop functionality and protocols to allow baggage and tracepoints§ Tracepoints were implemented inoDataNode’s DataTransferProtocoloNameNode’s ClientProtocol

16

Page 17: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Evaluation

• Experiment led to finding a bug in the HDFS for uneven load

distribution for replicated data

§ 8 DataNode and 1 NameNode cluster on HDFS

§ Used 96 stress test clients to determine high load levels on

two hosts and almost zero loads on the rest for replication

factor of 3

§ Replica Selection Bug (HDFS-6268) has since been fixed in

subsequent Hadoop release

17

Page 18: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Evaluation Cont.

18

Page 19: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Evaluation Cont.

• Based on experiments with the PT-infused Hadoop framework, results show:§ It is dynamic and extensible§ It is scalable with low overhead§ It allows cross-boundary analysis§ It uses event causality for diagnosis of errors§ It provides analysis even with minimal tracepoints

• However, it is not meant to replace all functionality of logs§ Security auditing§ Forensics§ Debugging

19

Page 20: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

Conclusion

• Pivot Tracing is the first monitoring system of its kind§ i.e. combining dynamic instrumentation and causal tracing§ Provided a happened-before join to boost efficiency of both

• Low overhead for cross-boundary analysis§ Extremely effective method for error diagnosis within

distributed systems

20

Page 21: Pivot Tracing: Dynamic Causal Monitoring for …myweb.astate.edu/dhkim/seminar/hunter_pivot_tracing.pdf•Pivot Tracing is the first monitoring system of its kind i.e. combining dynamic

References

• Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558–565.

21