Distributed systems-radiology

54
Modern Radiology for Distributed Systems Dietrich Featherston @d2fn Thursday, October 11, 12

description

Each of us operates distributed systems. Some of us operate traditional infrastructure with database, web, and load-balancing tiers. Others require infrastructure that is more bespoke and may incorporate non-traditional storage solutions (such as Riak). Regardless of where each of us falls on this spectrum, the network closely describes the behavior of our applications. Furthermore, it is the only place we can look to understand emergent behavior of applications working together in concert. In this talk, we take a radiological view of network-derived imagery and discuss what it can tell us about our systems as a whole.

Transcript of Distributed systems-radiology

Page 1: Distributed systems-radiology

Modern Radiology forDistributed Systems

Dietrich Featherston@d2fn

Thursday, October 11, 12

Page 2: Distributed systems-radiology

This is a talk about monitoring

Thursday, October 11, 12

Page 3: Distributed systems-radiology

But not just any kind of monitoring

Non-invasive monitoring

Thursday, October 11, 12

Page 4: Distributed systems-radiology

non-invasive monitoring

measures taken to describe the state of a system with minimal changes to the system being monitored

Thursday, October 11, 12

Page 5: Distributed systems-radiology

Insight

Invasiveness

Radiographic Imagery

Thursday, October 11, 12

Page 6: Distributed systems-radiology

preventative care

measures taken to prevent diseases or injuries rather than curing them or treating their symptoms

Thursday, October 11, 12

Page 7: Distributed systems-radiology

Non-invasive monitoring techniques focus primarily on host-based metrics

Why is this a problem?

Thursday, October 11, 12

Page 8: Distributed systems-radiology

Because applications are distributed

Thursday, October 11, 12

Page 9: Distributed systems-radiology

Information emittedabout nodes in the network

n Information emittedabout edges

in the network

n²Network size

Thursday, October 11, 12

Page 10: Distributed systems-radiology

We analyze cell-structure because we can’t envision

the whole organism

We react to disease and injury because we lack

preventative care

Thursday, October 11, 12

Page 11: Distributed systems-radiology

We lack preventative care for applications because our non-invasive monitoring techniques are growing less and less meaningful

Thursday, October 11, 12

Page 12: Distributed systems-radiology

Radiology is useful in illuminating non-invasive monitoring of distributed systems

Thursday, October 11, 12

Page 13: Distributed systems-radiology

Thursday, October 11, 12

Page 14: Distributed systems-radiology

Thursday, October 11, 12

Page 15: Distributed systems-radiology

Thursday, October 11, 12

Page 16: Distributed systems-radiology

Context iseverything

Thursday, October 11, 12

Page 17: Distributed systems-radiology

How do we use context?

Thursday, October 11, 12

Page 18: Distributed systems-radiology

Context

Your BigDumb Data

!!!

Thursday, October 11, 12

Page 19: Distributed systems-radiology

Human brain

+med school

Radiographic Imagery

Diagnoses

Thursday, October 11, 12

Page 20: Distributed systems-radiology

Signal Processing

VLA Output

E.T.

Thursday, October 11, 12

Page 21: Distributed systems-radiology

NetworkData

ApplicationBehavior

Application TopologySignal ProcessingExpert Brain

Thursday, October 11, 12

Page 22: Distributed systems-radiology

dimensions (11)epoch secondsepoch minutesepoch hoursnode idsource ipsource portdest ipdest portinterfacecountrynetwork/asn

measurements (8)egress packetsegress octetsingress packetsingress octetsretransmitserrorsapp-rtthandshake-rtt

Thursday, October 11, 12

Page 23: Distributed systems-radiology

Case Study #1

GC-Death of a distributed JVM application

Thursday, October 11, 12

Page 24: Distributed systems-radiology

Thursday, October 11, 12

Page 25: Distributed systems-radiology

Case Study #2

Symptoms:- Latent Riak handoff- Cluster throughput bottoming out

Thursday, October 11, 12

Page 26: Distributed systems-radiology

Thursday, October 11, 12

Page 27: Distributed systems-radiology

busy_dist_port

Thursday, October 11, 12

Page 28: Distributed systems-radiology

+zdbbl 8192

Thursday, October 11, 12

Page 29: Distributed systems-radiology

Thursday, October 11, 12

Page 30: Distributed systems-radiology

Case Study #3

Bringing a dead riak node back online

Thursday, October 11, 12

Page 31: Distributed systems-radiology

Thursday, October 11, 12

Page 32: Distributed systems-radiology

Thursday, October 11, 12

Page 33: Distributed systems-radiology

Thursday, October 11, 12

Page 34: Distributed systems-radiology

Case Study #4

Retransmits 10% of total network throughput

Thursday, October 11, 12

Page 35: Distributed systems-radiology

Thursday, October 11, 12

Page 36: Distributed systems-radiology

var put: HttpPut = nulltry {  // ... put data}catch {  case e: Exception =>    // ... handle exception}finally {  if(put != null) {    put.abort()  }}

Thursday, October 11, 12

Page 37: Distributed systems-radiology

var put: HttpPut = nulltry {  // ... put data}catch {  case e: Exception =>    // ... handle exception}finally {  if(put != null) {    put.abort()  }}

Thursday, October 11, 12

Page 39: Distributed systems-radiology

129    public void abort() {130        ClientConnectionRequest localRequest;131        ConnectionReleaseTrigger localTrigger;132        133        this.abortLock.lock();134        try {135            if (this.aborted) {136                return;137            }            138            this.aborted = true;139            140            localRequest = connRequest;141            localTrigger = releaseTrigger;142        } finally {143            this.abortLock.unlock();144        }        145146        // Trigger the callbacks outside of the lock, to prevent147        // deadlocks in the scenario where the callbacks have148        // their own locks that may be used while calling149        // setReleaseTrigger or setConnectionRequest.150        if (localRequest != null) {151            localRequest.abortRequest();152        }153        if (localTrigger != null) {154            try {155                localTrigger.abortConnection();156            } catch (IOException ex) {157                // ignore158            }159        }160    }

Thursday, October 11, 12

Page 40: Distributed systems-radiology

Thursday, October 11, 12

Page 41: Distributed systems-radiology

augmented intelligence precedesartificial intelligence

Thursday, October 11, 12

Page 42: Distributed systems-radiology

1895

Wilhelm Röntgen discovers X-RaysFirst medical use of x-rays in human imaging takes place one month later

Thursday, October 11, 12

Page 43: Distributed systems-radiology

1895

Wilhelm Röntgen discovers X-RaysFirst medical use of x-rays in human imaging takes place one month later

1905

First English text on chest radiography

Thursday, October 11, 12

Page 44: Distributed systems-radiology

1895

Wilhelm Röntgen discovers X-RaysFirst medical use of x-rays in human imaging takes place one month later

1920

1905

First English text on chest radiography

Society of Radiographers formed

Thursday, October 11, 12

Page 45: Distributed systems-radiology

Recognition of radiology as a formal medical discipline was a cultural problem, not

a technology problem

http://www.bshr.org.uk/page13.htmlThursday, October 11, 12

Page 46: Distributed systems-radiology

If you want to talk to me about the query language used to ask questions of the network data we collect at Boundary talk to me after or hit me up on twitter.

@d2fngithub.com/dietrichf

Thursday, October 11, 12

Page 47: Distributed systems-radiology

Find 45 minutes of total traffic

seen on meters 1, 2, 226, & 301

starting 18 hours ago broken

down by peer ip retain top 10 by

the ratio of retransmits to

packets

get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m;]categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMsby epochMillis, ipretain top 10 per epochMillis on retransmits/packets

Thursday, October 11, 12

Page 48: Distributed systems-radiology

Find 45 minutes of total traffic

seen on meters 1, 2, 226, & 301

starting 18 hours ago broken

down by peer ip retain top 10 by

the ratio of retransmits to

packets

get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m;]categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMsby epochMillis, ipretain top 10 per epochMillis on retransmits/packets

Thursday, October 11, 12

Page 49: Distributed systems-radiology

Find 45 minutes of total traffic

seen on meters 1, 2, 226, & 301

starting 18 hours ago broken

down by peer ip retain top 10 by

the ratio of retransmits to

packets

get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m;]categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMsby epochMillis, ipretain top 10 per epochMillis on retransmits/packets

Thursday, October 11, 12

Page 50: Distributed systems-radiology

Find 45 minutes of total traffic

seen on meters 1, 2, 226, & 301

starting 18 hours ago broken

down by peer ip retain top 10 by

the ratio of retransmits to

packets

get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m;]categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMsby epochMillis, ipretain top 10 per epochMillis on retransmits/packets

Thursday, October 11, 12

Page 51: Distributed systems-radiology

Find 45 minutes of total traffic

seen on meters 1, 2, 226, & 301

starting 18 hours ago broken

down by peer ip retain top 10 by

the ratio of retransmits to

packets

get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m;]categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMsby epochMillis, ipretain top 10 per epochMillis on retransmits/packets

Thursday, October 11, 12

Page 52: Distributed systems-radiology

Find 45 minutes of total traffic

seen on meters 1, 2, 226, & 301

starting 18 hours ago broken

down by peer ip retain top 10 by

the ratio of retransmits to

packets

get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m;]categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMsby epochMillis, ipretain top 10 per epochMillis on retransmits/packets

Thursday, October 11, 12

Page 53: Distributed systems-radiology

Find 45 minutes of total traffic

seen on meters 1, 2, 226, & 301

starting 18 hours ago broken

down by peer ip retain top 10 by

the ratio of retransmits to

packets

get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m;]categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMsby epochMillis, ipretain top 10 per epochMillis on retransmits/packets

Thursday, October 11, 12

Page 54: Distributed systems-radiology

Find 45 minutes of total traffic

seen on meters 1, 2, 226, & 301

starting 18 hours ago broken

down by peer ip retain top 10 by

the ratio of retransmits to

packets

get volume_1s_meter_ip [ meter in {1, 2, 226, 301}; epochMillis from -18h for 45m;]categorize sum(ingress) as ingress, sum(egress) as egress, sum(ingressPackets + egressPackets) as packets, sum(retransmits) as retransmits, mean(appRttUsec/1000) as appRttMsby epochMillis, ipretain top 10 per epochMillis on retransmits/packets

Thursday, October 11, 12