1 Measuring and monitoring Microsoft’s enterprise network Richard Mortier (mort), Rebecca Isaacs,...

18
1 Measuring and monitoring Measuring and monitoring Microsoft’s enterprise Microsoft’s enterprise network network Richard Mortier (mort) Richard Mortier (mort) , Rebecca , Rebecca Isaacs, Laurent Massouli Isaacs, Laurent Massouli é é , , Peter Key Peter Key
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    1

Transcript of 1 Measuring and monitoring Microsoft’s enterprise network Richard Mortier (mort), Rebecca Isaacs,...

1

Measuring and monitoring Measuring and monitoring Microsoft’s enterprise networkMicrosoft’s enterprise networkRichard Mortier (mort)Richard Mortier (mort), Rebecca , Rebecca Isaacs, Laurent MassouliIsaacs, Laurent Massouliéé, Peter Key, Peter Key

2

We monitored our network…We monitored our network…

……and this is how…and this is how…

……and this is what we saw…and this is what we saw…

• How did we monitor it?How did we monitor it?

• What did we see?What did we see?

3

Microsoft CorpNet @ MSR CambridgeMicrosoft CorpNet @ MSR Cambridge

LatinAmerica

NorthAmerica

AsiaPacific

Area 0

EMEAarea3area2

area1

CORPNET

eBGP

MSRC

4

Capture setupCapture setup

• MSRC site organized using IP subnetsMSRC site organized using IP subnets– Roughly one per wing plus one for datacenterRoughly one per wing plus one for datacenter– Datacenter is by far the most activeDatacenter is by far the most active

• Captured using Captured using VLAN spanningVLAN spanning– 1:1 mapping between (Ethernet) VLAN and IP subnet1:1 mapping between (Ethernet) VLAN and IP subnet– Mapped all VLANs to one port (NS trace)…Mapped all VLANs to one port (NS trace)…– ……except datacenter, mapped to second port (DC trace)except datacenter, mapped to second port (DC trace)

• Also took a capture at one VLAN’s Ethernet switchAlso took a capture at one VLAN’s Ethernet switch– Allowed us to estimate amount of traffic not capturedAllowed us to estimate amount of traffic not captured– >99% traffic is routed (i.e. goes ‘off-VLAN’)>99% traffic is routed (i.e. goes ‘off-VLAN’)– Missed printer, some subnet broadcast, some SMBMissed printer, some subnet broadcast, some SMB

6

Packet processingPacket processing

1.1. Assigned packets to applicationAssigned packets to application– Used port numbers, RPC GUID, Used port numbers, RPC GUID,

signature byte strings, server namesignature byte strings, server name

2.2. Assigned applications to categoryAssigned applications to category– ~40 applications ~40 applications ~10 categories ~10 categories

3.3. Generated packet and flow recordsGenerated packet and flow records– Reduce disk IO, increase performanceReduce disk IO, increase performance– Still took ~10 days per complete runStill took ~10 days per complete run

4.4. Python scripts processed recordsPython scripts processed records

7

Problems with this setupProblems with this setup

• DuplicationDuplication– No DC switch: some hosts directly connected to routerNo DC switch: some hosts directly connected to router– See their packets See their packets twicetwice (on the way in and out) (on the way in and out) Deduplicate both traces; careful selection from NS traceDeduplicate both traces; careful selection from NS trace

• IPSec IPSec transport modetransport mode deployment deployment– Packet encapsulated in shim header plus trailerPacket encapsulated in shim header plus trailer– IP protocol moved into trailer and header rewrittenIP protocol moved into trailer and header rewritten Wrote custom capture tools to unpick encapsulation Wrote custom capture tools to unpick encapsulation

• Flow detectionFlow detection– Network flow ≠ transport flow ≠ application flowNetwork flow ≠ transport flow ≠ application flow Used IP 5-tuple and timeout = 90 secondsUsed IP 5-tuple and timeout = 90 seconds

9

Trace characteristicsTrace characteristics

DateDate 25 Aug 2005 – 21 Sep 200525 Aug 2005 – 21 Sep 2005

DurationDuration 622 hours622 hours

Size on diskSize on disk 5.35 TB5.35 TB

SnaplenSnaplen 152 bytes152 bytes

% IPSec packets% IPSec packets 84%84%

# hosts seen# hosts seen 28,49528,495

# bytes (onsite:offsite)# bytes (onsite:offsite) 11.4 TB11.4 TB 9.8:1.6 TB (86%:14%)9.8:1.6 TB (86%:14%)

# pkts (onsite:offsite)# pkts (onsite:offsite) 12.8 bn12.8 bn 9.7:3.1 bn (76%:24%)9.7:3.1 bn (76%:24%)

# flows (onsite:offsite)# flows (onsite:offsite) 66.9 mn66.9 mn 38.8:28.1mn (58%:42%)38.8:28.1mn (58%:42%)

10

Traffic classificationTraffic classification

CategoryCategory Constituent applicationsConstituent applications

BackupBackup BackupBackup

DirectoryDirectory Active Directory, DNS, NetBIOS NameActive Directory, DNS, NetBIOS Name

EmailEmail Exchange, SMTP, IMAP, POPExchange, SMTP, IMAP, POP

FileFile SMB, NetBIOS Session, NetBIOS Datagram, printSMB, NetBIOS Session, NetBIOS Datagram, print

ManagementManagement SMS, MOM, ICMP, IGMP, Radius, BGP, Kerberos, IPSec key SMS, MOM, ICMP, IGMP, Radius, BGP, Kerberos, IPSec key exchange, DHCP, NTPexchange, DHCP, NTP

MessengerMessenger MessengerMessenger

RemoteDesktopRemoteDesktop Remote desktop protocolRemote desktop protocol

RPCRPC RPC Endpoint mapper serviceRPC Endpoint mapper service

SourceDepotSourceDepot Source Depot (CVS source control)Source Depot (CVS source control)

WebWeb HTTP, ProxyHTTP, Proxy

11

Protocol distributionProtocol distribution

13

neither client nor server suggests

peer-to-peer

neither client nor server suggests

peer-to-peer

# flows ~ # src ports suggesting client

behaviour

flows use few src ports suggests

server behaviour

15

Traffic dynamicsTraffic dynamics

• Headlines: seasonal, highly volatileHeadlines: seasonal, highly volatile

• Examine throughExamine through– AutocorrelationsAutocorrelations

– Variation per-application per-hourVariation per-application per-hour

– Variation per-application per-hostVariation per-application per-host

– Variation in heavy-hitter setVariation in heavy-hitter set

16

Correlograms: onsite trafficCorrelograms: onsite traffic

17

Correlograms: offsite trafficCorrelograms: offsite traffic

18

Variation per-application per-hourVariation per-application per-hour

• Exponential decayExponential decay

• Light-tailedLight-tailed

• Onsite (left)Onsite (left)

• Offsite (down)Offsite (down)

19

Variation per-application per-hostVariation per-application per-host

• Linear decayLinear decay

• Heavy-tailedHeavy-tailed

• Heavy hittersHeavy hitters

• Onsite (left)Onsite (left)

• Offsite (down)Offsite (down)

20

Implications for modellingImplications for modelling

• Timeseries modelling is hardTimeseries modelling is hard– Tried ARMA, ARIMA models but per-Tried ARMA, ARIMA models but per-

application onlyapplication only– Exponentiation leads to large errors in Exponentiation leads to large errors in

forecastingforecasting

• Client/server distinction unclearClient/server distinction unclear– Tried PCA, “projection pursuit method” Tried PCA, “projection pursuit method” – Neither found anythingNeither found anything

• PCA discovered singleton clusters in rank PCA discovered singleton clusters in rank order...order...

21

Implications for endsystem measurementImplications for endsystem measurement

• Heavy hitter tracking a useful approach for Heavy hitter tracking a useful approach for network monitoringnetwork monitoring

• Must be dynamic since heavy hitter set Must be dynamic since heavy hitter set varies varies – between applications and between applications and – over time per-applicationover time per-application

• ……but is it possible to define a baseline but is it possible to define a baseline against which to detect (volume) against which to detect (volume) anomalies?anomalies?

22

Questions?Questions?