1 Measuring and monitoring Microsoft’s enterprise network Richard Mortier (mort), Rebecca Isaacs,...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of 1 Measuring and monitoring Microsoft’s enterprise network Richard Mortier (mort), Rebecca Isaacs,...
1
Measuring and monitoring Measuring and monitoring Microsoft’s enterprise networkMicrosoft’s enterprise networkRichard Mortier (mort)Richard Mortier (mort), Rebecca , Rebecca Isaacs, Laurent MassouliIsaacs, Laurent Massouliéé, Peter Key, Peter Key
2
We monitored our network…We monitored our network…
……and this is how…and this is how…
……and this is what we saw…and this is what we saw…
• How did we monitor it?How did we monitor it?
• What did we see?What did we see?
3
Microsoft CorpNet @ MSR CambridgeMicrosoft CorpNet @ MSR Cambridge
LatinAmerica
NorthAmerica
AsiaPacific
Area 0
EMEAarea3area2
area1
CORPNET
eBGP
MSRC
4
Capture setupCapture setup
• MSRC site organized using IP subnetsMSRC site organized using IP subnets– Roughly one per wing plus one for datacenterRoughly one per wing plus one for datacenter– Datacenter is by far the most activeDatacenter is by far the most active
• Captured using Captured using VLAN spanningVLAN spanning– 1:1 mapping between (Ethernet) VLAN and IP subnet1:1 mapping between (Ethernet) VLAN and IP subnet– Mapped all VLANs to one port (NS trace)…Mapped all VLANs to one port (NS trace)…– ……except datacenter, mapped to second port (DC trace)except datacenter, mapped to second port (DC trace)
• Also took a capture at one VLAN’s Ethernet switchAlso took a capture at one VLAN’s Ethernet switch– Allowed us to estimate amount of traffic not capturedAllowed us to estimate amount of traffic not captured– >99% traffic is routed (i.e. goes ‘off-VLAN’)>99% traffic is routed (i.e. goes ‘off-VLAN’)– Missed printer, some subnet broadcast, some SMBMissed printer, some subnet broadcast, some SMB
6
Packet processingPacket processing
1.1. Assigned packets to applicationAssigned packets to application– Used port numbers, RPC GUID, Used port numbers, RPC GUID,
signature byte strings, server namesignature byte strings, server name
2.2. Assigned applications to categoryAssigned applications to category– ~40 applications ~40 applications ~10 categories ~10 categories
3.3. Generated packet and flow recordsGenerated packet and flow records– Reduce disk IO, increase performanceReduce disk IO, increase performance– Still took ~10 days per complete runStill took ~10 days per complete run
4.4. Python scripts processed recordsPython scripts processed records
7
Problems with this setupProblems with this setup
• DuplicationDuplication– No DC switch: some hosts directly connected to routerNo DC switch: some hosts directly connected to router– See their packets See their packets twicetwice (on the way in and out) (on the way in and out) Deduplicate both traces; careful selection from NS traceDeduplicate both traces; careful selection from NS trace
• IPSec IPSec transport modetransport mode deployment deployment– Packet encapsulated in shim header plus trailerPacket encapsulated in shim header plus trailer– IP protocol moved into trailer and header rewrittenIP protocol moved into trailer and header rewritten Wrote custom capture tools to unpick encapsulation Wrote custom capture tools to unpick encapsulation
• Flow detectionFlow detection– Network flow ≠ transport flow ≠ application flowNetwork flow ≠ transport flow ≠ application flow Used IP 5-tuple and timeout = 90 secondsUsed IP 5-tuple and timeout = 90 seconds
9
Trace characteristicsTrace characteristics
DateDate 25 Aug 2005 – 21 Sep 200525 Aug 2005 – 21 Sep 2005
DurationDuration 622 hours622 hours
Size on diskSize on disk 5.35 TB5.35 TB
SnaplenSnaplen 152 bytes152 bytes
% IPSec packets% IPSec packets 84%84%
# hosts seen# hosts seen 28,49528,495
# bytes (onsite:offsite)# bytes (onsite:offsite) 11.4 TB11.4 TB 9.8:1.6 TB (86%:14%)9.8:1.6 TB (86%:14%)
# pkts (onsite:offsite)# pkts (onsite:offsite) 12.8 bn12.8 bn 9.7:3.1 bn (76%:24%)9.7:3.1 bn (76%:24%)
# flows (onsite:offsite)# flows (onsite:offsite) 66.9 mn66.9 mn 38.8:28.1mn (58%:42%)38.8:28.1mn (58%:42%)
10
Traffic classificationTraffic classification
CategoryCategory Constituent applicationsConstituent applications
BackupBackup BackupBackup
DirectoryDirectory Active Directory, DNS, NetBIOS NameActive Directory, DNS, NetBIOS Name
EmailEmail Exchange, SMTP, IMAP, POPExchange, SMTP, IMAP, POP
FileFile SMB, NetBIOS Session, NetBIOS Datagram, printSMB, NetBIOS Session, NetBIOS Datagram, print
ManagementManagement SMS, MOM, ICMP, IGMP, Radius, BGP, Kerberos, IPSec key SMS, MOM, ICMP, IGMP, Radius, BGP, Kerberos, IPSec key exchange, DHCP, NTPexchange, DHCP, NTP
MessengerMessenger MessengerMessenger
RemoteDesktopRemoteDesktop Remote desktop protocolRemote desktop protocol
RPCRPC RPC Endpoint mapper serviceRPC Endpoint mapper service
SourceDepotSourceDepot Source Depot (CVS source control)Source Depot (CVS source control)
WebWeb HTTP, ProxyHTTP, Proxy
13
neither client nor server suggests
peer-to-peer
neither client nor server suggests
peer-to-peer
# flows ~ # src ports suggesting client
behaviour
flows use few src ports suggests
server behaviour
15
Traffic dynamicsTraffic dynamics
• Headlines: seasonal, highly volatileHeadlines: seasonal, highly volatile
• Examine throughExamine through– AutocorrelationsAutocorrelations
– Variation per-application per-hourVariation per-application per-hour
– Variation per-application per-hostVariation per-application per-host
– Variation in heavy-hitter setVariation in heavy-hitter set
18
Variation per-application per-hourVariation per-application per-hour
• Exponential decayExponential decay
• Light-tailedLight-tailed
• Onsite (left)Onsite (left)
• Offsite (down)Offsite (down)
19
Variation per-application per-hostVariation per-application per-host
• Linear decayLinear decay
• Heavy-tailedHeavy-tailed
• Heavy hittersHeavy hitters
• Onsite (left)Onsite (left)
• Offsite (down)Offsite (down)
20
Implications for modellingImplications for modelling
• Timeseries modelling is hardTimeseries modelling is hard– Tried ARMA, ARIMA models but per-Tried ARMA, ARIMA models but per-
application onlyapplication only– Exponentiation leads to large errors in Exponentiation leads to large errors in
forecastingforecasting
• Client/server distinction unclearClient/server distinction unclear– Tried PCA, “projection pursuit method” Tried PCA, “projection pursuit method” – Neither found anythingNeither found anything
• PCA discovered singleton clusters in rank PCA discovered singleton clusters in rank order...order...
21
Implications for endsystem measurementImplications for endsystem measurement
• Heavy hitter tracking a useful approach for Heavy hitter tracking a useful approach for network monitoringnetwork monitoring
• Must be dynamic since heavy hitter set Must be dynamic since heavy hitter set varies varies – between applications and between applications and – over time per-applicationover time per-application
• ……but is it possible to define a baseline but is it possible to define a baseline against which to detect (volume) against which to detect (volume) anomalies?anomalies?