Effective Datacenter Troubleshooting Methodologies: A...

59

Transcript of Effective Datacenter Troubleshooting Methodologies: A...

Page 1: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study
Page 2: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

Effective Datacenter Troubleshooting Methodologies: A Case Study Review

BRKDCT-2408

Jane Gao Customer Support Engineer

Jerred Horsman Systems Engineer

Page 3: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Agenda

• Data Center Solution Overview

• Troubleshooting Basics

• Case Studies

• The Dos and the Donts

3

Page 4: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

Data Center Solution Overview

Page 5: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Cisco Unified Data Center

5

Unified Fabric

Unified Management

Unified Computing

Automated Resource Management

• Simplify and automate IT provisioning

• Deliver physical and virtual resources on demand

Integrated, Smart Computing Infrastructure

• Unify computing, networking, storage access, and virtualization resources

• Simplify management and enhance flexibility

Highly Scalable, Secure Network Fabric

• Deliver architectural flexibility

• Provide consistent networking across physical, virtual, and cloud environments

Page 6: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

84% of the respondents in this study from senior level to rank-

and-file say they would rather walk barefoot over hot coals than

have their data center go down.

Ponemon Institute September 2013

Page 7: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

When a Data Center Goes Down…

7

Bad stuff like this happens:

– Loss of revenue and service

– Loss of business continuance

– Service disruption

– Lower customer satisfaction

Quickly identify the problem area and

get to a solution within the minimum time!

Page 8: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

Troubleshooting Basics - Methodology

Page 9: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

How We Troubleshoot

9

Understanding

the problem

• Knowledge based

• Strategy based

7 * 9 =

24 * 41 =

2 4 * 4 1 =

63

984

8 _ 4 = 984 1 8

Page 10: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

The Skill Pyramid

10

Low complexity

High complexity High complexity

Low complexity

Strategy

Knowledge

Strategy

Knowledge Problems Problems

Page 11: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

The TAC Secret Ingredients

11

Troubleshoot

Apply knowledge

Identify possible causes

Test the Most

Probable cause

Break down the

issue

Understanding

the problem

Confirm the root cause

Page 12: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

A Sample Problem

13

There are packet drops on the network

Affected servers

Page 13: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Understanding the Problem

The 5 Ws

• Who is experiencing the problem

• Why is it important

• What are the effects

• When did the problem start

• Where does the problem occur

The H

• How did the problem start, what has changed

14

Situation assessment

Problem definition

Page 14: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Understanding the Problem -- Describe

15

• What is not the problem is often as important as what is the problem

• Ask the questions

When, where, what, to what extend

Affected servers

Working servers

Page 15: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Troubleshoot

Break down the problem - simplify

– Network Topology

– Technology

Apply Knowledge & Experience

– How things should have worked

– What are the changes

Identify the possible causes

– Changes (known vs. unknown)

– Rule out

Test the most probable cause

– Explain the symptoms (Is and Is Not)

– Satisfy the conditions ( What, When, When, Extend)

– Use the most approachable test effectively

17

First break down

Simplify the

topology

Page 16: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Verifying the Root Cause

Test against the conditions:

• Does the probable cause match the problem description

• Does the probable cause satisfy all of the conditions

Test against the cause:

• Eliminate the probable cause: does the problem get eliminated?

• Reproduce the same condition: does the problem get reproduced?

18

Page 17: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

The TAC Secret Ingredients

19

Troubleshoot

Apply knowledge

Identify Possible causes

Test probable

cause

Break down the

issue

Understanding

the problem

Confirm the root cause

Multiple servers on varies VLANs are experience slowness during file transfers

• Network topology

• L2 vs. L3

• Confirm the symptoms

• Software processing

• L2 instability

• Unicast flooding

• Faulty hardware

• Ping / Traceroute

• Working vs. non-working

• Tools: Ethanalyzer, SPAN, etc.

• Forwarding path of the traffic

• L2 vs. L3

• Difference between sites

Page 18: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

Troubleshooting Basics – The Tools

Page 19: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

NX-OS Tools

• Granular show commands and CLI filtering

• Granular show tech-support

• Logging Capabilities

• GOLD (General On-Line Diagnostics)

• OBFL (On-Board Failure Logging)

• Debugs (with filters & redirection) and Debug Plugins

• Ethanalyzer (built-in “CPU sniffer”)

• ELAM

• EEM (Embedded Event Manager)

• SPAN

• Programmability

21

Info Collection

Hardware

Troubleshoot

Page 20: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Granular Show Commands and CLI Filtering

• Improved IOS-like CLI

– Feature specific show commands

– ‘show run’, ‘show run <feature>’ and ‘show run all’

– ‘show’ commands can be executed from exec or config mode

– Output piping ‘show xxx | ?’

• Well structured ‘show’ commands

– ‘show system internal’

– ‘show hardware internal’

– ‘show <feature> internal’

• Useful commands

– ‘hex’ / ‘dec’

– ‘diff’

– ‘show cli history [unformated]’

22

Page 21: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Granular Show Tech-support

• Capture show tech

– ‘show tech detail’

– ‘tac-pac’

– ‘show tech <feature>

– ‘show tech all binary’ (6.2.x feature)

• Need-to-knows

– Collect show tech as soon as possible

– Redirect the outputs to files using ‘>’

– Appending to files with ‘>>’

– Capture feature show tech in addition to show tech

• Commands

– ‘show tech

23

Page 22: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Logging Capabilities

• Persistent logging (Nexus 7000)

• Constant logging – event history

• Accounting log

• Commands:

– ‘show file logflash://sup-active//log/messages’

– ‘loggin level <feature> <level>’

– ‘show log logfile’ vs. ‘show log nvram’

– ‘show accounting log’

– ‘show system internal <feature> event-history’

– ‘show <feature> internal event-history’

24

Page 23: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

OBFL (On-Board Failure Logging)

• Persistent logging

– 32MB onboard flash

– Logs varies events, for exampel • Reset reason

• Statistics history

• Kernel trace

• others

• Command

– ‘show logging onboard mod <x>’

25

Page 24: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

GOLD (Generic OnLine Diagnostics)

• A diagnostic framework runs while the system is operational – Corrective actions are taken through Embedded Evant Manager(EEM) polices

– Tests run on both Supervisors and line cards

• Tests types – Bootup

– Health Monitoring

– On-demand

– Scheduled

• Commands – ‘show diagnostics content’

– ‘show diagnostics result’

– ‘show diagnostics ?’

26

Page 25: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Debugs

• When event-history is not sufficient

– Use debug logfile ‘debug logfile <file>’

– Use debug-filter

• Debug-filter

– More granular debugs

– Can apply multiple filters simultaneously

• Commands

‘debug-filter pktmgr interface e1/1’

‘debug-filter pktmgr dest-mac 0100.5e00.000D’

‘show debug-filter all’

27

Page 26: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

SPAN (Switched Port ANalyzer)

• Tool to captures traffic from the source and directs to a destination interface

– Source: Ethernet port, port channel, inband interface to CPU, VLANs, Fabric port, HIF

– Destination: Ethernet port, port channel

• Need-to-knows

– Identify the capturing points

– Understand the traffic flow(s) being captured

– Be aware of the limitations

– Very useful for data plane issues, packet drops, intermittent problems

• Other Variation

– ERSPAN, encapsulated remote switched port analyzer

28

Page 27: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

EEM (Embedded Event Manager)

• A subsystem to automate tasks and customize the device behavior

– Event

– Notification

– Action

• Many built-in system policies: ‘show event manager system-policy’

• Event notification action

• Helpful in data gathering when the occurrence of the issue is unpredictable

29

Page 28: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Ethanalyzer

• Built-in sniffer for CPU bound traffic

– ‘capture-filter’ vs. ‘display filter’

– ‘decode-internal’

– Other options

• Ethanalyzer does not

– Capture data plane traffic forwarded in hardware

– Support interface specific capture

• Ethanalyzer guides

– http://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116136-trouble-ethanalyzer-nexus7000-00.html

– http://www.cisco.com/c/en/us/support/docs/switches/nexus-5000-series-switches/116201-technote-ethanalyzer-00.html

30

Page 29: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

ELAM (Embedded Logic Analyzer Module)

• A tool to capture a packet and determine its forwarding path within the switch

– Powerful and flexible triggering capability

– Module specific

– Available on Nexus 7000 and Nexus 6000

• Need-to-knows

– L2-4 data plane forwarding issues

– Consistent problem

– Not a replacement for capture utilities like Ethanalzyer or SPAN

• Elam guides – http://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116648-technote-

product-00.html

– http://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116647-technote-product-00.html

31

Page 30: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Programmability

32

• Adds control blocks in the CLI execution

• Python

– Cli(), Clid(), Clip()

– Interactive mode

– Noninteractive mode

• TCL

– Tcl8.5, NXOS 5.1(1)

– ‘ tclsh bootflash:example.tcl’

• Search for “python API” on cisco.com

Page 31: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

Case Studies

Page 32: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPC

34

Broad

cast

(ARP)

Or

Unica

st

vPC

Po 50 Po 50

Po 1

1/5

vPC: Loop

Avoidance

Logic

• Port-channel can be created from two discrete boxes to a single device

• STP is eliminated, all ports are in forwarding state

• Devices verify they are alive by peer keep-alive interconnect

• Two Nexus devices

emulate the same

LACP System-ID to

accomplish this

• Devices sync control

plane mac addresses

and arp via Peer-Link

connection between

them

• Loops are blocked if a

frame comes from the

peer-link and needs to

be forwarded out a

vPC port

Page 33: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

OTV

35

• OTV provides layer 2 connectivity over a layer 3 core

• Mac addresses are learned and communicated between the OTV edge devices

• STP and ARP are suppressed on the OTV overlay

• OTV requires a dedicated VDC

• OTV requires unicast

or multicast

reachability between

all OTV edge devices

• OTV enabled devices

at the same site

require a shared

broadcast domain to

form a site adjacency

• OTV uses ISIS as the

control plane protocol

OTV

Page 34: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Case Study: File Transfer Performance

A large entertainment company has moved their workloads to a secondary

data center for capacity reasons. The server administrators are reporting slow

application performance on most HTTP and MySQL calls.

36

Page 35: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Troubleshooting

Understand the

Problem

Knowledge and

Experience

Break Down the

issue

When is it

happening?

Where is it

happening?

Where is it not

happening?

What protocols

are behaving

slowly?

Is there any

known

congestion?

What is the

traffic flow?

Have the links

been checked

for drops?

Which

direction is the

transfer?

What is the

src/dst IP?

Is this issue

also occurring

locally?

What

interfaces are

Involved?

Is this issue

occurring only

when the

interconncet is

involved?

Which devices

are not in the

diagram?

38

Page 36: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

10.0.0.51/24 10.0.0.52/24

40

Page 37: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Network Level Understanding • What protocols are behaving slowly?

• SMB/FTP/MySQL

• Is there any congestion in the network?

• No

• What is the traffic flow to the DR site?

• Through primary over OTV interconnect

• Have the links been checked for drops?

• Yes

• Which direction is the bulk transfer?

• Primary site to DR site

• What is the src/dst IP?

• 10.0.0.50 -> 10.0.0.51

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

10.0.0.51/24 10.0.0.52/24

41

Page 38: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPC

N5K-1 N5K-2

vPC

N7K-1-CORE N7K-2-CORE

N7K-2-OTV

N7K-1-OTVPrimary DC

10.0.0.50/24

OTV

Primary DC Packet Flow

1

2

3

10.0.0.51/24

43

Page 39: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

OTV

OTV Packet Flow

45

Page 40: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPC

N5K-2N5K-1

N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Backup DC

10.0.0.45/24

Backup DC Packet Flow

10.0.0.52/24

47

Page 41: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

1

2

3 4 5

5

5

10.0.0.51/24 10.0.0.52/24

49

Page 42: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

10.0.0.51/24 10.0.0.52/24

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

50

Page 43: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

C:\Users\Jerred Horsman>ping -t 10.0.0.51

Pinging 10.0.0.51 with 32 bytes of data:

Reply from 10.0.0.51: bytes=32 time<1ms TTL=255

Reply from 10.0.0.51: bytes=32 time<1ms TTL=255

Reply from 10.0.0.51: bytes=32 time<1ms TTL=255

10.0.0.51/24

C:\Users\Jerred Horsman>ping -t 10.0.0.51

Pinging 10.0.0.51 with 32 bytes of data:

Reply from 10.0.0.51: bytes=32 time=2ms TTL=255

Reply from 10.0.0.51: bytes=32 time=3ms TTL=255

Reply from 10.0.0.51: bytes=32 time=2ms TTL=255

10.0.0.52/24

C:\Users\Jerred Horsman>ping -t 10.0.0.52 Pinging 10.0.0.52 with 32 bytes of data:

Reply from 10.0.0.52: bytes=32 time<1ms TTL=255

Reply from 10.0.0.52: bytes=32 time<1ms TTL=255

Reply from 10.0.0.52: bytes=32 time<1ms TTL=255

51

Page 44: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

[email protected]$ scp [email protected]:/test ./

1 files (704.5 MiB) copies in 9 seconds (70.6 MiB/s )

10.0.0.51/24

[email protected]$ scp [email protected]:/test ./

1 files (704.5 MiB) copies in 404 seconds (175.3 KiB/s )

10.0.0.52/24

[email protected]$ scp [email protected]:/test ./

1 files (704.5 MiB) copies in 404 seconds (80.3 MiB/s )

52

Page 45: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

10.0.0.51/24

N5K-1(config)# monitor session 1

N5K-1(config)# source interface eth3/2

N5K-1(config)# destination interface eth3/1

N5K-1(config)# no shut

N5K-1(config)# interface eth3/1

N5K-1(config)# switchport monitor

N5K-1(config)# monitor session 1

N5K-1(config)# source interface eth3/2

N5K-1(config)# destination interface eth3/1

N5K-1(config)# no shut

N5K-1(config)# interface eth3/1

N5K-1(config)# switchport monitor

53

Page 46: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

10.0.0.51/24

54

Page 47: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

VLAN MAC Address Type age Secure NTFY

Ports/SWID.SSID.LID

---------+-----------------+--------+---------+------+----+--------------

* 5 bbbb.bbbb.bbbb dynamic - F F Po1

VLAN MAC Address Type age Secure NTFY

Ports/SWID.SSID.LID

---------+-----------------+--------+---------+------+----+------

* 5 bbbb.bbbb.bbbb dynamic - F F Eth2/3

N5K-2# Show port-channel load-balance forwarding-path port-channel 1 src-ip 10.0.0.50 dst-ip 10.0.0.51

Missing params will be substituted by 0's. Load-balance Algorithm on switch: source-dest-ip crc8_hash: 1 Outgoing port id Ethernet1/32

N7K-2-OTV# show otv route

VLAN MAC-Address Metric Uptime Owner Next-Hops

51 bbbb.bbbb.bbbb.bbbb 42 5d36h overlay N7K-2-OTV

55

Page 48: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

N7K-2#ethanalyzer local-interface inband capture-filter “ip.host==10.0.0.51” limit-captured-frames 100 2014-03-09 12:11:34.459123 10.0.0.50 -> 10.0.0.51 TCP Datagram 2014-03-09 12:11:34.459123 10.0.0.50 -> 10.0.0.51 TCP Datagram

N7K-2#ethanalyzer local-interface inband capture-filter “ip.host==10.0.0.51” limit-captured-frames 100 2014-03-09 12:11:34.459123 10.0.0.50 -> 10.0.0.51 IP Fragment 2014-03-09 12:11:34.459123 10.0.0.50 -> 10.0.0.51 IP Fragment

57

Page 49: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

vPCvPC

N5K-2N5K-1N5K-1 N5K-2

vPC

N7K-1-CORE

N7K-2-CORE

N7K-2-OTV

N7K-1-OTV N7K-1-OTV

N7K-2-OTV

N7K-1-CORE

N7K-2-CORE

vPC

Primary DC Backup DC

10.0.0.50/24 10.0.0.45/24

OTV

N7K-2# show policy-map interface control-plane

class-map class-default (match-any)

police cir 100 kbps bc 250 ms

conform action: transmit

violate action: drop

module 1:

conformed 10508444956 bytes,

violated 9205212314 bytes

N7K2-# show interface overlay0

Overlay0 is up

MTU 1400 bytes, BW 1000000 Kbit

Encapsulation OTV

Last link flapped 19:24:17

58

Page 50: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

The Dos and The Donts

Page 51: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

The Dos – Troubleshooting

• Understand how things should work

• Identify the broken scenario

• Use solid troubleshooting techniques, start with basics

• Capture valuable information

• Bring all parties to the table

• Ask the right questions

60

Page 52: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

The Dos – Operational

• Stay calm

• Know your network

• Backup

• Documentation

• Network Management

61

Page 53: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

The DONTs -- Troubleshooting

• Jump to conclusion

• Take drastic measures

– 'let's bounce the datacenter'

– 'we are reloading the switches one at a time'

• Lump all issues together

62

Page 54: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

The DONTs -- Operational

• Make multiple changes at once

• Status update and technical call in one

63

Page 55: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

The TAC secret ingredients

64

Troubleshoot

Apply Knowledge

Identify possible causes

Test the Most

Probable cause

Break down the issue

Understanding

the problem

Confirm the root cause

Page 56: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

Q&A

Page 57: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Complete Your Online Session Evaluation

• Give us your feedback and you could win fabulous prizes. Winners announced daily.

• Complete your session evaluation through the Cisco Live mobile app or visit one of the interactive kiosks located throughout the convention center.

Don’t forget: Cisco Live sessions will be available for viewing on-demand after the event at CiscoLive.com/Online

66

Page 58: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study

© 2014 Cisco and/or its affiliates. All rights reserved. BRKDCT-2408 Cisco Public

Continue Your Education

• Demos in the Cisco Campus

• Walk-in Self-Paced Labs

• Table Topics

• Meet the Engineer 1:1 meetings

67

Page 59: Effective Datacenter Troubleshooting Methodologies: A …d2zmdbbm9feqrf.cloudfront.net/2014/usa/pdf/BRKDCT-2408.pdf · Effective Datacenter Troubleshooting Methodologies: A Case Study