System Crashes Partner Training

39
Understanding system crashes and collecting appropriate data Nitin Vig - JTAC (ERX)

Transcript of System Crashes Partner Training

Page 1: System Crashes Partner Training

Understanding system crashes and collecting appropriate data

Nitin Vig - JTAC (ERX)

Page 2: System Crashes Partner Training

Target Audience

This training is intended for engineersThis training is intended for engineers

who are supporting the E-series platformwho are supporting the E-series platform

in the field.in the field.

A basic familiarity with the product and aA basic familiarity with the product and a

good network troubleshooting skills aregood network troubleshooting skills are

expected.expected.

Page 3: System Crashes Partner Training

Agenda

Why does an E-series router crash?Why does an E-series router crash?

What happens after a crash?What happens after a crash?

What can I do once it crashes?What can I do once it crashes?

What information does JTAC need?What information does JTAC need?

Page 4: System Crashes Partner Training

Why does the router crash?Why does the router crash?

Crashes can happen on the SRP or Crashes can happen on the SRP or the Line Modulethe Line Module

Can be due to software or hardware Can be due to software or hardware problemproblem

Crashes can be a good thingCrashes can be a good thing

Page 5: System Crashes Partner Training

Why does the router crash?Why does the router crash? A crash happens when:A crash happens when:

– We do not know what to do:We do not know what to do:– Received a packet that the software does not know how Received a packet that the software does not know how

to handle.to handle.– A mis-behaving piece of hardware causes undesirable A mis-behaving piece of hardware causes undesirable

operation.operation.

– Someone does not listen to us:Someone does not listen to us:– An application is in a deadlock and does not release An application is in a deadlock and does not release

resources.resources.– A line module does not respond to the SRP.A line module does not respond to the SRP.

Page 6: System Crashes Partner Training

Type of crashesType of crashes Software panicsSoftware panics

– Software not designed to handle a specific Software not designed to handle a specific condition.condition.

Processor ExceptionsProcessor Exceptions– Processor on the SRP or LM hits a violation while Processor on the SRP or LM hits a violation while

processing data.processing data.

Detector crashesDetector crashes– Recovery and Detection mechanism implemented Recovery and Detection mechanism implemented

to address forwarding fault conditionsto address forwarding fault conditions

Page 7: System Crashes Partner Training

Software panicsSoftware panics Software panics (example)Software panics (example)

time of reset: THU NOV 01 2007 00:57:26 CDTtime of reset: THU NOV 01 2007 00:57:26 CDTrun state: primaryrun state: primaryimage type: applicationimage type: applicationlocation: slot (6)location: slot (6)build date: 0x46392b64 THU MAY 03 2007 00:23:00 UTCbuild date: 0x46392b64 THU MAY 03 2007 00:23:00 UTCreset type: panicreset type: panictask: cliActortask: cliActorfile: osSemaphore.ccfile: osSemaphore.ccline: 153line: 153arg: 38516744arg: 38516744last errno: 0x3d0001last errno: 0x3d0001pc: 0x1c480f74: fatalPanic__Fv +0x8pc: 0x1c480f74: fatalPanic__Fv +0x8lr: 0x1c532eec: take__11OsSemaphore +0x1c0lr: 0x1c532eec: take__11OsSemaphore +0x1c0

<output truncated><output truncated>

– SRP crash seen when clearing an improperly terminated SSH session using SRP crash seen when clearing an improperly terminated SSH session using the 'clear line vty’ from another SSH session.the 'clear line vty’ from another SSH session.

– The fix involved changes to the SSH application behavior when clearing a The fix involved changes to the SSH application behavior when clearing a VTY session.VTY session.

– KB 31413

Page 8: System Crashes Partner Training

Software panicsSoftware panics Software panics (example)Software panics (example)

time of reset: Thu Aug 2 01:17:18 2007time of reset: Thu Aug 2 01:17:18 2007run state: unknown (0)run state: unknown (0)image type: applicationimage type: applicationimage id: 0x464c3086image id: 0x464c3086build date: 0x464c3086 Thu May 17 2007 10:37:58 GMTbuild date: 0x464c3086 Thu May 17 2007 10:37:58 GMTlocation: internal slot (1), processor 0, boardId 0x33, boardRev 0x3location: internal slot (1), processor 0, boardId 0x33, boardRev 0x3reset type: panicreset type: panicfile: dhcpDemux.ccfile: dhcpDemux.ccline: 221line: 221task: schedulertask: schedulerlast errno: 0x30065last errno: 0x30065pc: 0x3a1f5c -> fatalPanic(void) offset: 0x8pc: 0x3a1f5c -> fatalPanic(void) offset: 0x8lr: 0x709c4 -> lr: 0x709c4 -> DhcpDemux::receivePacket(Uid, Uid, OsBuffer &, unsigned char)DhcpDemux::receivePacket(Uid, Uid, OsBuffer &, unsigned char)

<output truncated><output truncated>

– Crash noticed on line modules running DHCP Relay proxy application after Crash noticed on line modules running DHCP Relay proxy application after an SRP failover. an SRP failover.

– DHCP application on LM received a DHCP packet before it was ready

– Fix involved discarding DHCP packets until DHCP application is readyFix involved discarding DHCP packets until DHCP application is ready

– KB 29531

Page 9: System Crashes Partner Training

Processor ExceptionsProcessor Exceptions Processor Exceptions (example)Processor Exceptions (example)

time of reset: Tue Mar 19 18:35:45 200time of reset: Tue Mar 19 18:35:45 200location: slot 9 (a), processor 0, boardId 0x19, boardRev 0x0location: slot 9 (a), processor 0, boardId 0x19, boardRev 0x0image type: applicationimage type: applicationbuild date: 0x3c7608b1 (Fri Feb 22 09:00:33 2002)build date: 0x3c7608b1 (Fri Feb 22 09:00:33 2002)reset type: processor exception 0x200 (machine check)reset type: processor exception 0x200 (machine check)task: icctask: iccpc: 0x37f2ccc -> memPartAlignedAlloc offset: 0x108pc: 0x37f2ccc -> memPartAlignedAlloc offset: 0x108lr: 0x37f2c9c -> memPartAlignedAlloc offset: 0xd8lr: 0x37f2c9c -> memPartAlignedAlloc offset: 0xd8dar: 0x00000000 cr: 0x24000080 xer: 0x00000000 fpcsr: 0x0369beecdar: 0x00000000 cr: 0x24000080 xer: 0x00000000 fpcsr: 0x0369beecsrr1: 0x0010b030srr1: 0x0010b030 dsisr: 0x00000000 ctr: 0x00000000 dsisr: 0x00000000 ctr: 0x00000000

<output truncated><output truncated>

– SRP crash due to L2 cache memory parity errorSRP crash due to L2 cache memory parity error

– The SRP CPU encounters a parity error when reading data from the L2 data The SRP CPU encounters a parity error when reading data from the L2 data cachecache

– No software fix available for the problem. Historically the crash never recurs No software fix available for the problem. Historically the crash never recurs on the same SRPon the same SRP

– KB 2443

Page 10: System Crashes Partner Training

Processor ExceptionsProcessor Exceptions Processor Exceptions (example)Processor Exceptions (example)

time of reset: Thu Aug 23 20:48:09 2007time of reset: Thu Aug 23 20:48:09 2007run state: primaryrun state: primaryimage type: applicationimage type: applicationimage id: 0x462e5408image id: 0x462e5408build date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMTbuild date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMTlocation: internal slot (7), processor 0, boardId 0x3e, boardRev 0location: internal slot (7), processor 0, boardId 0x3e, boardRev 0reset type: processor exception 0x300 (data access: protection violation (read attempt))reset type: processor exception 0x300 (data access: protection violation (read attempt))task: IpSubscriberManatask: IpSubscriberManapc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74pc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24

<output truncated><output truncated>

– SRP crash due to a data access violation.SRP crash due to a data access violation.

– IPSM application wrote data to an incorrect location. SRP CPU does not that IPSM application wrote data to an incorrect location. SRP CPU does not that data in that location hence it crashes while trying to read from that locationdata in that location hence it crashes while trying to read from that location

– IPSM application was fixed to ensure that it does not write to the invalid IPSM application was fixed to ensure that it does not write to the invalid locationlocation

– KB 29909

Page 11: System Crashes Partner Training

Detector crashesDetector crashes Detector crashesDetector crashes

– Internal mechanism developed by Juniper to detect and Internal mechanism developed by Juniper to detect and recover from forwarding faultsrecover from forwarding faults

– The objective is to minimize the forwarding impact on the The objective is to minimize the forwarding impact on the routerrouter

– The system initially tries to recover the fault without any The system initially tries to recover the fault without any external impact. If that is not possible, a crash is performed.external impact. If that is not possible, a crash is performed.

– Also records information about these faults for troubleshooting Also records information about these faults for troubleshooting purpose.purpose.

– KB 16800: Enhanced PFTE support on E-series

Page 12: System Crashes Partner Training

Detector crashesDetector crashesDetector crashesDetector crashes

– 2 basic mechanisms:2 basic mechanisms:– Run by SRPRun by SRP

– Commonly known as Commonly known as “PIMTE” “PIMTE” (Ping/Icc Monitoring Threshold (Ping/Icc Monitoring Threshold Exceeded) orExceeded) or “PFTE” “PFTE” (Ping Failure Threshold Exceeded)(Ping Failure Threshold Exceeded)

– Frequently polls the line modules to check their health (aka “ping”)Frequently polls the line modules to check their health (aka “ping”)– Thresholds are defined for applications interacting between SRP and Thresholds are defined for applications interacting between SRP and

LM LM – If thresholds are exceeded, the SRP decides on what action should If thresholds are exceeded, the SRP decides on what action should

be taken.be taken.– Additional information is written to a file with extension “.tsa”Additional information is written to a file with extension “.tsa”– TSA file generation does not always mean there was a crash TSA file generation does not always mean there was a crash

!!!!!!– These crashes have a These crashes have a generic crash signature. generic crash signature. – Crash can occur on the standby SRP or Line ModuleCrash can occur on the standby SRP or Line Module– Reboot.htyReboot.hty, , coredumpcoredump and and TSA fileTSA file (if present) are required in each (if present) are required in each

case to analyse the root cause.case to analyse the root cause.

Page 13: System Crashes Partner Training

Detector crashesDetector crashes Detector crashes: PIMTE (example)Detector crashes: PIMTE (example)

time of reset: Fri Nov 10 00:13:00 2006time of reset: Fri Nov 10 00:13:00 2006run state: unknown (0)run state: unknown (0)image type: applicationimage type: applicationimage id: 0x4550dd8bimage id: 0x4550dd8bbuild date: 0x4550dd8b Tue Nov 07 2006 19:24:59 GMTbuild date: 0x4550dd8b Tue Nov 07 2006 19:24:59 GMTlocation: internal slot (4), processor 0, boardId 0xff, boardRev 0xfflocation: internal slot (4), processor 0, boardId 0xff, boardRev 0xffreset type: panic, msg "Ping/ICC monitoring threshold exceeded"reset type: panic, msg "Ping/ICC monitoring threshold exceeded"file: ontrolNetwork.ccfile: ontrolNetwork.ccline: 775line: 775task: schedulertask: schedulerlast errno: 0x110001last errno: 0x110001pc: 0x9e4228 -> fatalPanic(void) offset: 0x8pc: 0x9e4228 -> fatalPanic(void) offset: 0x8lr: 0x122088 -> Hw2Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessageconst &, CbusMessage &, CbusReplyDoneAction *&) offset: lr: 0x122088 -> Hw2Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessageconst &, CbusMessage &, CbusReplyDoneAction *&) offset:

0x13080x1308<output truncated><output truncated>

time of reset: Tue Aug 29 11:21:41 2006time of reset: Tue Aug 29 11:21:41 2006run state: standbyrun state: standbyimage type: applicationimage type: applicationimage id: 0x44ed842eimage id: 0x44ed842ebuild date: 0x44ed842e Thu Aug 24 2006 10:49:18 Eastern Standard Timebuild date: 0x44ed842e Thu Aug 24 2006 10:49:18 Eastern Standard Timelocation: internal slot (9), processor 0, boardId 0x3e, boardRev 0location: internal slot (9), processor 0, boardId 0x3e, boardRev 0reset type: unknown software error signature (0xfadead4), msg "Ping/ICC monitoring threshold exceeded"reset type: unknown software error signature (0xfadead4), msg "Ping/ICC monitoring threshold exceeded"file: ontrolNetwork.ccfile: ontrolNetwork.ccline: 752line: 752task: cbusSlavetask: cbusSlavelast errno: 0x380003last errno: 0x380003pc: 0x4cc98f10 -> fatalPanic(void) offset: 0x8pc: 0x4cc98f10 -> fatalPanic(void) offset: 0x8lr: 0x4ce26e7c -> Hw1SrpSlaveControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &) offset: 0x1548lr: 0x4ce26e7c -> Hw1SrpSlaveControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &) offset: 0x1548<output truncated><output truncated>

Page 14: System Crashes Partner Training

Detector crashesDetector crashes Detector crashes: PFTE (example)Detector crashes: PFTE (example)

time of reset: Mon Apr 24 12:44:33 2006time of reset: Mon Apr 24 12:44:33 2006

run state: unknown (0)run state: unknown (0)

image type: bootimage type: boot

image id: 0x4390bdc4image id: 0x4390bdc4

build date: 0x4390bdc4 Fri Dec 02 2005 21:33:56 GMTbuild date: 0x4390bdc4 Fri Dec 02 2005 21:33:56 GMT

location: internal slot (5), processor 0, boardId 0x33, boardRev 0x3location: internal slot (5), processor 0, boardId 0x33, boardRev 0x3

reset type: panic, msg "ping failure threshold exceeded"reset type: panic, msg "ping failure threshold exceeded"

file: ontrolNetwork.ccfile: ontrolNetwork.cc

line: 1182line: 1182

task: schedulertask: scheduler

last errno: 0last errno: 0

pc: 0x19235d4 -> fatalPanic(void) offset: 0x8pc: 0x19235d4 -> fatalPanic(void) offset: 0x8

lr: 0x1999d90 -> Hw1Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &, CbusReplyDoneAction *&) offset: lr: 0x1999d90 -> Hw1Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &, CbusReplyDoneAction *&) offset: 0xbf80xbf8

<output truncated><output truncated>

Page 15: System Crashes Partner Training

Detector crashesDetector crashes Detector crashesDetector crashes

– 2 basic mechanisms that run:2 basic mechanisms that run:– Run by the Line moduleRun by the Line module

– Commonly known as Commonly known as “ic1Detector” “ic1Detector” crashescrashes– Various components on the line module inform the IC (line module Various components on the line module inform the IC (line module

CPU) about any forwarding faults.CPU) about any forwarding faults.– Based on the severity of the problem, the line module decides what Based on the severity of the problem, the line module decides what

action should be takenaction should be taken– The IC (line module CPU) initially attempts to recover the particular The IC (line module CPU) initially attempts to recover the particular

component. component. – If it can not be recovered or if the problem recurs, a crash is takenIf it can not be recovered or if the problem recurs, a crash is taken– The “ic1Detector” crashes are seen only on Line modules.The “ic1Detector” crashes are seen only on Line modules.– They have a They have a generic crash signature. generic crash signature. – Reboot.hty and coredump are required in each case to analyse the Reboot.hty and coredump are required in each case to analyse the

root cause.root cause.

Page 16: System Crashes Partner Training

Detector crashesDetector crashes Detector crashes: ic1Detector (example)Detector crashes: ic1Detector (example)

time of reset: Mon Aug 28 14:55:44 2006time of reset: Mon Aug 28 14:55:44 2006run state: unknown (0)run state: unknown (0)image type: applicationimage type: applicationimage id: 0x44a294a3image id: 0x44a294a3build date: 0x44a294a3 Wed Jun 28 2006 14:39:31 GMTbuild date: 0x44a294a3 Wed Jun 28 2006 14:39:31 GMTlocation: internal slot (5), processor 0, boardId 0xff, boardRev 0xfflocation: internal slot (5), processor 0, boardId 0xff, boardRev 0xffreset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash"reset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash"file: ic1Detector.ccfile: ic1Detector.ccline: 718line: 718task: schedulertask: schedulerlast errno: 0x110001last errno: 0x110001pc: 0x9b528c -> fatalPanic(void) offset: 0x8pc: 0x9b528c -> fatalPanic(void) offset: 0x8lr: 0x1286544 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int, unsigned short, bool, bool, char const *, bool) offset: 0xa1clr: 0x1286544 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int, unsigned short, bool, bool, char const *, bool) offset: 0xa1c<output truncated><output truncated>

time of reset: Tue Jan 22 11:32:39 2008time of reset: Tue Jan 22 11:32:39 2008run state: unknown (0)run state: unknown (0)image type: applicationimage type: applicationimage id: 0x46d571bfimage id: 0x46d571bfbuild date: 0x46d571bf Wed Aug 29 2007 13:16:47 GMTbuild date: 0x46d571bf Wed Aug 29 2007 13:16:47 GMTlocation: internal slot (15), processor 0, boardId 0xff, boardRev 0xfflocation: internal slot (15), processor 0, boardId 0xff, boardRev 0xffreset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash"reset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash"file: ic1Detector.ccfile: ic1Detector.ccline: 738line: 738task: schedulertask: schedulerlast errno: 0x110001last errno: 0x110001pc: 0xa6b740 -> fatalPanic(void) offset: 0x8pc: 0xa6b740 -> fatalPanic(void) offset: 0x8lr: 0x147dc24 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int,lr: 0x147dc24 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int,unsigned short, bool, bool, char const *, bool) offset: 0xa54unsigned short, bool, bool, char const *, bool) offset: 0xa54<output truncated><output truncated>

Page 17: System Crashes Partner Training

Agenda

Why does an E-series router crash?Why does an E-series router crash?

What happens after a crash?What happens after a crash?

What can I do once it crashes?What can I do once it crashes?

What information does JTAC need?What information does JTAC need?

Page 18: System Crashes Partner Training

What happens after a crash?What happens after a crash? After an SRP or Line module crashes, it goes After an SRP or Line module crashes, it goes

through the boot processthrough the boot process

During this time it writes entries to the During this time it writes entries to the reboot.hty file and generates a coredump (if reboot.hty file and generates a coredump (if enabled)enabled)

What is the reboot.hty file? What is a What is the reboot.hty file? What is a coredump?coredump?

Page 19: System Crashes Partner Training

What is the reboot.hty file?What is the reboot.hty file? Reboot.hty is a file maintained on the SRP’s flash which keeps Reboot.hty is a file maintained on the SRP’s flash which keeps

a history of the reboots that happen on the router.a history of the reboots that happen on the router.

These include regular reboots performed by the user and These include regular reboots performed by the user and unexpected crashes that may happen on the router.unexpected crashes that may happen on the router.

Each SRP maintains its own copy of the reboot.hty file. This file Each SRP maintains its own copy of the reboot.hty file. This file is NOT synchronized.is NOT synchronized.

Each SRP keeps a record of its own rebootsEach SRP keeps a record of its own reboots

Additionally, when a line module reboots it’s entries are written Additionally, when a line module reboots it’s entries are written to the to the primary SRP’s reboot.htyprimary SRP’s reboot.hty

Page 20: System Crashes Partner Training

What is the reboot.hty file?What is the reboot.hty file? How do I view the contents of the reboot.hty file?How do I view the contents of the reboot.hty file?

– Primary SRPPrimary SRP– Use the ‘show reboot-history’ commandUse the ‘show reboot-history’ command

– Standby SRPStandby SRP– Make a copy of the Standby SRP’s reboot.hty file on the primary SRP’s Make a copy of the Standby SRP’s reboot.hty file on the primary SRP’s

flash:flash:copy standby:reboot.hty <filename>.htycopy standby:reboot.hty <filename>.hty

– Use the ‘show reboot-history <filename>.hty’ commandUse the ‘show reboot-history <filename>.hty’ command

How do I copy the reboot.hty file to an FTP server?How do I copy the reboot.hty file to an FTP server?– Primary SRP:Primary SRP:

copy reboot.hty <FTPserver>:<path>/<filename>.htycopy reboot.hty <FTPserver>:<path>/<filename>.hty

– Standby SRP:Standby SRP:copy standby:reboot.hty <FTPserver>:<path>/<filename>.htycopy standby:reboot.hty <FTPserver>:<path>/<filename>.hty

Page 21: System Crashes Partner Training

What is the reboot.hty file?What is the reboot.hty file? What is the format of a crash record in the reboot.hty file?What is the format of a crash record in the reboot.hty file?

time of reset: Thu Aug 23 20:48:09 2007time of reset: Thu Aug 23 20:48:09 2007

run state: primaryrun state: primary

image type: applicationimage type: application

image id: 0x462e5408image id: 0x462e5408

build date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMTbuild date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMT

location: internal slot (7), processor 0, boardId 0x3e, boardRev 0location: internal slot (7), processor 0, boardId 0x3e, boardRev 0

reset type: processor exception 0x300 (data access: protection violation (read attempt))reset type: processor exception 0x300 (data access: protection violation (read attempt))

task: IpSubscriberManatask: IpSubscriberMana

pc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74pc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74

lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24

<output truncated><output truncated>

– time of reset:time of reset: Identifies the time the reset took placeIdentifies the time the reset took place

– run state:run state: Relevant only for the SRP. Identifies if the SRP was in ‘primary’ or Relevant only for the SRP. Identifies if the SRP was in ‘primary’ or ‘standby’ state when the reset occurred. Set to ‘unknown’ for line modules.‘standby’ state when the reset occurred. Set to ‘unknown’ for line modules.

– image type:image type: Identifies the type of image the SRP or line module was running when Identifies the type of image the SRP or line module was running when it reloaded. This can be boot, diag or application image.it reloaded. This can be boot, diag or application image.

– image id:image id: Internal ID used by Juniper to identify the releaseInternal ID used by Juniper to identify the release

– build date:build date: Identifying the date when the release on this router was built.Identifying the date when the release on this router was built.

Page 22: System Crashes Partner Training

What is the reboot.hty file?What is the reboot.hty file? What is the format of a crash record in the reboot.hty file?What is the format of a crash record in the reboot.hty file?

– location:location: – Identifies the slot in which the SRP or line module was present when it reloaded.Identifies the slot in which the SRP or line module was present when it reloaded.– There is a difference in the physical slot in which SRP or line module resides and There is a difference in the physical slot in which SRP or line module resides and

the “internal slot” reported in the reboot.hty file.the “internal slot” reported in the reboot.hty file.– Mapping of the internal slot to physical slot:Mapping of the internal slot to physical slot:

– ERX:ERX:

Physical SlotPhysical Slot Internal SlotInternal Slot00 0011 1122 2233 3344 4455 5566 7777 9988 101099 11111010 12121111 13131212 14141313 1515

– E320: E320: KB 26791: E320 Internal Slot Numbering

Page 23: System Crashes Partner Training

What is the reboot.hty file?What is the reboot.hty file? What is the format of a crash record in the reboot.hty file?What is the format of a crash record in the reboot.hty file?

– reset type:reset type: Identifies the type of reset. There can be quite a few different reset types Identifies the type of reset. There can be quite a few different reset types such as ‘processor exception’, ‘panic’, ‘user reboot’such as ‘processor exception’, ‘panic’, ‘user reboot’

– task:task: Identifies the task that was running on the SRP or line module when it was reset. Identifies the task that was running on the SRP or line module when it was reset. On the line module this will always be set to ‘scheduler’ as that is the only task that On the line module this will always be set to ‘scheduler’ as that is the only task that runs on the line module.runs on the line module.

Page 24: System Crashes Partner Training

What is the core dump file?What is the core dump file? Core dump is a snapshot of the memory at time Core dump is a snapshot of the memory at time

when the crash occurred. when the crash occurred.

It is an important tool for JTAC and engineering It is an important tool for JTAC and engineering teams to identify the root cause of a crashteams to identify the root cause of a crash

The size of the core dump varies based on the The size of the core dump varies based on the amount of memory on the SRP or Line module.amount of memory on the SRP or Line module.

The core dump is generated during the boot process The core dump is generated during the boot process after the crashafter the crash

Page 25: System Crashes Partner Training

What is the core dump file?What is the core dump file? How do I check if a core dump was generated?How do I check if a core dump was generated?

- Coredumps are stored on the SRP’s flashCoredumps are stored on the SRP’s flash- Use the ‘dir’ output to checkUse the ‘dir’ output to check

Can core dumps be disabled?Can core dumps be disabled?- Yes.Yes.- Disabling line card coredumps:Disabling line card coredumps:

E120(config)#exception dump srp-onlyE120(config)#exception dump srp-only

- Disabling SRP coredumps:Disabling SRP coredumps:E120(config)#exception dump except-srpE120(config)#exception dump except-srp

How do I check if core dumps are enabled?How do I check if core dumps are enabled?– Use the ‘show exception dump’ commandUse the ‘show exception dump’ command

How do I copy a core dump to an FTP server?How do I copy a core dump to an FTP server?copy <filename>.dmp <FTPserver>:<path>/<filename>.dmpcopy <filename>.dmp <FTPserver>:<path>/<filename>.dmp

Page 26: System Crashes Partner Training

What is the core dump file?What is the core dump file? Why do customers sometimes disable core dumps?Why do customers sometimes disable core dumps?

– A coredump takes a few minutes to complete A coredump takes a few minutes to complete – This adds to the boot time of the SRP/line moduleThis adds to the boot time of the SRP/line module

Should I be disabling them?Should I be disabling them?– It is not recommended to disable core dumpsIt is not recommended to disable core dumps

– In certain cases, when repetitive crashes are seen on a router, this may be In certain cases, when repetitive crashes are seen on a router, this may be considered in consultation with JTAC.considered in consultation with JTAC.

Sometimes why does a crash not generate a core dump?Sometimes why does a crash not generate a core dump?– Some of the possible causes could be:Some of the possible causes could be:

– Not enough space on flash to store the core dumpNot enough space on flash to store the core dump– Core dumps were disabledCore dumps were disabled– Power glitchesPower glitches– Conditions where the SRP/Line module could not take a dump of the Conditions where the SRP/Line module could not take a dump of the

memory.memory.

Page 27: System Crashes Partner Training

Agenda

Why does an E-series router crash?Why does an E-series router crash?

What happens after a crash?What happens after a crash?

What can I do once it crashes?What can I do once it crashes?

What information does JTAC need?What information does JTAC need?

Page 28: System Crashes Partner Training

What can I do once it crashes?What can I do once it crashes? Ensure services are restoredEnsure services are restored

– A brief checklist:A brief checklist:– Are the SRP and line modules in online state?Are the SRP and line modules in online state?

– Did all the routing protocols converge?Did all the routing protocols converge?

– Have the subscribers started reconnecting?Have the subscribers started reconnecting?

– Are the traffic levels restored?Are the traffic levels restored?

– Is the CPU utilization normal after some time?Is the CPU utilization normal after some time?

Page 29: System Crashes Partner Training

What can I do once it crashes?What can I do once it crashes? Assess the impact of the crashAssess the impact of the crash

– A brief checklistA brief checklist– What crashed?What crashed?– Is the SRP/Line module stable after the crash?Is the SRP/Line module stable after the crash?– Did any customer applications suffer an impact?Did any customer applications suffer an impact?– How many subscribers were impacted? For how long?How many subscribers were impacted? For how long?– For an SRP crashFor an SRP crash

– Was it the primary or standby SRP?Was it the primary or standby SRP?– Was high availability enabled?Was high availability enabled?– Did the standby SRP take over?Did the standby SRP take over?

– For a Line module crashFor a Line module crash– Was the line module in a redundancy group?Was the line module in a redundancy group?– If yes, did the redundant line module take over?If yes, did the redundant line module take over?– Was it subscriber-facing or core-facingWas it subscriber-facing or core-facing

Page 30: System Crashes Partner Training

What can I do once it crashes?What can I do once it crashes? Research the cause of the crashResearch the cause of the crash

- Ask yourself (or your customer Ask yourself (or your customer ):):- Any recent changes to configuration?Any recent changes to configuration?- Any recent changes to the load on the router?Any recent changes to the load on the router?- Any recent changes in the network?Any recent changes in the network?

– Search the knowledge baseSearch the knowledge base– All defects found at a customer site have a knowledge base article All defects found at a customer site have a knowledge base article

associated with them.associated with them.– Use the knowledge base effectivelyUse the knowledge base effectively– If you find a match, always double-check with JTACIf you find a match, always double-check with JTAC

– Contact JTACContact JTAC– It is recommended to contact JTAC whenever there is a crash on the It is recommended to contact JTAC whenever there is a crash on the

router router

Page 31: System Crashes Partner Training

Tips on searching the KBTips on searching the KB Some tips on searching the knowledge base:Some tips on searching the knowledge base:

- The crash record in the reboot.hty is a good pointer to startThe crash record in the reboot.hty is a good pointer to start- Look for information in the stack trace which seems uniqueLook for information in the stack trace which seems unique

- Filenames, Line numbers, Reset typeFilenames, Line numbers, Reset type

- Search for these in the knowledge baseSearch for these in the knowledge base- Remember:Remember:

- Some crashes have very generic crash records (eg: Detector crashes)Some crashes have very generic crash records (eg: Detector crashes)- In such cases, a match in the KB does NOT necessarily mean you are hitting In such cases, a match in the KB does NOT necessarily mean you are hitting

the same problemthe same problem

- Some crash records may match closely but not exactlySome crash records may match closely but not exactly- In some cases this may be the same problem showing up in a different formIn some cases this may be the same problem showing up in a different form- In some cases, it may be an entirely different issueIn some cases, it may be an entirely different issue

- Read the problem description and solution fields carefullyRead the problem description and solution fields carefully- Some times these are good pointers to confirm if you are hitting the same Some times these are good pointers to confirm if you are hitting the same

issueissue

- When in doubt, consult JTAC When in doubt, consult JTAC

http://www.juniper.net/kb

Page 32: System Crashes Partner Training

Agenda

Why does an E-series router crash?Why does an E-series router crash?

What happens after a crash?What happens after a crash?

What can I do once it crashes?What can I do once it crashes?

What information does JTAC need?What information does JTAC need?

Page 33: System Crashes Partner Training

What information does JTAC What information does JTAC need?need?

Files to be collectedFiles to be collected– Collect the following files from the routerCollect the following files from the router

– reboot.htyreboot.hty– It is usually a good idea to collect the file from both primary and It is usually a good idea to collect the file from both primary and

standby SRPsstandby SRPs

– Core dumps (if any)Core dumps (if any)

– Files with extension “.tsa” (if any)Files with extension “.tsa” (if any)

– system.log file system.log file

– Copy of the router configuration in CNF and SCR formatCopy of the router configuration in CNF and SCR format

Page 34: System Crashes Partner Training

What information does JTAC What information does JTAC need?need?

OutputsOutputs- Collect the following outputs:Collect the following outputs:

sh versionsh version

sh hardwaresh hardware

sh env allsh env all

dirdir

sh log data nv-filesh log data nv-file

sh log data severity debugsh log data severity debug

sh redundancysh redundancy

- Depending upon the problem, there may be other outputs Depending upon the problem, there may be other outputs that JTAC may require.that JTAC may require.

Page 35: System Crashes Partner Training

What information does JTAC What information does JTAC need?need?

Other informationOther information– The following information is very useful to JTAC The following information is very useful to JTAC

when troubleshooting crash cases:when troubleshooting crash cases:- Services deployed on the routerServices deployed on the router- Logical diagram of the networkLogical diagram of the network- Information about devices connected to the routerInformation about devices connected to the router- Number of subscribers connected to the router.Number of subscribers connected to the router.- The amount of traffic (in Mbps) on the router.The amount of traffic (in Mbps) on the router.- Information about any changes to the router Information about any changes to the router

configuration or deployment scenarioconfiguration or deployment scenario- Information about any changes in the external networkInformation about any changes in the external network

Page 36: System Crashes Partner Training

What information does JTAC What information does JTAC need?need?

When all else failsWhen all else fails– Some crashes are illusiveSome crashes are illusive– Crashes are seen on customer routers however the core Crashes are seen on customer routers however the core

dumps do not provide us enough information dumps do not provide us enough information – And, the crashes are not reproducible in JTAC labs worldwideAnd, the crashes are not reproducible in JTAC labs worldwide– In such cases there may be a need to collect additional data In such cases there may be a need to collect additional data

from the customer’s routerfrom the customer’s router– Some of the techniques used in the past:Some of the techniques used in the past:

• Installing a debug image on the customer routerInstalling a debug image on the customer router• Enabling memory debuggingEnabling memory debugging• Enabling assertionsEnabling assertions

– This is required in special cases only and JTAC will provide all This is required in special cases only and JTAC will provide all necessary information in such cases.necessary information in such cases.

Page 37: System Crashes Partner Training

SummarySummary Crashes can be a good thingCrashes can be a good thing

Assess the severity of the crash and its Assess the severity of the crash and its impact in the networkimpact in the network

Work closely with JTAC to analyze the root Work closely with JTAC to analyze the root causecause

A good understanding of the E-series A good understanding of the E-series behavior helps build customer behavior helps build customer confidenceconfidence

Page 38: System Crashes Partner Training

Questions…. ???Questions…. ???

Page 39: System Crashes Partner Training

Copyright © 2006 Juniper Networks, Inc. Proprietary and Confidential www.juniper.net 39