© 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

45
© 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011

Transcript of © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

Page 1: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Toolkit for Event Analysis and Logging

Education

Dec 2011

Page 2: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Contents

Overview

Locations

Commands

Alerts and Connectors

Debug

References

Page 3: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Overview

Page 4: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Overview

Common HPC Event Analysis Framework– Combined best aspects and lessons learned from BlueGene ELA and Federation

ELA– Addressed new p7 IH requirements

Common Event Repository– First release: CNM, Service Focal Point (HMC), PNSD, LL, GPFS (coming soon)

Analysis of Events to create Alerts– Rules based engine– Flexible alert delivery. For example, RMC and e-mail

Real-time Analysis and Historic Analysis– Real-time to be pro-active and react immediately to events– Historical allows for deeper debug on-site and off-site

Robust framework to prevent loss of alerts and events– Handles event flooding– Checkpoint/Shutdown/Restart

Open Source (pyteal.sourceforge.net)– Using ODBC– Python, C/C++, and Perl

Page 5: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

RAS Strategy

Correct

CentralizeDatabase

Event Adapters

AnalyzeGeneric Analysis

Custom Analysis

Rules

AlertGeneric filters, listeners

Custom

Auto-Recovery

Custom

FindFind

ResolveResolve

RefineRefine

Recommended Actions

Manual Analysis

Detect

Monitors

Observation

TEALTEAL

Get Data

DebugAnalyze Behavior

Release new rules

Fix Framework

Maintenance package escape

Shouldn’t be manual?

Data Mining

queries

Historicalanalysis

Data collection

As enabled

Grayed-out boxes are future possibilities

Query, e-mail, RMC

Page 6: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

TEAL Concepts

Event Log(table in xCat DB)

Alert Log(table in xCat DB)

Connector

CNM

Connector

Monitor

semaphore

Event

Analyzer Alert

Analyzer Alert

Filters

Alert

Listeners

Alert

teal.confteal.conf

teal.conf

Event

Page 7: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

P7-IH Usage

Output is to an alert database– Monitored by the administrator and operators– Various methods of monitoring will be described– Commands are used to query the database

Primary users are the administrator and operator

Runs on the EMS– Commands are issued via the EMS command line– SSRs may run commands under engineering direction

Event database may be collected to work on new analysis algorithms, or bugs

Page 8: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

P7-IH Implementation

CNM

SFP

GPFS

LL

PNSD

TEAL

Event Log(table in cluster DB)

Alert Log(table in cluster DB)

HMC(s)

Systems

SFP to TEAL

Analyzed

Events

Network Events

to SFP

Customer Notify

e-mail, RMC,

query

Admin, Operator

Store

Events

Page 9: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Locations

Points to a specific event location

Can be physical, logical or a mixture of both

Is hierarchical in nature–Simple - one type of item per level–Complex - multiple types of items per level

Operations–Scoping–Validation–Casting (platform specific)

XML-based description–/opt/teal/data/ibm/teal/xml/percs_location.xml–Can use it to remind yourself of the location formats

Page 10: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Location Code Examples

Simple–Hierarchy innate in

description

Complex–Compact ID–Optional Instance Values

Example:

<node>-<program>-<pid>

comp01-firefox-1234

comp01-vncserver-4567

Example:

FR

CG

SN

DR

HB

LL OM HF

LR LD RM

H:FR008-CG03-SN000-DR0

Page 11: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

P7-IH Locations

Application– A:c250mgrs20-pvt.ppd.pok.ibm.com##teal.py##28327–Expect this from PNSD and GPFS – apps in general

Job–J:z25c4s9.ppd.pok.ibm.com.1.3–Expect this from LoadLeveler

Hardware (aka logical hardware)

–H:FR008-CG03-SN000-DR0-HB1-OM27-LR22–Expect this from ISNM

pSeries (aka service/physical)

–P:U9125.F2C.0286C66–Expect this from SFP

Page 12: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Commands

Page 13: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

teal runtime and historic modes

tllsevent list events

tlrmevent prune events from the event log

tllsalert list alerts

tlchalert change the state of alert

tlrmalert prune alerts from the alert log

tllsckpt list checkpoints

tltab (sbin) database table maintenance

TEAL EMS Command Line (/opt/teal/bin)

Page 14: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Closing Alerts

1. tllsalert 2. tlchalert --id 1543 --state close

Querying Alerts

• tllsalert –q”creation_time>2010-12-30 creation_time<2011-02-01”

• tllsalert -q”event_loc=P” –f text• tllsalert -q”event_loc=H:FR007-CG03-SN016-DR0-HB0 event_scope=hub”• tllsalert –-with-assoc -f text

Removing Alerts

• tlrmalert --older-than 2011-01-01-12:00:00

Output Options: csv, json, text, “brief”

Can only remove alerts• closed• not a duplicate

Can take a long time

Managing Alerts

Page 15: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Listing events• tllsevent• tllsevent -q”src_loc=H:FR007-CG03-SN016-DR0-HB0 src_scope=hub”• tllsevent –e• tllsevent –q”time_logged=2011-04”

Removing Events

• tlrmevent --older-than 2011-01-01-12:00:00

Only Events not associated with:• an alert• a checkpoint

Managing Events

Page 16: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

1. Close (by resolving) any active alerts (tlchalert)

2. Remove all closed alerts (tlrmalert -–older-than)

3. Remove all events not associated with an alert(tlrmevent -–older-than)

Cleaning Out the DB

Page 17: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

tllschkpt

CnmEventAnalyzer R 35301

PNSDEventAnalyzer R None

LLEventAnalyzer S None

SFPEventAnalyzer R None

monitor_event_queue R 35301

MAX_event_rec_id 3530

tllschkpt –f text <- shows additional data

monitor_event_queue is last recovery type and start rec_id

GEAR based analyzers contain pool checkpoint information

Checkpoints

State when analyzer last checkpointed

Last event processedby the monitor

Maximum rec_id in event log

Page 18: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

User can set up a query for the criteria of interest

Filters and listeners in the configuration file for historic mode or all modes are executed

Choice of committing or not committing (default) the generated alerts

To capture all alerts produced, a file or print listener that does not specify any filters should be used

Time occurred or time logged can be used for analysis

teal --historic -–query=”src_comp=CNM time_occurred>2011-02-01-10:00:00”

Historic Analysis - Reanalyzing

Page 19: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

rec_id (=,<,>,<=,>=) Can be a single value or a comma separated list of ids

event_id (=) Can be a single value or a comma-separated list of ids

time_occurred (=,<,>,<=,>=) A single value in the format of yyyy-mm-dd hh:mm:ss

time_logged (=,<,>,<=,>=) A single value in the format of yyyy-mm-dd hh:mm:ss

src_comp (=) Can be a single value or a comma-separated list of values

src_loc_type:src_loc (=) The location is optional otherwise all events with the same location type will be included.

src_scope (=) Level to scope all source locations to. This is only valid if the reporting location type is specified

rpt_comp (=) Can be a single value or a comma-separated list of values

rpt_loc_type:rpt_loc (=) The location is optional otherwise all events with the same location type will be included

rpt_scope (=) Level to scope all reporting locations to. This is only valid if the reporting

TEAL historic and tlls* Options

Page 20: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

csv – good for reading into spreadsheets, or program parsing

rec_id,event_id,time_occurred,time_logged,src_comp,src_loc,src_loc_type,rpt_comp,rpt_loc,rpt_loc_type,event_cnt,elapsed_time

91455,BD700041,2011-02-09 15:06:19,2011-02-09 15:06:19,CNM,BB03-FR007-SN000-DR0-HB0-LD00,H,CNM,"TRMD",A,,

json– good for program parsing{"src_comp": "CNM", "rpt_loc_type": "A", "event_id": "BD700041",

"src_loc_type": "H", "time_occurred": "2011-02-09 15:06:19", "rec_id": 91455, "event_cnt": null, "rpt_loc": "TRMD", "elapsed_time": null, "rpt_comp": "CNM", "time_logged": "2011-02-09 15:06:19", "src_loc": "BB03-FR007-SN000-DR0-HB0-LD00“}

Sample output – csv and json

Page 21: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Alertsand

Connectors

Page 22: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation22

CNM and TEAL EMS

ISNM/CNM

NM

TEAL

Event Alert

Monitor

Filter

Analyzer

Listener

NetworkEvents

SFP

FSPFSPFSP Init

Rules

Page 23: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation23

Network Hardware Events

Events reported by the HFI, ISR or Optical Module:

HFI Events– HFI Down – report for completeness of network status

Link Events– Link types are HFI-to-ISR links, Llocal (intra-drawer), Lremote (intra-SN),

and D-link (inter-SN) – Port Down/Port Up– Threshold events: CRC, dropped flit, flit retry– Correctable/uncorrectable errors on port-level routing structures– Packet flow events, e.g. credit overflow, sender hang informational

Optical Module Events– Module-level events affect a single D port or two LR ports– Channel-level events affect a single D port. May affect one or two LR

ports depending on which channels are affected.– Some OM events are thresholded by LNMC

Page 24: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation24

Frame Events

Reported directly to CNM by frame (BPA) firmware

ISNM uses these events for analysis only – BPA creates any serviceable events for the problems it detects; ie. it suppresses network events caused by frame events

Sample frame events that may affect the ISR network:– CEC power dropped due to MCM Over Temperature– CEC DCCA errors – High ambient temperature BPA

FSP

FSP

FSP

FSP

CNM

Page 25: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation25

Example CNM Alert

>[c250mgrs52]>/opt/teal/bin/tllsalert -f text -q "alert_id=BD700025”rec_id : 9673alert_id : BD700025creation_time : 2011-08-16 15:15:11.146044severity : Eurgency : Sevent_loc : FR052-CG03-SN000-DR0-HB1-OM12-LD12event_loc_type : Hfru_loc : Nonerecommendation : There is a problem with a D-Link.Record the alert ID.Record the location in the alert message.Contact IBM Service.Log on to the Management Server.To isolate to the proper FRU, run Link Diags and perform the actions that it

recommends.If no action is recommended, because Diags cannot isolate to the proper FRU,

replace the FRUs in the order listed.reason : D-link down between frame FR052 cage CG03 (superNode SN000 drawer DR0)

hub HB1 port LD12 and frame FR052 cage CG06 (superNode SN003 drawer DR0) hub HB1 port LD15 (D Link Port Down)

src_name : CnmEventAnalyzerstate : 1raw_data : {"fru_list":"{ HFI_DDG,Isolation Procedure,,,, },{ HFI_CAB,Symbolic

Procedure,U78A9.001.20C1000-P1-T17-T6,,, },{ CBLCONT,Symbolic Procedure,U78A9.001.311B001-P1-T16-T5,,, },{ 52Y3020,FRU,U78A9.001.20C1000-P1-R2,YA193P203586,ABC123,TRMD },{ 52Y3020,FRU,U78A9.001.311B001-P1-R2,YA193P399669,ABC123,TRMD }","nbr_loc":"FR052-CG06-SN003-DR0-HB1-OM15-LD15","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B506"}

Page 26: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

raw_data : {"fru_list":"{ HFI_DDG,Isolation Procedure,,,, },{ HFI_CAB,Symbolic Procedure,U78A9.001.20C1000-P1-T17-T6,,, },{ CBLCONT,Symbolic Procedure,U78A9.001.311B001-P1-T16-T5,,, },{ 52Y3020,FRU,U78A9.001.20C1000-P1-R2,YA193P203586,ABC123,TRMD },{ 52Y3020,FRU,U78A9.001.311B001-P1-R2,YA193P399669,ABC123,TRMD }","nbr_loc":"FR052-CG06-SN003-DR0-HB1-OM15-LD15","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B506"}

CNM FRU list format in alerts

•Multiple FRUs with each one contained in braces•Part number, FRU type, FRU location, ECID, CCIN

Part Number

FRU type FRU location Part Serial Number

ECID CCIN

HFI_DDG Isolation Procedure

HFI_CAB Symbolic Procedure U78A9.001.20C1000-P1-T17-T6

CBLCONT Symbolic Procedure U78A9.001.311B001-P1-T16-T5

52Y3020 FRU U78A9.001.20C1000-P1-R2 YA193P203586 ABC123 TRMD

52Y3020 FRU U78A9.001.311B001-P1-R2 YA193P399669 ABC123 TRMD

Page 27: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation27

Example CNM Compound Alert

>[c250mgrs52]>/opt/teal/bin/tllsalert -f text -q "alert_id=BDFF0060” -wrec_id : 13304alert_id : BDFF0060creation_time : 2011-08-26 19:02:53.971854severity : Wurgency : Oevent_loc : FR052-CG04-SN001-DR0event_loc_type : Hfru_loc : Nonerecommendation : A large number of HFI network links attached to a drawer are down

without an accompanying power event.Contact IBM Service and report the alert ID.If a drawer lost power, then this is a secondary effect.reason : Drawer level event occurred on frame FR052 cage CG04 (superNode SN001

drawer DR0). (Suspicious Drawer)src_name : CnmEventAnalyzerstate : 1raw_data : {"fru_list":"{ HFI_IDR,Isolation Procedure,,,, }","nbr_loc":"FR052-

CG04-SN001-DR0-HB7-OM09-LD09","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B5D6"}

Condition Alerts: []Condition Events:

[32873,32874,32875,32876,32877,32878,32879,32880,32881,32882,32883,32884]Duplicate Alerts: []Suppression Alerts: []Suppression Events: []

Page 28: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation28

Example CNM Alert with suppression

>[c250mgrs52]>/opt/teal/bin/tllsalert -f text -q "alert_id=BD700022” -wrec_id : 8507alert_id : BD700022creation_time : 2011-08-11 14:39:00.244292severity : Eurgency : Sevent_loc : FR052-CG10-SN007-DR0-HB3-OM09-LD09event_loc_type : Hfru_loc : Nonerecommendation : There is a problem with a D-Link.Record the alert ID and call IBM Service.Log on to the Management Server.To isolate to the proper FRU, run Link Diags and perform the actions that it

recommends.If no action is recommended, because Diags cannot isolate to the proper FRU,

replace the FRUs in the order listed.reason : D Link Port Lane Width Change between frame FR052 cage CG10 (superNode

SN007 drawer DR0) hub HB3 port LD09 and frame FR052 cage CG09 (superNode SN006 drawer DR0) hub HB3 port LD08 (D Link Port Lane Width Change)

src_name : CnmEventAnalyzerstate : 1raw_data : {"fru_list":"{ HFI_DDG,Isolation Procedure,,,, },{ HFI_CAB,Symbolic

Procedure,U78A9.001.30CK001-P1-T14-T1,,, },{ CBLCONT,Symbolic Procedure,U78A9.001.312N005-P1-T14-T2,,, },{ 52Y3020,FRU,U78A9.001.30CK001-P1-R5,YA193P400322,ABC123,TRMD },{ 52Y3020,FRU,U78A9.001.312N005-P1-R5,YA193N035309,ABC123,TRMD }","nbr_loc":"FR052-CG09-SN006-DR0-HB3-OM08-LD08","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B5F6"}

Condition Alerts: []Condition Events: [26388]Duplicate Alerts: [8511]Suppression Alerts: []Suppression Events: [26389,26390]

Page 29: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation29

Example CNM Event>[c250mgrs52]>/opt/teal/bin/tllsevent -f text -q “event_id=BD700025” -erec_id : 22877event_id : BD700025 - D Link Port Downtime_occurred : 2011-08-01 14:52:14time_logged : 2011-08-01 14:52:14.369687src_comp : CNMsrc_loc : FR052-CG07-SN004-DR0-HB0-OM14-LD14src_loc_type : Hrpt_comp : CNMrpt_loc : c250mgrs52##cnmdrpt_loc_type : Aevent_cnt : Noneelapsed_time : Noneext.eed_loc_info : c250mgrs52:/var/opt/isnm/cnm/logext.encl_mtms : 9125-F2C/028B596ext.global_counter : Noneext.isnm_raw_data : REG_BEGIN ISR_GLOBAL_COUNTER_REGISTER = 0x000005347ecda480 ISR_ID_REGISTER =

0x004800d01c000000 ISR_D14D15_FIR = 0x4000000000000000 D_PORT_14_SEND_NEIGHBOR_ID = 0x000800d01ee00000 OLL_LLD14_LINK_STATUS = 0xc1d6000100000000 REG_END

ext.local_om1 : U78A9.001.30CM002-P1-R2-R1,52Y3020,YA193P407777,ABC122,TRMDext.local_om2 :ext.local_planar : U78A9.001.30CM002-P1,74Y0601,YH10HA0BH002,ABC122,2E00ext.local_port : U78A9.001.30CM002-P1-T17-T7ext.local_torrent : U78A9.001.30CM002-P1-R2,52Y3020,YA193P407777,ABC123,TRMDext.nbr_om1 : U78A9.001.30CK001-P1-R2-R4,52Y3020,YA193P399201,ABC123,TRMDext.nbr_om2 :ext.nbr_planar : U78A9.001.30CK001-P1,74Y0601,YH10HA0BJ003,ABC123,2E00ext.nbr_port : U78A9.001.30CK001-P1-T15-T8ext.nbr_torrent : U78A9.001.30CK001-P1-R2,52Y3020,YA193P399201,ABC123,TRMDext.neighbor_loc : H: FR052-CG04-SN006-DR0-HB0-OM11-LD11ext.pwr_ctrl_mtms : 78AC-100BC50052ext.recovery_file_path : /var/opt/isnm/cnm/log

Page 30: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Uses RMC and xCAT monitoring support

Retrieves batches of events from HMC

[c250mgrs14][/]> nodels hmc

c250hmc05_a

[c250mgrs14][/]> lscondresp

Displaying condition with response information:

Condition Response Node State

"AllServiceableEvents_HB" "TealLogSfpEvent_HB" "c250mgrs14" "Active"

FSP HMC TEAL

SFP Connector

Page 31: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

rec_id : 8490

event_id : B1812A80

time_occurred : 2011-04-20 09:57:41

time_logged : 2011-04-20 09:58:46.187401

src_comp : SFP

src_loc : U9125.F2C.P7IH165

src_loc_type : P

rpt_comp : 7042CR5/KQZAAAT

rpt_loc : c250hmc05.ppd.pok.ibm.com##AllServiceableEvents_B

rpt_loc_type : A

event_cnt : None

elapsed_time : None

ext.call_home : N

ext.description : Platform firmware (0x81) reported an error.

ext.fru_list : [['FSPSP04', 'ACT04219I Isolate procedure', '', '', '', ''], ['45D7208', 'ACT04216I FRU', 'U78A9.001.1122233-P1-R5', 'YH30HA022005', '', '2A3A'], ['FSPSP06', 'ACT04219I Isolate procedure', '', '', '', '']]

ext.prob_num : 320

ext.sfp_raw_data : {'FRURecentlyReplaced': ['No', 'No', 'No'], 'FRULogicControllingCECMachineSerialNumber': ['P7IH165', 'P7IH165', 'P7IH165'], 'HSCBiosName': 'KQZAAAT', 'CreatedTimeStamp': '04/20/2011 06:16:49', 'CECMachineModel': 'F2C', 'FDAdditionalMachine': ['9125-F2C-P7IH165'], 'EventType': 'open', 'SystemRefCode': 'B1812A80', 'CreatorID': 'E', 'FRUEnclosureMachineSerialNumber': ['P7IH165', 'P7IH165', 'P7IH165'], 'FRUEnclosureMachineTypeModel': ['9125-F2C', '9125-F2C', '9125-F2C'], 'DuplicateCount': '0', 'EventSeverity': '32', 'CECMachineType': '9125', 'SubsystemID': '129', 'FRULogicControllingCECMachineTypeModel': ['9125-F2C', '9125-F2C', '9125-F2C'], 'CalledHome': 'No', 'FRUReplacementPriority': ['80', '50', '25'], 'CECMachineSerialNumber': 'P7IH165', 'LastReportedTimeStamp': '04/20/2011 06:16:49', 'HSCBiosId': '7042CR5', 'PlatformLogID': '1346333000'}

SFP Event

Page 32: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

rec_id : 8040alert_id : 14020079creation_time : 2011-05-17 12:58:58.661058severity : Eurgency : Nevent_loc : U9458.100.BPCF007event_loc_type : Pfru_loc : Nonerecommendation : reason : Power/Cooling subsystem & control (0x60) reported an error.src_name : SFPEventAnalyzerstate : 1raw_data : {'FRU List': [['IQYRISC', 'ACT04219I Isolate procedure', '', '', '', ''], ['PU_BOOK', 'ACT04216I FRU', 'U78A9.001.1122233', '', '', '']], 'SFP': 'c250hmc05.ppd.pok.ibm.com', 'Problem Number': 601}

SFP Alert

Page 33: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

raw_data : {'FRU List': [['IQYRISC', 'ACT04219I Isolate procedure', '', '', '', ''], ['PU_BOOK', 'ACT04216I FRU', 'U78A9.001.1122233', '', '', '']], 'SFP': 'c250hmc05.ppd.pok.ibm.com', 'Problem Number': 601}

SFP FRU list format in alerts

•Multiple FRUs with each one contained in brackets•Part number, FRU type, FRU location, ECID, CCIN

Part Number

FRU type FRU location Part Serial Number

ECID CCIN

IQYRISC Isolate Procedure

PU_BOOK FRU U78A9.001.11222333

Page 34: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

New support for Loadleveler 5.1

DB table polling via TEAL connector daemon

Loadleveler must be configured to use the DB

teal_llTLL_Raslog teal

[root@c250mgrs20-pvt log]# service teal_ll status [ OK ]loadleveler.py (pid 17583) is running...

[c250mgrs14][/]> lssrc -s teal_llSubsystem Group PID Status teal_ll 5701830 active

Loadleveler Connector

Page 35: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

===================================================rec_id : 9alert_id : LL001000creation_time : 2011-05-19 13:26:34.559391severity : Eurgency : Nevent_loc : z25c4s12.ppd.pok.ibm.comevent_loc_type : Afru_loc : Nonerecommendation : Call next level of supportreason : LoadL_schedd on machine z25c4s12.ppd.pok.ibm.com is down.

src_name : LLEventAnalyzerstate : 1raw_data :

LL alert_id:

LL0010xx = Daemon Down

LL0020xx = job failures

Loadleveler Alert

Page 36: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Multi-tiered configuration through service nodes using RMC and xCAT monitoring support

Uses pnsd_stat command to get statistics

May cause jitter on compute nodes so may not be enabled in all cases

xcatmn2:~ # lscondresp

Displaying condition with response information:

Condition Response Node State

"TealAnyNodePnsdStat_H" "TealLogPnsdEvent_H" "xcatmn2" “Active"

TEALCompute Svc Node

PNSD Connector

Page 37: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

===================================================rec_id : 12alert_id : PNSD0001creation_time : 2011-01-26 23:03:40severity : Eurgency : Nevent_loc : compute37##TealPnsdStatevent_loc_type : Afru_loc : Nonerecommendation : Call next level of supportreason : Packet retransmit threshold has been exceeded on node compute37src_name : PNSDEventAnalyzerstate : 1raw_data : 0.046

PNSD Alert

PNSD alert_id:

PNSD0001 = Retransmit threshold exceeded

Page 38: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Installation

Page 39: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Multi-platform– AIX – installp– Linux – RPM

Base– Pipeline– Base services

•Logging•DB access•Configuration•Locations

– Rules engine– Common filters/listeners– Command line– xCAT extensions

Component– Connector Library/Program– Rules– Alert/Event Metadata– Extension Data Format– User specific Filters/Listeners– Configuration file

TEAL

Base

ISNM

GPFS

PNSD

LL

ServiceFocal Point

….

Packaging

Page 40: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Stanza-based

Used during startup (/etc/teal)

Separate files per package (teal.conf => base framework features)

Configures processing pipeline

Additional parameters for specialized function

Enabled in different modes

[alert_listener.RmcAlertListener]class = ibm.teal.listener.rmc_alert_listener.RmcAlertListenerenabled = false

[alert_listener.FileAlertListener]class = ibm.teal.listener.file_alert_listener.FileAlertListenerenabled = historicfilters = DuplicateAlertFilterformat = textfile = /var/log/teal/cluster_alert.logmode = write

Configuration Files

Page 41: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

Add the definition where TEAL will pick it up:– Add to base configuration file (/etc/teal/teal.conf)– Add in file to configuration directory (/etc/teal/my.conf)– For temporary use: copy conf file(s) to own directory, modify and use during

historic analysis (more often for writing out alerts)

[alert_listener.SmtpAlertListener] class = ibm.teal.listener.smtp_alert_listener.SmtpAlertListener enabled = realtime filters = DuplicateAlertFilter server=ems1234.cluster.net [email protected], [email protected] [email protected]

Adding a e-mail listener

Page 42: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

/

/opt/teal

/usr/lib

/etc/teal

/data

/ibm

/bin

Start up configuration(default)

Libraries

•Component rules & metadata•Location•Extended data def

Directory Structure

Code

Page 43: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

/var/log/teal has TEAL logs (default)

On AIX look at the console (alog –t console –o) Note the following: (These are important fields with their TEAL and SFP

equivalents)– TEAL alert_id, SFP refcode– TEAL src_loc, SFP reporting MTMS– TEAL reason, SFP problem description– FRU list in TEAL and SFP

Specific alert data or range (text format)– /opt/teal/bin/tllsalert –f text –q “[query to narrow down]”– -f json or –f csv can be more handy for greping out certain records– -d to show duplicates

Specific event data or range (text, with extended and raw data)– /opt/teal/bin/tllsevent –f text –e –r –q “[query to narrow down]”– -f json or –f csv can be more handy for greping out certain records– -x to show which alerts it is associated with

When Things Go Wrong

Page 44: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

When Things Go Wrong (continued)

Data dump–/opt/teal/sbin/tltab -d -p <path to dump file–Restore:

• /opt/teal/sbin/tltab -c # Drop and recreate the tables

• /opt/teal/sbin/tltab -r -p <path to returned file> # Restore the tables with the user data

See TEAL on sourceforge (pyteal.sourceforge.net)

Look at service pack for known issues, hints/tips, etc..–http://www.ibm.com/developerworks/wikis/display/hpccentral/IB

M+High+Performance+Computing+Clusters+Service+Packs

Page 45: © 2011 IBM Corporation Toolkit for Event Analysis and Logging Education Dec 2011.

© 2011 IBM Corporation

References

TEAL Sourceforge Project - http://pyteal.sourceforge.net– Command reference– Install/Configuration Instructions– Design Overview & other goodies– Mailing List– Problem Tickets

xCAT HPC Software Installation– http://sourceforge.net/apps/mediawiki/xcat/index.php?title=IBM_HPC_Stack_in_an

_xCAT_Cluster– Loadleveler– GPFS– RSCT/RMC

Cluster Guide– https://www.ibm.com/developerworks/wikis/display/hpccentral/IBM+HPC+Clusterin

g+with+Power+775+-+Cluster+Guide

Cluster Service Pack readme– https://www.ibm.com/developerworks/wikis/display/hpccentral/IBM+High+Performa

nce+Computing+Clusters+Service+Packs