Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk...

24
Rafael Lopez Perez IBM Staff Software Engineer Arie Vurtzel IBM Hardware Engineer Tech Talk: PureData System for Analytics How to handle disk problems

Transcript of Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk...

Page 1: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

Rafael Lopez PerezIBM Staff Software Engineer

Arie VurtzelIBM Hardware Engineer

Tech Talk: PureData System for AnalyticsHow to handle disk problems

Page 2: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

Webinar Replays are available On-Demand @ http://ibm.biz/dwwebinars

Welcome to today’s Tech Talk

Call Logistics• There are 2 options to listen to the webinar:

• Call In via Phone - Dial the phone number shown on your screen. When prompted, use your telephone keypad to enter the access code, and the Attendee ID shown on your screen.

• Call Using Computer—Choose this option to connect to audio using VoIP.

• Questions will be addressed at the end of the webinar:• ALL questions should be posted to the Q&A Panel. You can post questions at

any time during the meeting. • You can find the Q&A panel either on the top or on the sidebar on the right

hand side of your screen

• Use the chat panel to post questions to the Host.

Page 3: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation3

Topics that we will present:

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Disk issuesTechTalk

Basic Disks errors

We will talk about the most common disk errors that we see in sysmgr logs and the smart statistics.

Disk automatic failures related to the disk errors.

We will show how sysmgr controls scsi errors and acts

Disk monitoring

What you can do to keep an eye on the disks.

Nzraidcheck and nzmicrodiskrepair

How we repair data inconsistency in dataslices.

Rare disks issues– Disk failed and NPS restart due to spu reboot – Disk regen failing due to problems in the target disk and spu reboot.– Disk regen propagating disk issues.

Page 4: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation4

Basic disk errorsTechTalk

Examples2013-11-21 22:54:05.038793 EST Info: SCSI error reported from [diskhwid=1041 sn="9QJ7VLAX000090382WHJ" SPA=1 Parent=1023 Position=12 ParentEnclPosition=2 Spu=1043] : sector = 238075458, error = medium error (0x3), asc = no additional sense information (0x11, 0x0)

2013-11-21 23:32:11.820128 EST Info: SCSI error reported from [diskhwid=1041sn="9QJ7VLAX000090382WHJ" SPA=1 Parent=1023 Position=12 ParentEnclPosition=2 Spu=1043] : sector = 169208130, error = recovered error (0x1), asc = no additional sense information (0x18, 0x5)

22:30:25.537039 EST Info: SCSI error reported from [disk hwid=1132 sn="9QJ79KY100009030U7BT" SPA=4 Parent=1121 Position=5 ParentEnclPosition=4] : sector = 16843330, error = recovered error (0x1), asc = recovered data - recommend reassignment (0x18, 0x5) response_code = 0x70

2013-11-22 01:53:47.871438 EST Info: SCSI error reported from [diskhwid=1041 sn="9QJ7VLAX000090382WHJ" SPA=1 Parent=1023 Position=12 ParentEnclPosition=2 Spu=1043] : sector = 756605860, error = hardware error (0x4), asc = no additional sense information (0x32, 0x0)

2013-04-08 11:14:50.664154 CES Info: SCSI error reported from [diskhwid=1085 sn="9QJ7X0QA000090380XYK" SPA=1 Parent=1075 Position=2 ParentEnclPosition=4 Spu=1097] : sector = 165236034, error = medium error (0x3), asc = unrecovered read error (0x11, 0x0) response_code = 0x70

Introduction

Basic disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 5: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation5

Disk automatic failuresReasons TechTalk

The Decision to fail a disk automaticaly is related to the nzhealthcheckrules DM011,12,13,14,15 that are implemented in the sysmgr

And is done based on disk parameters collected and stored on the database catalog.Examples:

2013-04-03 21:31:09.441522 UTC Info: Scsi Errors: disk: [disk hwid=1015 sn="9QJ78TEZ00009030EA4P" SPA=1 Parent=1004 Position=5 ParentEnclPosition=1 Spu=1003]2013-04-03 21:31:24.403811 UTC Warning: NZ-01550: disk [disk hwid=1015 sn="9QJ78TEZ00009030EA4P" SPA=1 Parent=1004 Position=5 ParentEnclPosition=1 Spu=1003] failover (cause = 'Scsi Errors')

2014-12-05 13:24:04.428335 CST Warning: NZ-01529: disk [disk hwid=1117 sn="9QJ60E3Y00009017UE9L" SPA=2 Parent=1106 Position=3 ParentEnclPosition=1 Spu=1198] failover (cause = 'Grown Defect Limit Reached')

2014-09-08 09:17:46.143811 CDT Warning: NZ-01587: disk [disk hwid=2009 sn="9WK5KP370000C2124QC2" SPA=11 Parent=1996 Position=5 ParentEnclPosition=1 Spu=2285] a SCSI Predictive Failure has been detected on disk 20092014-09-08 09:17:58.506626 CDT Warning: NZ-01550: disk [disk hwid=2009 sn="9WK5KP370000C2124QC2" SPA=11 Parent=1996 Position=5 ParentEnclPosition=1 Spu=2285] failover (cause = 'Scsi PFE Error')

2013-03-27 17:25:23.560513 EDT Warning: NZ-01550: disk [disk hwid=1074 sn="9QJ6QLK300009025PFBD" SPA=1

Parent=1061 Position=7 ParentEnclPosition=4 Spu=1080] failover (cause = 'HW error')

Warning: NZ-01550: disk [disk hwid=1041 sn="9QJ641GZ00009019M9Z0" SPA=1 Parent=1023 Position=12 ParentEnclPosition=2 Spu=1056] failover (cause = 'Problem with Disk Raid’)

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 6: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation6

When a nzhealthcheck/nzhw reports a disk is failed wecan find the HWID taken from the Catalog as well as the spa it belong, Encl and Slot.By running a command like: Nzpush -s Spa/Spu disk smart --encl y --slot zA smart will be produced on the disk Running the above command will produce an output regarding a specifc disk which nzhealthcheck reported.

Disk automatic failuresSMART data TechTalkIntroduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 7: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation7

Disk Smart Stand for Self-Monitoring, Analysis, and Reporting Technology. Disk Smart intention is to try and predict when a healthy disk becomes more likely to fail because of issues found during a normal life of a disk.Under the smart result produced on our Neteza you could find 12 Section that will include info we get from the smart:

Basic Info: gives info like model,firmware,sizeWrite: include info related to Write processRead: include info related to Read processVerify: Include info related to the Verify ProcessNon Medium: relate to Electronical Problem Temperature: Informal detail Self Test: Include info regarding 2 kind of test The Short one and The Extended Background Scans: done usually on ideal times.Power On Hours: how many hours is the disk workingGrown list: list the number of defect sectors found Predictive Failure: is a failure Predicted on this driveDate of Manufacturing: week (1-52 ) and year

Disk automatic failuresSMART data cont. TechTalkIntroduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 8: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation8

nzpush -s 1/1 disk smart --encl 1 --slot 1

spu0101: DSK: Make : ST600MM0026spu0101: DSK: Model : IBM-ESXSspu0101: DSK: F/W Rev. : E56Fspu0101: DSK: S/N : S0M1P8W10000B426HS4Wspu0101: DSK: Size : 600 GB, (1172123567 sectors)spu0101: DSK: Transport Protocol: SASspu0101: DSK: Disk Location : encl1Slot01

spu0101: ---------------Write---------------spu0101: Errors corrected with possible delays = 0spu0101: Total re-writes re-reads = 0spu0101: Total corrected errors = 0spu0101: Total times correction algorithm processed = 0spu0101: Total bytes processed = 27116592824992spu0101: Total uncorrected errors = 0

Disk automatic failuresSMART data cont. TechTalkIntroduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 9: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation9

spu0101: ---------------Read---------------

spu0101: Errors corrected without possible delays = 3148236189spu0101: Errors corrected with possible delays = 0spu0101: Total re-writes re-reads = 0spu0101: Total corrected errors = 3148236189spu0101: Total times correction algorithm processed = 0spu0101: Total bytes processed = 76875706935864spu0101: Total uncorrected errors = 0

spu0101: ---------------Verify---------------spu0101: Errors corrected without possible delays = 203097368spu0101: Errors corrected with possible delays = 0spu0101: Total re-writes re-reads = 0spu0101: Total corrected errors = 203097368spu0101: Total times correction algorithm processed = 0spu0101: Total bytes processed = 58908639920spu0101: Total uncorrected errors = 0

Disk automatic failuresSMART data cont. TechTalkIntroduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 10: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation10

spu0101: ---------Non Medium-------------spu0101: Non-medium error count = 1543spu0101: ---------Temperature-------------spu0101: Current temperature = 29spu0101: Reference temperature = 65spu0101: ---------Self Test-------------spu0101: Self-test Extended Duration = 62 minutesspu0101: Self-test Short Duration < 3 minutesspu0101: ---------Background Scans ------------spu0101: Background Scanning Status = Backgroundscanning is enabled and the device is waiting for Background Medium Interval timer experationspu0101: Number of background scans performed = 451spu0101: Background medium scan progress = 0%spu0101: ---------Power On Hours-------------spu0101: Power On Hours = 32007spu0101: ---------Grown list -------------spu0101: Grown defect list = 0spu0101: ---------Predictive Failure -------------spu0101: Predictive failure = Nonespu0101: ---------Date of Manufacturing---------spu0101: Calendar Year = 2014spu0101: Calendar Week = 2

TechTalkIntroduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Disk automatic failuresSMART data cont.

Page 11: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation11

Disk monitoringTechTalk

Nzhealthcheck rules inform of disk issues before the disks are failed and sometimes suggesting to fail them manually.Rule : DM011 Issue Detected : Multiple SCSI log page 0x15 events occuring daily Severity : Low Components : disk5[spa1.encl4](HWID: 1094) (from catalog) - 967 defects occured on 2015-07-08.

Expert's Advice : No action is required to be performed. This information is valuable only when there is a noticeable performance degradation on the system. This issue may be a hint to investigate performance of the disks reporting these errors.

Rule : DM013 Issue Detected : SCSI rewrite-in-place errors Severity : Low Components : disk9[spa2.encl4](HWID: 1140) (from catalog) - suspected disk - suspected disk

Expert's Advice : No action is required to be performed. This information is valuable only when there is a noticeable performance degradation on the system. This issue may be a hint to investigate performance of the disks reporting these errors.___________________________________________________

Rule : DM012 Issue Detected : Multiple SCSI Log page 0x15 events on disk head Severity : Medium Components : disk5[spa1.encl4](HWID: 1094) (from catalog) - suspected disk

Expert's Advice : Critical number of SCSI log page 0x15 events has occurred on reported disks. The reported disks are in service and should be failed. ___________________________________________________

Rule : DM015 Issue Detected : Multiple SCSI Log page 0x15 events on disk Severity : Medium Components : Disk reported 2048 defects during 2016-10-23..2016-10-29 period disk6[spa2.encl1](HWID: 1074) (from catalog) - Disk reported 2048 defects during 2016-10-29..2016-10-29 period

Expert's Advice : Critical number of SCSI log page 0x15 events has occurred on reported disks. The reported disks are in service and should be failed.

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 12: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation12

Disk monitoring cont.TechTalk

Nzhealthcheck rules inform of disk issues before the disks are failedRule : DM025 Issue Detected : Unrecoverable Read Errors(URE) reported by background scan in the last day Severity : Medium Components : disk1[spa2.encl2](HWID: 1090) (from catalog) - UREs count: 1 last URE error has been reported on: 2016-06-28 19:11:28

Expert's Advice : This rule is only intended for Netezza Support. URE on a disk might influence data slice regeneration after disk failure (data loss might occur in the worst case). Before performing disk failure, verify that there are no errors reported from the micro repair of paired disk in system manager logs. If such errors are reported, perform nzraidcheck.

___________________________________________________

Rule : DM040 Issue Detected : Both data slice disks have reported issues Severity : High Components : dslice69 (from nps_nzds) - Disk [HWID: 1089] reported issues: DM011, DM012, DM013, DM015 and Disk [HWID: 1110] reported issues: DM011, DM012, DM013, DM015 dslice70 (from nps_nzds) - Disk [HWID: 1089] reported issues: DM011, DM012, DM013, DM015 and Disk [HWID: 1110] reported issues: DM011, DM012, DM013, DM015

Expert's Advice : For reported dataslices, disks containing both data partitions have issues identified through other rules. It significantly increases data loss probability and requires immediate review.

For IBM Support it is advised to run nzraidcheck before failing any of the drives.

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 13: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation13

Disk monitoringTechTalk

Another option is to run queries that show the error details of the disks but it is preferable to use nzhealthcheck.nzsql -c "select DiskID, SPUID, Grown_Defects, DiskModel, DiskSerial from _vt_scsi_defectswhere grown_defects > 0 order by grown_defects desc" DISKID | SPUID | GROWN_DEFECTS | DISKMODEL | DISKSERIAL--------+-------+---------------+-------------+----------------------

1085 | 1008 | 23 | ST600MM0026 | S0M1NMNL0000B424G05M1130 | 1008 | 1 | ST600MM0026 | S0M1NMBZ0000B4259E0G

nzsql -c "select * from _t_SCSI_PFE;“

PFE_DISKHWID | PFE_DISKSERIAL | PFE_DISKMODEL | PFE_DISKMFG | PFE_REPORTINGHWID | PFE_REPORTINGHWSERIAL | PFE_REPORTINGHWTYPE | PFE_ERRSTRING | PFE_ASC | PFE_ASCQ | PFE_FRU | PFE_TIMESTAMP

--------------+----------------+---------------+-------------+-------------------+-----------------------+---------------------+---------------+---------+----------+---------+---------------

(0 rows)

nzsql -c "select * from _t_scsi_errors;"

SCSI_ERRID | SCSI_HWID | SCSI_HWTYPE | SCSI_SERNUM | SCSI_MODEL | SCSI_MCFG | SCSI_DISKSZ | SCSI_HWSTATE | SCSI_HWROLE | SCSI_SECTOR | SCSI_SENSEKEY | SCSI_ASC | SCSI_ASCQ | SCSI_TIMESTAMP | SCSI_ADDITIONAL_INFO | SCSI_CDB

------------+-----------+-------------+-------------+------------+-----------+-------------+--------------+-------------+-------------+---------------+----------+-----------+----------------+----------------------+----------

(0 rows)

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 14: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation14

Nzraidcheck TechTalk

This is a tool that we use to check the consistency of data between the main disk and its mirror.

It runs in background, scans for all databases / all tables and compares primary and mirror copies

- Any difference, it logs into tables that are created by nzraidcheck tool- We can utilize these tables to identify (pro-actively) which tables have mis-

match data between primary / mirror.In cases were we see many disk scsi issues or we suspect that there might be a data corruption we run the tool before failing and removing any disk from the system.

The tool will run for few hours, depending of the system size and disk utilization it can take up to 24h

Once it completes a report will be available on the table V_NZRAIDCHECK_ERROR_DETAIL_LATEST

The important columns are:

rd_failed_pri

rd_failed_mir

CAN REPAIR

T T F

F T T

T F T

F F T

T|T = Worst case scenario, we will need to drop the data that is affected because we can not read from any disk (primary/mirror)F|T = Nzmicrodiskrepair can be runT|F = Nzmicrodiskrepair can be runF|F = Nzmicrodiskrepair can be run

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 15: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation15

Nzmicrodiskrepair TechTalk

Is the tool that helps to fix the data inconsistencies between primary and mirror disk.

It needs a fresh updated table V_NZRAIDCHECK_ERROR_DETAIL_LATEST

It is fast just few seconds because it is focused only the sectors that reported errors in the table.

It will try to repair the sectors that can be read from any of the disk pair

These are the steps:

[nz@netezza ~]$ nzsystem pauseAre you sure you want to pause the system (y|n)? [n] y[nz@netezza ~]$ nzsystem set -arg system.spuDiskMultipath=falseAre you sure you want to change the system configuration (y|n)? [n] y[nz@netezza ~]$ nzsystem resume[nz@netezza ~]$ /nz/kit/share/tools/storage/nzmicrodiskrepair -mode RepairOnly

[nz@netezza ~]$ nzsystem pauseAre you sure you want to pause the system (y|n)? [n] y[nz@netezza ~]$ nzsystem set -arg system.spuDiskMultipath=trueAre you sure you want to change the system configuration (y|n)? [n] y[nz@netezza ~]$ nzsystem resume

The best approach is to run a new nzraidcheck to verify the errors are gone.

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 16: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation16

When we can not repairTechTalk

In some situations the primary and mirror disk have problems reading the same sector so we can not guarantee the data that is stored.

If the errors can not be repaired, it must be decided whether these tables can be restored from a backup.

If not, an alternative would be to empty affected pages

nzsqa emptyPage -spuhwid <spuhwidN> -dev <num> -LBA <num>

#1: The owner SPU HWID of the affected dataslice#2: The data partition number that represents the affected dataslice#3: The location of the bad sector #4: The object ID of the affected table

nzsqa to make rest of the data usable. Of course this will delete the affected data in the process.

Parameters for nzsqa command should be taken from nz_db_tables_rowcount output (Dev and LBA).

Example

2014-10-07 12:12:19.922253 PDT Warning: Disk/FPGA error has been detected on the disk - DISK_FPGA_ERROR [hwId=2209, hwType=disk, diskHwId=2209, spuHwId=4115, location=Logical Name:'spa18.diskEncl4.disk11' Physical Location:'9th rack, 8th disk enclosure, disk in Row 3/Column 3', errType=3, errCode=116, oper=0, retries=0, dataPartition=5, lba=35209219, tableId=949541, dataSliceId=825, block=35000809, fpgaEngineId=6, fpgaBoardSerial=1237F58250064, devSerial=Y012UF31K04X, diskModel=ST1000NM0001, diskMfg=IBM-ESXS, diskSerial=Z1N3A9VL000093205TW7, eventSource=system, errString=Disk/FPGA error encountered, reasonCode=1018]

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 17: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation17

Rare disk issuesDisk failed and NPS restart due to spu reboot TechTalk

Disk is failed but the spu is not able to remove the disk. We need to restart the spu. That leads to an NPS change of state.2015-10-24 11:46:17.098516 PHT (25416) Warning: NZ-01529: disk [disk hwid=1179 sn="S0M4053Z0000K517BWKL" SPA=1 Parent=1137 Position=22 ParentEnclPosition=6] failover (cause = 'Raid Error')2015-10-24 11:46:17.151492 PHT (25416) Warning: DataSlices 119 and 120 has only one copy of the data remaining2015-10-24 11:46:17.151586 PHT (25416) Info: NZ-01526: the role of disk [disk hwid=1179 sn="S0M4053Z0000K517BWKL" SPA=1 Parent=1137 Position=22 ParentEnclPosition=6] changed from 'active' to 'failing'2015-10-24 11:46:17.159342 PHT (25416) Info: started gathering service data task for hwid: 11792015-10-24 11:46:17.159364 PHT (25416) Info: finished gathering service data for hwid: 1179 - no device manager2015-10-24 11:46:17.159375 PHT (25416) Info: started logs collecting task2015-10-24 11:46:17.161582 PHT (25416) Info: Changing DeviceMapping for failOverDisk: failed Active/Assigned2015-10-24 11:46:17.174237 PHT (25416) Info: New DeviceMapping version 92015-10-24 11:46:25.407015 PHT (25416) Info: Unsetting failover flag for [disk hwid=1179 sn="S0M4053Z0000K517BWKL" SPA=1 Parent=1137 Position=22 ParentEnclPosition=6]2015-10-24 11:48:05.893818 PHT (25416) Info: Finished collecting logs with result: 12015-10-24 11:48:12.084900 PHT (25416) Info: Got devMappingAck for [spu hwid=1009 sn="Y012UF4AF0YV" SPA=1 Parent=1002 Position=5 spuName= spu0105] devMapVer 92015-10-24 11:48:12.085043 PHT (25416) Info: Error removing lunS0M4053Z0000K517BWKL from md.Unable to fail lun S0M4053Z0000K517BWKL2015-10-24 11:48:12.085179 PHT (25416) Info: started logs collecting task2015-10-24 11:48:12.106074 PHT (25416) Warning: NZ-01554: [spu hwid=1009 sn="Y012UF4AF0YV" SPA=1 Parent=1002 Position=5 spuName= spu0105] restarted because of failed regen or failed formatting of a device2015-10-24 11:48:12.108467 PHT (25416) Info: Storing the spu log in /nz/kit.7.2.0.4-P1/log/spucores/spulog0105.20151024_114812.devmapReply.gz2015-10-24 11:48:13.678237 PHT (25416) Info: NZ-01500: system state change from 'Online' to 'Pausing Now’

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 18: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation18

Rare disk issuesDisk regen failing due to problems in the target disk and spu reboot. TechTalk

Disk failed, and removed from catalog but the spu is not able to start the data regeneration in the new disk and it needs to restart the spu.2018-03-08 09:37:06.897849 SAS (22495) Info: Setting failover flag for [disk hwid=3060 sn="0BGKTUTD" SPA=2 Parent=3034 Position=6 ParentEnclPosition=4] Reason - Problem with Disk Raid...2018-03-08 09:42:49.613462 SAS (22495) Info: Regening lun [ Lun id=0BGKK6WD size=600127266304 lunGroupId=126 Assigned Spu=1044] to [ Lun id=0BGKTWJD size=600127266304 lunGroupId=0 Assigned Spu=1044]

...

2018-03-08 09:44:53.382231 SAS (22495) Info: Got devMappingAck for [spu hwid=1044 sn="Y014UF63R0CM" SPA=2 Parent=1010 Position=1 spuName= spu0201 DesignatedSpu] devMapVer 902018-03-08 09:44:53.382808 SAS (22495) Error: Error setting up regen's for dataslices 329,330 to disk [disk hwid=3064 sn="0BGKTWJD" SPA=2 Parent=3034 Position=10 ParentEnclPosition=4]. Unable to initiate regen on 0BGKTWJD2018-03-08 09:44:53.382904 SAS (22495) Info: Regen target disk [disk hwid=3064 sn="0BGKTWJD" SPA=2 Parent=3034 Position=10 ParentEnclPosition=4] - Invisible2018-03-08 09:44:53.399156 SAS (22495) Info: NZ-01526: the role of disk [disk hwid=3064 sn="0BGKTWJD" SPA=2 Parent=3034 Position=10 ParentEnclPosition=4] changed from 'assigning' to 'spare'2018-03-08 09:44:53.402484 SAS (22495) Info: Changing DeviceMapping for undoRegenDependency: x->Spare2018-03-08 09:44:53.408312 SAS (22495) Info: New DeviceMapping version 912018-03-08 09:44:53.428396 SAS (22495) Info: New Topology version 152018-03-08 09:45:00.598279 SAS (22495) Warning: dataslices 329,330 has only one copy of the data remaining2018-03-08 09:45:00.598449 SAS (22495) Info: started logs collecting task2018-03-08 09:45:00.603127 SAS (22495) Warning: NZ-01554: [spu hwid=1044 sn="Y014UF63R0CM" SPA=2 Parent=1010 Position=1 spuName= spu0201 DesignatedSpu] restarted because of failed regen or failed formatting of a device

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 19: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation19

Rare disk issuesDisk regen propagating disk issues. TechTalk

One initial disk failure, data regen is completed but soon after the new disk is failed too.

This is a very rare case were few sectors with problems in the regeneration source disk are propagated to the new assigned disk.2018-02-10 21:17:10.763295 UTC (11331) Info: NZ-01526: the role of disk [disk hwid=1260 sn="KSGEWZER" SPA=1 Parent=1225 Position=15 ParentEnclPosition=5] changed from 'failing' to 'failed'2018-02-10 21:17:10.775734 UTC (11331) Info: NZ-01526: the role of disk [disk hwid=1652 sn="S0M7WDQ70000K718E0K8" SPA=1 Parent=1225 Position=10 ParentEnclPosition=5] changed from 'spare' to 'assigning'

The issue is that the disk mirror of that dataslice is having some scsi errors too2018-02-10 22:51:07.144846 UTC (11331) Info: SCSI error reported from [disk hwid=1129 sn="KNX7LJKR" SPA=1 Parent=1090 Position=19 ParentEnclPosition=2] : sector = 1171946655, error = medium error (0x3), asc = unrecovered read error (0x11, 0x0) response_code = 0x702018-02-10 22:51:10.209861 UTC (11331) Info: SCSI error reported from [disk hwid=1129 sn="KNX7LJKR" SPA=1 Parent=1090 Position=19 ParentEnclPosition=2] : sector = 1171946911, error = medium error (0x3), asc = unrecovered read error (0x11,

The regen is completed but we are propagating the errors into the new regenerated disks, later on we see scsi errors in the new disk and the sysmgr fails it.

2018-02-12 16:56:47.067391 UTC (2814) Info: SCSI error reported from [disk hwid=1652 sn="S0M7WDQ70000K718E0K8" SPA=1 Parent=1225 Position=10 ParentEnclPosition=5] : sector = 409582271, error = medium error (0x3), asc = unrecovered read error (0x11, 0x0) response_code= 0x702018-02-12 16:56:47.075300 UTC (2814) Info: SCSI error reported from [disk hwid=1652 sn="S0M7WDQ70000K718E0K8" SPA=1 Parent=1225 Position=10 ParentEnclPosition=5] : sector = 409582399, error = medium error (0x3), asc = unrecovered read error (0x11, 0x0) response_code= 0x70

2018-02-13 14:20:00.132318 UTC (32347) Info: Marking lun [ Lun id=S0M7WDQ70000K718E0K8 size=600127266304 lunGroupId=87 Assigned Spu=1594] as failed2018-02-13 14:20:00.132351 UTC (32347) Info: Setting failover flag for [disk hwid=1652 sn="S0M7WDQ70000K718E0K8" SPA=1 Parent=1225 Position=10 ParentEnclPosition=5] Reason -Problem with Disk Raid

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 20: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation20

Rare disk issuesDisk regen propagating disk issues cont. TechTalk

In order to recover from that situation we need to zeroed the affected sectors in the disk, if we are lucky we don’t need to delete any table data.

Example2018-02-19 12:31:34.292481 UTC (14109) Info: SCSI error reported from [disk hwid=1351 sn="KNX919BR" SPA=1 Parent=1315 Position=16 ParentEnclPosition=7] : sector = 409582308, error = medium error (0x3), asc = unrecovered read error (0x11, 0x0) response_code = 0x702018-02-19 12:31:34.302802 UTC (14109) Info: SCSI error reported from [disk hwid=1351 sn="KNX919BR" SPA=1 Parent=1315 Position=16 ParentEnclPosition=7] : sector = 409582308, error = medium error (0x3), asc = unrecovered read error (0x11, 0x0) response_code = 0x702018-02-19 12:31:34.309736 UTC (14109) Info: SCSI error reported from [disk hwid=1120 sn="KSGKXK3N" SPA=1 Parent=1090 Position=10 ParentEnclPosition=2] : sector = 1171946820, error = medium error (0x3), asc = unrecovered read error (0x11, 0x0) response_code = 0x702018-02-19 12:32:11.918170 UTC (14109) Info: Spupartition 'spa1.spu1.dpart29' is degraded and the faulty lun wasn't found

"select * from _vt_disk_alloc where hwid=1588 and device=29 and START_SECTOR <= (409582308 - 63) and END_SECTOR >= (409582308 - 63)”

select * from _vt_disk_alloc where hwid=1588 and device=29 and START_SECTOR <= (1171946820 - 762364575) and END_SECTOR >= (1171946820 - 762364575)”

The offset from the disk partitions:/dev/sdfv1 63 409625369 204812653+ 83 Linux/dev/sdfv2 409625370 762364574 176369602+ 83 Linux/dev/sdfv3 762364575 1171989944 204812685 83 Linux

If the select is empty it means no table is using those sectors so IBM support canzeroed them. If there any table affected we need to drop it or clear the page as wesaw before.

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 21: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

©2018 IBM Corporation21

Understanding hard drive media defects white paper -Servershttps://www.ibm.com/support/home/docdisplay?lndocid=mcgn-3n3k7b

S.M.A.R.T. Explainedhttps://en.wikipedia.org/wiki/S.M.A.R.T.

Nzraidcheck• http://www-01.ibm.com/support/docview.wss?uid=swg21675248

• https://www.ibm.com/developerworks/community/blogs/9c8f1300-9ac0-4de5-80e6-0708f8e0260d/entry/50_PureData_Nuggets_5_Collecting_Nzraidcheck_report?lang=en

Nzmicrodiskrepairhttp://www-01.ibm.com/support/docview.wss?uid=swg22002772

nzsqa emtpyPagehttp://www.ibm.com/support/docview.wss?uid=swg21973858

ReferencesTechTalk

Introduction

Basic Disks errors

Disk automatic failures

Disk monitoring

Nzraidcheck and

nzmicrodiskrepair

Rare disks issues

Q&A

Reference material

Next TechTalk

Page 22: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

Questions?Type your question in the

Q&A panel on your screen.

Page 23: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

PRODUCT LINKS:

IBM Fix Central:https://www.ibm.com/support/fixcentral/

PureData System for Analytics Support Page:http://ibm.biz/pda_support

IBM Knowledge Center –PureData: http://ibm.biz/pda_knowledgecenter

For more information

Web Site: IBM PureData System for Analytics http://www.ibm.com/software/data/puredata/analytics/system/

Blogs/Articles: IBM Big Data & Analytics Hub –http://www.ibmbigdatahub.com

Community: Upcoming & On-Demand Webinarshttp://ibm.biz/dwwebinars

Data Warehouse Communityhttp://ibm.biz/dwcommunityMake sure to JOIN the community to get the latest updates and join in on the conversation! [select “Log In” in the top right hand of the screen to register and JOIN]

Page 24: Tech Talk: PureDataSystem for Analytics How to handle disk ... · Next TechTalk Diskissues TechTalk Basic Disks errors We will talk about the most common disk errors that we see in

© International Business Machines Corporation 2017International Business Machines Corporation New Orchard Road Armonk, NY 10504 IBM, the IBM logo, PureSystems, PureFlex, PureApplication, PureData and ibm.com are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. A current list of IBM trademarks is available on the Web at www.ibm.com/legal/copytrade.shtmlAll rights reserved.