Troubleshooting Cases - support.huaweicloud.com

23
Kunpeng BoostKit for SDS Troubleshooting Cases Issue 03 Date 2021-03-23 HUAWEI TECHNOLOGIES CO., LTD.

Transcript of Troubleshooting Cases - support.huaweicloud.com

Page 1: Troubleshooting Cases - support.huaweicloud.com

Kunpeng BoostKit for SDS

Troubleshooting Cases

Issue 03

Date 2021-03-23

HUAWEI TECHNOLOGIES CO., LTD.

Page 2: Troubleshooting Cases - support.huaweicloud.com

Copyright © Huawei Technologies Co., Ltd. 2021. All rights reserved.

No part of this document may be reproduced or transmitted in any form or by any means without priorwritten consent of Huawei Technologies Co., Ltd. Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.All other trademarks and trade names mentioned in this document are the property of their respectiveholders. NoticeThe purchased products, services and features are stipulated by the contract made between Huawei andthe customer. All or part of the products, services and features described in this document may not bewithin the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements,information, and recommendations in this document are provided "AS IS" without warranties, guaranteesor representations of any kind, either express or implied.

The information in this document is subject to change without notice. Every effort has been made in thepreparation of this document to ensure accuracy of the contents, but all statements, information, andrecommendations in this document do not constitute a warranty of any kind, express or implied.

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. i

Page 3: Troubleshooting Cases - support.huaweicloud.com

Contents

1 Failed to Restart the OSD Process in a Ceph Cluster......................................................1

2 Uneven Distribution of PGs...................................................................................................3

3 COSBench Failed to Read Large Files................................................................................. 4

4 COSBench Test Stops Unexpectedly....................................................................................5

5 High-Concurrency Test Failed...............................................................................................6

6 RGW Failed to Start................................................................................................................ 8

7 Only Some RGWs Are Displayed in the Cluster............................................................. 10

8 Server Restarts During the Ceph Performance Test......................................................13

9 fio Connection Failed............................................................................................................14

10 Failed to Perform Fio Test................................................................................................. 15

11 Failed to Load the fio Engine libaio............................................................................... 16

12 High CPU Usage of the osq_lock Function....................................................................17

13 Failed to Create an OSD.................................................................................................... 18

14 Ceph MON Exception......................................................................................................... 19

A Change History...................................................................................................................... 20

Kunpeng BoostKit for SDSTroubleshooting Cases Contents

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. ii

Page 4: Troubleshooting Cases - support.huaweicloud.com

1 Failed to Restart the OSD Process in aCeph Cluster

Symptom1. When restarting the OSD nodes in a Ceph cluster after the read/write

performance of the cluster is tested, the error message shown in the followingfigure is reported by the test tool:

2. Check the status of the Ceph cluster. Some OSD nodes are down, as shown inthe following figure.

Kunpeng BoostKit for SDSTroubleshooting Cases

1 Failed to Restart the OSD Process in a CephCluster

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 1

Page 5: Troubleshooting Cases - support.huaweicloud.com

Procedure

Step 1 View the Ceph log. It is found that the memory fails to be allocated to Ceph. Anexception may occur when the OSD process attempts to obtain memory.

Step 2 Run the following command. The value of osd_memory_target is not the defaultvalue (4 GB) released officially.

Step 3 Add osd_memory_target = 4294967296 to the ceph.conf file to limit the memoryallocated to each OSD to 4 GB.

Step 4 Push the modified file to the other nodes.ceph-deploy --overwrite-conf admin ceph1 ceph2 ceph3 client1 client2 client3

Step 5 Restart the cluster.systemctl restart ceph.target

----End

Kunpeng BoostKit for SDSTroubleshooting Cases

1 Failed to Restart the OSD Process in a CephCluster

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 2

Page 6: Troubleshooting Cases - support.huaweicloud.com

2 Uneven Distribution of PGs

SymptomWhen the drives are heavy-loaded during the I/O test, the load of some drivesreaches 100%, while that of some drives is less than 80%. The overall drive load isunbalanced. After the ceph pg dump command is executed to query the PGallocation, the PGs are not evenly distributed in the Ceph cluster.

ProcedureThe number of PGs on each OSD must be the same or close to each other.Otherwise, some OSDs may be overloaded and become bottlenecks. Use thebalancer plugin to optimize PG distribution.

Step 1 Check the PG distribution.ceph balancer evalceph pg dump

NO TE

Use either of the preceding commands.

Step 2 Enable automatic balancing for Ceph PGs.ceph balancer mode upmapceph balancer on

Ceph adjusts PG distribution every 60 seconds.

Step 3 Repeat Step 1 occasionally. If the PG distribution does not change, the PGdistribution is optimal.

----End

Kunpeng BoostKit for SDSTroubleshooting Cases 2 Uneven Distribution of PGs

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 3

Page 7: Troubleshooting Cases - support.huaweicloud.com

3 COSBench Failed to Read Large Files

SymptomDuring the 256 KB get test, COSBench fails to read files.

Procedure

Step 1 View the COSBench log (/path/to/cosbench/archive/workload/workload.log).The following error is found:Uploading large file fails with ResetException: Failed to reset the request input stream

The default size of the file to be read by COSBench is 128 KB.

Step 2 Add the following to the Java command line in the /path/to/cosbench/cosbench-start.sh script and modify the parameter value:-Dcom.amazonaws.sdk.s3.defaultStreamBufferSize=<YOUR_MAX_PUT_SIZE>

----End

Kunpeng BoostKit for SDSTroubleshooting Cases 3 COSBench Failed to Read Large Files

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 4

Page 8: Troubleshooting Cases - support.huaweicloud.com

4 COSBench Test Stops Unexpectedly

SymptomThe test terminates unexpectedly as the COSBench data integrity verification fails.

Procedure

Step 1 View the COSBench log (/path/to/cosbench/archive/workload/workload.log).The following error is found:

The test stops because COSBench fails to verify the MD5 data.

Step 2 Add the following to the Java command line in the /path/to/cosbench/cosbench-start.sh script to disable MD5 validation:-Dcom.amazonaws.services.s3.disableGetObjectMD5Validation=true

Step 3 Restart all COSBench processes.

----End

Kunpeng BoostKit for SDSTroubleshooting Cases 4 COSBench Test Stops Unexpectedly

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 5

Page 9: Troubleshooting Cases - support.huaweicloud.com

5 High-Concurrency Test Failed

SymptomWhen the number of concurrent access requests of an RGW is greater than 512,the COSBench test stops unexpectedly.

Procedure

Step 1 View the COSBench log (/path/to/cosbench/archive/workload/workload.log).The following error is found:HTTP Request Time Out

Step 2 View the RGW log (/var/log/ceph/<rgw>.log). The following error is found:

Step 3 Query the default number of RGW threads.radosgw-admin --show-config | grep thread

Kunpeng BoostKit for SDSTroubleshooting Cases 5 High-Concurrency Test Failed

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 6

Page 10: Troubleshooting Cases - support.huaweicloud.com

The default number of threads of the RGW is 512. When the number ofconcurrent requests exceeds 512, the RGW cannot process client requests,resulting the failure of all tests.

Step 4 Run the following command on any Ceph node to increase the number of RGWthreads:sed -i 's/rgw_frontends.*/& num_threads=1024/g' ceph.conf

Step 5 Restart the COSBench process.systemctl restart ceph-radosgw.target

----End

Kunpeng BoostKit for SDSTroubleshooting Cases 5 High-Concurrency Test Failed

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 7

Page 11: Troubleshooting Cases - support.huaweicloud.com

6 RGW Failed to Start

SymptomThe RGW fails to start due to duplicate RGW ports.

ProcedureStep 1 Check the RGW process.

ps -ef | grep rgw

1. Check the RGW process of the system. There is only one RGW process.

2. Check the system configuration. There are eight RGW services configured inthe system.

3. Manually start ceph-rgw.rgw* and ceph-rgw.ceph-zip3. rgw.ceph-zip3 is notstarted.

4. Check the corresponding RGW ports. It is found that port 7480 is enabled,while port 7482 is not.

Kunpeng BoostKit for SDSTroubleshooting Cases 6 RGW Failed to Start

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 8

Page 12: Troubleshooting Cases - support.huaweicloud.com

5. Check the configuration file. In the configuration file, port 7480 correspondsto the ceph-zip* RGW process, and port 7482 corresponds to the RGW2process. The process startup failure is caused by port conflict.

Step 2 Restart the RGW process.

1. Stop the RGW2 process.systemctl stop [email protected]

2. Start the ceph-zip3 and RGW2 processes. The problem is solved.systemctl start [email protected] start [email protected]

----End

Kunpeng BoostKit for SDSTroubleshooting Cases 6 RGW Failed to Start

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 9

Page 13: Troubleshooting Cases - support.huaweicloud.com

7 Only Some RGWs Are Displayed in theCluster

SymptomOnly some RGWs are displayed in the cluster due to duplicate RGW names.

Kunpeng BoostKit for SDSTroubleshooting Cases 7 Only Some RGWs Are Displayed in the Cluster

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 10

Page 14: Troubleshooting Cases - support.huaweicloud.com

Procedure

Step 1 Modify the Zone configuration.

1. Check Zone.radosgw-admin zone list

2. Check the placement information.radosgw-admin zone placement list

Kunpeng BoostKit for SDSTroubleshooting Cases 7 Only Some RGWs Are Displayed in the Cluster

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 11

Page 15: Troubleshooting Cases - support.huaweicloud.com

3. Modify the compression option of Zone.radosgw-admin zone placement modify --rgw-zone=default --compression=zlib --placement-id=default-placement

Step 2 Restart all RGW services in the cluster for the configuration to take effect.for i in {1..7};do service [email protected]$i restart;done

----End

Kunpeng BoostKit for SDSTroubleshooting Cases 7 Only Some RGWs Are Displayed in the Cluster

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 12

Page 16: Troubleshooting Cases - support.huaweicloud.com

8 Server Restarts During the CephPerformance Test

SymptomWhen the TaiShan 200 server is configured with the onboard NIC and 1822 NIC,the server restarts during the Ceph performance test.

Procedure

Step 1 It is found that the onboard NIC driver is faulty and the kernel parameters need tobe configured.

Step 2 Upgrade the onboard NIC driver and add the irqpoll parameter to the kernel bootitem.

----End

Kunpeng BoostKit for SDSTroubleshooting Cases 8 Server Restarts During the Ceph Performance Test

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 13

Page 17: Troubleshooting Cases - support.huaweicloud.com

9 fio Connection Failed

SymptomDuring the fio test, the remote client cannot be connected, as shown in thefollowing:

fio: connect: Connection refusedfio: failed to connect to 192.168.3.132:8765

Procedure

Step 1 Check the fil service on the remote client. It is found that the fio service is notstarted.

Step 2 Start the fio service on the remote client.fio --server

----End

Kunpeng BoostKit for SDSTroubleshooting Cases 9 fio Connection Failed

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 14

Page 18: Troubleshooting Cases - support.huaweicloud.com

10 Failed to Perform Fio Test

SymptomDuring the fio test, the management node displays a command error, and theremote client displays a message indicating that the server or client version doesnot match, as shown in the following:

fio: bad server cmd version 78fio: server bad crc on command (got 0, wanted 4b0a)fio: bad server cmd version 78fio: server bad crc on command (got 0, wanted 27cd)fio: client/server version mismatch (66 != 78)

ProcedureThe fio version on the management node is different from that on the client.Install fio of the same version on the management node and all fio clients.

Kunpeng BoostKit for SDSTroubleshooting Cases 10 Failed to Perform Fio Test

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 15

Page 19: Troubleshooting Cases - support.huaweicloud.com

11 Failed to Load the fio Engine libaio

SymptomDuring the fio test, a message is displayed indicating that the libaio engine fails tobe loaded.

Possible CausesThe fio version installed on the client does not support libaio.

ProcedureInstall libaio-devel and recompile and install fio. The procedure is as follows:

yum -y install libaio-develcd /path/to/fio/./configuremakemake install

Kunpeng BoostKit for SDSTroubleshooting Cases 11 Failed to Load the fio Engine libaio

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 16

Page 20: Troubleshooting Cases - support.huaweicloud.com

12 High CPU Usage of the osq_lockFunction

SymptomWhen the libaio engine is used to perform the fio test, the perf top commandoutput indicates that the CPU usage of the osq_lock function in the kernel spaceexceeds 40%.

ProcedureReplace libaio with the RBD engine of fio.

Kunpeng BoostKit for SDSTroubleshooting Cases 12 High CPU Usage of the osq_lock Function

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 17

Page 21: Troubleshooting Cases - support.huaweicloud.com

13 Failed to Create an OSD

SymptomFailed to create OSD. The error is shown as follows:

[ceph4][ERROR ] RuntimeError: command returned non-zero exit status: 1[ceph_deploy.osd][ERROR ] Failed to execute command: /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore --data /dev/nvme0n1p1[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSD

Possible CausesThe LVM on which the OSD depends fails to be created. The Ceph logical volumeis not found when the logical volume information is checked on the LVS. However,the Ceph logical volume can be found by running the lsblk command. As a result,the DM mapping of the Ceph logical volume is not cleared.

ProcedureClear the DM mapping of the logical volume.

dmsetup info -Cdmsetup remove [dm_map_name]

Kunpeng BoostKit for SDSTroubleshooting Cases 13 Failed to Create an OSD

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 18

Page 22: Troubleshooting Cases - support.huaweicloud.com

14 Ceph MON Exception

SymptomRun the ceph -s command to check whether slow ops exists in the Ceph MONprocess. The error information is shown as follows:

HEALTH_WARN 376 slow ops, oldest one blocked for 894 sec, daemons [mon,ceph4,mon,ceph5,mon,ceph6] have slow ops.SLOW_OPS 376 slow ops, oldest one blocked for 894 sec, daemons [mon,ceph4,mon,ceph5,mon,ceph6] have slow ops.

Possible CausesAfter the Ceph cluster is redeployed, the configuration file of the original Cephcluster overwrites that of the current cluster. As a result, the NUMA affinityconfiguration does not match the actual situation.

ProcedureReconfigure NUMA affinity and modify the NUMA affinity configuration in theceph.conf file as required.[osd.N]:osd_numa_node = 1public_network_interface = bond1cluster_network_interface = bond1

Kunpeng BoostKit for SDSTroubleshooting Cases 14 Ceph MON Exception

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 19

Page 23: Troubleshooting Cases - support.huaweicloud.com

A Change History

Date Description

2021-03-23 This is the third official release.Changed the solution name from "Kunpeng SDS solution"to "Kunpeng BoostKit for SDS".

2021-01-25 This issue is the second official release.● Deleted "Write Performance Deteriorates in the Second

Round of the File Storage Test."● Modified the procedure in 7 Only Some RGWs Are

Displayed in the Cluster.

2020-06-10 This issue is the first official release.

Kunpeng BoostKit for SDSTroubleshooting Cases A Change History

Issue 03 (2021-03-23) Copyright © Huawei Technologies Co., Ltd. 20