Download - Navisphere Analyzer Hands On Script EMC World 2010

EMC World – May 2010

1

Navisphere Analyzer

Purpose of the script: EMC Navisphere Analyzer allows you to view storage system performance statistics in various types of charts. These charts can help you find and anticipate bottlenecks in the disk storage component of a computer

system. Today’s session will take a look at the Navisphere Analyzer User Interface. In particular, it will be covering the

different views available and provide some basic starting point in checking if your existing configuration is being stressed

or working in a well utilized manner. This script is designed for use with Navisphere Manager 6.x software.

Please do not alter any of the workstation or CLARiiON Storage System configuration details unless instructed

to do so by these instructions or by a member of the EMC presentation team.

In this session there are primarily two exercises covering archive retrieval, viewing and on-array real time analysis. In addition to the instructions for these exercises, you will find more exercises and reference material in this handout.

If you have time to do so, please explore those additional sections during this session.

Before you begin: Fill in the following information:

a. Assigned Array from the Desktop ICON array.txt on your laptop: Array _________ SP ___ b. Storage System IP address to use ___.___.___.___

c. Proceed to exercise A

Exercise A, NAR file viewing offline

This exercise is to direct you into checking the status of data logging on your array, retrieving an

archive file containing performance data, and then looking at that data using Navisphere off-array

software. This is primarily a walk-through exercise to get familiarity with the steps involved in

performance analysis. The second exercise will cover more details about the metrics you are looking

at.

The NAR file is from a test environment where a total of 12 tests were run. You will clearly be able to

segment the statistics into 12 areas. The odd numbered tests were using a single thread to each LUN

and the even numbered tests were using 4 threads per LUN. The essence of this exercise is to get

familiarity with the interface, as well as identify that increasing load on the array has various effects

What you do Notes & observations Start Internet Explorer <Enter the IP address of your assigned managed

node ( SP IP address ) into the

address window of the browser>

The process of getting started with Navisphere Manager is simple. You begin by pointing your browser at the storage system’s IP

address.


2

Login <Enter the user as emcw and password emcw>

You will be presented with the standard Navisphere view of the

Domain you logged into.

<Select Tools –>Analyzer -> Data

Logging>

Check the logger is running and that periodic archiving is as you want it. If

checked it saves the archives every 5

hours with the default 120 second sampling, or 2.5 hours if using 60

second sampling.

If you do not have Analyzer installed

on the array, you can still invoke

logging but this will be limited to 7 days of periodic archiving and will

create encrypted archives for service

use.

<Select Cancel>

If running Pre-version 24 of array

code, the logging feature operated

differently so you will have to manually enable statistics logging at

the SP level.

With Rel-24 and above, the logger will automatically enable or

disable statistics logging as required.

Statistics logging is the process whereby the array will collect

statistics for each object within the storage system. The logger is required to facilitate collection of those statistics, however if the

logger isn’t running, you can view a subset of statistics using the SP

Properties view within Navisphere, or collect some raw statistics

using secure CLI commands.


3

<Select Tools ->Analyzer -> Archive

-> Retrieve>

The dialogue will present the current repository contents for the selected

SP.

<Scroll down to find the file called

emcw_2010_xxx-xx.nar

Select that file and click on Retrieve>

Note the location where the file will be

stored.

You will see the status of the operation

in the lower pane as the file is

uploaded to your workstation.

<Close this dialogue with Done, then

close the current browser instance>

The newest file listed could be up to 5.5 hours old so you may need to Create New to force the logger to create a new archive containing

recent statistical data from its buffer.

You have the option of retrieving archives from SP-A or SP-B, and although they should contain almost identical data, it is worth

retrieving from both SP’s in case there’s any problem with viewing

one of the files, or one SP may have been rebooted during an archive and will miss samples during that time.


4

What you do Script We are now going to use off-array

Navisphere to view the archive we

retrieved in the previous operation.

We could use the array to view the archive, however usual practice is

to view archives independent to having an array resource available

Recommended software components

for off-array operations to enable

viewing Analyzer data.

Ensure the Navisphere Management

Server service is running (Start,

Settings, Control Panel, Administrative Tools, Services).

Service is called NaviGovenor

Note; this is only required for Off Array management and offline

Analyzer Archive file viewing.

<Start Off-Array UI - double click desktop ICON labeled

OffArrayUI>

If not available, you can explicitly run the off array management UI by

selecting START, RUN and pointing to the following link;"C:\ProgramFiles\EMC\ManagementUI\6.29.x.x\WebContent

\start.html"

<Enter Management Server IP

address 127.0.0.1 and use default

port of 80/443>

127.0.0.1 is the localhost address.

Alternatively you can use the IP

address assigned by the DHCP

server.

Login <Enter the user as emcw and password emcw>

You will be presented with the standard Navisphere view of the Domain you logged into.


5

What you do Script <Select Tools –>Analyzer ->

Customize>

In the General TAB view, check that

the Advanced box is ticked.

In the Archives TAB view, check the

default path for archives, and the

Performance Survey for the initial view, and check the Initially Check

All Tree Objects box is ticked.

When you have many objects you may

wish to be more granular on

selections and chose not to initially

check all objects.

In the Survey Charts TAB view,

check the Utilization, Response Time

and Average Queue Length are

selected and the values shown for

each are present.

<Click OK> to use these

settings.

Note: When viewing the analyzer

standard performance detail view,

there are 4 windows in the view. These are the object (top left), value

(bottom left), plot (top right) and plot

item list (bottom right). To display

the values available for a given

object, you must select that object. Selecting a plot item will highlight the

plot associated with that selection. This will be useful when selecting

many items to view.

We’ll see this view later in the exercise.

Customize is only

required once for the off-array

environment.

Customize is also

available for the

array environment so when you set an

array option, it will

remain set for

anyone logging into the array for viewing

real-time data

covered in the next exercise.

Normally we might

suggest a threshold of 10 samples but for

this exercise we’ll

use 4 for the off-array archive file

were looking at.


6

What you do Script <Select Tools –>Analyzer -> Archive

-> Open>

Open file emcw2010_1_xx-xx.nar .

Use the default time points. Select the file in location “C:\Documents and

Settings\emcperf”.

Leave the default start and end time for this exercise however when doing

this on a NAR file from your own

storage system, you may want to narrow the time display to make it

easier to view specific activity.

<Click OK>

You should now have the

Performance Survey View if you setup

the default open view as shown in prior steps.

<Scroll in this view to see if anything is highlighted by red boxes, indicating

a threshold set in the configuration

has been exceeded>.

A RED or YELLOW Utilization box

maybe an indication for concern,

especially if the Response Time and/or Queue Length are also RED.

Make a note of any suspected LUNs in

the space below;

____________________________

Make some notes on what you see in the Survey Chart.

You can merge nar files to cover a longer period of time

however when opening a large

nar file, it can speed up the

interface by focusing on a shorter time period.

The merge option is referenced

later in this paper.

It is not recommended to

merge SP-A and SP-B nar files

together as they should be very

similar data anyway


7

What you do Script <With the pointer in the Utilization

display for LUN 50, either double click, or right click and select

Performance Detail. You will now see

a graph showing LUN 50 Utilization>

From this display, we want to check

some things out;

SP Utilization – they should be about the same if the load is well balanced.

<You need to expand the LUN

using the + to see the SP check

box – check this to add SP detail>

Uncheck the LUN to increase the scaling of the SP Utilization if necessary. Is the load between the two SP’s balanced?

<In the Performance Detail View, right click on the SP and select

Properties. Check that under the

Cache Tab, Read and Write cache are both enabled, and check the cache

page size>

Now we want to check the cache dirty

pages – these are pages of data held in memory during writing. If these are

very high, it could be an indication of

a problem we need to work on.

<make sure you have the SP selected

with the pointer i.e. SP A is shown highlighted. Now scroll in the lower

left window down to the Dirty Pages

(%) property – check the box>

Write cache works on a policy of

watermarks. Here we see the dirty pages

around 60% to 80% which indicate the

watermark processing is working well.

You can see how the watermarks are set

by selecting the SP Tab then right

clicking the array and selecting

Performance Overview view.

<Now uncheck both SP’s and the Dirty Pages box>

Check total memory allocated to read and write cache and that both

are enabled. A reference to

allocation of cache memory can be

found in the CLARiiON Best Practices Guide, although typically

recommended to reserve up to 20%

of total available cache to read

cache, and the rest for write cache.

If the Dirty Pages (%) were

consistently high, this maybe

an issue we’d need to look at closer. Maybe the watermark

settings would need

changing? Maybe the cache allocation would need

changing? You need to

consider the write load on the array and duration, combined

with distribution at the disk

level on the backend; after all,

the disks govern the speed at which we can de-stage data

from dirty pages. Adding

more spindles to a particular application can help de-stage

data quicker for write

intensive applications.


8

What you do Script Now let’s look at the LUN details.

<Click once on LUN 50, then in lower

left window, un-check the utilization

then check both the Read Bandwidth and Write Bandwidth boxes.>

Simultaneous reading and writing on

a single LUN can be challenging. Let’s look at more detail.

<Now select the property Forced

Flushes/s>

If none seen, good reason to re-check that cache is on, although none is also

indicating write cache not being

worked too hard.

Although we don’t have any forced flushes here, write throughput is

the reported write cache hits combined with any forced flushes i.e. a

write causing page(s) of write cache to be flushed to make room for the write do not count in the write cache hit total (unless the write size

satisfies the write-aside and bypasses cache – more performance

architecture understanding required if that wasn’t understood).

Let’s just check the LUN properties.

<Right click on LUN 50 and select

Properties>

You want to know RAID type, number of disks and user capacity.

Also check the stripe Element Size is

as expected – 128 is normal for striped raid as per the CLARiiON

Best Practices Guide.

Under the Cache TAB you may see the read and write cache enabled

boxes empty – this can indicate cache wasn’t enabled for the LUN but

in this instance it was. Always check the current code release notes for known issues with the interface.

You should also check what LUNs share the same disks to see if multiple

hot LUNs are due to disk contention

on the backend.

<In the Performance Detail view,

click on the Storage Pool TAB at the top – then you can expand each RAID

Group to see what LUNs share the

same disks>

As you will see, some of the suspected RED LUNs shown in the

Performance Summary view are sharing disks in the second half of the test


9

What you do Script <Select the LUN TAB at the top of

the window>

<Now deselect Forced Flushes/s,

Read Bandwidth and Write

Bandwidth>

<Select Read Throughput I/O/Sec

and Read Cache Hits/s>

None would indicate either random

access or reads too big for pre-fetch.

Check the IO size to the LUN.

<Right click on LUN 50 and select IO

Size Distribution Summary>

<In the view, you can quickly select all values by right clicking in the left

pane and click on Select All –

Values>

You can see here all IO’s are small (4KB), both read and write – this

indicates a totally random profile as no read cache hits were seen.

<Close the IO Distribution Summary

window> and <de-select Read

Throughput I/O/Sec and Read Cache

Hits/s>

<With LUN 50 selected, select the

Full Stripe Writes/s check box>

<None of the LUNs are doing FSW’s

in this archive>

Another useful view would be the IO Size distribution detail.

<Right click LUN 50 and select IO

Size Distribution Detail>

<Select Read and Write IO size of

4KB only as we confirmed that as the only IO size used by checking the IO

Size Distribution Summary view

previously>

This view is useful to see the read/write ratio over time.

As we saw from the

IO Distribution

Summary the writes were small and no Full

Stripe Writes taking

place; this suggests the

writes are also random as no full stripe

coalescing in cache is

taking place.

We could get some

write IO coalescing taking place resulting

in larger than 4KB

writes at the disk layer

– we would need to check write IO size at

the disk to validate

that


10

What you do Script <Close the IO Distribution Detail

window. Also, with LUN 50 selected, uncheck the Full Stripe Writes/s box.

Also, uncheck the LUN 50 box as

well>

Now, expand LUN 50, we can see 5

disk drives.

<Select the last disk check box, then

in the parameter window, select

Utilization and Average Seek

Distance>

What do we see here; disk seeks are a few GB indicating a moderate

level of randomness. Uncheck Utilization to get a better view of the

seek distance (or zoom in). <Uncheck Utilization and Average

Seek Distance>

<Select LUN 50 and then check

Write Size>

As you can see here, over time the disk

IO size tracks the LUN IO size – again,

indicating random workload with little

or no coalescing of data. If disk IO’s

were bigger, a good indication of

coalescing taking place – always check

you have write cache enabled.

Don’t forget to check you have write cache enabled (LUN and Array).

<Uncheck Write Size and also

uncheck LUN 50>

<Check both of the first 2 disks in LUN 50 and then check the Total

Throughput for these disks>

The disks at varying times are working very hard. We can see them

reaching over 350 IOPs per disk. Now, for small random IO we have a

rule of thumb (ROT) stating a 15K rpm disk can be used for 180 IOPs

for mixed random load with good response time. When running them at higher loads we can expect an impact in the response time observed.

High disk utilization means we are working the disks

well. Low utilization would

indicate additional load could be placed on the

drives with consistent

service and response times

Some 32KB writes

maybe seen that

could be protection bits

being set rather

than coalesced

user data


11

What you do Script <Uncheck Total Throughput and also

uncheck the second disk drive>

<With the highlighted first disk dive,

check the Queue, Average Busy

Queue Length and Service Time

boxes>

Average Busy Queue compared to the regular Queue can give an indication

of burstyness however at the disk

level, activity includes de-staging writes from cache that can arrive in

bursts.

Always check the release notes for

known issues relating to accuracy of statistics.

The disk service time is how long the disk is taking to service each

request. If IO gets queued at the disk, the disk response time is then a factor of the service time multiplied by the queue depth. Therefore, the

higher the queue, the longer the response time.

A point to remember though is that writes will typically be serviced by

cache and have a very fast response time. Reads will have a more directly impacted response time as disk queues increase. Of course,

this all depends on IO size and also how writes are being de-staged

from cache and how efficient cache is working to optimize that

process. <Uncheck Service Time>

<Uncheck Queue>

<Check Response Time>

You can see the Response Time follows the average busy queue depth

as the service time was pretty stable.

<Uncheck all objects>

Now select the SP TAB, click on SP-A

and select Total Throughput.

You can see that as the load increases,

so does the overall capability of the array i.e. more threads per test, higher IOPs. More LUNs tested, more IOPs.

The second set of tests where you see

the IOPs starting around 270 increasing steadily to 1286 are where

both SP’s are being used, so the aggregate IOPs would be higher still.

Those peaks seen at the start of tests are normally writes being absorbed into the protected write cache until watermark processing

starts to write data to disks on the back end.


12

What you do Script <Uncheck SP A and Total

throughput>

Select the LUN Tab.

Now check LUNs 60 and 61. These

are both using the same disks and you

can see the impact when both LUNs are under load as the LUN response

time increases.

Select the disks for LUN 60.

This results in fewer aggregate IOs across both SP’s due to an increased

queue as well as small increase in

seek distance at the drive level.

The other small peaks seen here are associated with disk

statistics and the SNiiFER process where there is no host load

accessing the disks. This is a process that is validating data

availability in background (performing 512KB read operations

at 1 IO/s. Take a look at disk read size and you will see.


13

What you do Script Summary of Exercise A

We got you to look at LUN 50 as it was used in all 12 tests. The first test

area you see we get moderate

throughput at the disks but when we increase the threads accessing the

same disks, we get much better work

from them. The detrimental effect of

driving a higher load is an increase in response time to the application due

to the increased queuing at the disks

(go back and take a look if you have time).

Now, as we add more load to more disks, the same effect can be seen

between the single thread tests and

the 4 thread tests i.e. per disk IOPs is

higher if we have more processes accessing the LUNs. In all tests where

we have a single thread per LUN the

per disk IOPs is low compared to the 4 thread tests. This highlights some

key performance notes;

Concurrency, when using small IO sizes, is essential for good

performance i.e. multiple

threads/processes.

Also, as we observed in the SP

statistics, overall array performance scales with how many LUNs are being

accessed concurrently, so it was clear

to see as more LUNs were busy, the

overall throughput increased.

Do not expect to get maximum

performance from an array unless you have the necessary disk count to

service the load. Please reference the

CLARiiON Performance and

Availability: Applied Best Practices on scaling capability guidance for

each array type.

When finished Exercise-A, close down

the off-array browser and proceed to

Exercise-B

Here we see the distinct 6 areas of the first 6 tests performed on one

Storage Processor. Each lower level is showing the single thread

performance for 1, 2 & 3 LUNs. The higher peaks are representing throughput when 4 concurrent threads per LUN are generating IO.


14

What you do Script

Exercise B – Analyzer Statistics Viewing Real-Time

This exercise is to direct you around some of the views in Analyzer while the array is under a

simulated load from a Windows server. You’ll be directed to look at some of the key statistics

that indicate if a system is functioning within acceptable parameters. This exercise is to extend

your experience and expand upon some descriptions of those statistics you are looking at – select

and deselect components and statistics to overlay graphs – but consider the scale of selections

such that high IO/s on the same graph as disk queue length will not be easy to distinguish queue

variation. If you have time you’ll be directed to look at some specific statistics in order to

determine where there is a problem with the current load on the array.

What you do Script Start Internet Explorer <Enter the IP address of your assigned managed

node ( SP IP address ) into the

address window of the browser

Repeat as in the first steps in exercise-A

Login <Enter the user as emcw and

password emcw>

Also refer back to Exercise-A for

customize options required to be set

on the array

You will be presented with the standard Navisphere view of the

Domain you logged into.

You have already looked at the logging mechanism so now we want to

start viewing real-time statistics.

If only the Local Domain is shown, expand the view by clicking on the +

by the Domain icon.

<Right Click on your array and move

the pointer over the Analyzer

selection to see the expanded list of

options>

<Select Performance Survey>

Performance statistics can be viewed for individual components or you

can select the storage system and then view a selection of components.

To get this window, you can select the array, SP, raid group, Thin Pool, storage group, LUN, Thin LUN or disk to choose which analyzer

view to look at. Here we’ll select the array to present all objects

available.

The Performance survey view will start to plot current statistics based

on a 60 second sample period – please wait until you have at least two

plots to continue i.e. wait at least 2 minutes for the plots to show.


15

What you do Script The objective here is to watch the real

time view develop and start to look at some of the performance statistics

that are being logged.

You have to wait for 2 samples to get

data plotted. Each sample is an

average of statistics between each

sample except for write cache dirty pages that are an absolute value at

the sample point.

We’ll have a look at how to view some

of the key statistics used in analyzing

an array performance.

Exercise-A gave you familiarity with the interface and through this

exercise we’ll expand upon what some statistics mean.

Utilization – LUNs, SPs and Disks

In the Performance Survey view you

can double click on a graph to open

the Performance Detail view for that statistic. Then you can select more

components to view, that will be

placed on the same graph.

Try it – pick one of the utilization

graphs in the Performance Survey View and double click on it.

SP Properties

Right click on the SP and select properties. Ensure Cache is allocated

and enabled.

Total Size indicates possible maximum – look at the Read Cache

Size and Write Cache Size for

allocated cache.

If you setup the survey

chart thresholds as

instructed earlier, you will

start to see green, yellow or

red boxes appear. These give

you an indication of where to start looking for possible

performance issues.

Expand the LUN

component to

reveal the disks and the storage

processor that

currently owns

this LUN

SP properties view is limited within a nar file compared to

the same view when

connected to an array in real-

time as displayed here.


16

Dirty Pages

Dirty pages are protected write cache

data that hasn’t been committed to

disk yet. To see appropriate value selections

available in the lower left part of the

detail view, you must select a

component item in the upper left part of the view. Dirty Pages will only be

an available option when you have

clicked on a Storage Processor (SP). Dirty pages that peak at 99% indicate

cache saturation resulting in force

flushing that can hurt performance. We’ll look at LUN force flushes later

in this exercise.

.

LUN Bandwidth

Selecting both read and write

bandwidth tells us about the load on

the LUN however you will need to check IO sizes and data locality to

determine if the values seen are

expected based on the load.

We can check locality by looking at

seek distances at the drive level later

in this exercise.

LUN Forced Flushes

Forced flushes are an indication of write cache saturation – if you have

many forced flushes taking place, this

will impact the system as seen by increased SP utilization as well as

increased response times.

Although dirty pages may not have shown being at 100% that statistic is

an absolute value at the sample time.

If you are seeing forced flushes taking place then that indicates the cache

pages were 100% dirty at some point.

To help when

viewing a graph

plot, you can click on the legend item

in the lower right

window pane and it will highlight that

statistic in the

graph. Also, you

can customize the graph views by

right-clicking on

the graph and selecting Chart

Configuration

Remember to

uncheck previous

viewed

selections to change the

graph scale,

unless you

need to see how one

statistic plots

against another one.

It’s very important to

see if any

forced flushes

taking place.


17

LUN Read I/O/sec

Looking at the IO/sec and Read

Cache Hits/Misses, you can tell if the

read pre-fetching is working.

A high ratio of read cache hits per

second to LUN Read IO/sec is a good indicator of pre-fetching working. You

can directly see this ratio by looking

at the Used Prefetches %.

LUN IO Size distribution summary

This will enable you to determine

where your host IO sizes fit.

Right click LUN-2 in the Detail View,

select IO Size Distribution Summary, then in that view, you can select all

values by right clicking in the value

pane of the window on the left.

Right-click / Select All / Values

This is a histogram where each column represents IO in the range

from that size to the next size -1 block e.g. in the view here, we see a

value for reads that are 8KB and above, but lower than 16KB in size.

LUN Write Size and Full Stripe Writes

You can view these back in the Performance Detail view to see over

time, if coalesced writes are resulting

in full stripe writes to the LUN. This indicates that write cache is working

well and some writes are sequential.

If no Full Stripe Writes are seen, writes to this LUN are more random

and small, with little or no locality.

Looking at the average LUN IO size for read or write in the detail

view can be misleading as it will be an average and a low write IO rate

will not be accurately shown. You really need to use the IO

Distribution Summary for the LUN to see the IO distribution.

In the lower left

pane, you can choose to show the

I/O rate at each

size. Default is I/O count that means

the total I/O’s for

this sample period.

Another method to detect sequential

write access is

comparing disk write IO size with

LUN write IO size

i.e. cache coalesces smaller

IO into fewer

larger IO when de-

staging data

Read hits are when a

host read comes in

and the data is already in cache.

Remember also that pre-fetch activity may

span sample periods

i.e. pre-fetched data in

one period may not be read until the next

period.


18

Disks – Average Seek Distance &

Utilization

This will give an indication of data

locality and if the disk is working

hard. Be aware that a disk that shows 100% utilized isn’t necessarily bad as

the sample rate just indicates the disk

was never idle, and is reported from

the highest SP (the other SP may have some more usage, up to 100% also).

You could look at the disk Average

Busy Queue Length and compare

with the Queue Length. If the

Average is bigger than the reported

Queue Length, this maybe an indication of bursty activity.

LUN & Disks write size

This can give an indication that

coalescing is taking place in write

cache such that disk writes are bigger

than the LUN writes.

If we see the LUN and disk write size

is the same, typically this implies the writes are very random and not

coalescing in cache to become larger

IO’s – or write cache is not being used.

LUN 50 will show this but LUN 2

doesn’t – can you explain why? Tip: check the LUN 2 IO Distribution

and write IO rate

CLARiiON cache is great at optimizing back-end disk access,

particularly of benefit to Raid 5 and Raid 6 options that have write

penalties associated with small block random write activity.

Cached writes may also result in bursty activity at the disk

level as write cache flushes data. The trick is to not let that activity lead you to think your host activity is bursty when it

isn’t.


19

Performance Overview View

Select the SP TAB in the performance detail view then right click on the

array;

<Select Performance Overview>

In the overview view you can see more

detailed properties of cache together

with 3 key statistics for the overall array- Throughput, Bandwidth and

Dirty Pages. One particularly useful

detail is the watermark settings as they are not visible anywhere else

when looking at an Analyzer NAR file

offline to the array it came from.

The cache states shown here will be

set when the logger started the

current nar/naz file logging so be sure to determine actual settings from

measured metrics.

Dirty pages on each SP indicates

write cache is enabled at the array

level.

Read cache hits, pre-fetch bandwidth are some indicators that read cache is

enabled.

Watermark settings are used to intelligently flush write cache pages

out to disks on the backend and keep a level of write cache available

for bursts of activity.

Raid Group / Thin Pool

Select the Storage Pool TAB in the

Performance Detail view. Careful not

to get confused here as the Raid Group and Thin Pool statistics are

derived from disk statistics, not LUN

statistics. Thus, the values will depend on all activity for all LUNs within that

raid group or all Thin LUNs in a Thin

Pool. Thin Pools aren’t covered in any

specific detail here although disk

statistics are logged and can be

analyzed in the same way as a regular Raid Group.

This reference was more for information as we’re not going to be looking at raid group statistics specifically for the exercise. The

Storage Pool TAB is the only method to analyze disk activity within a

Thin Pool. Thin Pools do have regular LUNs that are considered

private and hidden from view, including Analyzer.

The best overview of the

cache configuration and only

place you can see the

watermark settings is in the Overview screen

Don’t be fooled by settings that may have been changed

during the logging period

though i.e. it may show cache as enabled or disabled

here, but you should verify

that with read cache statistics

and dirty pages in the other views


20

The 2 exercises have explored the options and views available to you, with some explanation and

guidance on what the statistics mean and how they help in characterizing your IO.

This following section, should you have time to look at it, will guide you to look at the loads on a

specific set of LUNs sharing the same set of disks.

You have explored the views, now the task is to analyze a specific area where we have an issue. Now, please explore the interface and

look at the following attributes for this

load on the array with a focus on Raid Group 0, LUNs 50 & 51. Look at the

following for each of these LUNs and

see if you can draw any conclusions (make notes on the worksheet table at the back of

this handout);

LUN read and write throughput

LUN read and write IO size

LUN Response Time LUN Queue Length

LUN IO sizes

Disks read and write throughput Disk seek distance

Disk IO sizes

Disk Queue Length

Disk response time

Define the IO profile associated with

each LUN and think about what they can be. There is an area where we do

have an issue that we want to fix.

Hint; one of them is doing large

sequential reads – could be a video

or data warehousing application.

Look at the profiles for these two LUNs…. How are they different and what would be a suggestion on improving performance?

Using this hint, think about what helps sequential operations

and what could also hurt it.

If it is sequential – are we seeing pre-fetching and a high pre-

fetch used rate? We do need to know what the application is

trying to do, then correlate that with characterized IO on the

storage system to see if it is doing what we think it should be

doing. So, what about the other LUNs that

showed up red and/or yellow boxes? As explained during the overview, the

red/yellow boxes give an indication of

where to investigate and not indicative of an absolute problem.

Raid Group 4 has multiple LUNs with different IO characteristics. Check some of the metrics and see if you can conclude anything about

this raid group. Don’t spend too much time on this task as the prior

elements focusing on LUNs 50 and 51 are the main points of this exercise. If you have time, you might want to take some notes on

LUNs 2 and 3 statistics for reference (are they busy? Is that bad?)


21

Additional reference notes

Although these exercises cover the

performance statistics from host

accessing LUNs in the array, there maybe additional load generated

internal to the array. This load could

include those shown opposite;

Typically, layered application IO will not be logged at the LUN level but

will be visible at the disk level. If you understand what is taking place, once

accustomed to the user interface and

operational characteristics of layered

applications, you can determine what disk activity relates to host access or

internally generated IO.

Sometimes, you will observe a blip in the statistics i.e. a value for a statistic

outside normal range. To overcome

this being a nuisance, you can either restart the plotting or adjust the

scaling of the graph plots by zooming

in or setting the Chart Configuration

Axes options in the graph view.

Real-time viewing of Analyzer

statistics isn’t the preferred method

due to the requirement to be there at the time, as well as the additional

impact to the array in presenting the

information. Typically you would

look at a captured Analyzer NAR file as covered in Exercise-A.

SnapView Snapshot sessions Background zeroing for bind operations

SnapView clones Background verifying

MirrorView/S activity MirrorView/A activity

SAN Copy activity

Raid Group rebuild activity

Hot spare equalize activity

LUN migration operations

In the metric selection window, since Release-26 of

code, you will see options for

Optimal and Nonoptimal

metrics for LUNs. These are used when you have LUNs

using the ALUA failover mode

(mode=4) of operation.

Typically, you would see

Optimal values when accessed from the current owning SP,

and Nonoptimal values when

accessed from the non-owning

SP so a slightly longer path.

When not running in ALUA

mode, selecting either the regular metric or the metric-

Optimal will display the same

values.


22

Supplemental, Command Line NAR file retrieval and export capabilities

This is to direct you into the capabilities to script NAR file retrieval for lights out performance

statistics gathering. As the Navisphere archive file collects data covering the previous 5 hours of

statistics, the capability to script retrieval of the NAR file is useful when you want statistics for a

period of activity and you’re unable to retrieve the file in the normal way using the Navisphere

GUI interface e.g. statistics logged on Saturday would need to be retrieved sometime Sunday or

they would be overwritten by Monday. With the release of revision 24 and later revisions of

code, the Analyzer Archive facility allows the automatic archiving of Analyzer files on the array

itself for later retrieval via the GUI or CLI process, and retained for a much longer period than

the previous 5 hours (or 25 hours for older code archives). Remember though that if Periodic

Archiving is not enabled, you will only grab the prior 5 hours of data by default when you

retrieve the archive.

<Ensure NaviCLI utility is installed.

This is easily done if the Navisphere

CLI directory is present. Here you can double click the shortcut on the

desktop called NaviCLI>

This will start a command window that will go to the default installation

directory c:\Program

Files\EMC\Navisphere CLI

(Username and password will be

emcw for the following commands)

<Retrieve the Navisphere archive files

using the following command;

“naviseccli –user <username> -

password <password> -scope 0

–address <SP IP> analyzer –archive

-all”

Be careful with this command as you

may have many archives to download

when selecting all and it could take a

long time to complete. By omitting

the “all” you are presented with a

selection list where you can select

one or more archives to retrieve.

Scope will be 1 if the account details used are local and not global.

Do not do this here but you can reset the statistical data by using the

following command if you are looking to collect data for a specific

test period only and you are not interested in previous collected data;

“naviseccli –user <username> -password <pwd> -scope <0¦1>

-address <SP IP> analyzer –logging -reset”

The username and password can be omitted if you have setup the

security file for NaviSeccli.

The username used here does not have privileges to reset data logging

on the arrays being used.

Note; the desktop shortcut used here is not created for you during

installation. You have to do that yourself if you want that shortcut available on your own systems.

Prior to release 24 you would need to use the java archiveretrieve

command to get the archive from the array;

“java –jar archiveretrieve.jar –User <username> -Password

<password> -Scope 0 –Address <array IP> –File archive_emc.nar –

Location “C:\program files\emc\Navisphere cli” –Overwrite 1 –Retry 2

–v”>


23

Now you can follow these steps and

open the retrieved NAR file using the

on-array or off-array capability, or you can convert the NAR file data to

CSV format for import into Excel.

You can use the following command

to do this;

< naviseccli –user <username> -

password <password> -scope 0

–address <SP IP> analyzer -

archivedump -data test.nar -out

test.csv -object s,l,d >

You can also filter the output to only

get specific statistics like read

throughput adding the qualifier –

format rio

This example outputs stats for SP’s, LUNs and disks (-object s,l,d). To get

stats for metaluns, etc, please consult

the document; Navisphere Analyzer Administrator's Guide.pdf

If Excel format is required you can use the archivedump command to covert the NAR file data to a format readable by Excel, typically CSV.

Some more qualifiers for the format command are as follows

(separate with a comma if used);

Utilization (%) u

Response Time (ms) rt

Dirty Pages (%) dp

For other qualifiers please consult the Admin Guide.

If you leave off the qualifier –object, it will output all statistics for all

objects.

The Navisphere UI has an Analyzer dump wizard that guides you

through device and attributes selection prior to dumping to a CSV file.

Start Excel and select “open file” and

browse to the c:\Program Files\EMC\Navisphere CLI directory

and select file type as CSV, then open

the sample.csv file you created in the last step to view the statistics as

presented in Excel.

If Excel 2007, use the INSERT TAB to

display graphing options.

If you’re not too familiar with Excel

but would like to plot a graph showing

one of the statistics over time, you can easily do this by selecting a column by

clicking on the header letter, then

once the column is highlighted, click on the chart wizard icon in the tool

bar, select line as chart type, then

click next to see what the chart would

look like. You can then customize it as required.

Please note that each device selected in the dump command, like

SP and LUNs will be listed down the left column, so selecting an entire column to plot would actually plot all SP stats followed

by all LUN stats, and so on. You would need to be more

selective and manipulate the data when plotting graphs in a

logical manner.


24

You can try the archivedump

command and be more specific on

some qualifiers shown previously.

You could also have a go at the dump

wizard from the Analyzer drop down in the Navisphere Manager GUI –

using off-array Navisphere.

“naviseccli –user <username> -password <password> -scope 0

–address <SP IP> analyzer -archivedump -data test1.nar -out

test1.csv -object s –format u,dp”

This will output test1.csv containing SP statistics of utilization and write cache dirty pages.

Another option is archivemerge; used to merge multiple NAR files

together. We don’t use that in this session but remember that this is

useful if you want to view data access trends that span more than the

typical NAR file size of 5 hours. It is not necessary to merge nar files from both SP’s as each SP has the

same data.

The array based archivedump wizard provides an easy way to dump specific

statistics associated with individual

devices rather than using the CLI

method shown above.

With either on-array or off-array,

select Tools, Analyzer, Archive, Dump

Then select where the source file is

located and follow the wizard to select

objects to dump and what statistics you require.


25

Supplemental, Thin LUN Analysis

This is to highlight some differences in metrics available for Thin LUNs in a CLARiiON

environment and the way in which we view them.

There is a read/write load running to LUN201 on the array. This is a Thin LUN provisioned

from a pool of 3 disks.

Thin Pools in a CLARiiON have a private structure that isn’t visible in the Navisphere

interface. This structure has private LUNs that Thin LUNs utilize in 1GB increments. With the

experience gained from the primary exercises you can take a look at the active Thin LUN and

how to observe IO to both it and the Thin Pool disks.

Check Thin LUN properties.

<Right click on LUN 201 and select Properties>

<Here you can see the Pool properties that this LUN is serviced

from and the Thin LUN virtual size

and actual consumed capacity from

the pool>

<When selecting the Thin LUN, you do not see cache operations

associated with that LUN as these are

associated with the private LUNs

servicing the IO to the Pool and those are hidden from view >

<Unlike regular LUNs in the LUN TAB view, you only see the SP a Thin

LUN is assigned to. To see the disks

servicing the Thin LUN and its Pool,

you have to select the Storage Pool TAB >

<Select the Storage Pool TAB> <In the view, you can expand the Pool

to see the disks servicing the total

Pool load. You cannot see the private

LUNs that are hidden in the Pool>

End of exercises

These are metrics you will not see when

selecting Thin LUNs to analyze. This may change in a future release, but for

now; you have to look at the Thin Pool

disk characteristics to determine what’s happening in the Pool as a whole.

Regular IO metrics like throughput,

bandwidth, and response time are

available for each Thin LUN.

There are no specific

instructions on what to

investigate here although if time, compare the disks

within the Pool and how

those align with the Thin

LUN characteristics.


26

Supplemental notes

The following operations are executed at the disk level to provide data integrity features associated with

redundant RAID types as well as consistency of data stripes that could be at risk due to media issues.

Background zero; Before user data can be written to the physical disks within a LUN, the area has to

undergo a zero operation. New disks are initially supplied in a zero state where data can be written to the

disks immediately after binding LUNs, however if the disks have been used before i.e. bound and

unbound, they have to be re-zeroed.

You can zero the disks using a naviseccli command in readiness for grouping and binding LUNs later on

or the array will zero the disks when you create new LUNs on them. This zero operation results in

512KB SCSI write-same commands to the disks in a sequential manner, unless the array has to zero-on-

demand an area the user is writing too that is in the queue but hasn’t been zeroed yet. There is some

other small activity on the disks during zeroing as checkpoint operations keep track of progress.

Typically with no access to the LUNs any zeroing will complete in a matter of a few hours although a

busy array and activity to the disks being zeroed will delay completion. Also, the 512KB write-same

command will not consume back end bandwidth but will affect disk load and utilization.

Background verify; This operation is validating consistency of data protection at the disk level and is

automatically performed on newly created LUNs. The IO profile at the disk level is 64KB reads and like

zeroing, can take hours to complete and is also governed by array and disk activity.

Background zero, zero-on-demand, and background verify operations exhibit relatively large IO sizes

that can affect one’s analysis of the array. Also, if considering user testing its worth noting these

operations may affect the performance the array can present due to the parallel action of user data

access and these preliminary operations.

Also be aware these operations run in a sequential manner for any given raid group(RG) e.g. if you bind

5 LUNs on a RG 0 through 4, LUN 0 will start to zero and when complete will perform a background

verify. This is followed by the second LUN in that RG. Each LUN will zero then verify until all newly

created LUNs complete that process. Thereafter the only regular IO you will see at the disk level due to

internal operations will be SNiiFFER where you will see approximately 1 IO per second at 512KB in size

to each disk in a RG. SNiiFFER is a data checking operation that cycles through every block in every

LUN in the array to ensure data availability, even for data you might not have touched for months/years.

Any data inconsistency detected through SNiiFFER will automatically invoke recovery and remap of

affected blocks. RG’s will run through zero, verify and SNiiFFER operations independent to each-other.

Zeroing will have the most effect on performance so consider this when testing. Verify may have a small

effect and SNiiFFER will have a negligible effect on performance.

Always check disk stats to see what IO sizes are taking place at that level. With a RG idle, disk activity

showing 512KB writes indicate zeroing, 64KB reads indicate verifying and 512KB reads indicate sniffing.

{end}


27

Worksheet – use as needed during exercises.

LUN ID 50 51

Owner SP

SP Utilization

LUN Read IOPs

LUN Read size

LUN Write IOPs

LUN Write size

LUN Read MB/s

LUN Write MB/s

LUN response time

LUN Queue

Disk Read IOPs

Disk Read size

Disk Write IOPs

Disk Write size

Disk Queue

Average disk seek

Disk response time

LUN ID

Owner SP

SP Utilization

LUN Read IOPs

LUN Read size

LUN Write IOPs

LUN Write size

LUN Read MB/s

LUN Write MB/s

LUN response time

LUN Queue

Disk Read IOPs

Disk Read size

Disk Write IOPs

Disk Write size

Disk Queue

Average disk seek

Disk response time