EMC World – May 2010
1
Navisphere Analyzer
Purpose of the script: EMC Navisphere Analyzer allows you to view storage system performance statistics in various types of charts. These charts can help you find and anticipate bottlenecks in the disk storage component of a computer
system. Today’s session will take a look at the Navisphere Analyzer User Interface. In particular, it will be covering the
different views available and provide some basic starting point in checking if your existing configuration is being stressed
or working in a well utilized manner. This script is designed for use with Navisphere Manager 6.x software.
Please do not alter any of the workstation or CLARiiON Storage System configuration details unless instructed
to do so by these instructions or by a member of the EMC presentation team.
In this session there are primarily two exercises covering archive retrieval, viewing and on-array real time analysis. In addition to the instructions for these exercises, you will find more exercises and reference material in this handout.
If you have time to do so, please explore those additional sections during this session.
Before you begin: Fill in the following information:
a. Assigned Array from the Desktop ICON array.txt on your laptop: Array _________ SP ___ b. Storage System IP address to use ___.___.___.___
c. Proceed to exercise A
Exercise A, NAR file viewing offline
This exercise is to direct you into checking the status of data logging on your array, retrieving an
archive file containing performance data, and then looking at that data using Navisphere off-array
software. This is primarily a walk-through exercise to get familiarity with the steps involved in
performance analysis. The second exercise will cover more details about the metrics you are looking
at.
The NAR file is from a test environment where a total of 12 tests were run. You will clearly be able to
segment the statistics into 12 areas. The odd numbered tests were using a single thread to each LUN
and the even numbered tests were using 4 threads per LUN. The essence of this exercise is to get
familiarity with the interface, as well as identify that increasing load on the array has various effects
What you do Notes & observations Start Internet Explorer <Enter the IP address of your assigned managed
node ( SP IP address ) into the
address window of the browser>
The process of getting started with Navisphere Manager is simple. You begin by pointing your browser at the storage system’s IP
address.
EMC World – May 2010
2
Login <Enter the user as emcw and password emcw>
You will be presented with the standard Navisphere view of the
Domain you logged into.
<Select Tools –>Analyzer -> Data
Logging>
Check the logger is running and that periodic archiving is as you want it. If
checked it saves the archives every 5
hours with the default 120 second sampling, or 2.5 hours if using 60
second sampling.
If you do not have Analyzer installed
on the array, you can still invoke
logging but this will be limited to 7 days of periodic archiving and will
create encrypted archives for service
use.
<Select Cancel>
If running Pre-version 24 of array
code, the logging feature operated
differently so you will have to manually enable statistics logging at
the SP level.
With Rel-24 and above, the logger will automatically enable or
disable statistics logging as required.
Statistics logging is the process whereby the array will collect
statistics for each object within the storage system. The logger is required to facilitate collection of those statistics, however if the
logger isn’t running, you can view a subset of statistics using the SP
Properties view within Navisphere, or collect some raw statistics
using secure CLI commands.
EMC World – May 2010
3
<Select Tools ->Analyzer -> Archive
-> Retrieve>
The dialogue will present the current repository contents for the selected
SP.
<Scroll down to find the file called
emcw_2010_xxx-xx.nar
Select that file and click on Retrieve>
Note the location where the file will be
stored.
You will see the status of the operation
in the lower pane as the file is
uploaded to your workstation.
<Close this dialogue with Done, then
close the current browser instance>
The newest file listed could be up to 5.5 hours old so you may need to Create New to force the logger to create a new archive containing
recent statistical data from its buffer.
You have the option of retrieving archives from SP-A or SP-B, and although they should contain almost identical data, it is worth
retrieving from both SP’s in case there’s any problem with viewing
one of the files, or one SP may have been rebooted during an archive and will miss samples during that time.
EMC World – May 2010
4
What you do Script We are now going to use off-array
Navisphere to view the archive we
retrieved in the previous operation.
We could use the array to view the archive, however usual practice is
to view archives independent to having an array resource available
Recommended software components
for off-array operations to enable
viewing Analyzer data.
Ensure the Navisphere Management
Server service is running (Start,
Settings, Control Panel, Administrative Tools, Services).
Service is called NaviGovenor
Note; this is only required for Off Array management and offline
Analyzer Archive file viewing.
<Start Off-Array UI - double click desktop ICON labeled
OffArrayUI>
If not available, you can explicitly run the off array management UI by
selecting START, RUN and pointing to the following link;"C:\ProgramFiles\EMC\ManagementUI\6.29.x.x\WebContent
\start.html"
<Enter Management Server IP
address 127.0.0.1 and use default
port of 80/443>
127.0.0.1 is the localhost address.
Alternatively you can use the IP
address assigned by the DHCP
server.
Login <Enter the user as emcw and password emcw>
You will be presented with the standard Navisphere view of the Domain you logged into.
EMC World – May 2010
5
What you do Script <Select Tools –>Analyzer ->
Customize>
In the General TAB view, check that
the Advanced box is ticked.
In the Archives TAB view, check the
default path for archives, and the
Performance Survey for the initial view, and check the Initially Check
All Tree Objects box is ticked.
When you have many objects you may
wish to be more granular on
selections and chose not to initially
check all objects.
In the Survey Charts TAB view,
check the Utilization, Response Time
and Average Queue Length are
selected and the values shown for
each are present.
<Click OK> to use these
settings.
Note: When viewing the analyzer
standard performance detail view,
there are 4 windows in the view. These are the object (top left), value
(bottom left), plot (top right) and plot
item list (bottom right). To display
the values available for a given
object, you must select that object. Selecting a plot item will highlight the
plot associated with that selection. This will be useful when selecting
many items to view.
We’ll see this view later in the exercise.
Customize is only
required once for the off-array
environment.
Customize is also
available for the
array environment so when you set an
array option, it will
remain set for
anyone logging into the array for viewing
real-time data
covered in the next exercise.
Normally we might
suggest a threshold of 10 samples but for
this exercise we’ll
use 4 for the off-array archive file
were looking at.
EMC World – May 2010
6
What you do Script <Select Tools –>Analyzer -> Archive
-> Open>
Open file emcw2010_1_xx-xx.nar .
Use the default time points. Select the file in location “C:\Documents and
Settings\emcperf”.
Leave the default start and end time for this exercise however when doing
this on a NAR file from your own
storage system, you may want to narrow the time display to make it
easier to view specific activity.
<Click OK>
You should now have the
Performance Survey View if you setup
the default open view as shown in prior steps.
<Scroll in this view to see if anything is highlighted by red boxes, indicating
a threshold set in the configuration
has been exceeded>.
A RED or YELLOW Utilization box
maybe an indication for concern,
especially if the Response Time and/or Queue Length are also RED.
Make a note of any suspected LUNs in
the space below;
____________________________
Make some notes on what you see in the Survey Chart.
You can merge nar files to cover a longer period of time
however when opening a large
nar file, it can speed up the
interface by focusing on a shorter time period.
The merge option is referenced
later in this paper.
It is not recommended to
merge SP-A and SP-B nar files
together as they should be very
similar data anyway
EMC World – May 2010
7
What you do Script <With the pointer in the Utilization
display for LUN 50, either double click, or right click and select
Performance Detail. You will now see
a graph showing LUN 50 Utilization>
From this display, we want to check
some things out;
SP Utilization – they should be about the same if the load is well balanced.
<You need to expand the LUN
using the + to see the SP check
box – check this to add SP detail>
Uncheck the LUN to increase the scaling of the SP Utilization if necessary. Is the load between the two SP’s balanced?
<In the Performance Detail View, right click on the SP and select
Properties. Check that under the
Cache Tab, Read and Write cache are both enabled, and check the cache
page size>
Now we want to check the cache dirty
pages – these are pages of data held in memory during writing. If these are
very high, it could be an indication of
a problem we need to work on.
<make sure you have the SP selected
with the pointer i.e. SP A is shown highlighted. Now scroll in the lower
left window down to the Dirty Pages
(%) property – check the box>
Write cache works on a policy of
watermarks. Here we see the dirty pages
around 60% to 80% which indicate the
watermark processing is working well.
You can see how the watermarks are set
by selecting the SP Tab then right
clicking the array and selecting
Performance Overview view.
<Now uncheck both SP’s and the Dirty Pages box>
Check total memory allocated to read and write cache and that both
are enabled. A reference to
allocation of cache memory can be
found in the CLARiiON Best Practices Guide, although typically
recommended to reserve up to 20%
of total available cache to read
cache, and the rest for write cache.
If the Dirty Pages (%) were
consistently high, this maybe
an issue we’d need to look at closer. Maybe the watermark
settings would need
changing? Maybe the cache allocation would need
changing? You need to
consider the write load on the array and duration, combined
with distribution at the disk
level on the backend; after all,
the disks govern the speed at which we can de-stage data
from dirty pages. Adding
more spindles to a particular application can help de-stage
data quicker for write
intensive applications.
EMC World – May 2010
8
What you do Script Now let’s look at the LUN details.
<Click once on LUN 50, then in lower
left window, un-check the utilization
then check both the Read Bandwidth and Write Bandwidth boxes.>
Simultaneous reading and writing on
a single LUN can be challenging. Let’s look at more detail.
<Now select the property Forced
Flushes/s>
If none seen, good reason to re-check that cache is on, although none is also
indicating write cache not being
worked too hard.
Although we don’t have any forced flushes here, write throughput is
the reported write cache hits combined with any forced flushes i.e. a
write causing page(s) of write cache to be flushed to make room for the write do not count in the write cache hit total (unless the write size
satisfies the write-aside and bypasses cache – more performance
architecture understanding required if that wasn’t understood).
Let’s just check the LUN properties.
<Right click on LUN 50 and select
Properties>
You want to know RAID type, number of disks and user capacity.
Also check the stripe Element Size is
as expected – 128 is normal for striped raid as per the CLARiiON
Best Practices Guide.
Under the Cache TAB you may see the read and write cache enabled
boxes empty – this can indicate cache wasn’t enabled for the LUN but
in this instance it was. Always check the current code release notes for known issues with the interface.
You should also check what LUNs share the same disks to see if multiple
hot LUNs are due to disk contention
on the backend.
<In the Performance Detail view,
click on the Storage Pool TAB at the top – then you can expand each RAID
Group to see what LUNs share the
same disks>
As you will see, some of the suspected RED LUNs shown in the
Performance Summary view are sharing disks in the second half of the test
EMC World – May 2010
9
What you do Script <Select the LUN TAB at the top of
the window>
<Now deselect Forced Flushes/s,
Read Bandwidth and Write
Bandwidth>
<Select Read Throughput I/O/Sec
and Read Cache Hits/s>
None would indicate either random
access or reads too big for pre-fetch.
Check the IO size to the LUN.
<Right click on LUN 50 and select IO
Size Distribution Summary>
<In the view, you can quickly select all values by right clicking in the left
pane and click on Select All –
Values>
You can see here all IO’s are small (4KB), both read and write – this
indicates a totally random profile as no read cache hits were seen.
<Close the IO Distribution Summary
window> and <de-select Read
Throughput I/O/Sec and Read Cache
Hits/s>
<With LUN 50 selected, select the
Full Stripe Writes/s check box>
<None of the LUNs are doing FSW’s
in this archive>
Another useful view would be the IO Size distribution detail.
<Right click LUN 50 and select IO
Size Distribution Detail>
<Select Read and Write IO size of
4KB only as we confirmed that as the only IO size used by checking the IO
Size Distribution Summary view
previously>
This view is useful to see the read/write ratio over time.
As we saw from the
IO Distribution
Summary the writes were small and no Full
Stripe Writes taking
place; this suggests the
writes are also random as no full stripe
coalescing in cache is
taking place.
We could get some
write IO coalescing taking place resulting
in larger than 4KB
writes at the disk layer
– we would need to check write IO size at
the disk to validate
that
EMC World – May 2010
10
What you do Script <Close the IO Distribution Detail
window. Also, with LUN 50 selected, uncheck the Full Stripe Writes/s box.
Also, uncheck the LUN 50 box as
well>
Now, expand LUN 50, we can see 5
disk drives.
<Select the last disk check box, then
in the parameter window, select
Utilization and Average Seek
Distance>
What do we see here; disk seeks are a few GB indicating a moderate
level of randomness. Uncheck Utilization to get a better view of the
seek distance (or zoom in). <Uncheck Utilization and Average
Seek Distance>
<Select LUN 50 and then check
Write Size>
As you can see here, over time the disk
IO size tracks the LUN IO size – again,
indicating random workload with little
or no coalescing of data. If disk IO’s
were bigger, a good indication of
coalescing taking place – always check
you have write cache enabled.
Don’t forget to check you have write cache enabled (LUN and Array).
<Uncheck Write Size and also
uncheck LUN 50>
<Check both of the first 2 disks in LUN 50 and then check the Total
Throughput for these disks>
The disks at varying times are working very hard. We can see them
reaching over 350 IOPs per disk. Now, for small random IO we have a
rule of thumb (ROT) stating a 15K rpm disk can be used for 180 IOPs
for mixed random load with good response time. When running them at higher loads we can expect an impact in the response time observed.
High disk utilization means we are working the disks
well. Low utilization would
indicate additional load could be placed on the
drives with consistent
service and response times
Some 32KB writes
maybe seen that
could be protection bits
being set rather
than coalesced
user data
EMC World – May 2010
11
What you do Script <Uncheck Total Throughput and also
uncheck the second disk drive>
<With the highlighted first disk dive,
check the Queue, Average Busy
Queue Length and Service Time
boxes>
Average Busy Queue compared to the regular Queue can give an indication
of burstyness however at the disk
level, activity includes de-staging writes from cache that can arrive in
bursts.
Always check the release notes for
known issues relating to accuracy of statistics.
The disk service time is how long the disk is taking to service each
request. If IO gets queued at the disk, the disk response time is then a factor of the service time multiplied by the queue depth. Therefore, the
higher the queue, the longer the response time.
A point to remember though is that writes will typically be serviced by
cache and have a very fast response time. Reads will have a more directly impacted response time as disk queues increase. Of course,
this all depends on IO size and also how writes are being de-staged
from cache and how efficient cache is working to optimize that
process. <Uncheck Service Time>
<Uncheck Queue>
<Check Response Time>
You can see the Response Time follows the average busy queue depth
as the service time was pretty stable.
<Uncheck all objects>
Now select the SP TAB, click on SP-A
and select Total Throughput.
You can see that as the load increases,
so does the overall capability of the array i.e. more threads per test, higher IOPs. More LUNs tested, more IOPs.
The second set of tests where you see
the IOPs starting around 270 increasing steadily to 1286 are where
both SP’s are being used, so the aggregate IOPs would be higher still.
Those peaks seen at the start of tests are normally writes being absorbed into the protected write cache until watermark processing
starts to write data to disks on the back end.
EMC World – May 2010
12
What you do Script <Uncheck SP A and Total
throughput>
Select the LUN Tab.
Now check LUNs 60 and 61. These
are both using the same disks and you
can see the impact when both LUNs are under load as the LUN response
time increases.
Select the disks for LUN 60.
This results in fewer aggregate IOs across both SP’s due to an increased
queue as well as small increase in
seek distance at the drive level.
The other small peaks seen here are associated with disk
statistics and the SNiiFER process where there is no host load
accessing the disks. This is a process that is validating data
availability in background (performing 512KB read operations
at 1 IO/s. Take a look at disk read size and you will see.
EMC World – May 2010
13
What you do Script Summary of Exercise A
We got you to look at LUN 50 as it was used in all 12 tests. The first test
area you see we get moderate
throughput at the disks but when we increase the threads accessing the
same disks, we get much better work
from them. The detrimental effect of
driving a higher load is an increase in response time to the application due
to the increased queuing at the disks
(go back and take a look if you have time).
Now, as we add more load to more disks, the same effect can be seen
between the single thread tests and
the 4 thread tests i.e. per disk IOPs is
higher if we have more processes accessing the LUNs. In all tests where
we have a single thread per LUN the
per disk IOPs is low compared to the 4 thread tests. This highlights some
key performance notes;
Concurrency, when using small IO sizes, is essential for good
performance i.e. multiple
threads/processes.
Also, as we observed in the SP
statistics, overall array performance scales with how many LUNs are being
accessed concurrently, so it was clear
to see as more LUNs were busy, the
overall throughput increased.
Do not expect to get maximum
performance from an array unless you have the necessary disk count to
service the load. Please reference the
CLARiiON Performance and
Availability: Applied Best Practices on scaling capability guidance for
each array type.
When finished Exercise-A, close down
the off-array browser and proceed to
Exercise-B
Here we see the distinct 6 areas of the first 6 tests performed on one
Storage Processor. Each lower level is showing the single thread
performance for 1, 2 & 3 LUNs. The higher peaks are representing throughput when 4 concurrent threads per LUN are generating IO.
EMC World – May 2010
14
What you do Script
Exercise B – Analyzer Statistics Viewing Real-Time
This exercise is to direct you around some of the views in Analyzer while the array is under a
simulated load from a Windows server. You’ll be directed to look at some of the key statistics
that indicate if a system is functioning within acceptable parameters. This exercise is to extend
your experience and expand upon some descriptions of those statistics you are looking at – select
and deselect components and statistics to overlay graphs – but consider the scale of selections
such that high IO/s on the same graph as disk queue length will not be easy to distinguish queue
variation. If you have time you’ll be directed to look at some specific statistics in order to
determine where there is a problem with the current load on the array.
What you do Script Start Internet Explorer <Enter the IP address of your assigned managed
node ( SP IP address ) into the
address window of the browser
Repeat as in the first steps in exercise-A
Login <Enter the user as emcw and
password emcw>
Also refer back to Exercise-A for
customize options required to be set
on the array
You will be presented with the standard Navisphere view of the
Domain you logged into.
You have already looked at the logging mechanism so now we want to
start viewing real-time statistics.
If only the Local Domain is shown, expand the view by clicking on the +
by the Domain icon.
<Right Click on your array and move
the pointer over the Analyzer
selection to see the expanded list of
options>
<Select Performance Survey>
Performance statistics can be viewed for individual components or you
can select the storage system and then view a selection of components.
To get this window, you can select the array, SP, raid group, Thin Pool, storage group, LUN, Thin LUN or disk to choose which analyzer
view to look at. Here we’ll select the array to present all objects
available.
The Performance survey view will start to plot current statistics based
on a 60 second sample period – please wait until you have at least two
plots to continue i.e. wait at least 2 minutes for the plots to show.
EMC World – May 2010
15
What you do Script The objective here is to watch the real
time view develop and start to look at some of the performance statistics
that are being logged.
You have to wait for 2 samples to get
data plotted. Each sample is an
average of statistics between each
sample except for write cache dirty pages that are an absolute value at
the sample point.
We’ll have a look at how to view some
of the key statistics used in analyzing
an array performance.
Exercise-A gave you familiarity with the interface and through this
exercise we’ll expand upon what some statistics mean.
Utilization – LUNs, SPs and Disks
In the Performance Survey view you
can double click on a graph to open
the Performance Detail view for that statistic. Then you can select more
components to view, that will be
placed on the same graph.
Try it – pick one of the utilization
graphs in the Performance Survey View and double click on it.
SP Properties
Right click on the SP and select properties. Ensure Cache is allocated
and enabled.
Total Size indicates possible maximum – look at the Read Cache
Size and Write Cache Size for
allocated cache.
If you setup the survey
chart thresholds as
instructed earlier, you will
start to see green, yellow or
red boxes appear. These give
you an indication of where to start looking for possible
performance issues.
Expand the LUN
component to
reveal the disks and the storage
processor that
currently owns
this LUN
SP properties view is limited within a nar file compared to
the same view when
connected to an array in real-
time as displayed here.
EMC World – May 2010
16
Dirty Pages
Dirty pages are protected write cache
data that hasn’t been committed to
disk yet. To see appropriate value selections
available in the lower left part of the
detail view, you must select a
component item in the upper left part of the view. Dirty Pages will only be
an available option when you have
clicked on a Storage Processor (SP). Dirty pages that peak at 99% indicate
cache saturation resulting in force
flushing that can hurt performance. We’ll look at LUN force flushes later
in this exercise.
.
LUN Bandwidth
Selecting both read and write
bandwidth tells us about the load on
the LUN however you will need to check IO sizes and data locality to
determine if the values seen are
expected based on the load.
We can check locality by looking at
seek distances at the drive level later
in this exercise.
LUN Forced Flushes
Forced flushes are an indication of write cache saturation – if you have
many forced flushes taking place, this
will impact the system as seen by increased SP utilization as well as
increased response times.
Although dirty pages may not have shown being at 100% that statistic is
an absolute value at the sample time.
If you are seeing forced flushes taking place then that indicates the cache
pages were 100% dirty at some point.
To help when
viewing a graph
plot, you can click on the legend item
in the lower right
window pane and it will highlight that
statistic in the
graph. Also, you
can customize the graph views by
right-clicking on
the graph and selecting Chart
Configuration
Remember to
uncheck previous
viewed
selections to change the
graph scale,
unless you
need to see how one
statistic plots
against another one.
It’s very important to
see if any
forced flushes
taking place.
EMC World – May 2010
17
LUN Read I/O/sec
Looking at the IO/sec and Read
Cache Hits/Misses, you can tell if the
read pre-fetching is working.
A high ratio of read cache hits per
second to LUN Read IO/sec is a good indicator of pre-fetching working. You
can directly see this ratio by looking
at the Used Prefetches %.
LUN IO Size distribution summary
This will enable you to determine
where your host IO sizes fit.
Right click LUN-2 in the Detail View,
select IO Size Distribution Summary, then in that view, you can select all
values by right clicking in the value
pane of the window on the left.
Right-click / Select All / Values
This is a histogram where each column represents IO in the range
from that size to the next size -1 block e.g. in the view here, we see a
value for reads that are 8KB and above, but lower than 16KB in size.
LUN Write Size and Full Stripe Writes
You can view these back in the Performance Detail view to see over
time, if coalesced writes are resulting
in full stripe writes to the LUN. This indicates that write cache is working
well and some writes are sequential.
If no Full Stripe Writes are seen, writes to this LUN are more random
and small, with little or no locality.
Looking at the average LUN IO size for read or write in the detail
view can be misleading as it will be an average and a low write IO rate
will not be accurately shown. You really need to use the IO
Distribution Summary for the LUN to see the IO distribution.
In the lower left
pane, you can choose to show the
I/O rate at each
size. Default is I/O count that means
the total I/O’s for
this sample period.
Another method to detect sequential
write access is
comparing disk write IO size with
LUN write IO size
i.e. cache coalesces smaller
IO into fewer
larger IO when de-
staging data
Read hits are when a
host read comes in
and the data is already in cache.
Remember also that pre-fetch activity may
span sample periods
i.e. pre-fetched data in
one period may not be read until the next
period.
EMC World – May 2010
18
Disks – Average Seek Distance &
Utilization
This will give an indication of data
locality and if the disk is working
hard. Be aware that a disk that shows 100% utilized isn’t necessarily bad as
the sample rate just indicates the disk
was never idle, and is reported from
the highest SP (the other SP may have some more usage, up to 100% also).
You could look at the disk Average
Busy Queue Length and compare
with the Queue Length. If the
Average is bigger than the reported
Queue Length, this maybe an indication of bursty activity.
LUN & Disks write size
This can give an indication that
coalescing is taking place in write
cache such that disk writes are bigger
than the LUN writes.
If we see the LUN and disk write size
is the same, typically this implies the writes are very random and not
coalescing in cache to become larger
IO’s – or write cache is not being used.
LUN 50 will show this but LUN 2
doesn’t – can you explain why? Tip: check the LUN 2 IO Distribution
and write IO rate
CLARiiON cache is great at optimizing back-end disk access,
particularly of benefit to Raid 5 and Raid 6 options that have write
penalties associated with small block random write activity.
Cached writes may also result in bursty activity at the disk
level as write cache flushes data. The trick is to not let that activity lead you to think your host activity is bursty when it
isn’t.
EMC World – May 2010
19
Performance Overview View
Select the SP TAB in the performance detail view then right click on the
array;
<Select Performance Overview>
In the overview view you can see more
detailed properties of cache together
with 3 key statistics for the overall array- Throughput, Bandwidth and
Dirty Pages. One particularly useful
detail is the watermark settings as they are not visible anywhere else
when looking at an Analyzer NAR file
offline to the array it came from.
The cache states shown here will be
set when the logger started the
current nar/naz file logging so be sure to determine actual settings from
measured metrics.
Dirty pages on each SP indicates
write cache is enabled at the array
level.
Read cache hits, pre-fetch bandwidth are some indicators that read cache is
enabled.
Watermark settings are used to intelligently flush write cache pages
out to disks on the backend and keep a level of write cache available
for bursts of activity.
Raid Group / Thin Pool
Select the Storage Pool TAB in the
Performance Detail view. Careful not
to get confused here as the Raid Group and Thin Pool statistics are
derived from disk statistics, not LUN
statistics. Thus, the values will depend on all activity for all LUNs within that
raid group or all Thin LUNs in a Thin
Pool. Thin Pools aren’t covered in any
specific detail here although disk
statistics are logged and can be
analyzed in the same way as a regular Raid Group.
This reference was more for information as we’re not going to be looking at raid group statistics specifically for the exercise. The
Storage Pool TAB is the only method to analyze disk activity within a
Thin Pool. Thin Pools do have regular LUNs that are considered
private and hidden from view, including Analyzer.
The best overview of the
cache configuration and only
place you can see the
watermark settings is in the Overview screen
Don’t be fooled by settings that may have been changed
during the logging period
though i.e. it may show cache as enabled or disabled
here, but you should verify
that with read cache statistics
and dirty pages in the other views
EMC World – May 2010
20
The 2 exercises have explored the options and views available to you, with some explanation and
guidance on what the statistics mean and how they help in characterizing your IO.
This following section, should you have time to look at it, will guide you to look at the loads on a
specific set of LUNs sharing the same set of disks.
You have explored the views, now the task is to analyze a specific area where we have an issue. Now, please explore the interface and
look at the following attributes for this
load on the array with a focus on Raid Group 0, LUNs 50 & 51. Look at the
following for each of these LUNs and
see if you can draw any conclusions (make notes on the worksheet table at the back of
this handout);
LUN read and write throughput
LUN read and write IO size
LUN Response Time LUN Queue Length
LUN IO sizes
Disks read and write throughput Disk seek distance
Disk IO sizes
Disk Queue Length
Disk response time
Define the IO profile associated with
each LUN and think about what they can be. There is an area where we do
have an issue that we want to fix.
Hint; one of them is doing large
sequential reads – could be a video
or data warehousing application.
Look at the profiles for these two LUNs…. How are they different and what would be a suggestion on improving performance?
Using this hint, think about what helps sequential operations
and what could also hurt it.
If it is sequential – are we seeing pre-fetching and a high pre-
fetch used rate? We do need to know what the application is
trying to do, then correlate that with characterized IO on the
storage system to see if it is doing what we think it should be
doing. So, what about the other LUNs that
showed up red and/or yellow boxes? As explained during the overview, the
red/yellow boxes give an indication of
where to investigate and not indicative of an absolute problem.
Raid Group 4 has multiple LUNs with different IO characteristics. Check some of the metrics and see if you can conclude anything about
this raid group. Don’t spend too much time on this task as the prior
elements focusing on LUNs 50 and 51 are the main points of this exercise. If you have time, you might want to take some notes on
LUNs 2 and 3 statistics for reference (are they busy? Is that bad?)
EMC World – May 2010
21
Additional reference notes
Although these exercises cover the
performance statistics from host
accessing LUNs in the array, there maybe additional load generated
internal to the array. This load could
include those shown opposite;
Typically, layered application IO will not be logged at the LUN level but
will be visible at the disk level. If you understand what is taking place, once
accustomed to the user interface and
operational characteristics of layered
applications, you can determine what disk activity relates to host access or
internally generated IO.
Sometimes, you will observe a blip in the statistics i.e. a value for a statistic
outside normal range. To overcome
this being a nuisance, you can either restart the plotting or adjust the
scaling of the graph plots by zooming
in or setting the Chart Configuration
Axes options in the graph view.
Real-time viewing of Analyzer
statistics isn’t the preferred method
due to the requirement to be there at the time, as well as the additional
impact to the array in presenting the
information. Typically you would
look at a captured Analyzer NAR file as covered in Exercise-A.
SnapView Snapshot sessions Background zeroing for bind operations
SnapView clones Background verifying
MirrorView/S activity MirrorView/A activity
SAN Copy activity
Raid Group rebuild activity
Hot spare equalize activity
LUN migration operations
In the metric selection window, since Release-26 of
code, you will see options for
Optimal and Nonoptimal
metrics for LUNs. These are used when you have LUNs
using the ALUA failover mode
(mode=4) of operation.
Typically, you would see
Optimal values when accessed from the current owning SP,
and Nonoptimal values when
accessed from the non-owning
SP so a slightly longer path.
When not running in ALUA
mode, selecting either the regular metric or the metric-
Optimal will display the same
values.
EMC World – May 2010
22
Supplemental, Command Line NAR file retrieval and export capabilities
This is to direct you into the capabilities to script NAR file retrieval for lights out performance
statistics gathering. As the Navisphere archive file collects data covering the previous 5 hours of
statistics, the capability to script retrieval of the NAR file is useful when you want statistics for a
period of activity and you’re unable to retrieve the file in the normal way using the Navisphere
GUI interface e.g. statistics logged on Saturday would need to be retrieved sometime Sunday or
they would be overwritten by Monday. With the release of revision 24 and later revisions of
code, the Analyzer Archive facility allows the automatic archiving of Analyzer files on the array
itself for later retrieval via the GUI or CLI process, and retained for a much longer period than
the previous 5 hours (or 25 hours for older code archives). Remember though that if Periodic
Archiving is not enabled, you will only grab the prior 5 hours of data by default when you
retrieve the archive.
<Ensure NaviCLI utility is installed.
This is easily done if the Navisphere
CLI directory is present. Here you can double click the shortcut on the
desktop called NaviCLI>
This will start a command window that will go to the default installation
directory c:\Program
Files\EMC\Navisphere CLI
(Username and password will be
emcw for the following commands)
<Retrieve the Navisphere archive files
using the following command;
“naviseccli –user <username> -
password <password> -scope 0
–address <SP IP> analyzer –archive
-all”
Be careful with this command as you
may have many archives to download
when selecting all and it could take a
long time to complete. By omitting
the “all” you are presented with a
selection list where you can select
one or more archives to retrieve.
Scope will be 1 if the account details used are local and not global.
Do not do this here but you can reset the statistical data by using the
following command if you are looking to collect data for a specific
test period only and you are not interested in previous collected data;
“naviseccli –user <username> -password <pwd> -scope <0¦1>
-address <SP IP> analyzer –logging -reset”
The username and password can be omitted if you have setup the
security file for NaviSeccli.
The username used here does not have privileges to reset data logging
on the arrays being used.
Note; the desktop shortcut used here is not created for you during
installation. You have to do that yourself if you want that shortcut available on your own systems.
Prior to release 24 you would need to use the java archiveretrieve
command to get the archive from the array;
“java –jar archiveretrieve.jar –User <username> -Password
<password> -Scope 0 –Address <array IP> –File archive_emc.nar –
Location “C:\program files\emc\Navisphere cli” –Overwrite 1 –Retry 2
–v”>
EMC World – May 2010
23
Now you can follow these steps and
open the retrieved NAR file using the
on-array or off-array capability, or you can convert the NAR file data to
CSV format for import into Excel.
You can use the following command
to do this;
< naviseccli –user <username> -
password <password> -scope 0
–address <SP IP> analyzer -
archivedump -data test.nar -out
test.csv -object s,l,d >
You can also filter the output to only
get specific statistics like read
throughput adding the qualifier –
format rio
This example outputs stats for SP’s, LUNs and disks (-object s,l,d). To get
stats for metaluns, etc, please consult
the document; Navisphere Analyzer Administrator's Guide.pdf
If Excel format is required you can use the archivedump command to covert the NAR file data to a format readable by Excel, typically CSV.
Some more qualifiers for the format command are as follows
(separate with a comma if used);
Utilization (%) u
Response Time (ms) rt
Dirty Pages (%) dp
For other qualifiers please consult the Admin Guide.
If you leave off the qualifier –object, it will output all statistics for all
objects.
The Navisphere UI has an Analyzer dump wizard that guides you
through device and attributes selection prior to dumping to a CSV file.
Start Excel and select “open file” and
browse to the c:\Program Files\EMC\Navisphere CLI directory
and select file type as CSV, then open
the sample.csv file you created in the last step to view the statistics as
presented in Excel.
If Excel 2007, use the INSERT TAB to
display graphing options.
If you’re not too familiar with Excel
but would like to plot a graph showing
one of the statistics over time, you can easily do this by selecting a column by
clicking on the header letter, then
once the column is highlighted, click on the chart wizard icon in the tool
bar, select line as chart type, then
click next to see what the chart would
look like. You can then customize it as required.
Please note that each device selected in the dump command, like
SP and LUNs will be listed down the left column, so selecting an entire column to plot would actually plot all SP stats followed
by all LUN stats, and so on. You would need to be more
selective and manipulate the data when plotting graphs in a
logical manner.
EMC World – May 2010
24
You can try the archivedump
command and be more specific on
some qualifiers shown previously.
You could also have a go at the dump
wizard from the Analyzer drop down in the Navisphere Manager GUI –
using off-array Navisphere.
“naviseccli –user <username> -password <password> -scope 0
–address <SP IP> analyzer -archivedump -data test1.nar -out
test1.csv -object s –format u,dp”
This will output test1.csv containing SP statistics of utilization and write cache dirty pages.
Another option is archivemerge; used to merge multiple NAR files
together. We don’t use that in this session but remember that this is
useful if you want to view data access trends that span more than the
typical NAR file size of 5 hours. It is not necessary to merge nar files from both SP’s as each SP has the
same data.
The array based archivedump wizard provides an easy way to dump specific
statistics associated with individual
devices rather than using the CLI
method shown above.
With either on-array or off-array,
select Tools, Analyzer, Archive, Dump
Then select where the source file is
located and follow the wizard to select
objects to dump and what statistics you require.
EMC World – May 2010
25
Supplemental, Thin LUN Analysis
This is to highlight some differences in metrics available for Thin LUNs in a CLARiiON
environment and the way in which we view them.
There is a read/write load running to LUN201 on the array. This is a Thin LUN provisioned
from a pool of 3 disks.
Thin Pools in a CLARiiON have a private structure that isn’t visible in the Navisphere
interface. This structure has private LUNs that Thin LUNs utilize in 1GB increments. With the
experience gained from the primary exercises you can take a look at the active Thin LUN and
how to observe IO to both it and the Thin Pool disks.
Check Thin LUN properties.
<Right click on LUN 201 and select Properties>
<Here you can see the Pool properties that this LUN is serviced
from and the Thin LUN virtual size
and actual consumed capacity from
the pool>
<When selecting the Thin LUN, you do not see cache operations
associated with that LUN as these are
associated with the private LUNs
servicing the IO to the Pool and those are hidden from view >
<Unlike regular LUNs in the LUN TAB view, you only see the SP a Thin
LUN is assigned to. To see the disks
servicing the Thin LUN and its Pool,
you have to select the Storage Pool TAB >
<Select the Storage Pool TAB> <In the view, you can expand the Pool
to see the disks servicing the total
Pool load. You cannot see the private
LUNs that are hidden in the Pool>
End of exercises
These are metrics you will not see when
selecting Thin LUNs to analyze. This may change in a future release, but for
now; you have to look at the Thin Pool
disk characteristics to determine what’s happening in the Pool as a whole.
Regular IO metrics like throughput,
bandwidth, and response time are
available for each Thin LUN.
There are no specific
instructions on what to
investigate here although if time, compare the disks
within the Pool and how
those align with the Thin
LUN characteristics.
EMC World – May 2010
26
Supplemental notes
The following operations are executed at the disk level to provide data integrity features associated with
redundant RAID types as well as consistency of data stripes that could be at risk due to media issues.
Background zero; Before user data can be written to the physical disks within a LUN, the area has to
undergo a zero operation. New disks are initially supplied in a zero state where data can be written to the
disks immediately after binding LUNs, however if the disks have been used before i.e. bound and
unbound, they have to be re-zeroed.
You can zero the disks using a naviseccli command in readiness for grouping and binding LUNs later on
or the array will zero the disks when you create new LUNs on them. This zero operation results in
512KB SCSI write-same commands to the disks in a sequential manner, unless the array has to zero-on-
demand an area the user is writing too that is in the queue but hasn’t been zeroed yet. There is some
other small activity on the disks during zeroing as checkpoint operations keep track of progress.
Typically with no access to the LUNs any zeroing will complete in a matter of a few hours although a
busy array and activity to the disks being zeroed will delay completion. Also, the 512KB write-same
command will not consume back end bandwidth but will affect disk load and utilization.
Background verify; This operation is validating consistency of data protection at the disk level and is
automatically performed on newly created LUNs. The IO profile at the disk level is 64KB reads and like
zeroing, can take hours to complete and is also governed by array and disk activity.
Background zero, zero-on-demand, and background verify operations exhibit relatively large IO sizes
that can affect one’s analysis of the array. Also, if considering user testing its worth noting these
operations may affect the performance the array can present due to the parallel action of user data
access and these preliminary operations.
Also be aware these operations run in a sequential manner for any given raid group(RG) e.g. if you bind
5 LUNs on a RG 0 through 4, LUN 0 will start to zero and when complete will perform a background
verify. This is followed by the second LUN in that RG. Each LUN will zero then verify until all newly
created LUNs complete that process. Thereafter the only regular IO you will see at the disk level due to
internal operations will be SNiiFFER where you will see approximately 1 IO per second at 512KB in size
to each disk in a RG. SNiiFFER is a data checking operation that cycles through every block in every
LUN in the array to ensure data availability, even for data you might not have touched for months/years.
Any data inconsistency detected through SNiiFFER will automatically invoke recovery and remap of
affected blocks. RG’s will run through zero, verify and SNiiFFER operations independent to each-other.
Zeroing will have the most effect on performance so consider this when testing. Verify may have a small
effect and SNiiFFER will have a negligible effect on performance.
Always check disk stats to see what IO sizes are taking place at that level. With a RG idle, disk activity
showing 512KB writes indicate zeroing, 64KB reads indicate verifying and 512KB reads indicate sniffing.
{end}
EMC World – May 2010
27
Worksheet – use as needed during exercises.
LUN ID 50 51
Owner SP
SP Utilization
LUN Read IOPs
LUN Read size
LUN Write IOPs
LUN Write size
LUN Read MB/s
LUN Write MB/s
LUN response time
LUN Queue
Disk Read IOPs
Disk Read size
Disk Write IOPs
Disk Write size
Disk Queue
Average disk seek
Disk response time
LUN ID
Owner SP
SP Utilization
LUN Read IOPs
LUN Read size
LUN Write IOPs
LUN Write size
LUN Read MB/s
LUN Write MB/s
LUN response time
LUN Queue
Disk Read IOPs
Disk Read size
Disk Write IOPs
Disk Write size
Disk Queue
Average disk seek
Disk response time
Top Related