XIV System Architecture2.pdf

8/14/2019 XIV System Architecture2.pdf

1/65

2009 IBM Corporation

XIV Education

XIV System Architecture


2/65

2

Overview

Phase 10 Features and Capabilities

Gen II Systems Hardware Design and Layout

XIV Software Framework

XIV Systems Management

The XIV Storage System architecture incorporates a variety of features designed to

uniformly distribute data across key internal resources. This unique data distribution

method fundamentally differentiates the XIV Storage System from conventional

storage subsystems, thereby effecting numerous availability, performance, and

management benefits across both physical and logical elements of the system.


3/65

3

Session I: Phase 10 Features and Capabilities

System Components

Architectural design

Grid Architecture

Storage Virtualization and Logical Parallelism

Logical System Concepts

Usable Storage Capacity

Storage Pool Concepts

Capacity Allocation and Thin Provisioning

The XIV Storage System architecture is designed to deliver performance,

scalability, and ease of management while harnessing the high capacity and cost

benefits of SATA drives. The system employs off-the-shelf products as opposed to

traditional offerings which need more expensive components using proprietary

designs.


4/65

4

System Components

The XIV Storage System is comprised of the following components:

Host Interface Modules consisting of six modules, each containing 12 SATA Disk

Drives

Data Modules made up of nine modules, each containing 12 SATA Disk Drives

A UPS module complex made up of three redundant UPS units v

Two Ethernet Switches and an Ethernet Switch Redundant Power Supply (RPS

A Maintenance Module

An Automatic Transfer Switch (ATS) for external power supply redundancy

A modem, connected to the Maintenance module for externally servicing the

system

All the modules in the system are linked through an internal redundant Gigabit

Ethernet network enabling maximum bandwidth utilization and resilient to at leastany single component failure. The system and all of its components come pre

assembled and wired in a lockable rack.


5/65

5

Systems Components (cont.)

Hardware elements

The primary components of the XIV Storage System are known as modules. Modules provideprocessing, cache and host interfaces and are based on standard Intel/Linux systems. They areredundantly connected to one another via an internal switched Ethernet fabric. All of the moduleswork together concurrently as elements of a grid architecture, and therefore the systemharnesses the powerful parallelism inherent to a distributed computing environment.

Data Modules

At a conceptual level, the Data Modules function as the elementary building blocks of the system,providing physical capacity, processing power, and caching, in addition to advanced system-managed services that comprise the systems internal operating environment. The equivalence ofhardware across Data Modules and their ability to share and manage system software andservices are key elements of the physical architecture as shown in the slide.

Interface Modules

FundamentalIy, Interface Modules are equivalent to Data Modules in all aspects, with the followingexceptions:

1. In addition to disk, cache, and processing resources, Interface Modules are designed to includeboth Fibre Channel and iSCSI interfaces for host system connectivity as well as remote mirroring.

The Slide conceptually illustrates the placement of Interface Modules within the topology of theXIV IBM Storage System architecture.

2. The system services and software functionality associated with managing external I/O resideexclusively on the Interface Modules.

Ethernet switches

The XIV Storage System contains a redundant switched Ethernet fabric that conducts both data andmetadata traffic between the modules. Traffic can flow in the following ways:

Between two Interface Modules

Between an Interface Module and a Data Module

Between two Data Modules


6/65

6

Architectural Design

Massive Parallelism

Workload balancing Self-Healing

True virtualization

Thin provisioning

Module

Interface

Module

Interface

Module

Interface

Switching

Module

Interface

Module

Interface

Module

Interface

Mod

Interfa

Switching

Massive Parallelism

the system architecture ensures full exploitation of all system components. Any I/O activity involving a specificlogical volume in the system is always

inherently handled by all spindles. The system harnesses all storage capacity and all internal bandwidth and ittakes advantage of all processing power available. This is equally true for host-initiated I/O activity as it is forsystem-initiated activity such as rebuild processes and snapshot generation. All disks, CPUs, switches andother components of the system contribute at all times.

Workload balancing

The workload is evenly distributed over all hardware components at all times. All disks and modules are utilizedequally, regardless of access patterns. Despite the fact that applications may access certain volumes morefrequently than other volumes, or access certain parts of a volume more frequently than other parts, the load onthe disks and modules will be balanced perfectly.

Pseudo-random distribution ensures consistent load-balancing even after adding, deleting or resizing volumes,as well as adding or removing hardware. This balancing of all data on all system components eliminates thepossibility of a hot-spot being created.

Self-Healing

Protection against double disk failure is provided by an efficient rebuild process that brings the system back tofull redundancy in minutes. In addition, the XIV Storage System extends the self-healing concept, resumingredundancy even after failures in components other than disks.

True virtualization

Unlike other system architectures, storage virtualization is inherent to the basic principles of the XIV StorageSystem design. Physical drives and their locations are completely hidden from the user. This dramaticallysimplifies storage configuration, letting the system lay out the users volume in the optimal way. The automaticlayout maximizes the systems performance by leveraging system resources for each volume, regardless of theusers access patterns.

Thin provisioning

the system enables thin provisioning. That is the capability to allocate storage to applications on a just-in-timeand as needed basis, allowing significant cost savings compared to traditional provisioning techniques. Thesavings are achieved by defining a logical capacity which is larger than the physical capacity. This capabilityallows users to improve storage utilization rates, thereby significantly reducing capital and operational expensesby allocating capacity based on total space consumed, rather than total space allocated.


7/65

7

Grid Architecture

Relative effect of the loss of a given computing resource is

minimized

All modules are able to participate equally in handling the

total workload

Modules consist of standard off the shelf components

Computing resources can be dynamically changed both

capacity and performance

By scaling out

By scaling up

IBM XIV Storage System grid overview

The XIV Grid design entails the following characteristics:

Both Interface Modules and Data Modules work together in a distributed computing sense. In other words,although Interface Modules have additional interface ports and assume some unique functions, they alsootherwise contribute to the system operations equally to Data Modules.

The modules communicate with each other via the internal, redundant Ethernet network.

The software services and distributed computing algorithms running within the modules collectively manage allaspects of the operating environment.

Design principles

The XIV Storage System grid architecture, by virtue of its distributed topology and standard Intel/Linux building-block components, ensures that the following design principles are possible:

Performance: The relative effect of the loss of a given computing resource, or module, is minimized.

Performance: All modules are able to participate equally in handling the total workload.

This is true regardless of access patterns. The system architecture enables excellent load balancing,even if certain applications access certain volumes, or certain parts within a volume, more frequently.

Openness: Modules consist of standard off the shelf components.

Because components are not specifically engineered for the subsystem, the resources and time

required for development of newer hardware technologies are minimized. This, coupled with theefficient integration of computing resources into the grid architecture, enable the subsystem to realizethe rapid adoption of the newest hardware technologies available, without the need to deploy a wholenew subsystem.

Upgradability and scalability: Computing resources can be dynamically changed:

Scaled out by either adding new modules to accommodate both new capacity and new performancedemands, or by even by tying together groups of modules.

Scaled up by upgrading modules.


8/65


9/65

9

Storage Virtualization and Logical Parallelism

Easier volume management

Consistent performance and scalability

High availability and data integrity

Flexible snapshots

Data migration efficiency

Pseudo-random algorithm

Modular software design

Flexible snapshots

Full storage virtualization incorporates snapshots that are differential in nature; only

updated data consumes physical capacity.

Many concurrent snapshots (up to 16,000 volumes and snapshots can be

defined). Note that this is possible because a snapshot uses physical spaceonly after a change has occurred on the source.

Multiples snapshots of a single master volume can exist independently of

each other.

Snapshots can be cascaded, in effect creating snapshots of snapshots.

Snapshot creation/deletion does not require data to be copied, and hence occurs

immediately.

As updates occur to master volumes, the systems virtualized logical structure

enables it to elegantly and efficiently preserve the original point-in-time data

associated with any and all dependent snapshots by simply redirecting the updateto a new physical location on disk. This process, referred to as redirect on write,

occurs transparently from the host perspective by virtue of the virtualized remapping

of the updated data, and minimizes any performance impact associated with

preserving snapshots, regardless of the number of snapshots defined for a given

master volume.

Because they use redirect on write and do not necessitate data movement, the

size of a snapshot is independent of the source volume size.

Data migration efficiencyXIV supports thin provisioning. When migrating from system that only support


10/65

10

Logical System Concepts Distribution Algorithm

Each volume is spread across all drives

Data is cut into 1MB Partitions and stored on the disks

XIVs distribution algorithm automatically distributespartitions across all disks in the system pseudo-randomly

Module

Interface

Module

Interface

Module

Interface

Switching

XIV disks behave like connectedvessels, as the distributionalgorithm aims for constant diskequilibrium.

Thus, XIVs overalldisk spindle usageapproaches 100% inall usage scenarios.


11/65

11

Logical System Concepts Distribution Algorithm (Cont.)

Data distribution only changes when the system changes

Equilibrium is kept when new hardware is added

Equilibrium is kept when old hardware is removed

Equilibrium is kept after a hardware failure

Module 2

Module 3

Module 1

Module 2Module 1

Module 3


12/65

12

Module 4

Logical System Concepts - Partitions

Data distribution only changes when the system changes

Equilibrium is kept when new hardware is added

Equilibrium is kept when old hardware is removed

Equilibrium is kept after a hardware failure

Module 2

Module 3

Module 1

[ hardware upgrade ]

Module 4

Logical constructs

The XIV Storage System logical architecture incorporates constructs that underlie the

storage virtualization and distribution of data, integral to its design. The logical structure of

the subsystem ensures there is optimum granularity in the mapping of logical elements to

both modules and individual physical disks, thereby guaranteeing an ideal distribution of

data across all physical resources.

Partitions

The fundamental building block of logical volumes is known as a partition. Partitions have

the following characteristics:

All partitions are 1MB (1024 KB) in size.

A partition contains either a primary copy or secondary copy of data:

Each partition is mapped to a single physical disk.

This mapping is dynamically managed by the system via a proprietary

pseudo-random distribution algorithm in order to preserve data redundancy

and equilibrium.The storage administrator has no control or knowledge of the specific

mapping of partitions to drives.

Secondary partitions are always placed onto a physical disk that does not contain

the primary partition.

In addition, secondary partitions are also in a module that does not contain

its corresponding primary partition.


13/65


14/65

14

Module 1

Logical System Concepts Slices

XIV data distribution architecture uses Slices for partition copies

Slices are spread across all disk drives in the system Each slice has two copies: Primary copy and a Secondary copy

There are 16,384 slices (times 2 copies is a total of 32,768 slices)

Each disk holds approx 182 slices [(16384 x 2)/180]

A Primary and its Secondary Slice will never reside on the same module

Module 2


15/65

15

Logical System Concepts Slices (Cont.)

Module 1 Module 2

When creating a Volume it must spanacross all drives in the system

The minimum size of a volume is 17GB

1MB Partition = 2^20 (1,048,576) x 16384 = ~17GB

Each Partition is numbered with a Logical Partition

number for its specific Volume

9 19231426 11 21 6 24 3 1 417 16 18 813200 27 12 15 2 7 5 1025 22

When creating another volume which is bigger than

the minimum 17GB, the system will allocate several

17GB chunks for it

In this example it is a 51GB Volume

built from 3 x 17GB chunks

Logical Partition numbers are also assigned here

Numbering is always in modulus to 16384

9 19231426 11 21 6 24 3 1 417 16 18 813200 27 12 15 2 7 5 1025 22

9 19231426 11 21 6 24 3 1 417 16 18 813200 27 12 15 2 7 5 1025 22

9 19231426 11 21 6 24 3 1 417 16 18 813200 27 12 15 2 7 5 1025 22

+16384

+32768

+0


16/65

16

Logical System Concepts Logical Volumes

Every logical volume is comprised of 1MB pieces of data

Interface modules will not hold partition copies from otherinterface modules

This is called FC Proof

Implemented because Interface modules are more prune to problemsthan Data modules

The physical capacity associated with a logical volume isalways a multiple of 17GB

The maximum number of volumes that can be concurrentlydefined on the system is 4,605

The same address space for both volumes and snapshots is 16,377

Logical volumes are administratively managed within thecontext of Storage Pools

Logical volumes

The XIV Storage System presents logical volumes to hosts in the same manner as conventional subsystems,however, both the granularity of logical volumes and the mapping of logical volumes to physical disks isfundamentally different.

As discussed previously, every logical volume is comprised of 1MB (1024KB) pieces of data known aspartitions.

The physical capacity associated with a logical volume is always a multiple of 17GB (decimal). Therefore, while it is possible to present a block-designated (discussed in module 3) logical volume

to a host that is not a multiple of 17GB, the maximum physical space that is allocated for thevolume will always be the sum of the minimum number of 17GB increments needed to meet theblock-designated capacity. Note that the initial physical capacity actually allocated by the systemupon volume creation may be less than this amount.

The maximum number of volumes that can be concurrently defined on the system is limited by:

1. The logical address space limit:

The logical address range of the system permits up to 16,377 volumes, although thisconstraint is purely logical and therefore should not normally be a practical consideration.

Note that the same address space is used for both volumes and snapshots.

2. The limit imposed by the logical and physical topology of the system for the minimum volume size:

The physical capacity of the system, based on 180 drives with 1TB of capacity per drive

and assuming the minimum volume size of 17GB, limits the maximum volume count to4,605 volumes. Again, since volumes and snapshots share the same address space, asystem with active snapshots can have more than 4,605 addresses assigned collectivelyto both volumes and snapshots.

Logical volumes are administratively managed within the context of Storage Pools.

Since the concept of Storage Pools is administrative in nature, they are not part of the logicalhierarchy inherent to the systems operational environment.


17/65

17

Logical System Concepts Volume Layout

The Partition Table maps between a logical partition number and

the physical location on the disk The distribution algorithms seek to preserve the statistical equality

of access among all physical disks

Each volume is allocated across at least 17GB (decimal) of capacity that is

distributed evenly across all disks

Each disk has its data mirrored across all other disks excluding same module

The storage system administrator does not plan the layout of

volumes on the modules

There are no unusable pockets of capacity known as orphaned spaces

Upon component failure a new Goal Distribution is created

All disks participate in the enforcement and therefore rapidly return to fullredundancy

Logical volume layout on physical disks

The XIV Storage System facilitates the distribution of logical volumes over disks and modules by means of a

dynamic relationship between primary data copies, secondary data copies, and physical disks. This

virtualization of resources in the XIV Storage System is governed by a pseudo random algorithm.

Partition table

Mapping between a logical partition number and the physical location on disk is maintained in a Partition Table.

The Partition Table maintains the relationship between the partitions that comprise a logical volume and their

physical locations on disk.

Volume layout

At a high level, the data distribution scheme is an amalgam of mirroring and striping. While it is tempting to

think of this in the context of RAID 1+0 (10) or 0+1, the low-level virtualization implementation precludes the

usage of traditional RAID algorithms in the architecture. This is because conventional RAID implementationscannot incorporate dynamic, intelligent, and automatic management of data placement based on knowledge of

the volume layout, nor is it feasible for a traditional RAID system to span all drives in a subsystem due to the

vastly unacceptable rebuild times that would result.

Partitions are distributed on all disks using what is defined as a pseudo-random distribution function. The distribution algorithms seek to preserve the statistical equality of access among all physical

disks under all conceivable real-world aggregate workload conditions and associated volume access

patterns. Essentially, while not truly random in nature, the distribution algorithms in combination withthe system architecture preclude the occurrence of the phenomenon traditionally known of as hot-

spots.

The XIV Storage System contains 180 disks, and each volume is allocated across at least

17GB (decimal) of capacity that is distributed evenly across all disks.

Each logically adjacent partition on a volume is distributed across a different disk; partitions

are not combined into groups before they are spread across the disks.

The pseudo-random distribution ensures that logically adjacent partitions are never striped

sequentially across physically adjacent disks.

Each disk has its data mirrored across all other disks excluding the disks in the same module.

Each disk holds approximately 1% of any other disk in other modules


18/65

18

Logical System Concepts Snapshots

A snapshot represents a point-in-time copy of a Volume

Snapshots are governed by almost all of the principles that

apply to Volumes

Snapshots incorporate dependent relationships with their

source volumes

Can be either logical volumes or other snapshots

A given partition of a primary volume and its snapshot are

stored on the same disk

A write to this partition is redirected within the module, minimizing

latency and utilization between modules

As updates occur to master volumes, the systems virtualized logicalstructure enables it to preserve the original point-in-time data

Snapshots

A snapshot represents a point-in-time copy of a Volume. Snapshots are governed by almost all of the

principles that apply to Volumes. Unlike Volumes, snapshots incorporate dependent relationships

with their source volumes, which can be either logical volumes or other snapshots. Because they are

not independent entities, a given snapshot does not necessarily wholly consist of partitions that are

unique to that snapshot. Conversely, a snapshot image will not share all of its partitions with itssource volume if updates to the source occur after the snapshot was created.

Volumes and snapshots

Volumes and snapshots are mapped using the same distribution scheme.

A given partition of a primary volume and its snapshot are stored on the same disk drive

As a result, a write to this partition is redirected within the module, minimizing latency and

utilization associated with additional interactions between modules.

As updates occur to master volumes, the systems virtualized logical structure enables it to

elegantly and efficiently preserve the original point-in-time data associated with any and all

dependent snapshots by simply redirecting the update to a new physical location on the disk.

This process, referred to as redirect on write, occurs transparently from the hostperspective by virtue of the virtualized remapping of the updated data, and minimizes any

performance impact associated with preserving snapshots regardless of the number of

snapshots defined for a given master volume.


19/65

19

19

Data ModuleData Module

Data ModuleData Module

XIV Physical ViewVolume

Host Logical View

Snapshot

As Host Writes data, it is placedrandomly across system in 1MB chunks

Each Server has pointers in memory tothe disks that hold the data locally

On a snapshot, each Server simply pointsto original volume. Memory only

Operation

Restore Volume from snapshot copy

Logical System Concepts Snapshots

Snapshot creation/deletion is instantaneous

Vol Snap


20/65

20

Usable Storage Capacity

XIV Storage System reserves physical disk capacity for:

Global spare capacity

Metadata, including statistics and traces

Mirrored copies of data

The global reserved space includes sufficient space to

withstand the failure of a full module in addition to three disks

The system reserves roughly 4% of physical capacity forstatistics and traces, as well as the distribution and partition

tables

Net usable capacity is reduced by a factor of 50% to account

for data mirroring Usable capacity = [1,000GB x 0.96 x [180-[12 + 3]]]/2 = 79,113GB

The XIV Storage System reserves physical disk capacity for:


Metadata, including statistics and traces

Mirrored copies of data


The dynamically balanced distribution of data across all physical resources by definition obviates the inclusionof dedicated spare drives that are necessary with conventional RAID technologies. Instead, the XIV StorageSystem reserves capacity on each disk in order to provide adequate space for the redistribution and recreationof redundant data in the event of a hardware failure.

The global reserved space includes sufficient space to withstand the failure of a full module in addition to threedisks, enabling the system to execute a new Goal Distribution, discussed earlier, and return to full redundancyeven after multiple hardware failures. Since the reserve spare capacity does not reside on dedicated disks,space for hot spare is reserved as a percentage of each individual drive overall capacity.

Metadata and system reserve

The system reserves roughly 4% of physical capacity for statistics and traces, as well as the distribution andpartition tables.

Net usable capacity

The calculation of the net usable capacity of the system consists of the total disk count, less the number of disksreserved for sparing, multiplied by the amount of capacity on each disk that is dedicated to data, and finallyreduced by a factor of 50% to account for data mirroring.

Note: The calculation of the usable space is as follows:

Usable capacity = [drive space x (% utilized for data) x [Total Drives - Hot Spare reserve]/2

Usable capacity = [1,000GB x 0.96 x [180-[12 + 3]]]/2 = 79,113GB (decimal)


21/65

21

Storage Pool Concepts

Storage Pools form the basis for controlling the usage of storage space

Storage Pools consists exclusively of meta-data transactions

A logical volume is defined within the context of one and only one

Storage Pool

A Consistency Group is a group of volumes that can be snapshotted at

the same point in time

Storage pool relationship:

A logical volume may have multiple independent snapshots. This logical volume

is also known as a master volume

A master volume and all of its associated snapshots are always a part of only one

Storage Pool

A volume may only be part of a single Consistency Group All volumes of a Consistency Group must belong to the same Storage Pool

While the hardware resources within the XIV Storage System are virtualized in a global sense, the available capacity in thesystem can be administratively portioned into separate and independent Storage Pools. The concept of Storage Pools ispurely administrative in that they are not a layer of the functional hierarchical logical structure employed by the systemoperating environment. Instead, the flexibility of Storage Pool relationships from an administrative standpoint derives from thegranular virtualization within the system. Essentially, Storage Pools function as a means to effectively manage a relatedgroup of logical volumes and their snapshots.

Improved management of storage space

Storage Pools form the basis for controlling the usage of storage space by specific applications, group of applications, ordepartments, enabling isolated management of relationships within the associated group of logical volumes and snapshotswhile imposing a capacity quota.

A logical volume is defined within the context of one and only one Storage Pool and knowing that a Volume is equallydistributed amongst all system disk resources, it follows that all Storage Pools must also span all system resources.

As a consequence of the system virtualization, there are no limitations on the size of Storage Pools or on the associationsbetween logical volumes and Storage Pools. In fact, manipulation of Storage Pools consists exclusively of meta-datatransactions, and does not impose any data copying from one disk or module to another. Hence, changes are completedinstantly and without any system overhead or performance degradation.

Consistency Groups

A Consistency Group is a group of volumes that can be snapshotted at the same point in time, thus ensuring a consistentimage of all volumes within the group at that time. The concept of Consistency Group is ubiquitous among storagesubsystems because there are many circumstances in which it is necessary to perform concurrent operations collectivelyacross a set of volumes, so that the result of the operation preserves the consistency among volumes. For example, effectivestorage management activities for applications that span multiple volumes, or for creating point-in-time backups, would notbe possible without first employing Consistency Groups.

Storage Pool relationships

Storage Pools facilitate administration of relationships between logical volumes, snapshots, and Consistency Groups.

The following principles govern the relationships between logical entities within the Storage Pool:

A logical volume may have multiple independent snapshots. This logical volume is also known as a master volume.

A master volume and all of its associated snapshots are always a part of only one Storage Pool.

A volume may only be part of a single Consistency Group.

All volumes of a Consistency Group must belong to the same Storage Pool.


22/65

22

Storage Pool Concepts (Cont.)

Storage pool size can vary from 17GB to full system capacity

Snapshot reserve capacity is defined within each regularStorage Pool and is maintained separately from logical volumecapacity

Snapshots are structured as logical volumes, however, a Storage Poolssnapshot reserve capacity is granular at the partition level (1MB)

snapshots will only be automatically deleted when there is inadequatephysical capacity available in the storage pool

Space allocated for a Storage Pool can be dynamically changed

The designation of a Storage Pool as a regular pool or a thinlyprovisioned pool can be dynamically changed

The storage administrator can relocate logical volumesbetween Storage Pools without any limitations

Storage Pools have the following characteristics:

The size of a Storage Pool can range from as small as possible (17 GB, the minimum size that can beassigned to a logical volume) to as large as possible (the entirety of the available space in the system) withoutany limitation imposed by the system (this is not true for hosts, however).

Snapshot reserve capacity is defined within each non-thinly provisioned, or regular, Storage Pool and iseffectively maintained separately from logical, or master, volume capacity. The same principles apply for thinlyprovisioned Storage Pools with the exception that space is not guaranteed to be available for snapshots due to

the potential for hard space depletion.Snapshots are structured in the same manner as logical volumes (also known as master volumes),however, a Storage Pools snapshot reserve capacity is granular at the partition level (1MB). In effect,snapshots collectively can be thought of as being thinly provisioned within each increment of 17GB ofcapacity defined in the snapshot reserve space.

As discussed in the example above, snapshots will only be automatically deleted when there isinadequate physical capacity available within the context of each Storage Pool independently. Thisprocess is managed by a snapshot deletion priority scheme, Therefore, when a Storage Pools size isexhausted, only the snapshots that reside in the affected Storage Pool are deleted.

The space allocated for a Storage Pool can be dynamically changed by the storage administrator:

The Storage Pool can always be increased in size, limited only by the unallocated space on thesystem.

The Storage Pool can always be decreased in size, limited only by the space consumed by thevolumes and snapshots defined within that Storage Pool.

The designation of a Storage Pool as a regular pool or a thinly provisioned pool can be dynamically changed

even for existing Storage Pools.The storage administrator can relocate logical volumes between Storage Pools without any limitations,provided there is sufficient free space in the target Storage Pool.

If necessary, the target Storage Pool capacity can be dynamically increased prior to volumerelocation, assuming there is sufficient unallocated capacity available in the system.

When a logical volume is relocated to a target Storage Pool, sufficient space must be available for allof its snapshots to reside in the target Storage Pool as well.


23/65

23

Capacity Allocation and Thin Provisioning

Soft volume size

The size of the logical volume that is observed by the host

Soft volume size is specified in one of two ways, depending on units:

In terms of GB: The system will allocate the soft volume soft size as the

minimum number of discrete 17GB increments

In terms of blocks: The capacity is indicated as a discrete number of 512

byte blocks

The system still consumes at a minimum of 17GB increments, however, the precise

size in blocks is reported to the host

Hard volume size

The physical space allocated to the volume following host writes to the volume

Upper limit is determined by the soft size assigned to the volume

Allocated to volumes by the system in increments of 17GB due to the

underlying logical and physical architecture

Increasing the soft volume size does not affect the hard volume size

The XIV Storage System virtualization empowers storage administrators to thinly provision resources, vastly improving aggregatecapacity utilization and simplifying resource allocation tremendously. Thin provisioning is a central theme of the virtualized design ofthe system, because it uncouples the virtual, or apparent, allocation of a resource from the underlying hardware allocation.

Hard and soft volume sizes

The physical capacity assigned to traditional, or fat, volumes is equivalent to the logical capacity presented to hosts. With the XIVStorage System, this does not need to be the case. All logical volumes by definition have the potential to be thinly provisioned as aconsequence of the XIV Storage Systems virtualized architecture, and therefore provide the most efficient capacity utilization

possible. For a given logical volume, there are effectively two associated sizes. The physical capacity allocated for the volume is notstatic, but increases as host writes fill the volume.

Soft volume size

This is the size of the logical volume that is observed by the host, as defined upon volume creation or as a result of a resizing command.The storage administrator specifies the soft volume size in the same manner regardless of whether the Storage Pool itself will bethinly provisioned. The soft volume size is specified in one of two ways, depending on units:

1. In terms of GB: The system will allocate the soft volume soft size as the minimum number of discrete 17GB increments needed tomeet the requested volume size.

2. In terms of blocks: The capacity is indicated as a discrete number of 512 byte blocks. The system will still allocate the soft volumesize consumed within the Storage Pool as the minimum number of discrete 17GB increments needed to meet the requested size(specified in 512 byte blocks), however, the size that is reported to hosts is equivalent to the precise number of blocks defined.

{Incidentally, the snapshot reserve capacity associated with each Storage Pool is a soft capacity limit and it is specified by the storageadministrator, though it effectively limits the hard capacity consumed collectively by snapshots as well.

Hard volume size

The volume allocated hard space reflects the physical space allocated to the volume following host writes to the volume, and isdiscretely and dynamically provisioned by the system (not the storage administrator). The upper limit of this provisioning is determinedby the soft size assigned to the volume.

The volume consumed hard space is not necessarily equal to the hard volume allocated capacity because the hard space allocationoccurs in increments of 17GB, while actual space is consumed at the granularity of the 1MB partitions. Therefore, the actual physicalspace consumed by a volume within a Storage Pool is transient because a volumes consumed hard space reflects the total amountof data that has been previously written by host applications:

Hard capacity is allocated to volumes by the system in increments of 17GB due to the underlying logical and physical architecture;there is no greater degree of granularity than17GB even if a only a few partitions are initially written beyond each 17GB boundary.

Application write access patterns determine the rate at which the allocated hard volume capacity is consumed, and subsequently therate at which the system allocates additional increments of 17GB up to the limit defined by the soft size for the volume. As a result,the storage administrator has no direct control over the hard capacity allocated to the volume by the system at any given point in time.

During volume creation, or when a volume has been formatted, there is zero physical capacity assigned to the volume. As applicationwrites accumulate to new areas of the volume, the physical capacity allocated to the volume will grow in increments of 17GB and mayultimately reach the full soft volume size.

Increasing the soft volume size does not affect the hard volume size.


24/65

24

Capacity Allocation and Thin Provisioning (Cont.)

17GB 17GB 17GB 17GB 17GB17GB

LogicalView

PhysicalView

17GB 17GB 17GB 17GB 17GB17GB

Vol-1 size=10GB

(Block Definition)

Vol-1 allocated

soft space

Vol-1 consumed

hard space

Vol-1 allocatedhard space

Vol-2 consumedhard space

Vol-2 allocated

hard space

Vol-2 allocated

soft space Unused spaceSnapshot Reserve

The Block definitionallows hosts to see

precise number of

blocks

Even for Block defined

volumes, the system allocates

logical capacity in increments

of 17GB

Taking Snapshots

Snapshot consumedhard space

Snapshot allocated

hard space

The consumed hard space

grows as host writes

accumulate to new areas of the

volume

The consumed hard space

grows as snapshots writes

accumulate to new areaswithin the allocated snapshot

reserve soft space

A new Thin Provisioned

Pool is createdNo physical space isactually allocated

Pool allocatedsoft space

Pool allocated

hard space

Storage Pool level thin provisioning

While volumes are effectively thinly provisioned automatically by the system, Storage Pools can be

defined by the storage administrator (when using the GUI) as either regular or thinly provisioned.

Note that when using the XCLI, there is no specific parameter to indicate thin provisioning for a

Storage Pool. You indirectly and implicitly create a Storage Pool as thinly provisioned by specifying a

pool soft size greater than its hard size.

With a regular pool, the host-apparent, capacity is guaranteed to be equal to the physical capacity

reserved for the pool. The total physical capacity allocated to the constituent individual volumes and

collective snapshots at any given time within a regular (non-thinly provisioned) will reflect the current

usage by hosts. because the capacity is dynamically consumed as required. However, the remaining

unallocated space within the pool remains reserved for the pool, and cannot be used by other

Storage Pools. Therefore, the pool will not achieve full utilization unless the constituent volumes are

fully utilized, but conversely there is no chance of exceeding the physical capacity that is available

within the pool as is possible with a thinly provisioned pool.

In contrast, a thinly provisioned Storage Pool is not fully backed by hard capacity, meaning theentirety of the logical space within the pool cannot be physically provisioned unless the pool is

transformed first into a regular pool. However, benefits may be realized when physical space

consumption is less than logical space assigned, because the amount of logical capacity assigned to

the pool that is not covered by physical capacity is available for use by other Storage Pools.


25/65

25

Session II: Gen II Systems Hardware Design and Layout

XIV Storage System Model 2810-A14

Full Rack Systems

Partially Populated Rack Systems

The Rack, ATS and UPS modules

Data Modules

Interface Module

SATA Disk Drives

The Patch Panel

Interconnection and Switches

Maintenance Module

This chapter describes the hardware architecture of the XIV Storage System. The

physical structures that make up the XIV Storage System are presented, such as

the system rack, Interface, Data and Management modules, disks, switches and

power distribution devices.


26/65

26

XIV Storage System Model 2810-A14

The XIV Storage System seen in this slide is designed to be a scalable enterprise

storage system based upon a grid array of hardware components. The architecture

offers highest performance through maximized utilization of all disks, true distributed

cache implementation, coupled with more effective bandwidth. It also offers superior

reliability through distributed architecture, redundant components, self-monitoring

and self-healing.


27/65

27

Full Rack Systems

Hardware characteristics

The IBM 2810-A14 is a new generation of IBM high-performance, high-availability, and high-capacityenterprise disk storage subsystem. This slide summarizes the main hardware characteristics.

All XIV hardware components come pre-installed in a standard APC AR3100 rack. At the bottom ofthe rack, a UPS module complex made up of three redundant UPS units, is installed and provides

power to the Data Modules, Interface Modules and switches.

A fully populated rack contains 15 Data Modules, where 6 modules are combined Data and InterfaceModules equipped with the connectivity adapters (FC and Ethernet). Each module includes twelve1TB SATA disk drives. This translates into a total raw capacity of 180TB for the complete system.

Two 48 port 1Gbps Ethernet switches form the basis of an internal redundant Gigabit Ethernet thatlinks all the modules in the system. The switches are installed in the middle of the rack between theInterface Modules.

The connections between the modules and switches and also all internal power connections in therack are realized by a redundant set of cables. For power connections, standard power cables andplugs are used. Additionally standard Ethernet cables are used for interconnection between the

modules and switches.

All 15 modules (6 Interface Modules and 9 Data Modules) have redundant connections through two48-port 1 Gbps Ethernet switches. This grid network ensures communication between all moduleseven if one of the switches or a cable connection fails. Furthermore, this grid network provides thecapabilities for parallelism and execution of a data distribution algorithm that contribute to theexcellent performance of the XIV Storage System.


28/65

28

Partially Populated Rack Systems

3

3

72

~27

48

Hardware characteristics

The IBM 2810-A14 is a new generation Partially Populated Rack provides a solution for mid/largeenterprises needing to begin working with XIV storage with less capacity. This slide summarizes themain hardware characteristics.

All XIV hardware components come pre-installed in a standard APC AR3100 rack. At the bottom of

the rack, a UPS module complex made up of three redundant UPS units, is installed and providespower to the Data Modules, Interface Modules and switches.

A partially populated rack contains 6 Data Modules, where 3 modules are combined Data andInterface Modules equipped with the connectivity adapters (FC and Ethernet). Each module includestwelve 1TB SATA disk drives. This translates into a total raw capacity of 72TB for the completesystem.

Two 48 port 1Gbps Ethernet switches form the basis of an internal redundant Gigabit Ethernet thatlinks all the modules in the system. The switches are installed in the middle of the rack between theInterface Modules.

The connections between the modules and switches and also all internal power connections in the

rack are realized by a redundant set of cables. For power connections, standard power cables andplugs are used. Additionally standard Ethernet cables are used for interconnection between themodules and switches.

All 6 modules (3 Interface Modules and 3 Data Modules) have redundant connections through two48-port 1 Gbps Ethernet switches. This grid network ensures communication between all moduleseven if one of the switches or a cable connection fails. Furthermore, this grid network provides thecapabilities for parallelism and execution of a data distribution algorithm that contribute to theexcellent performance of the XIV Storage System.


29/65

29

Partially Populated Rack Systems (Cont.)

TotalModules

6 9 10 11 12 13 14 15

UsableCapacity

27 43 50 54 61 66 73 79

InterfaceModules

3 6 6 6 6 6 6 6

Data Modules 3 3 4 5 6 7 8 9

Disk Drives 72 108 120 132 144 156 168 180

FiberChannelPorts

8 16 16 20 20 24 24 24

iSCSI Ports 0 4 4 6 6 6 6 6

Memory (GB) 48 72 80 88 96 104 112 120

Plant/FieldOrderable

Plant Field Field Field Field Field Field Both

Additional capacity configurations

The XIV Storage System Model A14 is now available in a six module configuration

consisting of three interface modules (feature number 1100) and three data

modules (feature number 1105). This configuration is designed to support the same

capabilities and functions as the current 15 module XIV Storage System with theIBM XIV Storage System Software V10. It has all of the same auxiliary components

and ships in the same physical rack as the 15 module system.

The six module configuration is field-upgradeable with additional interface modules

and data modules to achieve configurations with a total of nine, ten, eleven, twelve,

thirteen, fourteen, or fifteen modules. The resulting configuration can subsequently

continue to be upgraded with one or more additional modules, up to the maximum

of fifteen modules.


30/65

30

The Rack, ATS and UPS modules

In case of extended external

power failure or outage, theUPS module complex

maintains battery power long

enough to allow a safe and

ordered shutdown

The Automatic Transfer

System (ATS) supplies power

to all three UPSs and

Maintenance module

The rack

The IBM XIV hardware components are installed in a 19 NetShelter SX 42U rack (APC AR3100) from APC.The rack is 1070mm deep to accommodate deeper size modules and to provide more space for cables andconnectors. Adequate space is provided to house all components and to properly route all cables. The rackdoor and side panels are locked with a key to prevent unauthorized access to the installed components.

The UPS module complexThe Uninterruptable Power Supply (UPS) module complex consists of three UPS units. Each unit maintains aninternal power supply in the event of temporal failure of the external power supply. In case of extended externalpower failure or outage, the UPS module complex maintains battery power long enough to allow a safe andordered shutdown of the XIV Storage System. The complex can sustain the failure of one UPS unit, whileprotecting against external power disorders.

The three UPS modules are located at the bottom of the rack. Each of the modules has an output of 6 kVA tosupply power to all other components in the rack and is 3U in height. The design allows proactive detection oftemporary power problems and can correct them before the system goes down. In the case of a completepower outage, integrated batteries continue to supply power to the entire system. Depending on the load of theIBM XIV, the batteries are designed to continue system operation from 3.3 minutes to 11.9 minutes; This givesenough time to gracefully power-off the system.

Automatic Transfer System (ATS)

The Automatic Transfer System (ATS) supplies power to all three Uninterruptible Power Supplies (UPS), and tothe Maintenance Module. Two separate external main power sources supply power to the ATS.

In case of power problems or a failing UPS, the ATS reorganizes the power load balance between the powercomponents. The operational components take over the load from the failing power source or power supply.This rearrangement of the internal power load is performed by the ATS in a seamless way and systemoperation continues without any application impact.


31/65

31

The UPS Behavior

All components power connections in the box are distributed across3 UPSs

All three UPSs are running self-test procedures to validatebatteries operational state

Self-test schedule is cycled on the UPSs with 5 days interval between them -WRONG

An operational system is where at least two UPSs are running onutility power and with at least 70% of battery charge level

A single UPS failure will not impact power distribution

If two or more UPSs are in failed state, the system will wait for a30 sec grace period before determining system graceful shutdown

If one UPS is in failed state, the next self-test instance will beskipped to avoid the chance of a second UPS failure

UPS self test procedures are controlled by the system MicroCode The self test cycle is once every 14 days, with a 9 hour interval between

each UPS

The XIV Storage system is using its memory DIMMs for cache purposes. When a system has aproblem with power distribution it is imperative that a full proof power distribution system will be inplace to avoid data loss.

Power Distribution Rules

All system components use power connections that are distributed across 3 UPSs evenly.An fully operational system is where at least two UPSs are running on utility power and with at least70% of battery charge level.

In order to allow the system for a graceful shutdown (the process where system commits allremaining IOs in cache to disks and properly shuts down all system components), The XIVsystem needs at least two UPSs with a minimum of 70% charge level.

In case of a single UPS failure, either from a self test failure or from a physical problem with the UPSitself, power distribution to the system will not be impacted. The system will continue to functionnormally.

If two or more UPSs are in failed state, the system will wait for an additional 30 sec grace periodbefore determining that the system is indeed experiencing a major problem (to avoid taking thesystem down in cases of short power spikes) and issue a graceful shutdown to it.

If one UPS is in failed state for whatever reason, the next UPS self test instance will be skipped toavoid the chance of a second UPS failure that will cause the box to issue a graceful shutdown.

The UPS self test procedures are controlled by the system MicroCode and can beconfigured, if needed, using developer level commands.

The default self test interval is 5 days between each UPS.


32/65

32

Data ModuleSystem fans x 10

Motherboard

SAS Expander Card

System PSUs x 2PCI Slots (ETH)

CF Card

The hardware of the Interface Modules and the Data Modules is a Xyratex 1235E-X1. The module

has a 87.9 mm (2U) tall, is 483 mm wide and 707 mm deep. The weight depends on configuration

and type (Data Module or Interface Module) and is maximum of 30 kg.

The fully populated rack hosts 9 Data Modules (Module 1-3 and Module 10-15). There is no

difference in the hardware between Data Modules and Interface Modules, except for the additional

host adapters and GigE adapters in the Interface Modules. The main components of the module,beside the 12 disk drives

are:

System Planar

Processor

Memory / Cache

Enclosure Management Card

Cooling devices (fans)

Memory Flash Card

Redundant Power Supplies

In addition, each Data Module contains four redundant Gigabit Ethernet ports. These ports togetherwith the two switches form the internal network, which is the communication path for data and meta

data between all modules. One Dual GigE adapter is integrated in the System Planar (port 1 and 2),

the remaining two ports (3 and 4) are on a additional Dual GigE adapter installed in a PCIe slot.


33/65

33

Data Module (Cont.)

Back view picture of a Data module and the CF card with the Addonics adapter.


34/65

34

Data Module (Cont.)

The same system planar with a built-in SAS controller is used

in both Data and Interface modules Each module has 1 Intel Xeon Quad Core CPU.

8GB of fully buffered DIMM memory modules

10 fans for cooling of disks, CPU and board

An enclosure management card to issue alarms in case of problems

with the module

1GB Compact Flash card

This card is the boot device of the module and contains the

software and module configuration files

Due to the configuration files the Compact Flash Card is not

interchangeable between modules

System Planar

The System Planar used in the Data Modules and the Interface Modules is a standard ATX

board from Intel . This high-performance server board with a built in SAS adapter supports:

64-bit quad-core Intel Xeon processor to improve performance and headroom, and provide

scalability and system redundancy with multiple virtual applications.Eight fully buffered 533/667 MHz DIMMs to increase capacity and performance.

Dual Gb Ethernet with Intel I/O Acceleration Technology to improve application and network

responsiveness by moving data to and from applications faster.

Four PCI Express slots to provide the I/O bandwidth needed by servers.

SAS adapter

Processor: The processor is a Xeon Quad Core Processor. This 64-bit processor has the

following characteristics: 2.33 GHz clock 12 MB cache 1.33 GHz Front Serial Bus

Memory / Cache: Every module has 8 GB of memory installed (8 x 1GB FBDIMM). Fully

Buffered DIMM memory technology increases reliability, speed and density of memory for use

with Xeon Quad Core Processor platforms. This processor memory configuration can provide 3

times higher memory throughput, enable increased capacity and speed to balance capabilities

of quad core processors, perform reads and writes simultaneously and eliminate the previous

read to write blocking latency. Part of the memory is used as module system memory, while

the rest is used as cache memory for caching data previously read, pre-fetching of data from

disk and for delayed destaging of previously written data.


35/65

35

Interface ModuleSystem fans x 10

Motherboard

SAS Expander Card

System PSUs x 2PCI Slots (FC & ETH)

CF Card


36/65

36

Interface Module (Cont.)

Interface Module

The Interface Module is similar to the Data Module. The only differences are:

Each Interface Module contains iSCSI and Fibre Channel ports, through which hosts can attach to the XIVStorage System. These ports can also be used to establish Remote Mirror links with another, remote XIVStorage System.

There are two 4-port GigE PCIe adapters installed for additional internal network connections and also for theiSCSI ports.

There are six Interface Modules (modules 4-9) available in the rack. All Fibre Channel ports, iSCSI ports andEthernet ports used for external connections are internally connected to a patch panel where the externalcables are actually hooked up.

Fibre Channel connectivity

There are 4 FC ports (two 2-port adapters) available in each Interface Module for a total of 24 FCP ports. Theysupport 4 Gbps (Gigabit per second) full-duplex data transfer over short wave fibre links, using 50 micron multi-mode cable. The cable needs to be terminated on one end by a Lucent Connector (LC).

In each module the ports are allocated as follows:

ports 1 and 2 are allocated for host connectivity

ports 3 and 4 are allocated for remote connectivity

4Gb FC PCI Express adapterFibre channel connections to the Interface Modules are realized by two 2-port 4Gb FC PCI Express Adaptersper Interface Module for faster connectivity and improved data protection from LSI Corporation.

This Fibre Channel host bus adapter (HBA) is LSIs powerful FC949E controller and features full-duplex capableFC ports, that automatically detect connection speed, and can each independently operate at 1,2 or 4Gbps. Theability to operate on slower speeds ensure that these adapters remain fully compatible with legacy equipment.New end-to-end error detection (CRC) for improved data integrity during reads and writes is also supported.


37/65

37

Interface Module (Cont.)

iSCSI connectivity

There are six iSCSI service ports (two per Interface Module) available for iSCSI over IP/Ethernetservices. These ports are available in Interface Modules 7,8 and 9 supporting 1Gbps Ethernet hostconnection. These ports should connect, through the patch panel to the users IP network andprovide connectivity to the iSCSI hosts.

iSCSI connections can be operated with different functionalities:

As an iSCSI target: server hosts through the iSCSI protocol

As an iSCSI initiator for remote mirroring when connected to another iSCSI port

As an iSCSI for data migration when connected to third party iSCSI storage system

For CLI and GUI access over the iSCSI ports

iSCSI ports can be defined for different use:

Each iSCSI port can be defined as an IP interface

Groups of Ethernet iSCSI ports on the same module can be defined as a single link aggregationgroup (IEEE standard: 802.3ad)

Ports defined as a link aggregation group must be connected to the same Ethernet switch,and a parallel link aggregation group must be defined on that Ethernet switch.

Although a single port is defined as a link aggregation group of one, IBM XIV support canoverride this configuration if such a setup is not operable with the customer Ethernetswitches.

For each iSCSI IP interface these configuration options are definable:

IP address (mandatory)

Network mask (mandatory)

Default gateway (optional)

MTU; Default: 1,536; Maximum: 8,192 MTU


38/65

38

SATA Disk Drives

The SATA disk drives used in the IBM XIV are 1 TB, 7200 rpm hard drives designed

for high-capacity storage in enterprise environments.

All IBM XIV disks are installed in the front of the modules, twelve disks per module.

Each single SATA disk is installed in a disk tray which connects the disk to thebackplane and includes the disk indicators on the front. If a disk is failing it can be

replaced easily from the front of the rack. The complete disk tray is one FRU which

is latched by a mechanical handle in its position.


39/65

39

SATA Disk Drives (Cont.)

Performance features

3 Gb/s SAS interface supporting key features in SATA specification

32 MB cache buffer for enhanced data transfer performance

Rotation Vibration Safeguard (RVS) prevents performancedegradation

Reliability features

Advanced magnetic recording heads and media

Self-Protection Throttling (SPT) monitors I/O

Thermal Fly-height Control (TFC) provides better soft error rate

Fluid Dynamic Bearing (FDB) motor improves acoustics andpositional accuracy

R/W heads are place on the load/unload ramp to protect user datawhen power is removed

The IBM XIV was engineered with substantial protection against data corruption and data loss, thus not justrelying on the sophisticated distribution and reconstruction methods. Several features and functionsimplemented in the disk drive also increase reliability and performance. The highlights are:

Performance features and benefits

SAS interface

The disk drive features a 3 Gb/s SAS interface supporting key features in the Serial-ATAspecification including NCQ (Native Command Queuing), staggered spin-up and hot-swapcapability.

32 MB cache buffer

Internal 32 MB cache buffer enhances the data transfer performance

Rotation Vibration Safeguard (RVS)

In multi-drive environments, rotational vibration, which result from the vibration of neighboringdrives in a system, can degrade hard drive performance. To aid in maintaining highperformance the disk drive incorporates enhanced Rotation Vibration Safeguard (RVS)technology providing up to 50% improvement over the previous generation againstperformance degradation, leading the industry.

Reliability features and benefits

Advanced magnetic recording heads and media

Excellent soft error rate for improved reliability and performance

Self-Protection Throttling (SPT)

SPT monitors and manages I/O to maximize reliability and performance

Thermal Fly-height Control (TFC)

TFC provides better soft error rate for improved reliability and performance

Fluid Dynamic Bearing (FDB) Motor

FDB Motor to improve acoustics and positional accuracy

Load/unload ramp

The R/W Heads are placed outside the data area to protect user data when power is removed


40/65

40

The Patch Panel

4 FC ports on each interface module

(4, 5, 6, 7, 8, 9)

2 iSCSI ports on each interface module

(7, 8, 9)

3 connections for GUI and/or XCLI from

customer network (4, 5, 6)

2 ports for VPN connectivity (4, 6)

2 service ports (4, 5)

1 maintenance module connections

2 reserved port

The patch panel is located at the rear of the rack. Interface Modules are connected

to the patch panel using 50 micron cables. All external connections should be made

through the patch panel. In addition to the host connections and to the network

connections further ports are available on the patch panel for service connections.


41/65

41

Interconnection and Switches

Internal module communication is based on 2 redundant 48

port Gigabit Ethernet switches Switches are using interlink switch connection between them

Switches are using a RPS unit to eliminate the switch power

supply as a single point of failure

Internal Ethernet switches

The internal network is based on two redundant 48-port Gigabit Ethernet switches

(Dell Power Connect 6248). Each of the modules (Data or Interface) is directly

attached to each of the switches with multiple connections, and the switches are

also linked to each other. This network topology enables maximal bandwidthutilization since the switches are used in active-active configuration, while being

tolerant to any individual failure in network components like port, link, or switch. If

one switch is failing the bandwidth of the remaining connections is fair enough to

prevent noticeable performance impact and still keep enough parallelism in the

system.

The Dell PowerConnect 6248 is a Gigabit Ethernet Layer 3-Switch with 48 copper

and 4 combined ports (SFP or 10/100/10000), robust stacking and 10-Gigabit-

Ethernet uplink capability. The switches are powered by Dell RPS-600 redundant

power supplies, to eliminate the switch power supply as single point of failure.


42/65

42

Interconnection and Switches (Cont.)

Module - USB to serial

The Module - USB to Serial connections are used by internal processes to keep

alive the communication to the modules in the event the network connection is not

operational. Modules are linked together with those USB to serial cables in groups

of 3 modules. This emergency link is needed to communicate between the modulesfor internal processes and used by maintenance to repair internal network

communication issues only.

The USB to Serial connection is always connecting a group of three Modules.

USB Module 1 is connected to Serial Module 3



This connection sequence is repeated for the modules 4-6, 7-9, 10-12, and 13-15.


43/65

43

Maintenance Module

Used for IBM XIV support to maintain and repair the system

When needed, remote XIV support can connect remotely

Through a modem connection attached to the maintenance module

The maintenance module is a 1U generic server

It is powered through the ATS directly

This is the only component in the system that is not redundant

The maintenance module is not part of the XIV storage

architecture

If down it will not affect the system

The maintenance module is connected through Ethernetconnections to modules 5 and 6

The Maintenance module and the Modem, installed in the middle of the rack are

used for IBM XIV Support and the SSR/CE to maintain and repair the machine.

When there is a software or hardware problem that needs the attention of the IBM

XIV Support Center, a remote connection will be required and used to analyze and

possibly repair the faulty system. The connection can be established either via VPN

(virtual private network) broadband connection or via phone line and modem.

Modem

The Modem installed in the rack is needed and used for remote support. It enables

the IBM XIV Support Center specialists and, if necessary, higher level of support to

connect the XIV Storage System. Problem analysis and repair actions without a

remote connection are complicated and time consuming.

Maintenance ModuleA 1U remote support server is also required for full-functionality and supportability of

the IBM XIV. This device has fairly generic requirements as it is only used to gain

remote access to the device via VPN or modem for support personnel. The current

choice for this device is a SuperMicro 1U server, with an average commodity level

configuration


44/65

44

Session III: XIV Software Framework

Basic Terminology

Communication infrastructure

Single Module Frameworks

System Nodes

File Systems on the Module


45/65

45

Basic Terminology

Module a physical component

Regular module (contains disks)

Interface module (disks and also SCSI interfaces)

Power module

Switching module

Node XIV software component that runs on several modules

Singleton Node XIV software component that at any given

time runs on a single module

Basic Terminology

There are several components that complete the XIV Storage system.

Module the physical components that are used to build the system

Regular module contains only disks.

Interface module contains disks and host connectivity interfaces for FC

and iSCSI.

Power module the UPSs of the system.

Switching module the switches used to interconnect the all the different

modules.

Node a part of the XIV software components that runs on several modules

Singleton Node a part of the XIV software components that at any given time

runs only on a specific module


46/65

46

Communication infrastructure

NetPatrol

Guarantees network connectivity between every two modules

MCL

Provides a transactional layer between any two nodes

Each node has a unique id through it a node resolves to the type ofthe node and the module the node runs on

RPC

Exported MicroCode functions are called via RPC

Transported over MCL

XIV Configuration

Each module holds a copy of the XIV system configuration withcurrent status of all XIV modules and nodes

Transported over MCL

Communication Infrastructure

The XIV Storage architecture includes several communication components that allow the system to provide itscapabilities and maintain reliability.

NetPatrol

guarantees network connectivity between every two modules in the system.

MCL Management Control Layer

Each node has a unique identifierThe id is unique and resolves to the type of the node and the module the node runs on

Each process may have several MCL queues to send/receive transactions on its own

Handles timeouts and retries

Resolves singleton roles to node id, aware of singleton election

Uses a textual based forward-compatible protocol

RPC Remote Procedure Call

Any exported MicroCode function may be called via RPC. There are no limitations

Marshalling/Unmarshalling is generic and relies on auto-generated information from an XML file

Supports both sync/async client and server calls

Marshalls to compact binary form, and forward-compatible XML form

Easy to migrate to transactional transport: MCL, SCSI

XIV Configuration

Implemented over MCL

The XIV Configuration is loaded into the memory of each data and interface module

It holds the current status of all modules and nodes in the system

Each change in the status will trigger a set of operations to be handled by the system to maintain itsfull redundancy and proper operational state


47/65

47

Single Module Frameworks - Basic features

Modules are symmetric and have exactly the same data

All configuration is saved in a single XML file

The only difference between modules FS is the module_id

file

On replacement modules the ID is assigned to them during the

component testing phases prior to moving them to operational state

Tight integration with cluster hardware

Firmware management

Hardware configuration

Hardware monitoring

Single Module Frameworks

On the XIV Storage system all Modules are symmetric and have exactly the same

data. All the configuration data of the module is saved in a single XML file.

Since each module has a unique ID, that helps the systems to define its purpose

and services that should run on it, there is only one file that is different betweeneach module in the system, and that is the module_id file. In case of a module

replacement the new module FRU id is zerod out and is assigned to it during the

component test phase.

The MicroCode maximizes the use of module capabilities to achieve a high level of

availability and reliability from the system. The MicroCode is tightly integrated with:

Firmware manegement


Hardware monitoring


48/65

48

System Nodes

Platform Node all modules (process:platform_node)

The Platform Node manages installation/upgrade of Module software


Running services and nodes, and keepalive messages handling

Auto-generating service-specific configuration files

Sending heartbeats to Management Node

Handling configuration changes for the module

Interface Node modules 4-9 (process: i_node)

Implements the necessary protocols to serve as a SCSI target for the

FC and iSCSI transport

DESCRIPTION OF ALL NODES

This section covers all Nodes in the system:

Platform Node (process: platform_node)

The Platform Node runs on all Modules and manages the software and hardware of

a Module. The Platform Node manages installation/upgrade of Module software,configures all configurable hardware (WWNs, IP addresses, etc), running all

Services and Nodes upon startup, auto-generating Service-specific and UNIX

configuration files (/etc/ssh/sshd_config or /etc/hosts for example), handling the

keepalive messages of all Nodes on a Module, sending heartbeats to the

Management Node and handling Configuration changes for that Module. The

Platform Node is normally the only process executed by xinit upon normal startup

(it's hardcoded in xinit) and it spawns all Services and Nodes in accordance to the

Configuration for the Module its running on.

Interface Node (process: i_node)

The Interface Node implements the necessary protocols to serve as a SCSI Target

for the FC and iSCSI transport. For iSCSI communication it relies on an external

process called iscsi_host_session to setup iSCSI sessions.


49/65

49

System Nodes (Cont.)

Cache Node all modules (process: cte)

The storage backend of the XIV Storage system

Each is a holder of partitions against which IOs are performed

Gateway Node modules 4-9 (process:gw_node)

In charge of serving as the SCSI initiator for XDRP mirroring and data

migration

Admin Node modules 4-9 (process: aserver)

Listens on port 7777 (using STunnel from 7778) and receives xml

describing commands, then passes them to The Administrator

The Administrator parses and validates the xmls, and translates

them into an XIV RPC call to be executed by the Management Node

Cache Node (process: cte)

The Cache Node runs on all Modules with storage disks (at the time of this writing, all Gen2

Modules). It is the 'storage backend' of the XIV storage array. Each Cache Node is the primary or

secondary holder of zero or more data chunks called Partitions, against which IOs are performed. A

Cache Node services reads/writes from/to Partitions and decides which Partitions to keep in memory.

Gateway Node (process: gw_node)

The Gateway Node runs on all Module with an external data port (iSCSI/FC), and is in charge of

serving as the SCSI Initiator for XDRP mirroring and Data Migration. Amongst other things, the

Gateway Node writes blocks to the Target of an XDRP Volume for which a Secondary Volume is

defined, reads new blocks form data migrated Volumes and recovers bad blocks from the Secondary

Volume of a Primary Volume for which XDRP is defined and a Media Error occurred.

Admin Node (process: aserver)

The Admin Node runs on all Modules with an external management port, and is in charge of

processing and executing XCLI commands (the exterior API of the system). The Admin Node listens

on port 7777 and is using STunnel to redirect commands coming from 7778. It receives XMLdescribing commands in TCP and passes them to what's called "the Administrator". The

Administrator (administrator.py) is a Python-written passive-node which parses and validates the

XMLs, and translates them into an XIV RPC call to be executed by the Management Node (formerly

called the Manager). The Admin Node spawns Administrators as necessary, and it may kill old ones

and respawn new ones.


50/65

50


Management Node singleton 1|2|3 (process: manager)

In charge of managing system data redundancy by manipulating thedata rebuilds and distributions

Operation state changes (On, Maintenance, Shutting down)

Processing XCLI commands as they are received from The

Administrator in the form of an XIV RPC call

Cluster Node singleton 1|2|3 (process: cluster_hw)

In charge of managing hardware which doesnt belong to any

particular module (UPS, Switch)

Event Node singleton 1|2|3 (process: event_node)

Processes event rules and acts upon need (SMTP, SNMP, SMS)

Adds newly created events to the relevant part of the configuration

Management Node (process: manager)

The Management Node is a Singleton whose primary responsibility is managing system data

redundancy by means of manipulating the Distribution. In simpler terms, the Management module

decides which Partition should reside on which Cache Nodes and manages Rebuild and

Redistribution process. In addition, the Management Node is in charge of Operation State Changes

(shutting down, Maintenance Mode, etc) and processing XCLI commands as they are received fromthe Administrator in the form of an XIV RPC call.

Cluster Node (process: cluster_hw)

The Cluster Node is a Singleton Node in charge of managing hardware which doesn't belong to any

particular Module. For example, the Cluster Node monitors the UPS and switches, updating their

status in the Configuration as necessary.

Event Node (process: event_node)

The Event Node is a Singleton Node which runs on a Module with an external management port. Its

first duty is to process Event rules and acts upon them per every Event that is created (for example,

rules may dictate sending an SMTP email or an SNMP trap). In addition, it adds newly createdEvents to the relevant part of the configuration, so Event Saver Nodes (see below) will store them.


51/65

51


SCSI Reservation Node - singleton 1|2|3 (process: isync_node)

Receives SCSI II & III commands (Reserve, Release, Register) for fastprocessing by the system

SCSI Reservation table is maintained on each Interface Node (forredundancy) based on updates coming from the isync_node

Equip Master Node all modules (process: equip_master)

Lets foreign modules currently being equipped (test phase) todownload the XIV software to them

Event Saver Node all modules (process: event_saver)

Receives all events created on the system and saves them to theVISDOR virtual disk

HW Monitoring Node all modules (process: hw_mon_node) Monitors all available hardware on the module

SCSI Reservation Node (process: isync_node)

The SCSI Reservation node is a Singleton node that handles incoming SCSI II and III commands

from various hosts (Reserve, Release, Register). All incoming commands are routed through the

i_nodes to the isync_node which then updates all the i_nodes on his decision. All the i_nodes hold a

copy of the SCSI Reservation table for redundancy.

Equip Master Node (process: equip_master)

The Equip Master Node runs on all Modules and lets foreign Modules which are currently being

equipped (during the 'test' phase) to download XIV software from the Module the Equip Master node

is currently running on. It's a bit like an XIV RPC based file-server.

Event Saver Node (process: event_saver)

The Event Saver Node runs on on all Modules with storage disks (at the time of this writing, all Gen2

Modules). It receives all Events created on the system and saves them to permanent media (at the

time of this writing, to a partition of the visdor virtual disk).

Hardware Monitor Node (process: hw_mon_node)

The Hardware Monitor runs on all Modules and checks all hardware in the module we know how to

monitor. Monitored hardware includes the HBAs (if any), Disks, Enclosure (PSU, Fans, IPMI module,

heat levels), Ethernet interfaces and SAS controller. The Hardware Monitor does not configure

hardware or acts upon its findings, this is the job of other Nodes.


52/65

52

File Systems on the Module

XIV Storage System is based on the IBM-MCP Linux

distribution

Configuration for traditional Unix binaries is generated

automatically

The Compact Flash file-systems are mounted as Read Only

Persistent data is stored on a special logical volume (VISDOR)

triple-mirrored on the HDDs taking 2.5% of the system

VISDOR holds event logs and traces for the system

ISV holds statistics data for interface modules only

XIV File Systems

The GEN 2 XIV Storage Systems is based on the IBM-MCP Linux distribution. All

configurations for traditional Unix binaries are generated automatically.

Inside each module there is a PCI addonics card to host a CF card. That CF card

hold the basic MCP file system as read only. To accommodate some Unix relatedmetadata, the system creates a small RAM based file system. Persistent data, such

as event logs and traces, are stored on a special logical volume called VISDOR.

VISDOR is triple-mirrored across all the disk drives in the system. VISDOR takes

about 2.5% of the overall system capacity. VISDOR is usually mounted over

/dev/sdb. Everything under /local resides in the VISDOR.

ISV is an XIV hidden volume that is mapped to interface modules only. Usually

mounted as /dev/sdc. It can hold approximately a years worth of statistics for

general data and a months worth when sampling a specific Host/Volume.


53/65

53

Session IV: XIV Systems Management

Managing XIV Systems

XIV Management Architecture

The XIV Graphic User Interface (GUI)

Managing Multiple Systems

The XIV Command Line Interface (XCLI)

Session XCLI

Lab 1: Working with the XIV Management Tools


54/65

54

Managing XIV Systems

XIV Systems management can be done both through GUI and CLI

commands

XIV Storage Manager can be installed on:

Microsoft Windows GUI and CLI

MacOS SystemX GUI and CLI

Linux GUI and CLI

SUN Solaris CLI

IBM AIX CLI

HP/UX CLI

More info on XIV Storage Management

http://www.ibm.com/systems/storage/disk/xiv/index.html

The XIV Storage System software supports the functions of the XIV Storage System. The software provides the functionalcapabilities of the system. It is preloaded on each module (data and Interface Modules) within the XIV Storage System. Thefunctions and nature of this software are equivalent to what is usually referred to as microcode or firmware on other storagesystems.

The XIV Storage Management software is used to communicate with the XIV Storage System Software which in turninteracts with the XIV Storage hardware.

The XIV Storage Manager can be installed on a Linux, SUN Solaris, IBM AIX, HP/UX, Microsoft Windows or MacOS basedmanagement workstation that will then act as the management console for the XIV Storage System. The Storage Managersoftware is provided at time of installation, or optionally downloadable from the following web site:

http://www.ibm.com/systems/storage/disk/xiv/index.html

For detailed information about the XIV Storage Management software compatibility refer to the XIV interoperability matrix orthe System Storage Interoperability Center (SSIC) at:

http://www.ibm.com/systems/support/storage/config/ssic/index.jsp

The IBM XIV Storage Manager includes a user-friendly and intuitive Graphical User Interface (GUI) application, as well as anExtended Command Line Interface (XCLI) component offering a comprehensive set of commands to configure and monitorthe system.

Graphical User Interface (GUI)

A simple and intuitive GUI allows a user to perform most administrative and technical operations (depending upon the user

role) quickly and easily, with minimal training and knowledge.The main motivation behind the XIV management and GUI design is the desire to keep the complexities of the system and itsinternal workings completely hidden from the user. The most important operational challenges, such as overall configurationchanges, volume creation or deletion, snapshot definitions, and many more, are achieved with a few clicks. This chaptercontains descriptions and illustrations of tasks performed by a Storage administrator when using the XIV graphical userinterface (GUI) to interact with the system.

Extended Command Line Interface (XCLI)

The XIV Extended Command Line Interface (XCLI) is a powerful text based. command line based tool that enables anadministrator to issue simple commands to configure, manage or maintain the system, including the definitions required toconnect with hosts and applications. The XCLI can be used in a shell environment to interactively configure th

XIV System Architecture2.pdf

Documents

Transcript of XIV System Architecture2.pdf