XIV System Architecture2.pdf

download XIV System Architecture2.pdf

of 65

Transcript of XIV System Architecture2.pdf

  • 8/14/2019 XIV System Architecture2.pdf

    1/65

    2009 IBM Corporation

    XIV Education

    XIV System Architecture

  • 8/14/2019 XIV System Architecture2.pdf

    2/65

    2

    Overview

    Phase 10 Features and Capabilities

    Gen II Systems Hardware Design and Layout

    XIV Software Framework

    XIV Systems Management

    The XIV Storage System architecture incorporates a variety of features designed to

    uniformly distribute data across key internal resources. This unique data distribution

    method fundamentally differentiates the XIV Storage System from conventional

    storage subsystems, thereby effecting numerous availability, performance, and

    management benefits across both physical and logical elements of the system.

  • 8/14/2019 XIV System Architecture2.pdf

    3/65

    3

    Session I: Phase 10 Features and Capabilities

    System Components

    Architectural design

    Grid Architecture

    Storage Virtualization and Logical Parallelism

    Logical System Concepts

    Usable Storage Capacity

    Storage Pool Concepts

    Capacity Allocation and Thin Provisioning

    The XIV Storage System architecture is designed to deliver performance,

    scalability, and ease of management while harnessing the high capacity and cost

    benefits of SATA drives. The system employs off-the-shelf products as opposed to

    traditional offerings which need more expensive components using proprietary

    designs.

  • 8/14/2019 XIV System Architecture2.pdf

    4/65

    4

    System Components

    The XIV Storage System is comprised of the following components:

    Host Interface Modules consisting of six modules, each containing 12 SATA Disk

    Drives

    Data Modules made up of nine modules, each containing 12 SATA Disk Drives

    A UPS module complex made up of three redundant UPS units v

    Two Ethernet Switches and an Ethernet Switch Redundant Power Supply (RPS

    A Maintenance Module

    An Automatic Transfer Switch (ATS) for external power supply redundancy

    A modem, connected to the Maintenance module for externally servicing the

    system

    All the modules in the system are linked through an internal redundant Gigabit

    Ethernet network enabling maximum bandwidth utilization and resilient to at leastany single component failure. The system and all of its components come pre

    assembled and wired in a lockable rack.

  • 8/14/2019 XIV System Architecture2.pdf

    5/65

    5

    Systems Components (cont.)

    Hardware elements

    The primary components of the XIV Storage System are known as modules. Modules provideprocessing, cache and host interfaces and are based on standard Intel/Linux systems. They areredundantly connected to one another via an internal switched Ethernet fabric. All of the moduleswork together concurrently as elements of a grid architecture, and therefore the systemharnesses the powerful parallelism inherent to a distributed computing environment.

    Data Modules

    At a conceptual level, the Data Modules function as the elementary building blocks of the system,providing physical capacity, processing power, and caching, in addition to advanced system-managed services that comprise the systems internal operating environment. The equivalence ofhardware across Data Modules and their ability to share and manage system software andservices are key elements of the physical architecture as shown in the slide.

    Interface Modules

    FundamentalIy, Interface Modules are equivalent to Data Modules in all aspects, with the followingexceptions:

    1. In addition to disk, cache, and processing resources, Interface Modules are designed to includeboth Fibre Channel and iSCSI interfaces for host system connectivity as well as remote mirroring.

    The Slide conceptually illustrates the placement of Interface Modules within the topology of theXIV IBM Storage System architecture.

    2. The system services and software functionality associated with managing external I/O resideexclusively on the Interface Modules.

    Ethernet switches

    The XIV Storage System contains a redundant switched Ethernet fabric that conducts both data andmetadata traffic between the modules. Traffic can flow in the following ways:

    Between two Interface Modules

    Between an Interface Module and a Data Module

    Between two Data Modules

  • 8/14/2019 XIV System Architecture2.pdf

    6/65

    6

    Architectural Design

    Massive Parallelism

    Workload balancing Self-Healing

    True virtualization

    Thin provisioning

    Module

    Interface

    Module

    Interface

    Module

    Interface

    Switching

    Module

    Interface

    Module

    Interface

    Module

    Interface

    Mod

    Interfa

    Switching

    Massive Parallelism

    the system architecture ensures full exploitation of all system components. Any I/O activity involving a specificlogical volume in the system is always

    inherently handled by all spindles. The system harnesses all storage capacity and all internal bandwidth and ittakes advantage of all processing power available. This is equally true for host-initiated I/O activity as it is forsystem-initiated activity such as rebuild processes and snapshot generation. All disks, CPUs, switches andother components of the system contribute at all times.

    Workload balancing

    The workload is evenly distributed over all hardware components at all times. All disks and modules are utilizedequally, regardless of access patterns. Despite the fact that applications may access certain volumes morefrequently than other volumes, or access certain parts of a volume more frequently than other parts, the load onthe disks and modules will be balanced perfectly.

    Pseudo-random distribution ensures consistent load-balancing even after adding, deleting or resizing volumes,as well as adding or removing hardware. This balancing of all data on all system components eliminates thepossibility of a hot-spot being created.

    Self-Healing

    Protection against double disk failure is provided by an efficient rebuild process that brings the system back tofull redundancy in minutes. In addition, the XIV Storage System extends the self-healing concept, resumingredundancy even after failures in components other than disks.

    True virtualization

    Unlike other system architectures, storage virtualization is inherent to the basic principles of the XIV StorageSystem design. Physical drives and their locations are completely hidden from the user. This dramaticallysimplifies storage configuration, letting the system lay out the users volume in the optimal way. The automaticlayout maximizes the systems performance by leveraging system resources for each volume, regardless of theusers access patterns.

    Thin provisioning

    the system enables thin provisioning. That is the capability to allocate storage to applications on a just-in-timeand as needed basis, allowing significant cost savings compared to traditional provisioning techniques. Thesavings are achieved by defining a logical capacity which is larger than the physical capacity. This capabilityallows users to improve storage utilization rates, thereby significantly reducing capital and operational expensesby allocating capacity based on total space consumed, rather than total space allocated.

  • 8/14/2019 XIV System Architecture2.pdf

    7/65

    7

    Grid Architecture

    Relative effect of the loss of a given computing resource is

    minimized

    All modules are able to participate equally in handling the

    total workload

    Modules consist of standard off the shelf components

    Computing resources can be dynamically changed both

    capacity and performance

    By scaling out

    By scaling up

    IBM XIV Storage System grid overview

    The XIV Grid design entails the following characteristics:

    Both Interface Modules and Data Modules work together in a distributed computing sense. In other words,although Interface Modules have additional interface ports and assume some unique functions, they alsootherwise contribute to the system operations equally to Data Modules.

    The modules communicate with each other via the internal, redundant Ethernet network.

    The software services and distributed computing algorithms running within the modules collectively manage allaspects of the operating environment.

    Design principles

    The XIV Storage System grid architecture, by virtue of its distributed topology and standard Intel/Linux building-block components, ensures that the following design principles are possible:

    Performance: The relative effect of the loss of a given computing resource, or module, is minimized.

    Performance: All modules are able to participate equally in handling the total workload.

    This is true regardless of access patterns. The system architecture enables excellent load balancing,even if certain applications access certain volumes, or certain parts within a volume, more frequently.

    Openness: Modules consist of standard off the shelf components.

    Because components are not specifically engineered for the subsystem, the resources and time

    required for development of newer hardware technologies are minimized. This, coupled with theefficient integration of computing resources into the grid architecture, enable the subsystem to realizethe rapid adoption of the newest hardware technologies available, without the need to deploy a wholenew subsystem.

    Upgradability and scalability: Computing resources can be dynamically changed:

    Scaled out by either adding new modules to accommodate both new capacity and new performancedemands, or by even by tying together groups of modules.

    Scaled up by upgrading modules.

  • 8/14/2019 XIV System Architecture2.pdf

    8/65

  • 8/14/2019 XIV System Architecture2.pdf

    9/65

    9

    Storage Virtualization and Logical Parallelism

    Easier volume management

    Consistent performance and scalability

    High availability and data integrity

    Flexible snapshots

    Data migration efficiency

    Pseudo-random algorithm

    Modular software design

    Flexible snapshots

    Full storage virtualization incorporates snapshots that are differential in nature; only

    updated data consumes physical capacity.

    Many concurrent snapshots (up to 16,000 volumes and snapshots can be

    defined). Note that this is possible because a snapshot uses physical spaceonly after a change has occurred on the source.

    Multiples snapshots of a single master volume can exist independently of

    each other.

    Snapshots can be cascaded, in effect creating snapshots of snapshots.

    Snapshot creation/deletion does not require data to be copied, and hence occurs

    immediately.

    As updates occur to master volumes, the systems virtualized logical structure

    enables it to elegantly and efficiently preserve the original point-in-time data

    associated with any and all dependent snapshots by simply redirecting the updateto a new physical location on disk. This process, referred to as redirect on write,

    occurs transparently from the host perspective by virtue of the virtualized remapping

    of the updated data, and minimizes any performance impact associated with

    preserving snapshots, regardless of the number of snapshots defined for a given

    master volume.

    Because they use redirect on write and do not necessitate data movement, the

    size of a snapshot is independent of the source volume size.

    Data migration efficiencyXIV supports thin provisioning. When migrating from system that only support

  • 8/14/2019 XIV System Architecture2.pdf

    10/65

    10

    Logical System Concepts Distribution Algorithm

    Each volume is spread across all drives

    Data is cut into 1MB Partitions and stored on the disks

    XIVs distribution algorithm automatically distributespartitions across all disks in the system pseudo-randomly

    Module

    Interface

    Module

    Interface

    Module

    Interface

    Switching

    XIV disks behave like connectedvessels, as the distributionalgorithm aims for constant diskequilibrium.

    Thus, XIVs overalldisk spindle usageapproaches 100% inall usage scenarios.

  • 8/14/2019 XIV System Architecture2.pdf

    11/65

    11

    Logical System Concepts Distribution Algorithm (Cont.)

    Data distribution only changes when the system changes

    Equilibrium is kept when new hardware is added

    Equilibrium is kept when old hardware is removed

    Equilibrium is kept after a hardware failure

    Module 2

    Module 3

    Module 1

    Module 2Module 1

    Module 3

  • 8/14/2019 XIV System Architecture2.pdf

    12/65

    12

    Module 4

    Logical System Concepts - Partitions

    Data distribution only changes when the system changes

    Equilibrium is kept when new hardware is added

    Equilibrium is kept when old hardware is removed

    Equilibrium is kept after a hardware failure

    Module 2

    Module 3

    Module 1

    [ hardware upgrade ]

    Module 4

    Logical constructs

    The XIV Storage System logical architecture incorporates constructs that underlie the

    storage virtualization and distribution of data, integral to its design. The logical structure of

    the subsystem ensures there is optimum granularity in the mapping of logical elements to

    both modules and individual physical disks, thereby guaranteeing an ideal distribution of

    data across all physical resources.

    Partitions

    The fundamental building block of logical volumes is known as a partition. Partitions have

    the following characteristics:

    All partitions are 1MB (1024 KB) in size.

    A partition contains either a primary copy or secondary copy of data:

    Each partition is mapped to a single physical disk.

    This mapping is dynamically managed by the system via a proprietary

    pseudo-random distribution algorithm in order to preserve data redundancy

    and equilibrium.The storage administrator has no control or knowledge of the specific

    mapping of partitions to drives.

    Secondary partitions are always placed onto a physical disk that does not contain

    the primary partition.

    In addition, secondary partitions are also in a module that does not contain

    its corresponding primary partition.

  • 8/14/2019 XIV System Architecture2.pdf

    13/65

  • 8/14/2019 XIV System Architecture2.pdf

    14/65

    14

    Module 1

    Logical System Concepts Slices

    XIV data distribution architecture uses Slices for partition copies

    Slices are spread across all disk drives in the system Each slice has two copies: Primary copy and a Secondary copy

    There are 16,384 slices (times 2 copies is a total of 32,768 slices)

    Each disk holds approx 182 slices [(16384 x 2)/180]

    A Primary and its Secondary Slice will never reside on the same module

    Module 2

  • 8/14/2019 XIV System Architecture2.pdf

    15/65

    15

    Logical System Concepts Slices (Cont.)

    Module 1 Module 2

    When creating a Volume it must spanacross all drives in the system

    The minimum size of a volume is 17GB

    1MB Partition = 2^20 (1,048,576) x 16384 = ~17GB

    Each Partition is numbered with a Logical Partition

    number for its specific Volume

    9 19231426 11 21 6 24 3 1 417 16 18 813200 27 12 15 2 7 5 1025 22

    When creating another volume which is bigger than

    the minimum 17GB, the system will allocate several

    17GB chunks for it

    In this example it is a 51GB Volume

    built from 3 x 17GB chunks

    Logical Partition numbers are also assigned here

    Numbering is always in modulus to 16384

    9 19231426 11 21 6 24 3 1 417 16 18 813200 27 12 15 2 7 5 1025 22

    9 19231426 11 21 6 24 3 1 417 16 18 813200 27 12 15 2 7 5 1025 22

    9 19231426 11 21 6 24 3 1 417 16 18 813200 27 12 15 2 7 5 1025 22

    +16384

    +32768

    +0

  • 8/14/2019 XIV System Architecture2.pdf

    16/65

    16

    Logical System Concepts Logical Volumes

    Every logical volume is comprised of 1MB pieces of data

    Interface modules will not hold partition copies from otherinterface modules

    This is called FC Proof

    Implemented because Interface modules are more prune to problemsthan Data modules

    The physical capacity associated with a logical volume isalways a multiple of 17GB

    The maximum number of volumes that can be concurrentlydefined on the system is 4,605

    The same address space for both volumes and snapshots is 16,377

    Logical volumes are administratively managed within thecontext of Storage Pools

    Logical volumes

    The XIV Storage System presents logical volumes to hosts in the same manner as conventional subsystems,however, both the granularity of logical volumes and the mapping of logical volumes to physical disks isfundamentally different.

    As discussed previously, every logical volume is comprised of 1MB (1024KB) pieces of data known aspartitions.

    The physical capacity associated with a logical volume is always a multiple of 17GB (decimal). Therefore, while it is possible to present a block-designated (discussed in module 3) logical volume

    to a host that is not a multiple of 17GB, the maximum physical space that is allocated for thevolume will always be the sum of the minimum number of 17GB increments needed to meet theblock-designated capacity. Note that the initial physical capacity actually allocated by the systemupon volume creation may be less than this amount.

    The maximum number of volumes that can be concurrently defined on the system is limited by:

    1. The logical address space limit:

    The logical address range of the system permits up to 16,377 volumes, although thisconstraint is purely logical and therefore should not normally be a practical consideration.

    Note that the same address space is used for both volumes and snapshots.

    2. The limit imposed by the logical and physical topology of the system for the minimum volume size:

    The physical capacity of the system, based on 180 drives with 1TB of capacity per drive

    and assuming the minimum volume size of 17GB, limits the maximum volume count to4,605 volumes. Again, since volumes and snapshots share the same address space, asystem with active snapshots can have more than 4,605 addresses assigned collectivelyto both volumes and snapshots.

    Logical volumes are administratively managed within the context of Storage Pools.

    Since the concept of Storage Pools is administrative in nature, they are not part of the logicalhierarchy inherent to the systems operational environment.

  • 8/14/2019 XIV System Architecture2.pdf

    17/65

    17

    Logical System Concepts Volume Layout

    The Partition Table maps between a logical partition number and

    the physical location on the disk The distribution algorithms seek to preserve the statistical equality

    of access among all physical disks

    Each volume is allocated across at least 17GB (decimal) of capacity that is

    distributed evenly across all disks

    Each disk has its data mirrored across all other disks excluding same module

    The storage system administrator does not plan the layout of

    volumes on the modules

    There are no unusable pockets of capacity known as orphaned spaces

    Upon component failure a new Goal Distribution is created

    All disks participate in the enforcement and therefore rapidly return to fullredundancy

    Logical volume layout on physical disks

    The XIV Storage System facilitates the distribution of logical volumes over disks and modules by means of a

    dynamic relationship between primary data copies, secondary data copies, and physical disks. This

    virtualization of resources in the XIV Storage System is governed by a pseudo random algorithm.

    Partition table

    Mapping between a logical partition number and the physical location on disk is maintained in a Partition Table.

    The Partition Table maintains the relationship between the partitions that comprise a logical volume and their

    physical locations on disk.

    Volume layout

    At a high level, the data distribution scheme is an amalgam of mirroring and striping. While it is tempting to

    think of this in the context of RAID 1+0 (10) or 0+1, the low-level virtualization implementation precludes the

    usage of traditional RAID algorithms in the architecture. This is because conventional RAID implementationscannot incorporate dynamic, intelligent, and automatic management of data placement based on knowledge of

    the volume layout, nor is it feasible for a traditional RAID system to span all drives in a subsystem due to the

    vastly unacceptable rebuild times that would result.

    Partitions are distributed on all disks using what is defined as a pseudo-random distribution function. The distribution algorithms seek to preserve the statistical equality of access among all physical

    disks under all conceivable real-world aggregate workload conditions and associated volume access

    patterns. Essentially, while not truly random in nature, the distribution algorithms in combination withthe system architecture preclude the occurrence of the phenomenon traditionally known of as hot-

    spots.

    The XIV Storage System contains 180 disks, and each volume is allocated across at least

    17GB (decimal) of capacity that is distributed evenly across all disks.

    Each logically adjacent partition on a volume is distributed across a different disk; partitions

    are not combined into groups before they are spread across the disks.

    The pseudo-random distribution ensures that logically adjacent partitions are never striped

    sequentially across physically adjacent disks.

    Each disk has its data mirrored across all other disks excluding the disks in the same module.

    Each disk holds approximately 1% of any other disk in other modules

  • 8/14/2019 XIV System Architecture2.pdf

    18/65

    18

    Logical System Concepts Snapshots

    A snapshot represents a point-in-time copy of a Volume

    Snapshots are governed by almost all of the principles that

    apply to Volumes

    Snapshots incorporate dependent relationships with their

    source volumes

    Can be either logical volumes or other snapshots

    A given partition of a primary volume and its snapshot are

    stored on the same disk

    A write to this partition is redirected within the module, minimizing

    latency and utilization between modules

    As updates occur to master volumes, the systems virtualized logicalstructure enables it to preserve the original point-in-time data

    Snapshots

    A snapshot represents a point-in-time copy of a Volume. Snapshots are governed by almost all of the

    principles that apply to Volumes. Unlike Volumes, snapshots incorporate dependent relationships

    with their source volumes, which can be either logical volumes or other snapshots. Because they are

    not independent entities, a given snapshot does not necessarily wholly consist of partitions that are

    unique to that snapshot. Conversely, a snapshot image will not share all of its partitions with itssource volume if updates to the source occur after the snapshot was created.

    Volumes and snapshots

    Volumes and snapshots are mapped using the same distribution scheme.

    A given partition of a primary volume and its snapshot are stored on the same disk drive

    As a result, a write to this partition is redirected within the module, minimizing latency and

    utilization associated with additional interactions between modules.

    As updates occur to master volumes, the systems virtualized logical structure enables it to

    elegantly and efficiently preserve the original point-in-time data associated with any and all

    dependent snapshots by simply redirecting the update to a new physical location on the disk.

    This process, referred to as redirect on write, occurs transparently from the hostperspective by virtue of the virtualized remapping of the updated data, and minimizes any

    performance impact associated with preserving snapshots regardless of the number of

    snapshots defined for a given master volume.

  • 8/14/2019 XIV System Architecture2.pdf

    19/65

    19

    19

    Data ModuleData Module

    Data ModuleData Module

    XIV Physical ViewVolume

    Host Logical View

    Snapshot

    As Host Writes data, it is placedrandomly across system in 1MB chunks

    Each Server has pointers in memory tothe disks that hold the data locally

    On a snapshot, each Server simply pointsto original volume. Memory only

    Operation

    Restore Volume from snapshot copy

    Logical System Concepts Snapshots

    Snapshot creation/deletion is instantaneous

    Vol Snap

  • 8/14/2019 XIV System Architecture2.pdf

    20/65

    20

    Usable Storage Capacity

    XIV Storage System reserves physical disk capacity for:

    Global spare capacity

    Metadata, including statistics and traces

    Mirrored copies of data

    The global reserved space includes sufficient space to

    withstand the failure of a full module in addition to three disks

    The system reserves roughly 4% of physical capacity forstatistics and traces, as well as the distribution and partition

    tables

    Net usable capacity is reduced by a factor of 50% to account

    for data mirroring Usable capacity = [1,000GB x 0.96 x [180-[12 + 3]]]/2 = 79,113GB

    The XIV Storage System reserves physical disk capacity for:

    Global spare capacity

    Metadata, including statistics and traces

    Mirrored copies of data

    Global spare capacity

    The dynamically balanced distribution of data across all physical resources by definition obviates the inclusionof dedicated spare drives that are necessary with conventional RAID technologies. Instead, the XIV StorageSystem reserves capacity on each disk in order to provide adequate space for the redistribution and recreationof redundant data in the event of a hardware failure.

    The global reserved space includes sufficient space to withstand the failure of a full module in addition to threedisks, enabling the system to execute a new Goal Distribution, discussed earlier, and return to full redundancyeven after multiple hardware failures. Since the reserve spare capacity does not reside on dedicated disks,space for hot spare is reserved as a percentage of each individual drive overall capacity.

    Metadata and system reserve

    The system reserves roughly 4% of physical capacity for statistics and traces, as well as the distribution andpartition tables.

    Net usable capacity

    The calculation of the net usable capacity of the system consists of the total disk count, less the number of disksreserved for sparing, multiplied by the amount of capacity on each disk that is dedicated to data, and finallyreduced by a factor of 50% to account for data mirroring.

    Note: The calculation of the usable space is as follows:

    Usable capacity = [drive space x (% utilized for data) x [Total Drives - Hot Spare reserve]/2

    Usable capacity = [1,000GB x 0.96 x [180-[12 + 3]]]/2 = 79,113GB (decimal)

  • 8/14/2019 XIV System Architecture2.pdf

    21/65

    21

    Storage Pool Concepts

    Storage Pools form the basis for controlling the usage of storage space

    Storage Pools consists exclusively of meta-data transactions

    A logical volume is defined within the context of one and only one

    Storage Pool

    A Consistency Group is a group of volumes that can be snapshotted at

    the same point in time

    Storage pool relationship:

    A logical volume may have multiple independent snapshots. This logical volume

    is also known as a master volume

    A master volume and all of its associated snapshots are always a part of only one

    Storage Pool

    A volume may only be part of a single Consistency Group All volumes of a Consistency Group must belong to the same Storage Pool

    While the hardware resources within the XIV Storage System are virtualized in a global sense, the available capacity in thesystem can be administratively portioned into separate and independent Storage Pools. The concept of Storage Pools ispurely administrative in that they are not a layer of the functional hierarchical logical structure employed by the systemoperating environment. Instead, the flexibility of Storage Pool relationships from an administrative standpoint derives from thegranular virtualization within the system. Essentially, Storage Pools function as a means to effectively manage a relatedgroup of logical volumes and their snapshots.

    Improved management of storage space

    Storage Pools form the basis for controlling the usage of storage space by specific applications, group of applications, ordepartments, enabling isolated management of relationships within the associated group of logical volumes and snapshotswhile imposing a capacity quota.

    A logical volume is defined within the context of one and only one Storage Pool and knowing that a Volume is equallydistributed amongst all system disk resources, it follows that all Storage Pools must also span all system resources.

    As a consequence of the system virtualization, there are no limitations on the size of Storage Pools or on the associationsbetween logical volumes and Storage Pools. In fact, manipulation of Storage Pools consists exclusively of meta-datatransactions, and does not impose any data copying from one disk or module to another. Hence, changes are completedinstantly and without any system overhead or performance degradation.

    Consistency Groups

    A Consistency Group is a group of volumes that can be snapshotted at the same point in time, thus ensuring a consistentimage of all volumes within the group at that time. The concept of Consistency Group is ubiquitous among storagesubsystems because there are many circumstances in which it is necessary to perform concurrent operations collectivelyacross a set of volumes, so that the result of the operation preserves the consistency among volumes. For example, effectivestorage management activities for applications that span multiple volumes, or for creating point-in-time backups, would notbe possible without first employing Consistency Groups.

    Storage Pool relationships

    Storage Pools facilitate administration of relationships between logical volumes, snapshots, and Consistency Groups.

    The following principles govern the relationships between logical entities within the Storage Pool:

    A logical volume may have multiple independent snapshots. This logical volume is also known as a master volume.

    A master volume and all of its associated snapshots are always a part of only one Storage Pool.

    A volume may only be part of a single Consistency Group.

    All volumes of a Consistency Group must belong to the same Storage Pool.

  • 8/14/2019 XIV System Architecture2.pdf

    22/65

    22

    Storage Pool Concepts (Cont.)

    Storage pool size can vary from 17GB to full system capacity

    Snapshot reserve capacity is defined within each regularStorage Pool and is maintained separately from logical volumecapacity

    Snapshots are structured as logical volumes, however, a Storage Poolssnapshot reserve capacity is granular at the partition level (1MB)

    snapshots will only be automatically deleted when there is inadequatephysical capacity available in the storage pool

    Space allocated for a Storage Pool can be dynamically changed

    The designation of a Storage Pool as a regular pool or a thinlyprovisioned pool can be dynamically changed

    The storage administrator can relocate logical volumesbetween Storage Pools without any limitations

    Storage Pools have the following characteristics:

    The size of a Storage Pool can range from as small as possible (17 GB, the minimum size that can beassigned to a logical volume) to as large as possible (the entirety of the available space in the system) withoutany limitation imposed by the system (this is not true for hosts, however).

    Snapshot reserve capacity is defined within each non-thinly provisioned, or regular, Storage Pool and iseffectively maintained separately from logical, or master, volume capacity. The same principles apply for thinlyprovisioned Storage Pools with the exception that space is not guaranteed to be available for snapshots due to

    the potential for hard space depletion.Snapshots are structured in the same manner as logical volumes (also known as master volumes),however, a Storage Pools snapshot reserve capacity is granular at the partition level (1MB). In effect,snapshots collectively can be thought of as being thinly provisioned within each increment of 17GB ofcapacity defined in the snapshot reserve space.

    As discussed in the example above, snapshots will only be automatically deleted when there isinadequate physical capacity available within the context of each Storage Pool independently. Thisprocess is managed by a snapshot deletion priority scheme, Therefore, when a Storage Pools size isexhausted, only the snapshots that reside in the affected Storage Pool are deleted.

    The space allocated for a Storage Pool can be dynamically changed by the storage administrator:

    The Storage Pool can always be increased in size, limited only by the unallocated space on thesystem.

    The Storage Pool can always be decreased in size, limited only by the space consumed by thevolumes and snapshots defined within that Storage Pool.

    The designation of a Storage Pool as a regular pool or a thinly provisioned pool can be dynamically changed

    even for existing Storage Pools.The storage administrator can relocate logical volumes between Storage Pools without any limitations,provided there is sufficient free space in the target Storage Pool.

    If necessary, the target Storage Pool capacity can be dynamically increased prior to volumerelocation, assuming there is sufficient unallocated capacity available in the system.

    When a logical volume is relocated to a target Storage Pool, sufficient space must be available for allof its snapshots to reside in the target Storage Pool as well.

  • 8/14/2019 XIV System Architecture2.pdf

    23/65

    23

    Capacity Allocation and Thin Provisioning

    Soft volume size

    The size of the logical volume that is observed by the host

    Soft volume size is specified in one of two ways, depending on units:

    In terms of GB: The system will allocate the soft volume soft size as the

    minimum number of discrete 17GB increments

    In terms of blocks: The capacity is indicated as a discrete number of 512

    byte blocks

    The system still consumes at a minimum of 17GB increments, however, the precise

    size in blocks is reported to the host

    Hard volume size

    The physical space allocated to the volume following host writes to the volume

    Upper limit is determined by the soft size assigned to the volume

    Allocated to volumes by the system in increments of 17GB due to the

    underlying logical and physical architecture

    Increasing the soft volume size does not affect the hard volume size

    The XIV Storage System virtualization empowers storage administrators to thinly provision resources, vastly improving aggregatecapacity utilization and simplifying resource allocation tremendously. Thin provisioning is a central theme of the virtualized design ofthe system, because it uncouples the virtual, or apparent, allocation of a resource from the underlying hardware allocation.

    Hard and soft volume sizes

    The physical capacity assigned to traditional, or fat, volumes is equivalent to the logical capacity presented to hosts. With the XIVStorage System, this does not need to be the case. All logical volumes by definition have the potential to be thinly provisioned as aconsequence of the XIV Storage Systems virtualized architecture, and therefore provide the most efficient capacity utilization

    possible. For a given logical volume, there are effectively two associated sizes. The physical capacity allocated for the volume is notstatic, but increases as host writes fill the volume.

    Soft volume size

    This is the size of the logical volume that is observed by the host, as defined upon volume creation or as a result of a resizing command.The storage administrator specifies the soft volume size in the same manner regardless of whether the Storage Pool itself will bethinly provisioned. The soft volume size is specified in one of two ways, depending on units:

    1. In terms of GB: The system will allocate the soft volume soft size as the minimum number of discrete 17GB increments needed tomeet the requested volume size.

    2. In terms of blocks: The capacity is indicated as a discrete number of 512 byte blocks. The system will still allocate the soft volumesize consumed within the Storage Pool as the minimum number of discrete 17GB increments needed to meet the requested size(specified in 512 byte blocks), however, the size that is reported to hosts is equivalent to the precise number of blocks defined.

    {Incidentally, the snapshot reserve capacity associated with each Storage Pool is a soft capacity limit and it is specified by the storageadministrator, though it effectively limits the hard capacity consumed collectively by snapshots as well.

    Hard volume size

    The volume allocated hard space reflects the physical space allocated to the volume following host writes to the volume, and isdiscretely and dynamically provisioned by the system (not the storage administrator). The upper limit of this provisioning is determinedby the soft size assigned to the volume.

    The volume consumed hard space is not necessarily equal to the hard volume allocated capacity because the hard space allocationoccurs in increments of 17GB, while actual space is consumed at the granularity of the 1MB partitions. Therefore, the actual physicalspace consumed by a volume within a Storage Pool is transient because a volumes consumed hard space reflects the total amountof data that has been previously written by host applications:

    Hard capacity is allocated to volumes by the system in increments of 17GB due to the underlying logical and physical architecture;there is no greater degree of granularity than17GB even if a only a few partitions are initially written beyond each 17GB boundary.

    Application write access patterns determine the rate at which the allocated hard volume capacity is consumed, and subsequently therate at which the system allocates additional increments of 17GB up to the limit defined by the soft size for the volume. As a result,the storage administrator has no direct control over the hard capacity allocated to the volume by the system at any given point in time.

    During volume creation, or when a volume has been formatted, there is zero physical capacity assigned to the volume. As applicationwrites accumulate to new areas of the volume, the physical capacity allocated to the volume will grow in increments of 17GB and mayultimately reach the full soft volume size.

    Increasing the soft volume size does not affect the hard volume size.

  • 8/14/2019 XIV System Architecture2.pdf

    24/65

    24

    Capacity Allocation and Thin Provisioning (Cont.)

    17GB 17GB 17GB 17GB 17GB17GB

    LogicalView

    PhysicalView

    17GB 17GB 17GB 17GB 17GB17GB

    Vol-1 size=10GB

    (Block Definition)

    Vol-1 allocated

    soft space

    Vol-1 consumed

    hard space

    Vol-1 allocatedhard space

    Vol-2 consumedhard space

    Vol-2 allocated

    hard space

    Vol-2 allocated

    soft space Unused spaceSnapshot Reserve

    The Block definitionallows hosts to see

    precise number of

    blocks

    Even for Block defined

    volumes, the system allocates

    logical capacity in increments

    of 17GB

    Taking Snapshots

    Snapshot consumedhard space

    Snapshot allocated

    hard space

    The consumed hard space

    grows as host writes

    accumulate to new areas of the

    volume

    The consumed hard space

    grows as snapshots writes

    accumulate to new areaswithin the allocated snapshot

    reserve soft space

    A new Thin Provisioned

    Pool is createdNo physical space isactually allocated

    Pool allocatedsoft space

    Pool allocated

    hard space

    Storage Pool level thin provisioning

    While volumes are effectively thinly provisioned automatically by the system, Storage Pools can be

    defined by the storage administrator (when using the GUI) as either regular or thinly provisioned.

    Note that when using the XCLI, there is no specific parameter to indicate thin provisioning for a

    Storage Pool. You indirectly and implicitly create a Storage Pool as thinly provisioned by specifying a

    pool soft size greater than its hard size.

    With a regular pool, the host-apparent, capacity is guaranteed to be equal to the physical capacity

    reserved for the pool. The total physical capacity allocated to the constituent individual volumes and

    collective snapshots at any given time within a regular (non-thinly provisioned) will reflect the current

    usage by hosts. because the capacity is dynamically consumed as required. However, the remaining

    unallocated space within the pool remains reserved for the pool, and cannot be used by other

    Storage Pools. Therefore, the pool will not achieve full utilization unless the constituent volumes are

    fully utilized, but conversely there is no chance of exceeding the physical capacity that is available

    within the pool as is possible with a thinly provisioned pool.

    In contrast, a thinly provisioned Storage Pool is not fully backed by hard capacity, meaning theentirety of the logical space within the pool cannot be physically provisioned unless the pool is

    transformed first into a regular pool. However, benefits may be realized when physical space

    consumption is less than logical space assigned, because the amount of logical capacity assigned to

    the pool that is not covered by physical capacity is available for use by other Storage Pools.

  • 8/14/2019 XIV System Architecture2.pdf

    25/65

    25

    Session II: Gen II Systems Hardware Design and Layout

    XIV Storage System Model 2810-A14

    Full Rack Systems

    Partially Populated Rack Systems

    The Rack, ATS and UPS modules

    Data Modules

    Interface Module

    SATA Disk Drives

    The Patch Panel

    Interconnection and Switches

    Maintenance Module

    This chapter describes the hardware architecture of the XIV Storage System. The

    physical structures that make up the XIV Storage System are presented, such as

    the system rack, Interface, Data and Management modules, disks, switches and

    power distribution devices.

  • 8/14/2019 XIV System Architecture2.pdf

    26/65

    26

    XIV Storage System Model 2810-A14

    The XIV Storage System seen in this slide is designed to be a scalable enterprise

    storage system based upon a grid array of hardware components. The architecture

    offers highest performance through maximized utilization of all disks, true distributed

    cache implementation, coupled with more effective bandwidth. It also offers superior

    reliability through distributed architecture, redundant components, self-monitoring

    and self-healing.

  • 8/14/2019 XIV System Architecture2.pdf

    27/65

    27

    Full Rack Systems

    Hardware characteristics

    The IBM 2810-A14 is a new generation of IBM high-performance, high-availability, and high-capacityenterprise disk storage subsystem. This slide summarizes the main hardware characteristics.

    All XIV hardware components come pre-installed in a standard APC AR3100 rack. At the bottom ofthe rack, a UPS module complex made up of three redundant UPS units, is installed and provides

    power to the Data Modules, Interface Modules and switches.

    A fully populated rack contains 15 Data Modules, where 6 modules are combined Data and InterfaceModules equipped with the connectivity adapters (FC and Ethernet). Each module includes twelve1TB SATA disk drives. This translates into a total raw capacity of 180TB for the complete system.

    Two 48 port 1Gbps Ethernet switches form the basis of an internal redundant Gigabit Ethernet thatlinks all the modules in the system. The switches are installed in the middle of the rack between theInterface Modules.

    The connections between the modules and switches and also all internal power connections in therack are realized by a redundant set of cables. For power connections, standard power cables andplugs are used. Additionally standard Ethernet cables are used for interconnection between the

    modules and switches.

    All 15 modules (6 Interface Modules and 9 Data Modules) have redundant connections through two48-port 1 Gbps Ethernet switches. This grid network ensures communication between all moduleseven if one of the switches or a cable connection fails. Furthermore, this grid network provides thecapabilities for parallelism and execution of a data distribution algorithm that contribute to theexcellent performance of the XIV Storage System.

  • 8/14/2019 XIV System Architecture2.pdf

    28/65

    28

    Partially Populated Rack Systems

    3

    3

    72

    ~27

    48

    Hardware characteristics

    The IBM 2810-A14 is a new generation Partially Populated Rack provides a solution for mid/largeenterprises needing to begin working with XIV storage with less capacity. This slide summarizes themain hardware characteristics.

    All XIV hardware components come pre-installed in a standard APC AR3100 rack. At the bottom of

    the rack, a UPS module complex made up of three redundant UPS units, is installed and providespower to the Data Modules, Interface Modules and switches.

    A partially populated rack contains 6 Data Modules, where 3 modules are combined Data andInterface Modules equipped with the connectivity adapters (FC and Ethernet). Each module includestwelve 1TB SATA disk drives. This translates into a total raw capacity of 72TB for the completesystem.

    Two 48 port 1Gbps Ethernet switches form the basis of an internal redundant Gigabit Ethernet thatlinks all the modules in the system. The switches are installed in the middle of the rack between theInterface Modules.

    The connections between the modules and switches and also all internal power connections in the

    rack are realized by a redundant set of cables. For power connections, standard power cables andplugs are used. Additionally standard Ethernet cables are used for interconnection between themodules and switches.

    All 6 modules (3 Interface Modules and 3 Data Modules) have redundant connections through two48-port 1 Gbps Ethernet switches. This grid network ensures communication between all moduleseven if one of the switches or a cable connection fails. Furthermore, this grid network provides thecapabilities for parallelism and execution of a data distribution algorithm that contribute to theexcellent performance of the XIV Storage System.

  • 8/14/2019 XIV System Architecture2.pdf

    29/65

    29

    Partially Populated Rack Systems (Cont.)

    TotalModules

    6 9 10 11 12 13 14 15

    UsableCapacity

    27 43 50 54 61 66 73 79

    InterfaceModules

    3 6 6 6 6 6 6 6

    Data Modules 3 3 4 5 6 7 8 9

    Disk Drives 72 108 120 132 144 156 168 180

    FiberChannelPorts

    8 16 16 20 20 24 24 24

    iSCSI Ports 0 4 4 6 6 6 6 6

    Memory (GB) 48 72 80 88 96 104 112 120

    Plant/FieldOrderable

    Plant Field Field Field Field Field Field Both

    Additional capacity configurations

    The XIV Storage System Model A14 is now available in a six module configuration

    consisting of three interface modules (feature number 1100) and three data

    modules (feature number 1105). This configuration is designed to support the same

    capabilities and functions as the current 15 module XIV Storage System with theIBM XIV Storage System Software V10. It has all of the same auxiliary components

    and ships in the same physical rack as the 15 module system.

    The six module configuration is field-upgradeable with additional interface modules

    and data modules to achieve configurations with a total of nine, ten, eleven, twelve,

    thirteen, fourteen, or fifteen modules. The resulting configuration can subsequently

    continue to be upgraded with one or more additional modules, up to the maximum

    of fifteen modules.

  • 8/14/2019 XIV System Architecture2.pdf

    30/65

    30

    The Rack, ATS and UPS modules

    In case of extended external

    power failure or outage, theUPS module complex

    maintains battery power long

    enough to allow a safe and

    ordered shutdown

    The Automatic Transfer

    System (ATS) supplies power

    to all three UPSs and

    Maintenance module

    The rack

    The IBM XIV hardware components are installed in a 19 NetShelter SX 42U rack (APC AR3100) from APC.The rack is 1070mm deep to accommodate deeper size modules and to provide more space for cables andconnectors. Adequate space is provided to house all components and to properly route all cables. The rackdoor and side panels are locked with a key to prevent unauthorized access to the installed components.

    The UPS module complexThe Uninterruptable Power Supply (UPS) module complex consists of three UPS units. Each unit maintains aninternal power supply in the event of temporal failure of the external power supply. In case of extended externalpower failure or outage, the UPS module complex maintains battery power long enough to allow a safe andordered shutdown of the XIV Storage System. The complex can sustain the failure of one UPS unit, whileprotecting against external power disorders.

    The three UPS modules are located at the bottom of the rack. Each of the modules has an output of 6 kVA tosupply power to all other components in the rack and is 3U in height. The design allows proactive detection oftemporary power problems and can correct them before the system goes down. In the case of a completepower outage, integrated batteries continue to supply power to the entire system. Depending on the load of theIBM XIV, the batteries are designed to continue system operation from 3.3 minutes to 11.9 minutes; This givesenough time to gracefully power-off the system.

    Automatic Transfer System (ATS)

    The Automatic Transfer System (ATS) supplies power to all three Uninterruptible Power Supplies (UPS), and tothe Maintenance Module. Two separate external main power sources supply power to the ATS.

    In case of power problems or a failing UPS, the ATS reorganizes the power load balance between the powercomponents. The operational components take over the load from the failing power source or power supply.This rearrangement of the internal power load is performed by the ATS in a seamless way and systemoperation continues without any application impact.

  • 8/14/2019 XIV System Architecture2.pdf

    31/65

    31

    The UPS Behavior

    All components power connections in the box are distributed across3 UPSs

    All three UPSs are running self-test procedures to validatebatteries operational state

    Self-test schedule is cycled on the UPSs with 5 days interval between them -WRONG

    An operational system is where at least two UPSs are running onutility power and with at least 70% of battery charge level

    A single UPS failure will not impact power distribution

    If two or more UPSs are in failed state, the system will wait for a30 sec grace period before determining system graceful shutdown

    If one UPS is in failed state, the next self-test instance will beskipped to avoid the chance of a second UPS failure

    UPS self test procedures are controlled by the system MicroCode The self test cycle is once every 14 days, with a 9 hour interval between

    each UPS

    The XIV Storage system is using its memory DIMMs for cache purposes. When a system has aproblem with power distribution it is imperative that a full proof power distribution system will be inplace to avoid data loss.

    Power Distribution Rules

    All system components use power connections that are distributed across 3 UPSs evenly.An fully operational system is where at least two UPSs are running on utility power and with at least70% of battery charge level.

    In order to allow the system for a graceful shutdown (the process where system commits allremaining IOs in cache to disks and properly shuts down all system components), The XIVsystem needs at least two UPSs with a minimum of 70% charge level.

    In case of a single UPS failure, either from a self test failure or from a physical problem with the UPSitself, power distribution to the system will not be impacted. The system will continue to functionnormally.

    If two or more UPSs are in failed state, the system will wait for an additional 30 sec grace periodbefore determining that the system is indeed experiencing a major problem (to avoid taking thesystem down in cases of short power spikes) and issue a graceful shutdown to it.

    If one UPS is in failed state for whatever reason, the next UPS self test instance will be skipped toavoid the chance of a second UPS failure that will cause the box to issue a graceful shutdown.

    The UPS self test procedures are controlled by the system MicroCode and can beconfigured, if needed, using developer level commands.

    The default self test interval is 5 days between each UPS.

  • 8/14/2019 XIV System Architecture2.pdf

    32/65

    32

    Data ModuleSystem fans x 10

    Motherboard

    SAS Expander Card

    System PSUs x 2PCI Slots (ETH)

    CF Card

    The hardware of the Interface Modules and the Data Modules is a Xyratex 1235E-X1. The module

    has a 87.9 mm (2U) tall, is 483 mm wide and 707 mm deep. The weight depends on configuration

    and type (Data Module or Interface Module) and is maximum of 30 kg.

    The fully populated rack hosts 9 Data Modules (Module 1-3 and Module 10-15). There is no

    difference in the hardware between Data Modules and Interface Modules, except for the additional

    host adapters and GigE adapters in the Interface Modules. The main components of the module,beside the 12 disk drives

    are:

    System Planar

    Processor

    Memory / Cache

    Enclosure Management Card

    Cooling devices (fans)

    Memory Flash Card

    Redundant Power Supplies

    In addition, each Data Module contains four redundant Gigabit Ethernet ports. These ports togetherwith the two switches form the internal network, which is the communication path for data and meta

    data between all modules. One Dual GigE adapter is integrated in the System Planar (port 1 and 2),

    the remaining two ports (3 and 4) are on a additional Dual GigE adapter installed in a PCIe slot.

  • 8/14/2019 XIV System Architecture2.pdf

    33/65

    33

    Data Module (Cont.)

    Back view picture of a Data module and the CF card with the Addonics adapter.

  • 8/14/2019 XIV System Architecture2.pdf

    34/65

    34

    Data Module (Cont.)

    The same system planar with a built-in SAS controller is used

    in both Data and Interface modules Each module has 1 Intel Xeon Quad Core CPU.

    8GB of fully buffered DIMM memory modules

    10 fans for cooling of disks, CPU and board

    An enclosure management card to issue alarms in case of problems

    with the module

    1GB Compact Flash card

    This card is the boot device of the module and contains the

    software and module configuration files

    Due to the configuration files the Compact Flash Card is not

    interchangeable between modules

    System Planar

    The System Planar used in the Data Modules and the Interface Modules is a standard ATX

    board from Intel . This high-performance server board with a built in SAS adapter supports:

    64-bit quad-core Intel Xeon processor to improve performance and headroom, and provide

    scalability and system redundancy with multiple virtual applications.Eight fully buffered 533/667 MHz DIMMs to increase capacity and performance.

    Dual Gb Ethernet with Intel I/O Acceleration Technology to improve application and network

    responsiveness by moving data to and from applications faster.

    Four PCI Express slots to provide the I/O bandwidth needed by servers.

    SAS adapter

    Processor: The processor is a Xeon Quad Core Processor. This 64-bit processor has the

    following characteristics: 2.33 GHz clock 12 MB cache 1.33 GHz Front Serial Bus

    Memory / Cache: Every module has 8 GB of memory installed (8 x 1GB FBDIMM). Fully

    Buffered DIMM memory technology increases reliability, speed and density of memory for use

    with Xeon Quad Core Processor platforms. This processor memory configuration can provide 3

    times higher memory throughput, enable increased capacity and speed to balance capabilities

    of quad core processors, perform reads and writes simultaneously and eliminate the previous

    read to write blocking latency. Part of the memory is used as module system memory, while

    the rest is used as cache memory for caching data previously read, pre-fetching of data from

    disk and for delayed destaging of previously written data.

  • 8/14/2019 XIV System Architecture2.pdf

    35/65

    35

    Interface ModuleSystem fans x 10

    Motherboard

    SAS Expander Card

    System PSUs x 2PCI Slots (FC & ETH)

    CF Card

  • 8/14/2019 XIV System Architecture2.pdf

    36/65

    36

    Interface Module (Cont.)

    Interface Module

    The Interface Module is similar to the Data Module. The only differences are:

    Each Interface Module contains iSCSI and Fibre Channel ports, through which hosts can attach to the XIVStorage System. These ports can also be used to establish Remote Mirror links with another, remote XIVStorage System.

    There are two 4-port GigE PCIe adapters installed for additional internal network connections and also for theiSCSI ports.

    There are six Interface Modules (modules 4-9) available in the rack. All Fibre Channel ports, iSCSI ports andEthernet ports used for external connections are internally connected to a patch panel where the externalcables are actually hooked up.

    Fibre Channel connectivity

    There are 4 FC ports (two 2-port adapters) available in each Interface Module for a total of 24 FCP ports. Theysupport 4 Gbps (Gigabit per second) full-duplex data transfer over short wave fibre links, using 50 micron multi-mode cable. The cable needs to be terminated on one end by a Lucent Connector (LC).

    In each module the ports are allocated as follows:

    ports 1 and 2 are allocated for host connectivity

    ports 3 and 4 are allocated for remote connectivity

    4Gb FC PCI Express adapterFibre channel connections to the Interface Modules are realized by two 2-port 4Gb FC PCI Express Adaptersper Interface Module for faster connectivity and improved data protection from LSI Corporation.

    This Fibre Channel host bus adapter (HBA) is LSIs powerful FC949E controller and features full-duplex capableFC ports, that automatically detect connection speed, and can each independently operate at 1,2 or 4Gbps. Theability to operate on slower speeds ensure that these adapters remain fully compatible with legacy equipment.New end-to-end error detection (CRC) for improved data integrity during reads and writes is also supported.

  • 8/14/2019 XIV System Architecture2.pdf

    37/65

    37

    Interface Module (Cont.)

    iSCSI connectivity

    There are six iSCSI service ports (two per Interface Module) available for iSCSI over IP/Ethernetservices. These ports are available in Interface Modules 7,8 and 9 supporting 1Gbps Ethernet hostconnection. These ports should connect, through the patch panel to the users IP network andprovide connectivity to the iSCSI hosts.

    iSCSI connections can be operated with different functionalities:

    As an iSCSI target: server hosts through the iSCSI protocol

    As an iSCSI initiator for remote mirroring when connected to another iSCSI port

    As an iSCSI for data migration when connected to third party iSCSI storage system

    For CLI and GUI access over the iSCSI ports

    iSCSI ports can be defined for different use:

    Each iSCSI port can be defined as an IP interface

    Groups of Ethernet iSCSI ports on the same module can be defined as a single link aggregationgroup (IEEE standard: 802.3ad)

    Ports defined as a link aggregation group must be connected to the same Ethernet switch,and a parallel link aggregation group must be defined on that Ethernet switch.

    Although a single port is defined as a link aggregation group of one, IBM XIV support canoverride this configuration if such a setup is not operable with the customer Ethernetswitches.

    For each iSCSI IP interface these configuration options are definable:

    IP address (mandatory)

    Network mask (mandatory)

    Default gateway (optional)

    MTU; Default: 1,536; Maximum: 8,192 MTU

  • 8/14/2019 XIV System Architecture2.pdf

    38/65

    38

    SATA Disk Drives

    The SATA disk drives used in the IBM XIV are 1 TB, 7200 rpm hard drives designed

    for high-capacity storage in enterprise environments.

    All IBM XIV disks are installed in the front of the modules, twelve disks per module.

    Each single SATA disk is installed in a disk tray which connects the disk to thebackplane and includes the disk indicators on the front. If a disk is failing it can be

    replaced easily from the front of the rack. The complete disk tray is one FRU which

    is latched by a mechanical handle in its position.

  • 8/14/2019 XIV System Architecture2.pdf

    39/65

    39

    SATA Disk Drives (Cont.)

    Performance features

    3 Gb/s SAS interface supporting key features in SATA specification

    32 MB cache buffer for enhanced data transfer performance

    Rotation Vibration Safeguard (RVS) prevents performancedegradation

    Reliability features

    Advanced magnetic recording heads and media

    Self-Protection Throttling (SPT) monitors I/O

    Thermal Fly-height Control (TFC) provides better soft error rate

    Fluid Dynamic Bearing (FDB) motor improves acoustics andpositional accuracy

    R/W heads are place on the load/unload ramp to protect user datawhen power is removed

    The IBM XIV was engineered with substantial protection against data corruption and data loss, thus not justrelying on the sophisticated distribution and reconstruction methods. Several features and functionsimplemented in the disk drive also increase reliability and performance. The highlights are:

    Performance features and benefits

    SAS interface

    The disk drive features a 3 Gb/s SAS interface supporting key features in the Serial-ATAspecification including NCQ (Native Command Queuing), staggered spin-up and hot-swapcapability.

    32 MB cache buffer

    Internal 32 MB cache buffer enhances the data transfer performance

    Rotation Vibration Safeguard (RVS)

    In multi-drive environments, rotational vibration, which result from the vibration of neighboringdrives in a system, can degrade hard drive performance. To aid in maintaining highperformance the disk drive incorporates enhanced Rotation Vibration Safeguard (RVS)technology providing up to 50% improvement over the previous generation againstperformance degradation, leading the industry.

    Reliability features and benefits

    Advanced magnetic recording heads and media

    Excellent soft error rate for improved reliability and performance

    Self-Protection Throttling (SPT)

    SPT monitors and manages I/O to maximize reliability and performance

    Thermal Fly-height Control (TFC)

    TFC provides better soft error rate for improved reliability and performance

    Fluid Dynamic Bearing (FDB) Motor

    FDB Motor to improve acoustics and positional accuracy

    Load/unload ramp

    The R/W Heads are placed outside the data area to protect user data when power is removed

  • 8/14/2019 XIV System Architecture2.pdf

    40/65

    40

    The Patch Panel

    4 FC ports on each interface module

    (4, 5, 6, 7, 8, 9)

    2 iSCSI ports on each interface module

    (7, 8, 9)

    3 connections for GUI and/or XCLI from

    customer network (4, 5, 6)

    2 ports for VPN connectivity (4, 6)

    2 service ports (4, 5)

    1 maintenance module connections

    2 reserved port

    The patch panel is located at the rear of the rack. Interface Modules are connected

    to the patch panel using 50 micron cables. All external connections should be made

    through the patch panel. In addition to the host connections and to the network

    connections further ports are available on the patch panel for service connections.

  • 8/14/2019 XIV System Architecture2.pdf

    41/65

    41

    Interconnection and Switches

    Internal module communication is based on 2 redundant 48

    port Gigabit Ethernet switches Switches are using interlink switch connection between them

    Switches are using a RPS unit to eliminate the switch power

    supply as a single point of failure

    Internal Ethernet switches

    The internal network is based on two redundant 48-port Gigabit Ethernet switches

    (Dell Power Connect 6248). Each of the modules (Data or Interface) is directly

    attached to each of the switches with multiple connections, and the switches are

    also linked to each other. This network topology enables maximal bandwidthutilization since the switches are used in active-active configuration, while being

    tolerant to any individual failure in network components like port, link, or switch. If

    one switch is failing the bandwidth of the remaining connections is fair enough to

    prevent noticeable performance impact and still keep enough parallelism in the

    system.

    The Dell PowerConnect 6248 is a Gigabit Ethernet Layer 3-Switch with 48 copper

    and 4 combined ports (SFP or 10/100/10000), robust stacking and 10-Gigabit-

    Ethernet uplink capability. The switches are powered by Dell RPS-600 redundant

    power supplies, to eliminate the switch power supply as single point of failure.

  • 8/14/2019 XIV System Architecture2.pdf

    42/65

    42

    Interconnection and Switches (Cont.)

    Module - USB to serial

    The Module - USB to Serial connections are used by internal processes to keep

    alive the communication to the modules in the event the network connection is not

    operational. Modules are linked together with those USB to serial cables in groups

    of 3 modules. This emergency link is needed to communicate between the modulesfor internal processes and used by maintenance to repair internal network

    communication issues only.

    The USB to Serial connection is always connecting a group of three Modules.

    USB Module 1 is connected to Serial Module 3

    USB Module 3 is connected to Serial Module 2

    USB Module 2 is connected to Serial Module 1

    This connection sequence is repeated for the modules 4-6, 7-9, 10-12, and 13-15.

  • 8/14/2019 XIV System Architecture2.pdf

    43/65

    43

    Maintenance Module

    Used for IBM XIV support to maintain and repair the system

    When needed, remote XIV support can connect remotely

    Through a modem connection attached to the maintenance module

    The maintenance module is a 1U generic server

    It is powered through the ATS directly

    This is the only component in the system that is not redundant

    The maintenance module is not part of the XIV storage

    architecture

    If down it will not affect the system

    The maintenance module is connected through Ethernetconnections to modules 5 and 6

    The Maintenance module and the Modem, installed in the middle of the rack are

    used for IBM XIV Support and the SSR/CE to maintain and repair the machine.

    When there is a software or hardware problem that needs the attention of the IBM

    XIV Support Center, a remote connection will be required and used to analyze and

    possibly repair the faulty system. The connection can be established either via VPN

    (virtual private network) broadband connection or via phone line and modem.

    Modem

    The Modem installed in the rack is needed and used for remote support. It enables

    the IBM XIV Support Center specialists and, if necessary, higher level of support to

    connect the XIV Storage System. Problem analysis and repair actions without a

    remote connection are complicated and time consuming.

    Maintenance ModuleA 1U remote support server is also required for full-functionality and supportability of

    the IBM XIV. This device has fairly generic requirements as it is only used to gain

    remote access to the device via VPN or modem for support personnel. The current

    choice for this device is a SuperMicro 1U server, with an average commodity level

    configuration

  • 8/14/2019 XIV System Architecture2.pdf

    44/65

    44

    Session III: XIV Software Framework

    Basic Terminology

    Communication infrastructure

    Single Module Frameworks

    System Nodes

    File Systems on the Module

  • 8/14/2019 XIV System Architecture2.pdf

    45/65

    45

    Basic Terminology

    Module a physical component

    Regular module (contains disks)

    Interface module (disks and also SCSI interfaces)

    Power module

    Switching module

    Node XIV software component that runs on several modules

    Singleton Node XIV software component that at any given

    time runs on a single module

    Basic Terminology

    There are several components that complete the XIV Storage system.

    Module the physical components that are used to build the system

    Regular module contains only disks.

    Interface module contains disks and host connectivity interfaces for FC

    and iSCSI.

    Power module the UPSs of the system.

    Switching module the switches used to interconnect the all the different

    modules.

    Node a part of the XIV software components that runs on several modules

    Singleton Node a part of the XIV software components that at any given time

    runs only on a specific module

  • 8/14/2019 XIV System Architecture2.pdf

    46/65

    46

    Communication infrastructure

    NetPatrol

    Guarantees network connectivity between every two modules

    MCL

    Provides a transactional layer between any two nodes

    Each node has a unique id through it a node resolves to the type ofthe node and the module the node runs on

    RPC

    Exported MicroCode functions are called via RPC

    Transported over MCL

    XIV Configuration

    Each module holds a copy of the XIV system configuration withcurrent status of all XIV modules and nodes

    Transported over MCL

    Communication Infrastructure

    The XIV Storage architecture includes several communication components that allow the system to provide itscapabilities and maintain reliability.

    NetPatrol

    guarantees network connectivity between every two modules in the system.

    MCL Management Control Layer

    Each node has a unique identifierThe id is unique and resolves to the type of the node and the module the node runs on

    Each process may have several MCL queues to send/receive transactions on its own

    Handles timeouts and retries

    Resolves singleton roles to node id, aware of singleton election

    Uses a textual based forward-compatible protocol

    RPC Remote Procedure Call

    Any exported MicroCode function may be called via RPC. There are no limitations

    Marshalling/Unmarshalling is generic and relies on auto-generated information from an XML file

    Supports both sync/async client and server calls

    Marshalls to compact binary form, and forward-compatible XML form

    Easy to migrate to transactional transport: MCL, SCSI

    XIV Configuration

    Implemented over MCL

    The XIV Configuration is loaded into the memory of each data and interface module

    It holds the current status of all modules and nodes in the system

    Each change in the status will trigger a set of operations to be handled by the system to maintain itsfull redundancy and proper operational state

  • 8/14/2019 XIV System Architecture2.pdf

    47/65

    47

    Single Module Frameworks - Basic features

    Modules are symmetric and have exactly the same data

    All configuration is saved in a single XML file

    The only difference between modules FS is the module_id

    file

    On replacement modules the ID is assigned to them during the

    component testing phases prior to moving them to operational state

    Tight integration with cluster hardware

    Firmware management

    Hardware configuration

    Hardware monitoring

    Single Module Frameworks

    On the XIV Storage system all Modules are symmetric and have exactly the same

    data. All the configuration data of the module is saved in a single XML file.

    Since each module has a unique ID, that helps the systems to define its purpose

    and services that should run on it, there is only one file that is different betweeneach module in the system, and that is the module_id file. In case of a module

    replacement the new module FRU id is zerod out and is assigned to it during the

    component test phase.

    The MicroCode maximizes the use of module capabilities to achieve a high level of

    availability and reliability from the system. The MicroCode is tightly integrated with:

    Firmware manegement

    Hardware configuration

    Hardware monitoring

  • 8/14/2019 XIV System Architecture2.pdf

    48/65

    48

    System Nodes

    Platform Node all modules (process:platform_node)

    The Platform Node manages installation/upgrade of Module software

    Hardware configuration

    Running services and nodes, and keepalive messages handling

    Auto-generating service-specific configuration files

    Sending heartbeats to Management Node

    Handling configuration changes for the module

    Interface Node modules 4-9 (process: i_node)

    Implements the necessary protocols to serve as a SCSI target for the

    FC and iSCSI transport

    DESCRIPTION OF ALL NODES

    This section covers all Nodes in the system:

    Platform Node (process: platform_node)

    The Platform Node runs on all Modules and manages the software and hardware of

    a Module. The Platform Node manages installation/upgrade of Module software,configures all configurable hardware (WWNs, IP addresses, etc), running all

    Services and Nodes upon startup, auto-generating Service-specific and UNIX

    configuration files (/etc/ssh/sshd_config or /etc/hosts for example), handling the

    keepalive messages of all Nodes on a Module, sending heartbeats to the

    Management Node and handling Configuration changes for that Module. The

    Platform Node is normally the only process executed by xinit upon normal startup

    (it's hardcoded in xinit) and it spawns all Services and Nodes in accordance to the

    Configuration for the Module its running on.

    Interface Node (process: i_node)

    The Interface Node implements the necessary protocols to serve as a SCSI Target

    for the FC and iSCSI transport. For iSCSI communication it relies on an external

    process called iscsi_host_session to setup iSCSI sessions.

  • 8/14/2019 XIV System Architecture2.pdf

    49/65

    49

    System Nodes (Cont.)

    Cache Node all modules (process: cte)

    The storage backend of the XIV Storage system

    Each is a holder of partitions against which IOs are performed

    Gateway Node modules 4-9 (process:gw_node)

    In charge of serving as the SCSI initiator for XDRP mirroring and data

    migration

    Admin Node modules 4-9 (process: aserver)

    Listens on port 7777 (using STunnel from 7778) and receives xml

    describing commands, then passes them to The Administrator

    The Administrator parses and validates the xmls, and translates

    them into an XIV RPC call to be executed by the Management Node

    Cache Node (process: cte)

    The Cache Node runs on all Modules with storage disks (at the time of this writing, all Gen2

    Modules). It is the 'storage backend' of the XIV storage array. Each Cache Node is the primary or

    secondary holder of zero or more data chunks called Partitions, against which IOs are performed. A

    Cache Node services reads/writes from/to Partitions and decides which Partitions to keep in memory.

    Gateway Node (process: gw_node)

    The Gateway Node runs on all Module with an external data port (iSCSI/FC), and is in charge of

    serving as the SCSI Initiator for XDRP mirroring and Data Migration. Amongst other things, the

    Gateway Node writes blocks to the Target of an XDRP Volume for which a Secondary Volume is

    defined, reads new blocks form data migrated Volumes and recovers bad blocks from the Secondary

    Volume of a Primary Volume for which XDRP is defined and a Media Error occurred.

    Admin Node (process: aserver)

    The Admin Node runs on all Modules with an external management port, and is in charge of

    processing and executing XCLI commands (the exterior API of the system). The Admin Node listens

    on port 7777 and is using STunnel to redirect commands coming from 7778. It receives XMLdescribing commands in TCP and passes them to what's called "the Administrator". The

    Administrator (administrator.py) is a Python-written passive-node which parses and validates the

    XMLs, and translates them into an XIV RPC call to be executed by the Management Node (formerly

    called the Manager). The Admin Node spawns Administrators as necessary, and it may kill old ones

    and respawn new ones.

  • 8/14/2019 XIV System Architecture2.pdf

    50/65

    50

    System Nodes (Cont.)

    Management Node singleton 1|2|3 (process: manager)

    In charge of managing system data redundancy by manipulating thedata rebuilds and distributions

    Operation state changes (On, Maintenance, Shutting down)

    Processing XCLI commands as they are received from The

    Administrator in the form of an XIV RPC call

    Cluster Node singleton 1|2|3 (process: cluster_hw)

    In charge of managing hardware which doesnt belong to any

    particular module (UPS, Switch)

    Event Node singleton 1|2|3 (process: event_node)

    Processes event rules and acts upon need (SMTP, SNMP, SMS)

    Adds newly created events to the relevant part of the configuration

    Management Node (process: manager)

    The Management Node is a Singleton whose primary responsibility is managing system data

    redundancy by means of manipulating the Distribution. In simpler terms, the Management module

    decides which Partition should reside on which Cache Nodes and manages Rebuild and

    Redistribution process. In addition, the Management Node is in charge of Operation State Changes

    (shutting down, Maintenance Mode, etc) and processing XCLI commands as they are received fromthe Administrator in the form of an XIV RPC call.

    Cluster Node (process: cluster_hw)

    The Cluster Node is a Singleton Node in charge of managing hardware which doesn't belong to any

    particular Module. For example, the Cluster Node monitors the UPS and switches, updating their

    status in the Configuration as necessary.

    Event Node (process: event_node)

    The Event Node is a Singleton Node which runs on a Module with an external management port. Its

    first duty is to process Event rules and acts upon them per every Event that is created (for example,

    rules may dictate sending an SMTP email or an SNMP trap). In addition, it adds newly createdEvents to the relevant part of the configuration, so Event Saver Nodes (see below) will store them.

  • 8/14/2019 XIV System Architecture2.pdf

    51/65

    51

    System Nodes (Cont.)

    SCSI Reservation Node - singleton 1|2|3 (process: isync_node)

    Receives SCSI II & III commands (Reserve, Release, Register) for fastprocessing by the system

    SCSI Reservation table is maintained on each Interface Node (forredundancy) based on updates coming from the isync_node

    Equip Master Node all modules (process: equip_master)

    Lets foreign modules currently being equipped (test phase) todownload the XIV software to them

    Event Saver Node all modules (process: event_saver)

    Receives all events created on the system and saves them to theVISDOR virtual disk

    HW Monitoring Node all modules (process: hw_mon_node) Monitors all available hardware on the module

    SCSI Reservation Node (process: isync_node)

    The SCSI Reservation node is a Singleton node that handles incoming SCSI II and III commands

    from various hosts (Reserve, Release, Register). All incoming commands are routed through the

    i_nodes to the isync_node which then updates all the i_nodes on his decision. All the i_nodes hold a

    copy of the SCSI Reservation table for redundancy.

    Equip Master Node (process: equip_master)

    The Equip Master Node runs on all Modules and lets foreign Modules which are currently being

    equipped (during the 'test' phase) to download XIV software from the Module the Equip Master node

    is currently running on. It's a bit like an XIV RPC based file-server.

    Event Saver Node (process: event_saver)

    The Event Saver Node runs on on all Modules with storage disks (at the time of this writing, all Gen2

    Modules). It receives all Events created on the system and saves them to permanent media (at the

    time of this writing, to a partition of the visdor virtual disk).

    Hardware Monitor Node (process: hw_mon_node)

    The Hardware Monitor runs on all Modules and checks all hardware in the module we know how to

    monitor. Monitored hardware includes the HBAs (if any), Disks, Enclosure (PSU, Fans, IPMI module,

    heat levels), Ethernet interfaces and SAS controller. The Hardware Monitor does not configure

    hardware or acts upon its findings, this is the job of other Nodes.

  • 8/14/2019 XIV System Architecture2.pdf

    52/65

    52

    File Systems on the Module

    XIV Storage System is based on the IBM-MCP Linux

    distribution

    Configuration for traditional Unix binaries is generated

    automatically

    The Compact Flash file-systems are mounted as Read Only

    Persistent data is stored on a special logical volume (VISDOR)

    triple-mirrored on the HDDs taking 2.5% of the system

    VISDOR holds event logs and traces for the system

    ISV holds statistics data for interface modules only

    XIV File Systems

    The GEN 2 XIV Storage Systems is based on the IBM-MCP Linux distribution. All

    configurations for traditional Unix binaries are generated automatically.

    Inside each module there is a PCI addonics card to host a CF card. That CF card

    hold the basic MCP file system as read only. To accommodate some Unix relatedmetadata, the system creates a small RAM based file system. Persistent data, such

    as event logs and traces, are stored on a special logical volume called VISDOR.

    VISDOR is triple-mirrored across all the disk drives in the system. VISDOR takes

    about 2.5% of the overall system capacity. VISDOR is usually mounted over

    /dev/sdb. Everything under /local resides in the VISDOR.

    ISV is an XIV hidden volume that is mapped to interface modules only. Usually

    mounted as /dev/sdc. It can hold approximately a years worth of statistics for

    general data and a months worth when sampling a specific Host/Volume.

  • 8/14/2019 XIV System Architecture2.pdf

    53/65

    53

    Session IV: XIV Systems Management

    Managing XIV Systems

    XIV Management Architecture

    The XIV Graphic User Interface (GUI)

    Managing Multiple Systems

    The XIV Command Line Interface (XCLI)

    Session XCLI

    Lab 1: Working with the XIV Management Tools

  • 8/14/2019 XIV System Architecture2.pdf

    54/65

    54

    Managing XIV Systems

    XIV Systems management can be done both through GUI and CLI

    commands

    XIV Storage Manager can be installed on:

    Microsoft Windows GUI and CLI

    MacOS SystemX GUI and CLI

    Linux GUI and CLI

    SUN Solaris CLI

    IBM AIX CLI

    HP/UX CLI

    More info on XIV Storage Management

    http://www.ibm.com/systems/storage/disk/xiv/index.html

    The XIV Storage System software supports the functions of the XIV Storage System. The software provides the functionalcapabilities of the system. It is preloaded on each module (data and Interface Modules) within the XIV Storage System. Thefunctions and nature of this software are equivalent to what is usually referred to as microcode or firmware on other storagesystems.

    The XIV Storage Management software is used to communicate with the XIV Storage System Software which in turninteracts with the XIV Storage hardware.

    The XIV Storage Manager can be installed on a Linux, SUN Solaris, IBM AIX, HP/UX, Microsoft Windows or MacOS basedmanagement workstation that will then act as the management console for the XIV Storage System. The Storage Managersoftware is provided at time of installation, or optionally downloadable from the following web site:

    http://www.ibm.com/systems/storage/disk/xiv/index.html

    For detailed information about the XIV Storage Management software compatibility refer to the XIV interoperability matrix orthe System Storage Interoperability Center (SSIC) at:

    http://www.ibm.com/systems/support/storage/config/ssic/index.jsp

    The IBM XIV Storage Manager includes a user-friendly and intuitive Graphical User Interface (GUI) application, as well as anExtended Command Line Interface (XCLI) component offering a comprehensive set of commands to configure and monitorthe system.

    Graphical User Interface (GUI)

    A simple and intuitive GUI allows a user to perform most administrative and technical operations (depending upon the user

    role) quickly and easily, with minimal training and knowledge.The main motivation behind the XIV management and GUI design is the desire to keep the complexities of the system and itsinternal workings completely hidden from the user. The most important operational challenges, such as overall configurationchanges, volume creation or deletion, snapshot definitions, and many more, are achieved with a few clicks. This chaptercontains descriptions and illustrations of tasks performed by a Storage administrator when using the XIV graphical userinterface (GUI) to interact with the system.

    Extended Command Line Interface (XCLI)

    The XIV Extended Command Line Interface (XCLI) is a powerful text based. command line based tool that enables anadministrator to issue simple commands to configure, manage or maintain the system, including the definitions required toconnect with hosts and applications. The XCLI can be used in a shell environment to interactively configure th