Introduction to Object Storage Solutions White Paper

Introduction to Object Storage and Hitachi Content Platform

The Fundamentals of Hitachi Content Platform

DATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEGIC ION POWERFUL RELEVANT PERFORMANCE SOLUTION CLOUD VIRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN VIS

WH

ITE

PA

PE

R

By Hitachi Data Systems

May 2013

WHITE PAPER 2

Contents

Executive Summary 3

Introduction 4

Main Concepts and Features 4

Object-Based Storage 4

Object Structure 4

Distributed Design 6

Open Architecture 6

Multitenancy 7

Object Versioning 7

Spin-Down and Storage Tiering 7

Search 8

Replication 8

Common Use Cases 8

Fixed-Content Archiving 8

Backup-Free Data Protection and Content Preservation 8

Cloud-Enabled Storage 10

E-Discovery, Compliance and Metadata Analysis 11

System Fundamentals 12

Hardware Overview 12

Software Overview 14

System Organization 14

Namespaces and Tenants 15

Main Concepts 15

User and Group Accounts 17

System and Tenant Management 17

Policies 18

Content Management Services 19

Conclusion 20

WHITE PAPER 3

Introduction to Object Storage and Hitachi Content Platform

Executive Summary

One of IT’s greatest challenges today is an explosive, uncontrolled growth of unstructured data. Continual growth of email and documents, video, Web pages, presentations, medical images, and so forth, increase both complexity and risk. This effect is seen particularly in distributed IT environments, such as cloud service providers and organizations with branch or remote office sites. The vast quantity of data being created, difficulties in management and proper handling of unstructured content, and complexity of supporting more users and applications pose challenges to IT departments. Organizations often end up with sprawling storage silos for a multitude of applications and workloads, with few resources available to manage, govern, protect, and search the data.

Hitachi Data Systems provides an alternative solution to these challenges through a single object storage platform that can be divided into virtual storage systems, each configured for the desired level of service. The great scale and rich features of this solution help IT organizations in both private enterprises and cloud service providers manage distributed IT environments. It helps them to control the flood of storage requirements for unstructured content and addresses a variety of workloads.

WHITE PAPER 4

Introduction

Hitachi Content Platform (HCP) is a multipurpose distributed object-based storage system designed to sup-port large-scale repositories of unstructured data. HCP enables IT organizations and cloud service providers to store, protect, preserve and retrieve unstructured content with a single storage platform. It supports multiple levels of service and readily evolves with technology and scale changes. With a vast array of data protection and content preservation technologies, the system can significantly reduce or even eliminate tape-based self backups or backups of edge devices connected to the platform. HCP obviates the need for a siloed approach to storing unstructured content. Massive scale, multiple storage tiers, Hitachi reliability, nondisruptive hardware and software updates, mul-titenancy and configurable attributes for each tenant allow the platform to support a wide range of applications on a single physical HCP instance. By dividing the physical system into multiple, uniquely configured tenants, adminis-trators create “virtual content platforms” that can be further subdivided into namespaces for further organization of content, policies and access. With support for thousands of tenants, tens of thousands of namespaces, and pet-abytes of capacity in one system, HCP is truly cloud-ready.

Main Concepts and Features

Object-Based Storage

Hitachi Content Platform, as a general-purpose object store, allows unstructured data files to be stored as objects. An object is essentially a container that includes both file data and associated metadata that describes the data. The objects are stored in a repository. The metadata is used to define the structure and administration of the data. HCP can also leverage object metadata to apply specific management functions, such as storage tiering, to each object. The objects have intelligence that enables them to automatically take advantage of advanced storage and data management features to ensure proper placement and distribution of content.

HCP architecture isolates stored data from the hardware layer. Internally, ingested files are represented as objects that encapsulate both the data and metadata required to support applications. Externally, HCP presents each object either as a set of files in a standard directory structure or as a uniform resource locator (URL) accessible by users and applications via HTTP/HTTPS.

HCP stores objects in a repository. Data that is ingested and stored in the repository is permanently associated with the information about that data, called metadata. Each data object encapsulates both object data and metadata, and is treated within HCP as a single unit for all intents and purposes.

Object Structure

An HCP repository object is composed of file data and the associated metadata, which in turn consists of system metadata and, optionally, custom metadata and an access control list (ACL). The structure of the object is shown in Figure 1.

File data is an exact digital copy of the actual file contents at the time of its ingestion. If the object is under retention, it cannot be deleted before the expiration of its retention period, except when using a special privileged operation. If versioning is enabled, multiple versions of a file can be retained. If appendable objects are enabled, data can be appended to an object (with the CIFS or NFS protocols) without modifying the original fixed-content data.

WHITE PAPER 5

Figure 1. HCP Object

Metadata is system or user generated data that describes the fixed-content data of an object and defines the object’s properties. System metadata, the system-managed properties of the object, includes HCP-specific meta-data and POSIX metadata.

HCP-specific metadata includes the date and time the object was added to the namespace (ingest time), the date and time the object was last changed (change time), the cryptographic hash value of the object along with the namespace hash algorithm used to generate that value, and the protocol through which the object was ingested. It also includes the object’s policy settings, such as data protection level (DPL), retention, shredding, indexing, and, for HCP namespaces only, versioning.

POSIX metadata includes a user ID and group ID, a POSIX permissions value, and POSIX time attributes.

Custom metadata is optional, user-supplied descriptive information about a data object that is usually provided as well-formed XML. It is typically intended for more detailed description of the object. This metadata can also be used by future users and applications to understand and repurpose the object content. HCP supports multiple custom metadata fields for each object.

ACL is optional, user-provided metadata containing a set of permissions granted to users or user groups to perform operations on an object. The ACLs are supported only in HCP namespaces.

The complete metadata structure, as supported in HCP namespaces, is shown in Figure 2. It includes all metafiles supported by HCP for objects, which were generated for the sample data structure (assuming that custom metadata and ACLs were added for each object).

WHITE PAPER 6

Figure 2. HCP Namespace: Complete Metadata Structure

Distributed Design

An HCP system consists of both hardware and software and comprises many different components that are con-nected together to form a robust, scalable architecture for object-based storage. HCP runs on an array of servers, or nodes, that are networked together to form a single physical instance. Each node is a storage node. Storage nodes store data objects. All runtime operations and physical storage, including data and metadata, are distributed among the storage nodes. All objects in the repository are distributed across all available storage space but still presented as files in a standard directory structure. Objects that are physically stored on any particular node are available from all other nodes.

Open Architecture

HCP has an open architecture that insulates stored data from technology changes, as well as from changes in HCP itself due to product enhancements. This open architecture ensures that users will have access to the data long after it has been added to the repository. HCP acts as both a repository that can store customer data and an online portal that enables access to that data by means of several industry-standard interfaces, as well as through an integrated search facility, Hitachi Data Discovery Suite (HDDS). The HTTP or HTTPS, WebDAV, CIFS and NFS pro-tocols support various operations. These operations include storing data, creating and viewing directories, viewing and retrieving objects and their metadata, modifying object metadata, and deleting objects. Objects that were added using any protocol are immediately accessible through any other supported protocol. These protocols can be used to access the data with a Web browser, the HCP client tools, 3rd-party applications, Microsoft® Windows® Explorer, or native Windows or UNIX tools.

HCP allows special-purpose access to the repository through the SMTP protocol, which is used only for storing email. For data backup and restore, HCP supports the NDMP protocol.

WHITE PAPER 7

Multitenancy

Multitenancy support allows the repository in a single physical HCP instance to be partitioned into multiple name-spaces. A namespace is a logical partition that contains a collection of objects particular to one or more applications. Each namespace is a private object store that is represented by a separate directory structure and has a set of independently configured attributes. Namespaces provide segregation of data, while tenants, or groupings of namespaces, provide segregation of management. An HCP system can have up to 1,000 tenants. Each tenant and its set of namespaces constitute a virtual HCP system that can be accessed and managed independently by users and applications. This HCP feature is essential in enterprise, cloud and service-provider environments.

Data access to HCP namespaces can be either authenticated or nonauthenticated, depending on the type and configuration of the access protocol. Authentication can be performed using HCP local accounts or Microsoft Active Directory® groups.

Object Versioning

HCP supports object versioning, which is the capability of a namespace to create, store and manage multiple versions of objects in the HCP repository. This ability provides a history of how the data has changed over time. Versioning facilitates storage and replication of evolving content, thereby creating new opportunities for HCP in markets such as content depots and workflow applications.

Versioning is available in HCP namespaces and is configured at the namespace level. Versioning is only supported with HTTP or REST. Other protocols cannot be enabled if versioning is enabled for the namespace. Versioning applies only to objects, not to directories or symbolic links. A new version of an object is created when an object with the same name and location as an existing object is added to the namespace. A special type of version, called a deleted version, is created when an object is deleted. Updates to the object metadata affect only the current ver-sion of an object and do not create new versions.

Previous versions of objects that are older than a specified amount of time can be automatically deleted, or pruned. It is not possible to delete specific historical versions of an object; however, a user or application with appropriate permissions can purge the object to delete all its versions, including the current one.

Spin-Down and Storage Tiering

HCP implements spin-down disk support as an early step towards the long-term goal of supporting information life-cycle management (ILM) and intelligent objects. In the near term, the goal of the HCP spin-down feature is to take advantage of the energy savings potential of the spin-down technology.

HCP spindown-capable storage is based on the power savings feature of Hitachi midrange storage systems and is a core element of the new storage tiering functionality, which is implemented as an HCP service. According to storage tiering strategy that is specified by customers, the storage tiering service identifies objects that are eligible to reside on spin-down storage and moves them to and from the spin-down storage as needed.

Tiering selected content to spindown-enabled storage lowers overall cost by reducing energy consumption for large-scale unstructured data storage, such as deep archives and disaster recovery sites. Storage tiering can very effectively be used with customer-identified “dark data” (rarely accessed data) or data replicated for disaster recovery by moving that data to spin-down storage some time after ingestion or replication. Customer sites where data pro-tection is critical can use storage tiering to move all redundant data copies to spin-down storage, which makes the cost of keeping data protection copies competitive to a tape solution.

Storage tiering also enables service providers to use a turnkey framework to offer differentiated object data man-agement plans. This capability further enhances HCP as an attractive target for fixed content, especially for archive-oriented use cases where tape may be considered an alternative.

WHITE PAPER 8

Search

HCP provides the only integrated metadata query engine on the market. HCP includes comprehensive search capa-bilities that enable users to search for objects in namespaces, analyze namespace contents, and manipulate groups of objects. To satisfy government requirements, HCP supports e-discovery for audits and litigation.

The metadata query engine is always available in any HCP system, but the content search facility requires installation of a separate HDS product, Hitachi Data Discovery Suite.

Replication

Replication, an add-on feature to HCP, is the process that keeps selected tenants and namespaces in 2 or more HCP systems in sync with each other. The replication service copies one or more tenants or namespaces from one HCP system to another, propagating object creations, object deletions, and metadata changes. HCP also replicates tenant and namespace configuration, tenant-level user accounts, compliance and tenant log messages, and reten-tion classes.

The HCP system in which the objects are initially created is called the primary system. The 2nd system is called the replica. Typically, the primary system and the replica are in separate geographic locations and connected by a high-speed wide area network. HCP supports different replication topologies including many-to-one and chain configurations.

Common Use Cases

Fixed-Content Archiving

Hitachi Content Platform is optimized for fixed-content data archiving. Fixed-content data is information that does not change but must be kept available for future reference and be easily accessible when needed. A fixed-content storage system is one in which the data cannot be modified. HCP uses “write-once, read-many” (WORM) storage technology, and a variety of policies and services (such as retention, content verification and protection) to ensure the integrity of data in the repository. The WORM storage means that data, once ingested into the repository, cannot be updated or modified; that is, the data is guaranteed to remain unchanged from when it was originally stored. If the versioning feature is enabled within the HCP system, different versions of the data can be stored and retrieved, in which case each version is WORM.

Backup-Free Data Protection and Content Preservation

HCP is a true backup-free platform. HCP protects content without the need for backup. It uses sophisticated data preservation technologies, such as configurable data and metadata protection levels (MDPL), object versioning and change tracking, multisite replication with seamless application failover, and many others. HCP includes a variety of features designed to protect integrity, provide privacy, and ensure availability and security of stored data. Below is a summary of the key HCP data protection features:

■■ Content immutability. This intrinsic feature of HCP WORM storage design protects the integrity of the data in the repository.

■■ Content verification. The content verification service maintains data integrity and protects against data corrup-tion or tampering by ensuring that the data of each object matches its cryptographic hash value. Any violation is repaired in a self-healing fashion.

■■ Scavenging. The scavenging service ensures that all objects in the repository have valid metadata. In case meta-data is lost or corrupted, the service tries to reconstruct it by using the secondary, or scavenging, metadata (a copy of the metadata stored with each copy of the object data).

WHITE PAPER 9

■■ Data encryption. HCP supports encryption at rest capability that allows seamless encryption of data on the physi-cal volumes of the repository. This ensures data privacy by preventing unauthorized access to the stored data. The encryption and decryption are handled automatically and transparently to users and applications.

■■ Versioning. HCP uses versioning to protect against accidental deletes and storing wrong copies of objects.

■■ Data availability.

■■ RAID protection. RAID storage technology provides efficient protection from simple disk failures. SAN-based HCP systems typically use RAID-6 erasure coding protection to guard against dual drive failures.

■■ Multipathing and zero-copy failover. These features provide data availability in SAN-attached array of inde-pendent nodes (SAIN) systems.

■■ Data protection level and protection service. In addition to using RAID and SAN technologies to provide data integrity and availability, HCP can use software mirroring to store the data for each object in multiple loca-tions on different nodes. HCP groups storage nodes into protection sets with the same number of nodes in each set, and tries to store all the copies of the data for an object in a single protection set where each copy is stored on a different node. The protection service enforces the required level of data redundancy by check-ing and repairing protection sets. In case of violation, it creates additional copies or deletes extra copies of an object to bring the object into compliance. If replication is enabled, the protection service can use an object copy from a replica system if the copy on the primary system is unavailable.

■■ Metadata redundancy. In addition to the data redundancy as specified by DPL, HCP creates multiple copies of the metadata for an object on different nodes. Metadata protection level or MDPL is a system-wide setting that specifies the number of copies of the metadata that the HCP system must maintain (normally 2 copies, MDPL2). Management of MDPL redundancy is independent of the management of data copies for DPL.

■■ Nondisruptive software and hardware upgrades. HCP employs a number of techniques that minimize or eliminate any disruption of normal system functions during software and hardware upgrades. Nondisruptive software upgrade (NDSU) is one of these techniques that includes greatly enhanced online upgrade support, nondisruptive patch management, and online upgrade performance improvements. HCP supports media-free and remote upgrades, HTTP or REST drain mode, and parallel operating system (OS) installation. It also sup-ports automatic online upgrade commit, offline upgrade duration estimate, enhanced monitoring and email alerts, and other features. Storage nodes can be added to an HCP system without causing any downtime. HCP also supports nondisrup-tive storage upgrades that allow online storage addition to SAIN systems without any data outage.

■■ Seamless application failover. This feature is supported by HCP systems in a replicated topology. This capability includes seamless failover routing feature that enables direct integration with customer-owned load balancers by allowing HTTP requests to be serviced by any HCP system in a replication topology. Seamless domain name system (DNS) failover is an HCP built-in multisite load-balancing and high-availability technology that is ideal for cost efficient, best-effort customer environments.

■■ Replication. If enabled, this feature provides a multitude of mechanisms that ensure data availability. The rep-lica system can be used both as a source for disaster recovery and to maintain data availability by providing good object copies for protection and content verification services. If an object cannot be read from the primary system, HCP can try to read the object from the replica if read-from-replica feature is enabled.

■■ Data security.

■■ Authentication of management and data access.

WHITE PAPER 10

■■ Granular, multilayer data access permission scheme.

■■ IP filtering technology and protocol-specific access or deny lists.

■■ Secure Sockets Layer (SSL) for HTTP or WebDAV data access, management access, and replication.

■■ Node login prevention.

■■ Shredding policy and service.

■■ Autonomic technology refresh feature, implemented as HCP migration service, enables organizations to main-tain continuously operating content stores that allows them to preserve their digital content assets for the long term.

Cloud-Enabled Storage

The powerful, industry-leading capabilities of HCP make it well suited to the cloud storage space. An HCP-based infrastructure solution is sufficiently flexible to accommodate any cloud deployment models (public, private or hybrid) and simplify the migration to the cloud for both service providers and subscribers. HCP provides edge-to-core, secure multitenancy and robust management capabilities, and a host of features to optimize cloud storage operations.

HCP, in its role as an online data repository, is truly ready for a cloud-enabled market. While numerous HCP features were already discussed earlier in this paper, the purpose of this section is to summarize those that contribute the most to HCP cloud capabilities. They include:

■■ Large-scale multitenancy.

■■ Management segregation. HCP supports up to 1,000 tenants, each of which can be uniquely configured for use by a separate cloud service subscriber.

■■ Data segregation. HCP supports up to 10,000 namespaces, each of which can be uniquely configured for a particular application or workload.

■■ Massive scale.

■■ Petabyte repository offers 40PB of storage, 80 nodes, 32 billion user objects, and 15 million files per directory, all on a single physical system.

■■ Best node density in the object storage industry supports 500TB per node and 400+M objects per node. With fewer number of nodes, HCP requires less power, less cooling, and less floor space.

■■ Unparalleled expandability that allows organizations to “start small” and expand according to demand.

■■ Nodes and/or storage can be added to expand an HCP system’s storage and throughput capacity, without disruptions. Multiple storage systems are supported by a single HCP system.

■■ Easy tenant and storage provisioning.

■■ Geographical dispersal and global accessibility.

■■ WAN-friendly REST interface for namespace data access and replication.

■■ Replication of content across multiple sites using advanced, flexible replication topologies.

■■ WAN-optimized, high-throughput data transfer.

WHITE PAPER 11

■■ High availability.

■■ Fully redundant hardware.

■■ Automatic routing of client requests around hardware failures.

■■ Load balancing across all available hardware.

■■ Multiple REST interfaces. These interfaces include the REST API for namespace data access, management API, and metadata query API. REST API is a technology of choice for cloud enablers and consumers. Some of the reasons for its popularity include high efficiency and low overhead, caching at both the client and the server and API uniformity. In addition, this technology offers a stateless nature that allows accommodation of the latencies of Internet access and potentially complex firewall configurations.

■■ Secure, granular access to tenants, namespaces and objects, which is crucial in any cloud environment. This access is facilitated by the HCP multilayer, flexible permission mechanism, including object-level ACLs.

■■ Usage metering. HCP has built-in chargeback capabilities, indispensable for cloud use, to facilitate provider and subscriber transactions. HCP also provides tools for 3rd-party vendors and customers to write to the API for easy integration with the HDS solution for billing and reporting.

■■ Low-touch system that is self-monitoring, self-managing and self-healing. HCP features advanced monitor-ing, audit and reporting capabilities. HCP services can automatically repair issues if they arise.

■■ Support for multiple levels of service. This support is provided through HCP policies, service plans and quotas that can be configured for each tenant helps enforce service-level agreements (SLAs). It allows the platform to accommodate a wide range of subscriber use cases and business models on a single physical system.

■■ Edge-to-core solution. HCP, working in tandem with Hitachi Data Ingestor (HDI), provides an integrated edge-to-core solution for cloud storage deployments. HCP serves as the “engine” at the core of the HDS cloud architecture. HDI resides at the edge of the storage cloud (for instance, at a remote office or subscriber site) and serves as the “on-ramp” for application data to enter the cloud infrastructure. HDI acts as a local storage cache while migrating data into HCP and maintaining links to stored content for later retrieval. Users and applications interact with HDI at the edge of the cloud but perceive bottomless, backup-free storage provided by HCP at the core.

E-Discovery, Compliance and Metadata Analysis

Custom metadata enables building massive unstructured data stores by providing means for faster and more accurate access of content and giving storage managers the meaningful information they need to efficiently and intelligently process data and apply the right object policies to meet all business, compliance and protection require-ments. Regulatory compliance features include namespace retention mode (compliance and enterprise), retention classes, retention hold, automated content disposition, and privileged delete and purge. HCP search capabili-ties include support for e-discovery for litigation or audit purposes. On HCP, open APIs allow direct 3rd-party integration.

HCP supports search facilities that provide an interactive interface. The search console offers a structured environ-ment for creating and executing queries (sets of criteria that each object in the search results must satisfy). Users can apply various selection criteria, such as objects stored before a certain date or larger than a specified size. Queries return metadata for objects included in the search result. This metadata can be used to retrieve the object. From the search console, users can open objects, perform bulk operations on objects (hold, release, delete, purge, privileged delete and purge, change owner, set ACL), and export search results in standard file formats for use as input to other applications.

WHITE PAPER 12

The metadata query engine (MQE) is integrated with HCP and is always available in the HCP system. It is also used by the metadata query API, a programmatic interface for querying namespaces. The MQE index resides on desig-nated logical volumes on the HCP storage nodes, sharing or not sharing the space on these volumes with the object data, depending on the type of system and volume configuration.

Search is enabled at both the tenant and namespace levels. Indexing is enabled on a per-namespace basis. Settings at the system and namespace levels determine whether custom metadata is indexed in addition to system meta-data and ACLs. If indexing of custom metadata is disabled, the MQE indexes do not include custom metadata. If a namespace is not indexed at all, searches do not return any results for objects in this namespace.

Each object has an index setting that affects differently what content is indexed by the metadata query engine. If indexing is enabled for a namespace, MQE always indexes system metadata and ACLs regardless of the index set-ting for an object. If the index setting is set to true, MQE also indexes custom metadata for this object.

System Fundamentals

Hardware Overview

An individual physical HCP instance, or HCP system, is not a single device; it is a collection of devices that, com-bined with HCP software, can provide all the features of an online object repository while tolerating node, disk and other component failures.

From a hardware perspective, each HCP system consists of the following categories of components:

■■ Nodes (servers).

■■ Internal or SAN-attached storage.

■■ Networking components (switches and cabling).

■■ Infrastructure components (racks and power distribution units).

Storage nodes are the vital part of HCP. They store and manage the objects that reside in the physical system stor-age. The nodes are conventional off-the-shelf servers. Each node can have multiple internal physical drives and/or connect to external Fibre Channel storage (SAN). In addition to using RAID and SAN technologies and a host of other features to protect the data, HCP uses software mirroring to store the data and metadata for each object in multiple locations on different nodes. For data, this feature is managed by the namespace DPL setting, which specifies the number of copies of each object HCP must maintain in the repository to ensure the required level of data protection. For metadata, this feature is managed by the MDPL, which is a system-wide setting.

A storage node runs the complete HCP software and serves as both a repository for objects and a gateway to the data and metadata they contain. All runtime operations are distributed among the storage nodes, ensuring reliability and performance.

HCP runs on a redundant array of independent nodes (RAIN) or a SAN-attached array of independent nodes (SAIN). RAIN systems use the internal storage in each node. SAIN systems use the external SAN storage. HCP is offered as 2 products: HCP 300 (based on RAIN configuration) and HCP 500 (based on SAIN configuration).

HCP RAIN (HCP 300)

The nodes in an HCP 300 system are Hitachi Compute Rack 220 (CR 220) servers. RAIN nodes contain internal storage: RAID controller and disks. All nodes use hardware RAID-5 data protection. In an HCP RAIN system, the physical disks in each node form a single RAID group, normally RAID-5 (5D+1P) (see Figure 3). This helps ensure the integrity of the data stored on each node.

WHITE PAPER 13

An HCP 300 (RAIN) system must have a minimum of 4 storage nodes. Additional storage nodes are added in 4-node increments. An HCP 300 system can have a maximum of 20 nodes.

HCP 300 systems are normally configured with a DPL setting of 2 (DPL2), which, coupled with hardware RAID-5, yields an effective RAID-5+1 total protection level.

Figure 3. HCP 300 Hardware Architecture

HCP SAIN (HCP 500/500XL)

The nodes in an HCP 500 system are either Hitachi Compute Rack 220 (CR 220) servers or blades in Hitachi Compute Blade 320 (CB 320) servers. The HCP 500 nodes contain Fibre Channel host bus adapters (HBAs) and use external Fibre Channel SAN storage; they are diskless servers that boot from the SAN-attached storage.

The nodes in a SAIN system can have internal storage in addition to being connected to external storage. These nodes are called HCP 500XL nodes. They are an alternative to the standard HCP 500 nodes and have the same hardware configuration, except the addition of the RAID controller and internal hard disk drives. In HCP 500XL nodes, the system metadata database resides on the local disks, which leads to more efficient and faster database opera-tions. As a result, the system has the ability to better support larger capacity and higher object counts per node and address higher performance requirements.

A typical 500XL node internal storage configuration includes six 500GB 7200RPM SATA II drives in a single RAID-5 (5D+1P) RAID group, with 2 LUNs: 31GB (operating system) and 2.24TB (database). The HCP 500XL nodes are usu-ally considered when the system configuration exceeds 4 standard nodes.

WHITE PAPER 14

HCP 500 and 500XL (SAIN) systems are supported with a minimum of 4 storage nodes. With a SAIN system, addi-tional storage nodes are added in pairs, so the system always has an even number of storage nodes. A SAIN system can have a maximum of 80 nodes.

Both RAIN and SAIN systems can have a DPL as high as 4, which affords maximum data availability but greatly sacrifices storage utilization. Typically, the external SAN-attached storage uses RAID-6. Best protection and high availability of an HCP 500 system is achieved by giving each node its own RAID group or Hitachi Dynamic Provisioning (HDP) pool containing 1 RAID group.

Software Overview

HCP system software consists of an operating system (the appliance operating system) and core software. The core software includes components that:

■■ Enable access to the object repository through the industry-standard HTTP or HTTPS, WebDAV, CIFS, NFS, SMTP and NDMP protocols.

■■ Ingest fixed-content data, convert it into HCP objects, and manage the objects data and metadata over time.

■■ Maintain the integrity, stability, availability and security of stored data by enforcing repository policies and executing system services.

■■ Enable configuration, monitoring and management of the HCP system through a human-readable interface.

■■ Support searching the repository through an interactive Web interface (the search console) and a programmatic interface (the metadata query API).

System Organization

HCP is a fully symmetric, distributed application that stores and manages objects (see Figure 4). An HCP object encapsulates the raw fixed-content data that is written by a client application, and its associated system and custom metadata. Each node in an HCP system is a Linux-based server that runs a complete HCP instance. The HCP system can withstand multiple simultaneous node failures, and acts automatically to ensure that all object and namespace policies are valid.

WHITE PAPER 15

Figure 4. The High-Level Structure of an HCP System

External system communication is managed by the DNS manager, a distributed network component that balances client requests across all nodes to ensure maximum system throughput and availability. The DNS manager works in conjunction with a corporate DNS server to allow clients to access the system as a single entity, even though the system is made up of multiple independent nodes.

The HCP system is configured as a subdomain of an existing corporate domain. Clients access the system using predefined protocol-specific or namespace-specific names.

While not required, using DNS is important in ensuring balanced and problem-free client access to an HCP system, especially for the HTTP or REST clients.

Namespaces and Tenants

Main Concepts

An HCP repository is partitioned into namespaces. A namespace is a logical repository as viewed by an applica-tion. Each namespace consists of a distinct logical grouping of objects with its own directory structure, such that the objects in one namespace are not visible in any other namespace. Access to one namespace does not grant a user access to any other namespace. To the user of a namespace, the namespace is the repository. Namespaces are not associated with any preallocated storage; they share the same underlying physical storage. Namespaces provide a mechanism for separating the data stored for different applications, business units or customers. For example, there may be one namespace for accounts receivable and another for accounts payable. While a single namespace can

WHITE PAPER 16

host one or more applications, it typically hosts only one application. Namespaces also enable operations to work against selected subsets of repository objects. For example, a search could target the accounts receivable and accounts payable namespaces but not the employees namespace.

Figure 5 shows the logical structure of an HCP system with respect to its multitenancy features.

Figure 5. HCP System Logical Layout: Namespaces and Tenants

Namespaces are owned and managed by tenants. Tenants are administrative entities that provide segregation of management, while namespaces offer segregation of data. A tenant typically represents an actual organization

WHITE PAPER 17

such as a company or a department within a company that uses a portion of a repository. A tenant can also corre-spond to an individual person. Namespace administration is done at the owning tenant level.

Clients can access HCP namespaces through HTTP or HTTPS, WebDAV, CIFS, NFS and SMTP protocols. These protocols can support authenticated and/or anonymous types of access (types of access and their combinations are discussed in more detail later in this document). HCP namespaces are owned by HCP tenants. An HCP system can have multiple HCP tenants, each of which can own multiple namespaces. The number of namespaces each HCP tenant can own can be limited by an administrator.

User and Group Accounts

User and group accounts control access to various HCP interfaces and give users permission to perform administra-tive tasks and access namespace content.

An HCP user account is defined in HCP; it has a set of credentials, username and password, which is stored locally in the system. The HCP system uses these credentials to authenticate a user, performing local authentication.

An HCP group account is a representation of an Active Directory (AD) group. To create group accounts, HCP must be configured to support Active Directory. The group account enables AD users in the AD group to access one or more of HCP interfaces.

Like HCP user accounts, HCP group accounts are defined separately at the system and tenant levels. Different ten-ants have different user and group accounts. These accounts cannot be shared across tenants. Group membership is different at the system and tenant levels.

HCP administrative roles can be associated with both system-level and tenant-level user and group accounts. Data access permissions can be associated with only tenant-level user and group accounts. Consequently, system-level local and AD users can only be administrative users, while tenant-level local and AD users can be both adminis-trative users and have data access permissions. Tenant-level users can have only administrative roles without namespace data permissions, or only namespace data permissions without administrative roles, or any combination of administrative roles and namespace data permissions.

System and Tenant Management

The implementation of segregation of management in the HCP system is illustrated in Figure 6.

An HCP system has both system-level and tenant-level administrators:

■■ System-level administrative accounts are used for configuring system-wide features, monitoring system hard-ware and software and overall repository usage, and managing system-level users. The system administrator user interface, the system management console, provides the functionality needed by the maintainer of the physi-cal HCP system. For example, it allows the maintainer to shut down the system, see information about nodes, manage policies and services, and create HCP tenants. System administrators have a view of the system as a whole. This view includes all HCP software and hardware that make up the system, and can perform all of the administration for actions that have system scope.

■■ Tenant-level administrative accounts are used for creating HCP namespaces. They can configure individual ten-ants and namespaces, monitor namespace usage at the tenant and namespace level, manage tenant-level users, and control access to namespaces. The required functionality is provided by the tenant administrator user inter-face, tenant management console. This interface is intended for use by the maintainer of the virtual HCP system (an individual tenant with a set of namespaces it owns). The tenant-level administration feature facilitates segrega-tion of management, which is essential in cloud environments.

WHITE PAPER 18

An HCP tenant can optionally grant system-level users administrative access to itself. In this case, system-level users with the monitor, administrator, security or compliance role can log into the tenant management console or use the HCP management API for that tenant. System-level users with the monitor or administrator role can also access the tenant management console directly from the system management console. This effectively enables a system administrator to function as a tenant administrator, as shown in Figure 4. System-level users can perform all the activities allowed by the tenant-level roles that correspond to their system-level roles. An AD user may belong to AD groups for which the corresponding HCP group accounts exist at both the system and tenant levels. This user has the roles associated with both the applicable system-level group accounts and the applicable tenant-level group accounts.

Policies

Objects in a namespace have a variety of properties, such as the retention setting or index setting. These proper-ties are defined for each object by the object system metadata. Objects can also be affected by some namespace properties, such as the default metadata settings that are inherited by new objects stored in the namespace, or the versioning setting. Both the namespace-level settings and the properties that are part of the object metadata serve as parameters for the HCP system’s transactions and services, and determine the object’s behavior during its life cycle within the repository. These settings are called policies.

An HCP policy is one or more settings that influence how transactions and internal processes (services) affect objects in a namespace. Policies ensure that objects behave in expected ways.

The HCP policies are described in Table 1.

TablE 1. HITaCHI CONTENT PlaTFORM POlICIES

Policy Name Policy Description and Components Transactions and Services Influenced

DPL System DPL setting, namespace DPL setting. Object creation. Protection service.

Retention Default retention setting, object retention setting, hold setting, system metadata and custom metadata options for objects under retention.

Object creation, object deletion, system and custom metadata handling. Disposition, Garbage collection services.

Shredding Default shred setting, object shred setting. Object deletion. Shredding service.

Indexing Default index setting, object index setting. MQE.

Versioning Versioning setting, pruning setting. Object creation and deletion. Garbage collection service.

Custom Metadata Validation XML syntax validation. Add/replace custom metadata operations.

Each policy may consist of one or more settings that may have different scopes of application and methods of con-figuration. Policy settings are defined at the object and the namespace level. Note that the same policy setting may be set at different levels depending on the namespace. The default retention, shred and index settings are set at the namespace level in HCP namespaces.

WHITE PAPER 19

Table 2 lists all policy settings sorted according to their scope and method of configuration.

TablE 2. HITaCHI CONTENT PlaTFORM POlICy SETTINgS: SCOPE aNd CONFIgURaTION

Policy Policy SettingHCP Namespaces

Scope/Level Configured Via

Data Protection Level System DPL: 1-4 System System UI

Namespace DPL: 1-4, dynamic Namespace Tenant UI, MAPI

Retention Default retention setting: fixed date, offset, special value, retention class

Namespace Tenant UI, MAPI

Retention setting: fixed date, offset, special value, retention class

Object REST API, retention.txt

Hold setting: true or false Object REST API

Ownership and POSIX permission changes under retention: true or false


Custom metadata operations allowed under retention Namespace Tenant UI, MAPI

Indexing Index setting: true or false (1/0) Object REST API, index.txt

Default index setting: true or false Namespace Tenant UI, MAPI

Shredding Shred setting: true or false (1/0) Object REST API, shred.txt

Default shred setting: true or false Namespace Tenant UI, MAPI

Custom Metadata Validation XML validation: true or false Namespace Tenant UI, MAPI

Versioning Versioning setting: true or false Namespace Tenant UI, MAPI

Pruning setting: true/false and number of days for primary or replica


Content Management Services

A Hitachi Content Platform service is a background process that performs a specific function that is targeted at preserving and improving the overall health of the HCP system. In particular, services are responsible for optimizing the use of system resources and maintaining the integrity and availability of the data stored in the HCP repository.

HCP implements 12 services: protection, content verification, scavenging, garbage collection, duplicate elimination, shredding, disposition, compression, capacity balancing, storage tiering, migration and replication.

HCP services are briefly described in Table 3.

WHITE PAPER 20

TablE 3. HITaCHI CONTENT PlaTFORM SERvICES

Policy Description

Protection Enforces DPL policy compliance by ensuring that the proper number of copies of each object exists in the system, and that damaged or lost objects can be recovered. Any policy violation invokes repair process. Offers both scheduled and event-driven service. Events trigger a full service run, even if the service is disabled, after a configurable amount of time: 90 minutes after node shutdown; 1 minute after logical volume failure; 10 minutes after node removal.

Content Verification Guarantees data integrity of repository objects by ensuring that the content of a file matches its digital signature. Repairs the object if the hash does not match. Detects and repairs discrepancies between primary and secondary metadata. SHA-256 hash algorithm is used by default. Checksums are computed on external and internal files. Computationally intensive and time-consuming service. Runs according to the active service schedule.

Scavenging Ensures that all objects in the repository have valid metadata, and reconstructs metadata in case the metadata is lost or corrupted, but data files exist. The service verifies that both the primary metadata for each data object and the copies of the metadata stored with the object data (secondary metadata) are complete, valid and in sync with each other. Computationally intensive and time-consuming service. Scheduled service.

Garbage Collection Reclaims storage space by purging hidden data and metadata for objects marked for deletion, or left behind by incomplete transactions. It also deletes old versions of objects that are eligible for pruning. When applicable, the deletion triggers the shredding service. Scheduled service, not event driven.

Duplicate Elimination Identifies and eliminates redundant objects in the repository, and merges duplicate data to free space. The hash signature of external file representations is used to select objects as input to the service. These objects are then checked in a byte for byte manner to ensure that the data contents are indeed identical. Scheduled service.

Shredding Overwrites storage locations where copies of the deleted object were stored in such a way that none of its data or metadata can be reconstructed, for security reasons. Also called secure deletion. The default HCP shredding algorithm uses 3 passes to overwrite an object and is DoD 5220.22-M standard compliant. The algorithm is selected at install time. Event-driven only service, not scheduled. It is triggered by the deletion of an object marked for shredding.

Disposition Automatic cleanup of expired objects. All HCP namespaces can be configured to automatically delete objects after their retention period expires. Can be enabled or disabled both at the system and namespace level; enabling disposition for a namespace has no effect if the service is disabled at the system level. Disposition service deletes only current versions of versioned objects. Scheduled service.

Compression Compresses object data to make more efficient use of system storage space. The space reclaimed by compression can be used for additional storage. A number of configurable parameters are provided via System Management Console. Scheduled service.

Capacity Balancing Attempts to keep the usable storage capacity balanced (roughly equivalent) across all storage nodes in the system. If storage utilization for the nodes differs by a wide margin, the service moves objects around to bring the nodes closer to a balanced state. Runs only when started manually. Additions and deletions of objects do not trigger the service. Typically, an authorized HCP service provider starts this service after adding new storage nodes to the system. In addition, while not part of the service, during normal system operation new objects tend to naturally spread among all storage nodes in the system in fairly even proportion. This is due to the nature of the storage manager selection algorithm and resource monitoring of the administrative engine.

Storage Tiering Determines which storage tiering strategy applies to an object, evaluates where the copies of the object should reside based on the rules in the applied service plan, and moves objects between running and spin-down storage as needed. Active only in spindown-capable HCP SAIN systems. Scheduled service.

Conclusion

Hitachi Data Systems object storage solutions avoid the limitations of traditional file systems by intelligently storing content in far larger quantities and in a much more efficient manner. These solutions provide for the new demands imposed by the explosion of unstructured data and its growing importance to organizations, their partners, their customers, their governments and their shareholders.

WHITE PAPER 21

The Hitachi Data Systems object storage solutions treat file data, file metadata and custom metadata as a single object that is tracked and stored among a variety of storage tiers. With secure multitenancy and configurable attri-butes for each logical partition, the object store can be divided into a number of smaller virtual object stores that present configurable attributes to support different service levels. This allows the object store to support a wide range of workloads, such as content preservation, data protection, content distribution and even cloud from a single physi-cal infrastructure. One infrastructure is far easier to manage than disparate silos of technology for each application or set of users. By integrating many key technologies in a single storage platform, Hitachi Data Systems object storage solutions provide a path to short-term return on investment and significant long-term efficiency improvements. They help IT evolve to meet new challenges, stay agile over the long term and address future change and growth.

© Hitachi Data Systems Corporation 2013. All rights reserved. HITACHI is a trademark or registered trademark of Hitachi, Ltd. Microsoft, Windows and Active Directory are trademarks or registered trademarks of Microsoft Corporation. All other trademarks, service marks, and company names are properties of their respective owners.

Notice: This document is for informational purposes only, and does not set forth any warranty, expressed or implied, concerning any equipment or service offered or to be offered by Hitachi Data Systems Corporation.

WP-425-B DG May 2013

Corporate Headquarters 2845 Lafayette StreetSanta Clara, CA 96050-2639 USAwww.HDS.com

Regional Contact InformationAmericas: +1 408 970 1000 or [email protected], Middle East and Africa: +44 (0) 1753 618000 or [email protected] Pacific: +852 3189 7900 or [email protected]

http://www.linkedin.com/groups?mostPopular=&gid=36069

www.twitter.com/hdscorp

http://www.facebook.com/home.php?#!/pages/Hitachi-Data-Systems/94005676787

http://www.youtube.com/hdscorp

Introduction to Object Storage Solutions White Paper

Technology

Transcript of Introduction to Object Storage Solutions White Paper