EV 2007 - Open Storage Layer

29
Sym Enterprise Vault TECHNICAL WHITE PAPER A Technical Overview of the Enterprise Vault™ 2007 Storage Layer December 2007

Transcript of EV 2007 - Open Storage Layer

Page 1: EV 2007 - Open Storage Layer

Sym

Enterprise Vault

TECHNICAL WHITE PAPER

A Technical Overview of the Enterprise Vault™ 2007 Storage Layer

December 2007

Page 2: EV 2007 - Open Storage Layer

Page 2

TABLE OF CONTENTS

Introduction ...................................................................................................................... 3

Purpose of This Whitepaper ............................................................................................... 3

Target Audience ................................................................................................................... 3

Virtualization of the storage layer—the Open Storage Layer ........................................ 4

The “logical” Storage concept ...................................................................................... 7

The highest level of abstraction: Vault Store ................................................................... 7

Defining Storage devices and locations: Vault Store Partitions .................................... 9

Logically seperating a Vault Store : Archives ................................................................. 10

What is SIS? ........................................................................................................................ 12

Retention management ..................................................................................................... 13

The “physical” Storage architecture ......................................................................... 14

Inside a Vault Store Partition ........................................................................................... 14

Closing of Vault Store Partitions — Improving Backups ............................................... 15

What is a DVS file? ............................................................................................................. 15

DVS files for open standards ............................................................................................ 16

Using a flat file system ...................................................................................................... 17

Collection services — a more intelligent flat file system ............................................. 17

Migration service — migration as a central activity ...................................................... 18

Summary of the interaction of Enterprise Vault with an NTFS file system ................ 19

Storage hardware and systems .................................................................................. 20

Storing Enterprise Vault archives on Network Attached Storage ................................ 21

NetApp storage systems ...................................................................................................................21

Hitachi Content Archive Platform (HCAP) ......................................................................................22

Systems integrating with the Enterprise Vault Migrator .............................................. 22

Pegasus Disk Technologies – InveStore ........................................................................................23

Fujitsu Eternus ...................................................................................................................................24

Veritas NetBackup .............................................................................................................................24

Integration with Tivoli Storage Manager / IBM DR 550 ...............................................................25

Content Addressed Storage and other API-based storage ........................................... 25

EMC Centera........................................................................................................................................26

Read-only storage and retention management in Centera ........................................................26

Single Instance Storage and collections with Centera ...............................................................26

Centera summary ...............................................................................................................................28

Other supported storage systems ................................................................................... 29

Summary of the Open Storage Layer of Enterprise Vault 2007 .................................... 29

Page 3: EV 2007 - Open Storage Layer

Page 3

Introduction

Purpose of This Whitepaper

Customers implement archiving solutions to reduce the cost of storage for primary applications such as email or file shares while allowing for long-term retention of the large amounts of business-critical information produced by these systems. As this paper will show, Enterprise Vault is architected to meet these two objectives and specifically provides: Reduced overall cost of storage

– Storage tier-ing: Keeping inactive or noncritical data on less expensive storage media – Storage rationalization: Minimizing content duplication and size – Backup optimization: Shrinking overall backup and disaster recovery (DR) time

Long-term data retention

– Data integrity: Ensuring that data is captured accurately and reliably – Data resilience: Maintaining data across document formats and storage platform – Data fidelity: Optionally leveraging specialist storage to prevent data tampering

The intent of this paper is to show how, through its interaction with storage systems, Enterprise Vault achieves the above goals. In addition, we will show you how architectural decisions we have made in building Enterprise Vault will enable the solution to continue to meet customer needs for the foreseeable future.

Target Audience

The primary target audiences for this white paper are Enterprise Vault partners or customers that are looking for an introduction to the concepts that Enterprise Vault uses for storing archived items, indices and metadata.

Page 4: EV 2007 - Open Storage Layer

Page 4

Virtualization of the storage layer—the Open Storage Layer

The Open Storage Layer (OSL) is the virtual layer that contains all functions that affect and control the way Enterprise Vault interacts with physical storage systems and devices. End users need not be aware of archiving taking place, especially that an item has moved from the primary application to the archive. In addition, they should not know that items may subsequently move to other storage systems as they age, in a process called migration (storage treeing, or Information Lifecycle Management). Furthermore, most users will not be interested in the particular characteristics of a storage system (e.g., basic NTFS versus a WORM media store) The Open Storage Layer allows Enterprise Vault to virtualize the underlying storage, so that users of the archive are not aware of the storage system they are using today, and more importantly, a new storage system can be introduced to the archive at any time. At a very high level, the benefits that the OSL gives to the customer, and hence the advantages that Enterprise Vault gives to the customer, are:

– Storage tier-ing: Automated migration of data (by policy) from the primary tier of the archive to a secondary tier—whether that second tier is disk, tape, or other media • Storage rationalization: Compression of archived data and Single Instance Storage to remove duplicate copies of archived data

– Backup optimization:

Reduction of primary data to backup as well as an efficient format (leveraging archive container files and/or partitions) to reduce the amount of archived data to backup

– Data integrity: A “safety copy” feature that optionally ensures that archived items are backed up or replicated before they are removed from the primary server, to reduce the risk of data loss

– Data resilience: Protection to move to new storage platforms in the future and an archive file format that inherently preserves a “future-proofed” copy of every archived document (in HTML, or alternatively, an XML rendition), while also retaining the original in a transparent format that can be viewed independent of Enterprise Vault

– Data fidelity: An approach that prevents data/metadata elements from being lost during archiving and integration with underlying WORM or WORM-like technologies that ensure that data is retained for the desired period and is not tampered with during that period Figure 1 shows the OSL and its components. Each of these virtual and physical components will be explained during the course of this paper and this diagram will serve as a convenient way to highlight all of the value points of the Enterprise Vault solution’s advanced interaction with storage systems

Page 5: EV 2007 - Open Storage Layer

Page 5

Figure 1. The Open Storage Layer contains all the functions that determine how Enterprise Vault interacts with storage systems and devices

Storage for archiving requires a longer-term view One important consideration when examining the correct “type” of storage in which to house an archive is the length of time that items reside in the archive. Typically, IT purchase decision cycles are based on the accounting write-off period (the interval at which, in accounting terms, the item no longer has any value); a typical figure for this is three years. Companies are finding that they often require, or are even obliged by law, to retain content for longer than this write-off period. An average figure is often seven years. This means that, on average, we would expect any single item in the archive to “live” on more than two storage systems during its life span. In an increasing number of cases, much longer retention periods are required, so the ability of an archive system to evolve painlessly through multiple generations of storage is mandatory rather than a luxury. To this end, migrating between storage systems is a key facility of the Open Storage Layer and allows the archive to consume new storage and new storage systems over time

Page 6: EV 2007 - Open Storage Layer

Page 6

The key elements of Enterprise Vault Before we take a look at the interaction of Enterprise Vault and storage systems, we should have an overview of what elements are involved in an Enterprise Vault installation and what content is stored in an archive. Figure 2 shows that there are three main content sources that combine to make Enterprise Vault.

Figure 2 - The three main content sources in Enterprise Vault

• The structured Metabase: Enterprise Vault has a dependency on Microsoft® SQL Server to maintain a database of structured information about archived items. This information could be the current location of the item or extra meta tags describing the item. Use of the SQL Server database also means that several Enterprise Vault servers can share information about the same item quickly and easily. Archived items are not stored physically in the Microsoft SQL Server database • Full text index: Every item that is stored in Enterprise Vault will undergo full text indexing so users can perform rapid keyword searches. This is not performed by Microsoft SQL Server database but by the AltaVista index engine. The index services store their items in flat file structures on disk. • Physical storage: The full text search and the Microsoft SQL Server database both aid management of the physical storage. It is this area of Enterprise Vault that this paper will discuss. Individual items are added to the archive as DVS files and later may be combined into larger collections Migration, rationalization of content, and partitioning are all actions that may happen against the stored items and will be explained in this paper.

Page 7: EV 2007 - Open Storage Layer

Page 7

The “logical” Storage concept

The highest level of abstraction: Vault Store

Imagine two storage Administrators discussing their daily problems in the company’s cafeteria:

– “I have real trouble tracking the information on all our different storage devices. Wouldn’t it be great to have a database that tracks data across all our file-systems, tapes and WORM storage devices? If I move a file to some other system, a simple database query would still get me to the information”

- “That’s a great idea! But the database probably shouldn’t track filenames; it

should use a hashing checksum so that a rename of a file does not break the database.”

– “Indeed! Using a hash also allows us to check for identical files, so that we can

reduce the number of duplicate copies in the system”

- “That’s going to save us a lot of storage. But if we also track the age of the information and expire content we can make sure that we keep only the data we need to retain.”

– “Hey hang on, if we also check who “owns” which objects, we could manage

permissions from a central point, without the need to drill down into all file-systems and manually check the permissions.”

In fact, those two Storage experts have just described the Enterprise Vault “Vault Store”. It is essentially a SQL database that stores the following information about any item stored in Enterprise Vault:

– Current storage location – Hash-code (checksum) – Number of Sharers (Single-Instance Storage) – Archived and Modified Date – Retention Category – Permissions

Note that although it is called Vault Store, it is not referring to a storage device itself. At this level, we only track in which of the “known” storage locations the item is residing at the moment. You can actually see each Vault Store in the Enterprise Vault Administration Console (VAC). It is the top-level container referring to the storage sub-system within the Enterprise Vault product.

Page 8: EV 2007 - Open Storage Layer

Page 8

Figure 3 - The Vault Administration Console exposes the content for management via the Vault Store, a collection

of Vault Store partitions physically on disk.

The main benefit of the Vault Store is efficient management of the archive’s storage. It allows the administrator to abstract the physical storage of archived items by abstracting multiple devices, locations and tiers by tracking the location of items inside a SQL Database. For a user accessing the data, it is completely transparent whether the item is stored on a CAS, NAS or DAS, if it is still stored in a single-file or a container or has been migrated to another storage tier – Enterprise Vault will look up the location of the requested item and manage the retrieval accordingly. The Vault Store consists of two structures: • Vault Store Database:

This is the SQL database that acts as a directory to all items inside a Vault Store. Each item, on average, uses between 500 and 3,000 bytes of storage in the Vault Store Database (the size depends on what is being archived i.e. File System data, Exchange mailbox data, Domino mailbox data). This small amount of information is all that is needed to identify a archived items storage location. Therefore even a large installation with hundreds of millions of objects will have a relatively small and manageable database for each store.

• Vault Store Partitions:

These physical subdivisions of a Vault Store contain the “managed locations” where archived information is stored. A Vault Store consists of one or more Vault Store Partitions. These Partitions could be dubbed “Physical Storage locations” as they refer to existing storage locations like a particular share on a File-Server or a mounted Volume on a SAN.

One Enterprise Vault server can have multiple Vault Stores, but these stores cannot be across multiple Enterprise Vault servers.

Page 9: EV 2007 - Open Storage Layer

Page 9

When you create a new Vault Store in the Vault Administration Console, you select the Enterprise Vault Server and storage service that will host the new Vault Store. The storage devices and other storage related settings will be configured at the partition level.

Defining Storage devices and locations: Vault Store Partitions

As shown above, the Vault Store is a high-level container, visible to the administrator as a major subdivision of the archive, but without any reference to the storage itself. Partitions in contrast are the representation of a specific storage location and policy. Settings include:

- The device and technology used - Settings for the “Volume” on the device - Whether Single-Instance Storage should be used - If the device is WORM capable - A flag that NTFS ACLs are available on the device

- All settings regarding Collections - The device and settings used for migrations to another tier of storage.

Partitions within the Vault Store are usually (but don’t have to be) located on specific physical devices, which can be mixed within a particular Vault Store, not only in their location (e.g., a particular NTFS volume) but also in their storage technology. For example, a tape-based partition could be closed and a new magnetic disk partition opened within the same Vault Store, or a NAS or SAN device could be replaced by a CAS store. This means that an organization changing to a new storage technology does not have to perform an immediate store migration before starting to use the new device – all devices can be used in parallel. While a Vault Store can contain more than one Vault Store Partition, only one of these partitions will be open for writing at any one time. The only activity in the other partitions will be to delete items as they expire, and no further items will be added to them. This means that different backup and DR regimes can be applied to different partitions of a Vault Store depending on their status, as described later. In addition, Vault Stores can contain multiple types of data — for example, both the end-user mailbox archives as well as the journal archives for a given set of users. As we will describe later in this paper, this allows companies to achieve Single Instance Storage across instances of email from journal and end-user mailbox archives. The Vault Store Partition is the most critical item in the storage structure and, to this end, is designed to be fully self-contained. This means that if a Vault Store Partition was recovered from a backup tape, for example, then all content required to make that Vault Store operational (e.g., item security, retention periods, and original locations) are contained in the Vault Store. Given the size and number of objects held in an average Enterprise Vault Partition, Administrators are urged to backup the SQL database and the Altavista Indices, as retrieving the information alone can literally take weeks – still it is good to know that there are tools to recreate that information from the stored items in the partition as a “last line of defense” in a complete disaster scenerio.

Page 10: EV 2007 - Open Storage Layer

Page 10

By creating multiple VaultStore databases, each containing multiple partitions that each may easily contain millions of individual archived items, you can easily scale Enterprise Vault to manage huge ammounts of data across dozens of completely different storage devices. This approach is the key to effective business continuity and to ultimate scalability even if the system needs to service tens of thousands of users and contain many Terrabytes of data.

Logically seperating a Vault Store : Archives

We have now seen how the Vault Administration Console helps the administrator by virtualizing the view of the physical storage to aid in management of the archive. These container structures not only aid the management of the archive, but they also improve the resiliency and scalability of Enterprise Vault. When looking at the Vault Administration Console, you will see another logical concept that should be mentioned: the Archives container shown in figure 4 below:

Figure 4 - The Archives container is used to manage and represent logically connected items inside Enterprise Vault (e.g., a single user’s mailbox), regardless of their physical type or location.

Where the Vault Stores are the container structures used for managing the physical data on the disk, Archives are used to manage logical collections of items (e.g., a single user’s mailbox, a file-server share or a Sharepoint document library), regardless of the storage location of this archived content. The Archives container in the Vault Administration Console also helps manage the indexing services and other tasks on “logically connected items,” like the management of security (You can find more about the index concepts relating to archives in the white paper “Enterprise Vault Technical White Paper - Indexing and Search” ) There are several different types of archives available in Enterprise Vault 2007:

– Exchange Mailbox Archives

Page 11: EV 2007 - Open Storage Layer

Page 11

References all items archived from an Exchange users mailbox and keeps the security and index settings accordingly. Mailbox archives are structured, meaning that they keep information about the folder hierarchy inside the archive.

– Exchange Journal Archives

References all items that are achived from one or more Exchange journal maiboxes. Journal archives are flat archives (without information about folder hierarchies).

– Exchange Public Folder Archives

References all items archived from one or more public folder root paths. Permissions are synchronized from the Exchange permissions on the Public Folder

– Domino Mailbox Archives References all items stored from a single Domino users mailbox. Domnio archives are always flat as Notes user use “views” instead of folders, so that items can potentially belong to several different views.

– Domino Journal Archives References all items archived from one or more Domino journal mailboxes. Domino journal archives are also always flat.

– File Archives

References all items archived from one or more file system archive points. Permissions are synchronised from the file system. These archives are also structured.

– Sharepoint Archives

References all items archived from one or more SharePoint document libraries. These archives are flat.

– Shared Archives

References an archive that contains archived information that can be used to help users share information across sources and groups of users. Shared archives are flat archives.

Over a single archive may grow very large, and the actions of the EV migration services may mean that the content is physically located over several different storage systems. A single archive can be span mulitple Vault Store partitions, but when an end user searches an archive, he or she should be presented with a consistent set of results and need not be aware of the location of the actual items.

WORM - Building a tamper-proof read-only archive

Page 12: EV 2007 - Open Storage Layer

Page 12

The default Enterprise Vault deployment, regardless of the underlying storage technology, will create a read-only archive. Even if the archive is housed on a standard NTFS disk with read and write permissions, the default actions will be to disallow users from modifying or deleting content from the vault directly. There are options that can be set to allow users to delete content (over which they have rights) directly but they can never modify previously archived content. Enterprise Vault is effectively a read-only, or “fixed content,” store. Enterprise Vault does not provide any way for item content to be changed once written to the archive, unlike, say a regular file-share. If an item that has already been copied to the archive (e.g., from a file system) is subsequently edited and then re-archived, this modified file will be treated as a new item and archived separately. A search request would show multiple versions, distinguished by their time stamps. Most of the time this is considered sufficient protection against tamering, but in certain regulatory compliance situations, the additional protection of WORM storage may be required to defend against illicit tampering at media level. Enterprise Vault supports the latest WORM technologies from all leading vendors. You find a description of the available storage options later in this document.

What is SIS?

If we have the same 10 MB email sent to every user in the organization, then why do we need to store many copies of the email when we can store a single version of the document and “describe” that the email is owned by a number of different people? Using hashing algorithms, each file generates its own unique ID. If we determine that this unique ID has been archived before, we do not store the item again, but instead simply add a second, user-specific header to the existing saveset file containing the item already stored. This will contain all the user-specific properties for the second user sharing the same item, but without the need to store another copy of the main content. To check for the SIS opportunity for a single file (e.g., a Word file), the metadata properties of the file are separated out and then a full hash of the main file body is performed to create a unique ID for that file. As for email messages, the “per user” attributes (e.g., read/unread status or follow-up flags) are separated out, and then the potentially shareable part of the email (e.g., recipient lists, subject, message body, and attachments) is examined and a unique hash created for SIS checking. By doing this, we not only maximize the effectiveness of SIS storage, but we also ensure that when items are presented back to the end user, they exactly match “their” copy of the file. Note that this is a key feature of the way that Enterprise Vault performs SIS. If an apparently identical message was shared between recipients without taking into account that per-user properties could differ, then we would lose, or worse, change information, such as the fact that a message was unread or that the user had changed the title in their own mailbox copy. Single Instance Storage operates between items within a single Vault Store Partition. When Enterprise Vault is installed, a key design consideration is defining partitions in a way that optimizes SIS benefits. The reasons for having more than one partition in a Vault Store will be discussed later.

Page 13: EV 2007 - Open Storage Layer

Page 13

Retention management

So far we have spoken about putting items into the archive and how the system is both resilient and easy to recover in the event of a disaster. However, another key activity any archive should be able to perform is the deletion of content once it has been kept long enough to meet either business or regulatory commitments. Even though using Enterprise Vault means that the cost of ownership of older items can be significantly lower than keeping everything in the front-line stores, it is still even more cost-effective to remove items when they have no further use. From the pure storage point of view, it saves direct storage costs as well as indirect storage ownership costs. From the regulatory point of view, deleting items when they are no longer required to be retained can potentially save the considerable cost of unnecessary discovery of those items in response to a subsequent litigation or request from regulator. Every item that is added to the archive is assigned a retention category that will indicate to Enterprise Vault how long the item should be retained and given a date when the expiry service will delete the item from the archive. The retention category is not a physical date added to the archive item, rather it is a category associated with the item. If a physical date was added to the file, then there would be no efficient way to make wholesale changes to the retention dates assigned to items. A retention category will allow the expiry service to determine on a file-by-file basis if the expiry time has been reached and, at the same time, allow easy updates to the retention category, for example, if a regulatory body extended a retention period. The expiry service runs as a separate service, meaning that it can be run at a “quiet” time, for example, when users are not accessing the system to any great extent. Note that although automatic expiration is the most common way to remove items from the archive, there are both end-user and administrator functions to delete items explicitly, but these are subject to permissions. Though not directly connected to the storage service, Enterprise Vault also has the ability to manage the lifecycle of any shortcuts created in Microsoft Exchange. This means that not only are items in the vault retained exactly for as long as needed, minimizing the risk and the amount of storage consumed, but the same benefits also can be applied to the users’ shortcuts in a centrally controlled fashion, so that shortcuts can be removed sooner than the item expiration period. This is necessary as over time many shortcuts in a users mailbox, for example, could fill the mailbox quota. An example of this is to allow shortcuts to be retained in a users mailbox for one year, while the items themselves remain in the archive for a further five years. Once the shortcuts have been expired these items can still be searched and retrieved by the various search applications and Archive Explorer. The old shortcuts no longer clutter up the mailbox. In any case, shortcuts are automatically deleted when archived items are expired.

Page 14: EV 2007 - Open Storage Layer

Page 14

The “physical” Storage architecture

Inside a Vault Store Partition As we have stated before, the Vault Store Partition is the level where the “physical storage” is added to the Enterprise Vault configuration. For most storage devices this will be a primary storage path and (if needed) a second tier where the data should be migrated after a given period of time. Only when using an EMC Centera, you will add the IP Addresses of a Centera Frame without the option to further migrate the data. In this chapter we will focus on File-System based storage devices like NTFS volumes or CIFS shares.

File structure on disk Most archiving use-cases are based on archiving data according to its age. When archiving from mailboxes or file-shares, this age might be a few weeks, while journal archiving of email messages means that the item is only a few seconds old. Nevertheless customers see the archive build up as a timeline of information, therefore Enterprise Vault is designed to organize the data in a flexible folder structure that represents the last modification date of the information. This structure will automatically be created during the arcive process and will grow in a very predictable and efficient way, resulting in a folder hierachy that never exceeds 4 levels of subfolders and will split any days worth of information into a maximum of 25 folders. (1 “day” folder containing 24 “hour” subfolders) This is illustrated below:

Figure 5 - Enterprise Vault stores content in a flat file structure and, initially, as separate items

As can be seen, the partition is divided into folders in the format YEAR (YYYY) \ MONTH (MM) \ DATE (DD) \ HOUR (HH) The lowest folder level shows items collected from the same time period, and by default, these items will come from the same location in the primary application.

Page 15: EV 2007 - Open Storage Layer

Page 15

Every single item that is archived will have the ability to create its own “Saveset” file; however, as will be shown later when discussing Single Instance Storage, not every item will create a file as it can be referenced to an already existing one (providing they are the same).

Closing of Vault Store Partitions — Improving Backups

As we have already seen, the Vault Store contains one or more Vault Store Partitions. At any one time, only one Vault Store Partition is open and being written to. Organizations can take advantage of this to further reduce the TCO of Enterprise Vault As described earlier, Enterprise Vault stores items in individual files rather than in a single large structured file, which lends itself to the use of incremental file backups rather than having to back up the entire archive store every time. However, with some backup solutions, even this approach becomes inefficient as the time taken to detect the new or modified files can become prohibitive when vast numbers of files are being targeted. The concept of Vault Store Partitions addresses this problem. Partitions are given a maximum theoretical size, and once this size is reached can be closed and another partition opened to store all future data. Once a partition is closed, nothing new is written to that partition. Even items that are called from that partition and are edited and re-archived will be stored in the current “Open” partition. Imagine an archive where 80%

1 of your corporate email content now resides in Enterprise Vault,

and that this equates to 5 TB of archive storage. If the Vault Store Partitions were limited to 200 GB, then at any one time only a maximum of 200 GB out of the 5 TB will be liable to change due to newly archived items. This makes the archive very “backup-friendly,” as little of the overall corporate content now changes, and entire areas of the archive that are now closed can be backed up much less frequently, if at all. In practice, the only changes that will be made to a closed partition are deletions, which means that the frequency of backups for these closed partitions can be greatly reduced without fear of data loss. Even restoring an older closed partition may be all right as references to the deleted items will have been deleted from the directory and the indices, which may be acceptable in any but the most closely regulated environments With large ammounts of your corporate knowledge potentially residing in Enterprise Vault, it is imperative that consideration is given to how to effectively back up the archive and, more importantly, how to recover the archive in the event of a disaster. The self-contained nature of the DVS file and the concept of closing Vault Store Partitions meets these goals to create a storage system that is highly resilient to failure and, in addition, can be efficiently recovered from a disaster

Moving a Vault Store Partition

It is perfectly possible to move NTFS based Vault Store partitions between volumes on the same storage device or between NTFS storage devices. Please reference the following technotes on the Symantec support site for more information: 1 It is estimated that 70-80% of all content in a typical messaging system is older than 30 days.

Page 16: EV 2007 - Open Storage Layer

Page 16

273271 – How to move a Vault Store partition or Vault Store on the same Enterprise Vault server from one location to another 282880 – How to move a Vault Store and Vault Store partition to a different Enterprise Vault (EV)server in the SAME site. What is a DVS file? The DVS file is a reminder of how long Enterprise Vault has been working in the enterprise environment. DVS stands for Digital Vault SaveSet, and the “Digital” part of that name refers to the company Digital Equipment Corporation, where Enterprise Vault was originally created. A DVS file is often refered to as a “Saveset”. A DVS file is a single piece of archived content. Each DVS file consists of two main sections; the main section contains the actual archived content and its index (HTML) rendition in a compressed format, the second section describes where the content came from originally and who owns it (per user information). The DVS file obtains its name from the following parameters: <checksum><date><time><seconds><saveset_uniqueid>.dvs In addition to holding the original item, Enterprise Vault retains an HTML text version of the itemin the DVS file (provided that the content can be converted). This offers a degree of futureproofing, as it means that there is a version of the item that can be read without the need for the original viewing application such as Outlook or Notes. This is important because of the increasing longevity of items held in the archive. Many companies are adhering to retention periods that couldbe several decades, so they should ask themselves the question, “Will applications such asMicrosoft Office be available in 100 years time, and if not can I access the stored content?”. In addition, the maintenance of an HTML rendition means that an item can be rapidly viewed from a Web application without having to perform on-the-fly conversions. This is extremely valuable to save time and cost during a larger legal dicovery where highly paid legal staff do not want to wait for applications to open or - even worse – to be installed on demand by IT staff. DVS files for open standards

Imagine a situation where you walk into the office and find a discarded backup tape on the foyer floor. On examination of the tape, you find that it contains a partial backup of Enterprise Vault with lots of DVS files. Can you open these files without Enterprise Vault? We have already stated that this is a proprietary format, so surely this is impossible? It should, of course, be stated that every effort is made during the design and implementation of Enterprise Vault to ensure that the archive is secure and that a discarded backup tape is an extreme situation out of the control of Enterprise Vault, but the answer to the above question is, yes, you will be able to open the DVS file. There are tools available from Enterprise Vault support and professional services that will allow you to open these files and recover the content. The actual workings of this process are beyond the scope of this paper, and if you require further information, please contact Symantec support.

Page 17: EV 2007 - Open Storage Layer

Page 17

Using a flat file system

By looking at the files on the disk, we can already determine a great deal about how content is stored and arranged in Enterprise Vault to maximize the efficiency of storing the archive. We can see that DVS files roughly equate to a single archived item, meaning that Enterprise Vault is a flat file system. To see why Enterprise Vault benefits from storing items in a flat file manner, we would have to look at some of the problems associated with the longer-term retention of content within Microsoft Exchange. Exchange stores its content in large database files (.EDB files). This lends itself to the performance requirements for a dynamic front-line system but is not ideal for longterm retention of large and growing volumes of information. Each EDB file could contain millions of items, and if viewed from the file system, only one huge file would be visible. This huge file highlights the problems of long-term storage scalability in Exchange that Enterprise Vault solves. The EDB file and Exchange themselves will be able to scale very well, but often the critical supporting applications alongside Exchange will fail long before Exchange reaches a limit. For example, the backup applications will struggle to recover large EDB files in a limited time period Enterprise Vault initially stores items in a flat file format rather than within a database or pseudo-database structure. This means that there are far fewer scalability headaches, and in addition, the single items can be accessed without the need to load a large database file. For archive stores, we find that there is usually no advantage in caching archived items, as the access pattern is that of relatively infrequent and random retrieval from very large volumes of items. However, while writing individual flat files is optimum for the short term for the reasons outlined above and for good transactional integrity and single-instancing, this approach needs to be balanced in the long term with certain inefficiencies in holding extremely large numbers of discrete files. Our approach to this is discussed later in the “Collections” section. A further advantage of the flat file approach is that a more granular transactional approach can be taken to guarantee the integrity of an archived item. Enterprise Vault only deletes an item from the target system (or replaces it with a shortcut) when it is assured of the safety of the archived copy. This can simply be a case of only deleting it once the item has been successfully written to the store, but it can be further enhanced by deferring deletion until the archived version has itself been backed up. This is achieved by a separate “watcher” service that monitors the backup status of archive files and only triggers the deletion of the original file once the archive file has been backed up. This “safety copy” feature is a critical element for our customers in maintaining data integrity in terms of the overall archive. The current location and status of the DVS file is stored in the Vault Store Database, held in the Microsoft SQL Server database. It is important to note that only Enterprise Vault data such as file name and location is stored in the SQL database, which does not store any of the archived content or metadata. Maintaining an accurate position of both where a DVS file currently resides and what content is stored within a DVS file is very important when we look further at the lifecycle of a DVS file, as it moves to another storage system or becomes part of a larger container

Collection services — a more intelligent flat file system

It was previously stated that Exchange and other primary applications were not designed for the long-term storage of content as they store content in large file structures, unlike Enterprise Vault, which stores its content in a flat file format or as single files.

Page 18: EV 2007 - Open Storage Layer

Page 18

Initially, storing items as single files has many advantages, most especially speed of access to the most recent content. However, as items age, it becomes less and less favorable to store them as single files. For example, most file based backup software can struggle to perform incremental backup of file structures with huge numbers of files. There is an ideal balance in the middle. Initially storing items as single files for performance efficiency then later collecting them for improved storage occupancy and backup optimization would seem the ideal for an archive, and Enterprise Vault does just this. The collector service works within a Vault Store Partition to, by policy, collect lots of DVS files into larger containers (CAB files), which dramatically reduces the number of files in a Vault Store Partition. The backup software is now presented with fewer, larger items rather than lots of smaller items. This ulitimately will always increase backp performance. The collector has a configurable policy, allowing the administrator to both define the age at which collection starts and also the maximum size of each collection file. In addition, larger files can be excluded from the collection process but treated as collected if their size alone equates to a collected file. This behavior is completely transparent to end users and applications layered on Enterprise Vault. The mapping of items to containers is maintained in the normal SQL-based Vault Store directory, which is updated to reflect the new storage structure.

Migration service — migration as a central activity

We have already discussed how an average item in the archive will likely be stored for longer than the expected life span of the storage system. We would then expect that an item will, at some point, have to be moved to another storage system. This migration of content, which is built into the core of Enterprise Vault, can happen in two dimensions. First, new storage technology can replace the existing store as a primary archive repository, and the partition mechanism described earlier allows this to be done rapidly and in a way that is transparent to the end user. Second, as described below, a different storage technology can be introduced as a secondary archive store sitting behind the primary archive store. Not only does being able to quickly and easily migrate content allow new storage systems to be brought on stream as and when needed, it also allows for the continued lowering of the TCO of the archive store. For example, initially items may reside on a fast NAS system, but as they age items can be moved to a slower but far less expensive storage system (e.g., magnetic tape). This migration of content between tiers of storage systems is part of overall Information Lifecycle Management. The migration service is configured at a Vault Store Partition level, and the standard service can copy content from NTFS volume to NTFS volume. The target volume could, for example, be magnetic tape or an optical store fronted by software that presents it as an NTFS volume Alternatively, special migrators can be plugged into the mechanism to support non-NTFS stores For example, there is a Symantec NetBackup™ migrator that allows the same tape infrastructure that is being used by NetBackup to also be used by Enterprise Vault. Note that this is a further advantage of collections, since it is collections that are moved to the secondary store. They are a much more efficient way of handing “slow” storage, such as tape, as opposed to relatively small files. Again, this migration behavior is transparent to users and layered applications

Page 19: EV 2007 - Open Storage Layer

Page 19

Figure 6. Enterprise Vault is able to migrate content, by policy, from a Vault Store Partition

Note: The migration service discussed above is a policy-based service that will allow for multiple tiers of storage. Professional Services offers a migration consultancy service that will allow organizations to completely move to a new storage system, and cease using the initial system. This is not the same as the collector/migrator services outlined above, which are likely to be used when migrating from NTFS to a new storage technology (e.g., EMC Centera), where they are required to move everything off the old store rather than leave them in place alongside the new store.

Summary of the interaction of Enterprise Vault with an NTFS file system

Since the largest opportunity for Enterprise Vault has been, and will continue to be, archiving from environments that are compatible with Active-Driectory security, the predominant file system tends to be NTFS. In addition, the adoption of NTFS as an open standard interface to other non-disk-based systems (e.g., tape) means that Enterprise Vault support for the NTFS system is the base standard from which all other storage support is derived. There are other, non NTFS based storage systems available for use with Enterprise Vault, but those need to be carefully checked against the Enterprise Vault certification and compatibility tables (available on the Symantec support website). Key benefits to the storage infrastructure maintained by Enterprise Vault are:

Page 20: EV 2007 - Open Storage Layer

Page 20

• Storage tier-ing: Enterprise Vault inherently moves less active data out of primary applications such as Microsoft Exchange and, as described above, can optionally migrate older archived data to a secondary tier of archive storage, whether it is disk, tape, or other media • Storage rationalization: The minimum amount of content is actually stored on the storage system. Not only is every item in the archive compressed and Single Instanced, but also this is done without minimizing data reliability • Backup optimization: Since Vault Partitions can be closed and DVS files are read-only, the archive is very efficient to back up and recover. Enterprise Vault solves many of the problems of primary applications that store content in massive data files or on relatively expensive storage solutions. Enterprise Vault stores items as individual files where appropriate and combines those into larger collections, managed by a centralized policy • Data Integrity: As described earlier, the “safety copy” functionality ensures that no archived data is lost by not deleting the original item until a backup or replica is created (optionally) • Data resilience: From the DVS file all the way through to the Vault Store, each of the storage areas is self-contained, meaning that it can be recovered without the need for the rest of the archive or any of the support databases. As described, HTML copies “future-proof” archived data. In addition, while so far we have only spoken about NTFS, the API infrastructure built into the Open Storage Layer means that we are able to support a wide range of storage devices today and seamlessly bring on stream a new storage technology in the future • Data fidelity: As mentioned, we have never sacrificed data fidelity in the design of Enterprise Vault, ensuring that no data or metadata—including per-user items such as the “read-receipt flag”—is lost during the archiving process

Storage hardware and systems

Page 21: EV 2007 - Open Storage Layer

Page 21

Figure 7 – Vault Store partition options

Enterprise supports a large number of different storage devices and systems, covering all relevant media, sizes and vendors. There are three different categories of systems

- File System based storage (First Tier) - Systems integrating with the Enterprise Vault Migrator (Second Tier) - CAS and other API-based Storage (No Tiering)

This section focuses on the specific implementations and characteristics of the various storage systems.

Storing Enterprise Vault archives on Network Attached Storage

We will now examine the interaction of Enterprise Vault with non-windows NTFS storage. Many vendors have written front ends to allow open access to other non-disk-based storage hardware (e.g., Pegasus Disk Technologies NTFS front end to UDO storage). Later in the paper we will examine some of the more specific integrations available with Enterprise Vault and non-NTFS storage systems

NetApp storage systems NetApp has a wide range of solutions for customers, from high-end storage area networks to lower-cost network attached storage. One of the key benefits that NetApp offers is that the same storage infrastructure can be repurposed between these two offerings at any time. In general, NetApp filers present themselves as NTFS volumes via the CIFS protocol, so they will work normally with Enterprise Vault, the exception being SnapLock, where some special handling is required.

Page 22: EV 2007 - Open Storage Layer

Page 22

NetApp SnapLock

There are many additions to the OnTap operating system to improve resiliency or help with disaster recovery. One addition is SnapLock, which enables the creation of WORM partitions in a disk based NetApp environment. Just as with standard NetApp volumes, these are fully supported with Enterprise Vault, but there are a few considerations that means special handling is needed when implementing SnapLock. These storage volumes are an explicit administrator choice for a Vault Store target 1. If you are using a SnapLock volume, you are unable to take advantage of Single Instance Storage. SnapLock turns a filer into a true WORM device, meaning that data can be neither deleted nor changed 2. Retention management remains the same from an Enterprise Vault administration point of view, but it is mapped to the SnapLock WORM mechanism where the retention period is written as an attribute of the file NetApp summary

Because of the open nature of the OnTap operating system, NetApp file systems operate in much the same ways as NTFS volumes. The only difference is if SnapLock has been added to any part of the system. In addition to this (and the subject for another paper) is the fact that Enterprise Vault 2007 and NetApp OnTap now work together to provide placeholder support so organizations that use Enterprise Vault File System Archiving to do policy-driven archiving from NetApp filers can now leave transparent placeholders on the target filer.

Hitachi Content Archive Platform (HCAP) The Hitachi Content Archive Platform is a file-system that natively supports NFS, CIFS, HTTP and WebDAV communication protocols. It is a storage system with a single file system that can support hundreds of millions of objects on large amounts of capacity. Enterprise Vault has been integrated with HCAP to use the CIFS protocol for storing information on the HCAP system while utilizing the device’s WORM and retention management functionality. HCAP provides a network clustered architecture that allows customers to keep adding nodes, processors, cache memory, host ports and capacity as needed. This effectively allows using larger partitions then normally feasible for CIFS file-systems. A customer can start out with a four-node archive and connect additional nodes, two or more at a time, with no theoretical limit.

Systems integrating with the Enterprise Vault Migrator

Enterprise Vault is designed to provide tiered-storage across technologies, vendors and devices. While some devices are disk-based and provide fast and random access to the archived data, there is another more cost-effective group of systems that lack the performance of the primary store, but provide a viable platform for storing very old information that is only very infrequently accesses.

Page 23: EV 2007 - Open Storage Layer

Page 23

Figure 8 – Vault Store partition migration options

These systems should be used for data that is

– Outside the scope of Offline Vault (Age Limit for OV is set) – Is unlikely to be exported to PSTs – Has a low probability to be relevant to Legal Discovery and Investigation Cases

Also it should be noted that re-indexing from slower storage devices can be very time-consuming and it is recommended to back up indices regularly, if secondary storage devices are used. Enterprise Vault 2007 supports the following secondary storage devices:

- Enterprise Vaut File-System Migrator - Fujitsu Eternus - Veritas NetBackup - IBM Tivoli Storage Manager / DR 550

Pegasus Disk Technologies – InveStore Pegasus InveStore presents Optical Jukeboxes like DVD or UDO to Enterprise Vault like a fully compatible NTFS file-system. Therefore no proprietary integration is needed and the Enterprise Vault Migrator is used. This has compelling advantages as opposed to a JukeBox Integration on driver level, as the File-System presented by Pegasus can be used for other applications as well, allowing to share the Jukebox and to unify the WORM storage without any driver or hardware issues.

Page 24: EV 2007 - Open Storage Layer

Page 24

Fujitsu Eternus Fujitsu Eternus Archive Storage Support for NTFS Partitions in EV 2007 enables EV Collection files and large Saveset files to be migrated to a Fujitsu Eternus Storage Device. The Eternus migrator connects to the Eternus device via Fujitsu’s Content Archive Manager Client software. Due to the potentially huge overhead of migrating and retrieving millions of very small files (it could be tape!) only Collection files and Saveset large files will be migrated, making the Fujitsu Eternus an attractive 2

nd tier storage device that complements a front-end file-system solution for

younger items. To provide the necessary interfaces to Enterprise Vault, the Fujitsu Content Archive Manager Server 1.4 or later needs to be installed and configured on a Fujitsu Eternus storage device and the Fujitsu Content Archive Manager Client 1.4 or later needs to be installed on the Enterprise Vault Server.

Figure 9 – Fujitsu Eternus migration options

After selecting the Fujitsu Eternus as an archiving target on the Migrations tab, additional configuration porperties need to be specified on the Advanced tab. Refer to the Enterprise Vault documentation for further instrauctions.

Veritas NetBackup Enterprise Vault 2007 has integration with NetBackup media manager. This means that customers that have already invested in NetBackup can reuse the same hardware management infrastructure for storing Enterprise Vault archives alongside their backups. Usually the hardware would be some type of offline storage, typically tape, but NetBackup can utilize many different media types, including disk

Page 25: EV 2007 - Open Storage Layer

Page 25

Figure 10 – Veritas NetBackup migration options

Typically this would be done in the context of secondary archive migration, where the NetBackup infrastructure would be used for the second tier of archive storage. In this case, Enterprise Vault will only migrate older, archived files to NetBackup as collections. It would be possible to use the NetBackup infrastructure as the primary archive store, but there would still need to be a primary disk store acting as a temporary cache until the collection process had completed.

Integration with Tivoli Storage Manager / IBM DR 550 Enterprise Vault is able to talk to either a Tivoli Storage Manager (TSM) or IBM DR 550 system using the TSM Client API. To enable this integration the TSM Client needs to be installed on the Enterprise Vault Server. Another requirement is that the “Data Retention Manager” option is licensed and available on the TSM server (It is included in the DR550), which controls the retention and expiration of the content managed by the TSM envirnoment. The integration itself provides the same feature-set as the Netbackup solution.

Content Addressed Storage and other API-based storage

Enterprise Vault supports more than just NTFS-based file systems. As the archiving market has matured, storage vendors have modified their offerings to meet this market or have created completely new storage systems for the long-term storage of content, with reduced costs and management overheads This section looks at the differences or additions we have made to support important non- NTFS-based storage platforms.

Page 26: EV 2007 - Open Storage Layer

Page 26

EMC Centera EMC Centera is an example of a storage system ideally suited for archiving. It is classified as a content addressed storage (CAS) system. CAS systems create unique identifiers for items stored These unique IDs are, in effect, very similar to the SIS “hash codes” Enterprise Vault creates. Centera can optionally act as a WORM (Write Once Read Many) store, which makes it very suitable for regulatory retention. In this case, Centera has a concept of retention classes. Items cannot be deleted from Centera until their time period defined by their retention class has expired, and the Enterprise Vault expiration service will delete expired items after this period. Any attempt to delete items before then will result in an error. This has the advantage over classic WORM devices, such as optical disks, because individual items can be deleted after their retention periods have expired. In addition, since it is magnetic disk–based, its fast recall times make extensive searching and discovery feasible. Enterprise Vault has to take four main considerations into account when interacting with Centera: • Centera may act as a WORM storage device, and all content is read-only. • Centera has Single Instance Storage built into its core • Enterprise Vault retention management capabilities have to be integrated with the Centera capabilities. • Centera has a replication mechanism that is used for data integrity and can reduce the need for a separate backup. When you create a new Vault Store Partition, you can define this partition as a Centera partition, and Enterprise Vault will act accordingly, taking into account the points above. It’s also important to note that only archived items (Enterprise Vault Savesets) are committed to Centera and not indexing information. Centera is not suitable for storing the index files because they are contain dynamic and constantly changing data.

Read-only storage and retention management in Centera Centera is unusual, as it is a read-only disk-based storage system. Since Enterprise Vault is already read-only, the integration is very straightforward Retention management undertaken by Enterprise Vault is extended to Centera. Centera has a retention policy mechanism that is very similar to that of Enterprise Vault. Enterprise Vault retention categories are mapped to Centera “retention classes,” and the Centera retention class name is stored with the item. The retention periods specified in Centera retention classes can subsequently be extended in much the same way as with Enterprise Vault retention categories.

Single Instance Storage and collections with Centera Retention and read-only characteristics of Centera are really extensions to features that already exist in Enterprise Vault. While Single Instance Storage already exists in the vault, it has to be treated differently with Centera. In addition, the way items are “collected” is also changed when dealing with Centera.

Page 27: EV 2007 - Open Storage Layer

Page 27

First, we need to discuss how applications like Enterprise Vault pass content to Centera, and why this is different from NTFS for example. When an item is passed to Centera, a hash is generated that is unique to that piece of content, and two files are stored on Centera: a Clip file, which primarily contains metadata about the item, and a “blob” file that contains the main data The unique ID (or Clip ID) is the reference to the stored item that Enterprise Vault will retain in its Vault Store directory. When a file is written to Centera, if the hash that was generated already exists (meaning that the item has already been archived), the new Clip file for that item will contain a pointer to the existing “blob.” If it is the first time that this piece of content has been archived by Centera, a new blob is created, and the Clip file points to it The details of the actual items that are stored in Centera (header, body, and attachments) are then referenced in the original Clip file. Enterprise Vault will read the Clip content to see how to recover the original content. Along with all the information stored in the Clip are checksums generated by Enterprise Vault. These checksums help Enterprise Vault guarantee that the content committed to Centera is valid when recalled by cross-checking with the Clip ID held in the vault’s directory. After much experience with Centera, Enterprise Vault uses a variety of methods for storing items to achieve the best balance of performance and store occupation 1. Small items are stored in the Clip file directly. Items that are smaller than 15 KB can be stored in the Clip file without any associated blob file. For small files such as this, the disadvantage of losing the potential for SIS is offset by the performance advantage of having just one file to read (a file in a database) and the space saving achieved by avoiding “rounding up” due to the store allocation unit size. In fact, doing SIS for such small files would cost more space than not doing it. 2. Larger items can be stored individually or in collections (see 3 below). Where these items are messages with attachments that exceed a defined threshold size, those attachments are stored separately. This leads to a different SIS model than for NTFS messages, as attachments will be shared between different messages (assuming the attachment is identical) as well as between multiple copies of the same message. In addition, if the same file is archived from a file system or MS SharePoint, then it will again achieve SIS with the attachment version. As with NTFS SIS, user-specific attributes are all retained separately in the Clip files and only the main blob will be shared. 3. Similarly, collections are done differently than in NTFS. In this case, messages are gathered into collections, but again, attachments over a certain size are stored separately. Collection size is limited by the number of messages contained in it or the total size, whichever limit is arrived at first. Collections are also organized so that only items with the same retention policy are stored in the same collection. This optimizes store occupancy and minimizes the number of discrete files stored, with a consequent performance advantage for replication and recovery. The compromise is that the Clips for collections need to be read and traversed in order to retrieve messages, but the items themselves are retrieved from the collection by a partial read, so the effect is minimal. To the user this is transparent behavior, and collections will not be deleted until all the contained items have expired; but in practice, since all items have the same retention period and are gathered in a short time frame, this is not a problem. 4. In a similar way to NTFS based partitions being configured to only allow deletion of the original item once a complete and sucesful backup has taken place, Enterprise Vault can be configured to wait for Centera replication to take place before deletion is done. In this case, the “watcher” process only initiates the delete (or replacement by shortcut) when it detects that the replica

Page 28: EV 2007 - Open Storage Layer

Page 28

exists on the secondary Centera store. An important point to note is that on a Centera the domain of SIS is the whole Centera system, regardless of how Vault Stores are mapped to Centera. Figure 11 highlights how we determine how items are stored in Centera when Enterprise Vault is writing to items in collection or “non-collection” mode.

Figure 11 – Flow diagram for storing items in Centera

Centera summary All of the features of Centera are already features of Enterprise Vault, but the combination of hardware and software features make a powerful combination to customers. The major area of development for Enterprise Vault has been the Centera collections service, which takes advantage of the hardware SIS offered by a CAS system, but does so in a way that maximizes the throughput of Centera

Page 29: EV 2007 - Open Storage Layer

Page 29

Other supported storage systems

Enterprise Vault works with a wide variety of storage systems, and the list of supported platforms changes all the time. Please consult the full Certification and Compatibility Matrix found on the Symantec support website.

Summary of the Open Storage Layer of Enterprise Vault 2007

This paper has shown that there are unique considerations when deciding on your strategy for an archiving application. To the business, there are many benefits in ensuring that all content is stored (and expired) safely and reliably and that you have rapid access to it when needed. At the core of all of these activities is the actual storage that is being used and how this effectively enforces the core requirements of the archiving application. We have seen how the Open Storage Layer in Enterprise Vault helps: • Storage tier-ing: Enterprise Vault inherently moves less active data out of primary applications such as Microsoft Exchange and, as described earlier, can optionally migrate older archived data to a secondary tier of archive storage, whether it is disk, tape, or other media • Storage rationalization: The minimum amount of content is actually stored on the storage system. Not only is every item in the archive compressed and Single Instanced, but also this is done without minimizing data reliability. • Backup optimization: Since Vault Partitions can be closed and DVS files are read-only, the archive is very efficient to back up and recover. Enterprise Vault solves many of the problems of primary applications that store content in massive data files or on relatively expensive storage solutions. Enterprise Vault stores items as individual files where appropriate and combines those into larger collections, managed by a centralized policy. • Data integrity: As described earlier, the “safety copy” functionality ensures that no archived data is lost by not deleting the original item until a backup or replica is created (optionally). • Data resilience: From the DVS file all the way through to the Vault Store, each of the storage areas is self-contained, meaning that it can be recovered without the need for the rest of the archive or any of the support databases. As described, HTML copies “future-proof” archived data. • Data fidelity: As mentioned, we have never sacrificed data fidelity in the design of Enterprise Vault, ensuring that no data or metadata—including per-user items such as the “read-receipt flag”—is lost during the archiving process.