Online Hierarchical Storage Manager - LinuxsecretsRohit Vashist [email protected] Rohit K...

14
Online Hierarchical Storage Manager Sandeep K Sinha NetApp, India [email protected] Rishi B Agrawal Symantec, India [email protected] Vineet Agarwal [email protected] Rohit Vashist [email protected] Rohit K Sharma [email protected] Sneha Hendre [email protected] Abstract Intel, Sandisk, and Samsung are investing billions of dollars into Solid State Drive technology and manufac- turing capacity. Unfortunately due to the extreme cost of building the manufacturing facilities, SSD manufac- turing capacity is not likely to exceed HDD manufac- turing capability in the near future. Most data centre applications heavily lean toward database applications which use random read/write disk activity. For random read/write activity, the performance of SSDs is 10x to 100x that of a single rotational disk. Unfortunately, the cost is also 10x to 100x that of a single rotational disk. Due to the limited manufacturing capability of SSD, most applications are going to remain on rotational disk for the foreseeable future. Online Hierarchical Storage Manager has been developed to allow SSD and tradi- tional HDD (including RAID) to be seamlessly merged into a single operational environment thus leveraging SSD while using only a modest amount of SSD capacity. In an OHSM enabled environment, data is migrated to and from the high performing SSD storage to traditional storage based on various user defined policies. Thus, if widely deployed, OHSM has the ability to improve computer performance in a significant way without a commiserate increase in cost. OHSM being developed as open source software also abolishes the licensing is- sues and the costs involved in using storage solution software. OHSM being “online” signifies the complete abolishment of the downtime and any changes to the ex- isting namespace. 1 Introduction Hierarchical Storage Management is a data management technique that uses devices in an economically efficient manner, thus reducing the storage space and administra- tive costs associated with managing data. OHSM is an online hierarchical storage manager for Linux which offers policy based transparent movement of data from one class of storage to another. Being the first attempt towards an open source data manager, it provides a base platform for all further developments in similar areas. It supports policy based migration of files i.e. it defines a set of policies which decide the correct placement tier for a file during its initial creation, as well as block allocation and relocation of the file from one placement tier to another. A placement tier is basically a storage class, which consist of a collection of storage devices with similar properties defined in the policy file based on its speed, cost or any other attribute. These placement tiers can be priority-ordered and can be over- lapping as well. These policies are enforced on a OHSM enabled file system through an XML based placement policy file. A placement policy file contains a collection of rules which decides both, the initial file location and the circumstances under which existing files are relo- cated. Therefore, placement policies have been broadly categorized into placement and relocation policies. Whenever a file is created it will be allocated in accor- dance to the placement policy which has been enforced on the file system. If the file fails to match any of the rules specified in the policy file, it falls into the default allocation method that is used by the underlying file sys- tem. Similarly for relocation, whenever a file matches any of the relocation policy, it is relocated from the 263

Transcript of Online Hierarchical Storage Manager - LinuxsecretsRohit Vashist [email protected] Rohit K...

  • Online Hierarchical Storage Manager

    Sandeep K SinhaNetApp, India

    [email protected]

    Rishi B AgrawalSymantec, India

    [email protected]

    Vineet [email protected]

    Rohit [email protected]

    Rohit K [email protected]

    Sneha [email protected]

    Abstract

    Intel, Sandisk, and Samsung are investing billions ofdollars into Solid State Drive technology and manufac-turing capacity. Unfortunately due to the extreme costof building the manufacturing facilities, SSD manufac-turing capacity is not likely to exceed HDD manufac-turing capability in the near future. Most data centreapplications heavily lean toward database applicationswhich use random read/write disk activity. For randomread/write activity, the performance of SSDs is 10x to100x that of a single rotational disk. Unfortunately, thecost is also 10x to 100x that of a single rotational disk.Due to the limited manufacturing capability of SSD,most applications are going to remain on rotational diskfor the foreseeable future. Online Hierarchical StorageManager has been developed to allow SSD and tradi-tional HDD (including RAID) to be seamlessly mergedinto a single operational environment thus leveragingSSD while using only a modest amount of SSD capacity.

    In an OHSM enabled environment, data is migrated toand from the high performing SSD storage to traditionalstorage based on various user defined policies. Thus,if widely deployed, OHSM has the ability to improvecomputer performance in a significant way without acommiserate increase in cost. OHSM being developedas open source software also abolishes the licensing is-sues and the costs involved in using storage solutionsoftware. OHSM being “online” signifies the completeabolishment of the downtime and any changes to the ex-isting namespace.

    1 Introduction

    Hierarchical Storage Management is a data managementtechnique that uses devices in an economically efficientmanner, thus reducing the storage space and administra-tive costs associated with managing data.

    OHSM is an online hierarchical storage manager forLinux which offers policy based transparent movementof data from one class of storage to another. Being thefirst attempt towards an open source data manager, itprovides a base platform for all further developments insimilar areas. It supports policy based migration of filesi.e. it defines a set of policies which decide the correctplacement tier for a file during its initial creation, as wellas block allocation and relocation of the file from oneplacement tier to another. A placement tier is basicallya storage class, which consist of a collection of storagedevices with similar properties defined in the policy filebased on its speed, cost or any other attribute. Theseplacement tiers can be priority-ordered and can be over-lapping as well. These policies are enforced on a OHSMenabled file system through an XML based placementpolicy file. A placement policy file contains a collectionof rules which decides both, the initial file location andthe circumstances under which existing files are relo-cated. Therefore, placement policies have been broadlycategorized into placement and relocation policies.

    Whenever a file is created it will be allocated in accor-dance to the placement policy which has been enforcedon the file system. If the file fails to match any of therules specified in the policy file, it falls into the defaultallocation method that is used by the underlying file sys-tem. Similarly for relocation, whenever a file matchesany of the relocation policy, it is relocated from the

    • 263 •

  • 264 • Online Hierarchical Storage Manager

    source tier to destination tier as specified in the reloca-tion policy file. The migration of data is non disruptiveand completely transparent to the placement tiers. Theplacement policy file also contains the mapping infor-mation between the storage devices and the respectiveplacement tiers to which they belong. OHSM does notimpose constraints on the placement tiers as far as ca-pacity, performance and availability are concerned.

    OHSM also provides functionality to remove files oncertain events which can be specified through the policyfile. Defragmentation of the relocating files could alsobe achieved by enabling defragmentation at the time oftriggering relocation. Though OHSM doesn’t guaranteecomplete defragmentation of data blocks, it does a best-effort attempt. Relocation is an event triggered opera-tion based on single or multiple policies selected fromthe set of relocation policies aiming to provide greaterflexibility and usability to system administrators.

    2 Design

    The idea here is to leverage the underlying device topol-ogy of the block-device logical volume and use this in-formation to optimize the block allocation methodolo-gies used by the file system, thus achieving storage effi-ciency. OHSM provides a framework to implement hi-erarchy based storage in the existing environment us-ing support from the device mapper. This requires somemodifications to the file system’s file creation and blockallocation routines.

    The overall design of the system is composed of a groupof inter-operating modules implemented as shared li-braries, daemons and kernel modules. OHSM has var-ious components including the user interface, an XMLparser, OHSM Admin and the Kernel Driver with eachof them offering different functionality to support thevarious services offered by OHSM.

    2.1 User interface

    Another key component of the OHSM system from anadministrator’s perspective is the administrative inter-face. It is comprised of both a graphical as well as acommand line interface. Apart from the basic function-ality like enabling, disabling and querying the state ofthe OHSM system, it also provides the administrator anopportunity to generate, modify and validate the XML

    policy file through a graphical interface. The graphicalinterface also provides various other statistical data in-cluding state of tiers, space utilization, number of filesrelocated and various other information related to theeach placement tier. Apart from these facilities both theinterfaces have a lot of services to offer such as gettingand setting various OHSM runtime tuneable features,enforcing of placement and relocation policies on filesystems and many more. In all, both the GUI and CLIoffer a simple and easy to use interface to the adminis-trators.

    2.2 XML Parser

    OHSM is a policy based hierarchical storage managerand it uses an XML based policy file for defining all theplacement and relocation policies. The XML parser isresponsible not just for parsing the administrator definedXML policy file but also for validating the informationprovided. The XML parser takes the policy files as inputand further parses it to extract various information fromit like the tier-device mapping, placement and reloca-tion policies. Then it validates that information againstthe device topology map and checks for any conflict-ing policies. In case of any errors or conflicts it eitherreports back or else it transforms the parsed policy in-formation into relevant data structures and passes it tothe OHSM Admin module for further processing.

    2.3 OHSM Administrator

    This is the central communication hub which differen-tiates and communicates with all other components ofthe OHSM system. It also helps keep the design mod-ular and simpler. It is also responsible for all the re-quired communications between the user space and ker-nel driver to facilitate all the requests through the userinterface. Most of the error handling is done by theOHSM Admin. All the communication with the user-space device mapper library to get the device topologymapping also goes through the OHSM Admin. The er-rors received from the parser and the device mapperlibrary are processed and converted into user readablestrings and passed to the user interface.

    2.4 Device Mapper API

    OHSM uses the user space device mapper library in or-der to extract the device topology beneath the logical

  • 2009 Linux Symposium • 265

    Figure 1: OHSM Architecture

    volume on which the file system is already mounted.This helps the file system to optimize its data block al-location to provide better storage efficiency. The devicetopology along with the tier-device mapping informa-tion helps OHSM build its complete internal mappingdata structures. The final mapping information residesin OHSM kernel driver. The library also offers certainother callbacks providing data regarding the underly-ing devices which can be used in later implementationsof OHSM. This will help provide a more administratorfriendly solution. This information can also be utilizedto derive certain heuristics and statistical information re-garding the placement tiers.

    2.5 OHSM Kernel Driver

    OHSM kernel driver is the most important componentof the OHSM system. This is the core of the systemand helps service all the requests from the user space. Itstores various metadata and service routines associatedwith OHSM. It also holds the tier-device mapping table,placement and relocation policies associated with the

    file systems. The file system scanner and the completemechanism of relocation have also been implemented asa part of the kernel driver. Apart from these, it also im-plements the various ioctl service routines. The kerneldriver is loaded in the memory during boot time, so thatthe required information and functionality is available tothe file system soon after mounting. Any problem withthe kernel driver can lead to a complete freeze of theworking OHSM. Hence special care has been taken tohandle most of the error conditions gracefully. The ker-nel driver is the core of OHSM. However, it is the filesystem implementation that gives it its power. Keep-ing the two independent allows the file system changesto be minimally invasive and not require major OHSMspecific patches to address inode updates and preferreddata block allocation needs. The driver comes into pic-ture once the administrator triggers relocation on the filesystem.

  • 266 • Online Hierarchical Storage Manager

    2.6 File System

    OHSM system can work with most of the GNU/Linuxfile systems with some basic modifications to the waya file is created and extended. OHSM broadly dividesthose changes into two sub categories:

    2.6.1 Inode Updater

    As the name signifies, the changes revolve around theinode allocation mechanism for a file system. In anOHSM enabled environment, it is expected that the ini-tial file creation be governed through some file creationpolicy. OHSM changes the way the normal file cre-ation works and imposes an additional check of the file’sphysical characteristics against the placement policiesspecified. In case of any match, the home tier id is setfor that file, eventually directing the file system to serveall the data blocks requests for that file from the pool ofblocks belonging to its home tier.

    2.6.2 Preferred Block Allocator

    In an OHSM enabled environment, all the data blockrequests for a file are restricted to its home tier. Thismight lead to a situation wherein there is no free spaceleft on the specified tier. In order to overcome such cir-cumstances, OHSM offers an option for administratorsto provide the tiers in a prioritized order, with the mostdesirable destination listed first. This change overridesthe normal block allocation policy of the file system andmakes sure that it follows the policies specified by theadministrator, if any. Those files that don’t qualify toany of the file placement criteria follow the usual se-mantics of file creation offered by the file system. Also,all further data block requests of such files are servicedfrom blocks spanning over the complete file system.

    3 OHSM Value Proposition

    The most broadly applicable benefit of the Online Hier-archical Storage Manager is to reduce the average on-line storage cost by migrating inactive files to a less-expensive placement tier in a hierarchical based storageenvironment. It should be assumed that the lower place-ment tiers have a significantly lower per-byte cost thanstorage in next higher placement tier. In most of the

    Figure 2: Value Proposition

    cases, the cost differential between different types of on-line storage creates the economic justification for suchhierarchical storage environment. If the highest place-ment tier storage costs around $5 per gigabyte and midrange placement tier storage costs around $2 per giga-byte, an enterprise whose online data is 50% inactivecould save around 30% of its storage acquisition cost bymoving the inactive files to mid range placement tiers.Larger percentages of inactive files result in higher sav-ings.

    For enterprises that keep a significant amount of non-critical data online, a multi-tier storage strategy can of-fer substantial cost savings without adverse effects onbusiness operations. The challenge in attaining the ben-efits of multi-tier storage is to get the right files on theright storage tier at the right time. OHSM achieves itvery efficiently with the help of policy based allocationand relocation. The purpose of OHSM is to eliminateany administrative cost and complexity in a hierarchi-cal storage environment by automating the relocation offiles as levels of I/O activity against them rise and fall,as well as when their sizes, owners, or logical positionsin the file system hierarchy change.

    4 Working with OHSM

    Information Lifecycle Management provides effectivemanagement of information throughout its useful lifecy-cle. ILM also provides strategies to allow a computingdevice to administer the storage systems. These strate-gies consist of policies which differentiate and admin-ister data based on its usage and priority. In order towork with the Online Hierarchical Storage Manager theadministrator needs to be quite aware of the various file

  • 2009 Linux Symposium • 267

    placement and relocation policies supported by OHSM.The administrator needs to put together all this infor-mation along with the tier device mapping informationinto a XML file. This file is passed to OHSM to enableand enforce these policies on a file system. It should notbe forgotten that OHSM also offers a graphical user in-terface to generate the XML policy file. The user caneither use the graphical interface or the command lineinterface in order to enable OHSM. Below we examinevarious file placement and relocation policies with theirrespective sample XML policy grammar as supportedby OHSM.

    4.1 Tier Device Map

    The tier device mapping information is required in or-der to define the set of devices that belong to a partic-ular placement tier. Both the file placement and relo-cation policies are validated against the tier device mapinformation specified in this section. All the informa-tion provided in the tier-device map section is also val-idated against the configuration of the system. Someof the validations involve checks for making sure thatthe specified devices exist on the system, all the devicesare part of the same logical volume over which the filesystem is mounted, etc. A simple tier device mappinginformation:

    36

    1

    /dev/md4/dev/md5

    2/dev/md3

    3/dev/md1/dev/md2/dev/md6

    4.2 File Placement

    Transparent and non disruptive relocation of data acrossvarious placement tiers is undoubtedly the most obvioususe of the OHSM but it is not restricted to that. OHSM

    also offers various functionality to give targeted files apreferential placement at the time of file creation. Wehave seen an enterprise using database management sys-tem to manage their most critical data and provide thatcritical data a preference over all other data. OHSM cur-rently supports four file placement policies which can beused individually or can be combined together to formvarious new rules. If a file qualifies for multiple place-ment policies then the first match prevails over all oth-ers.

    For special situations where the target placement tiermight run out of free blocks, the administrators havethe facility to provide the preferential order of tiers foreach such rule. This directs the file system to allocateblocks from a lower tier in case the target placementtier is already full. This feature is optional and can beenabled/disabled as per administrator’s discretion. Cur-rently OHSM supports a very primitive set of file place-ment criterion including file type, user ID, group ID andthe logical placement of files in the file system hierar-chy, directory name. Allocation policy based on direc-tory name can be both recursive and non-recursive. Thefile placement policy can be based on the following:

    4.2.1 File Type (FTYP)

    In today’s time, there is a strong likelihood that applica-tions follow a pattern in the file name extension to de-termine the kind of data that the file holds. This patterncan be utilized to differentiate between different typesof files like database files, media files and log files etc.Using the file extension we can also associate the filewith different applications most of the time and deriveits criticality based upon that. This information can beused to provide preferential placement to various typeof files. We can also dedicate placement tiers to specificfile types based on its type.

    4.2.2 User ID (UID)

    In a large server environment the file systems are mostlyorganized based more on the users rather than the ap-plications. Consider the case of an enterprise, wheredifferent users have their home directory on the sameshared file system. In such situations there can alwaysbe reasons to allot higher placement tiers to varioususers while restricting others to have a mid range or alower one.

  • 268 • Online Hierarchical Storage Manager

    4.2.3 Group ID (GID)

    Similar to file placements based on users, various groupsin an organization can share a placement tier. This canbe based on the criticality of data they operate on and attimes the speed of data retrieval. Consider the case ofan engineering team and a marketing group; it may bedesirable to have the engineering data on a more reliableplacement tier as compared to the marketing team. Theaccounting group can have opted for a separate place-ment tier for various other reasons.

    4.2.4 Directory Name (DIR)

    Often we create directories based on the current timeor date. This helps us in keeping the data in a morestructured manner. For instance, if someone keeps itsreports for the last couple of years, there will be a num-ber of directories present on the file system, for example,report-2007, report-2008 and report-2009 and so on. Itis expected that the latest reports will be the most fre-quently accessed one. Though it is just an assumption,this can be used to provide placement tiers to variousstructured data classified on the basis of their age. Fileplacement based on directory names can be recursiveand non-recursive depending on the specification in thepolicy file. A simple file placement policy:

    1211

    0 1

    0 1501 3

    ora 1

    /foo 1 1

    4.3 File Relocation

    One of the most desirable things to have is to store in-active files on placement tiers of lesser quality so that itdoes not affect the applications adversely. If you lookat it from the I/O performance perspective, if the fileis accessed rarely and it is mostly inactive, the perfor-mance of the storage device underneath that holds it isirrelevant. This makes the ability to relocate files acrossplacement tiers very critical and important. Also, thereare situations where you have thousands of small files ina file system. It is seen that under such circumstancesmost of these files soon become inactive. Some of thescenarios are a document management system, a mailserver or any database application using opaque data ob-jects stored as small files. It would be highly desirableto have the ability to relocate data across placement tiersunder such circumstances. OHSM currently has supportfor the following file relocation policy criteria.

    4.3.1 File Access Age (FAA)

    It signifies the time since the last access to the file whichcan be one of the most appropriate qualifiers for a down-ward relocation. Based on the time of last access to thefile, it could easily be relocated to a lower placement tier.This can be useful in a search engines or mail server en-vironment, where the files access rates go down as thetime increases. This would not be a good candidate as aqualifier for relocation from lower to higher placementtier as this can be misleading at times. There can be fileswhich are just accessed to know that it’s not of use andthe data is stale. Still, because of the file’s access agebeing quite small, it can get relocated to a higher tier,which would be highly undesirable. The recent intro-duction to realtime in the kernel really changes how thelast access time is managed in a significant way. Anduse of such a feature might eliminate most of the valueof FAA.

    4.3.2 File Modification Age (FMA)

    This is a true qualifier for a relocation to happen froma lower to a higher placement tier. It can be fairly as-sumed that a file which has recently been modified orwhich has a smaller modification age would surely beaccessed more frequently in the near future. Hence, this

  • 2009 Linux Symposium • 269

    can be used for deciding upon the conditions for reloca-tion from lower to higher placement tiers. Most of thestub based implementations for HSM also use modifica-tion age as one of the primary qualifier to bring back thedata from their archival storage to their actual placementtier.

    4.3.3 File Size (FSZ)

    There may be various situations where it would be desir-able to allow a certain size of file to reside on a specificplacement tier. The reason is the limited size constraintsof the higher placement tiers due to their higher costs.We move the file to a lower placement tier if the filesize exceeds a specific threshold. This threshold can bebased upon the amount of space that a higher placementtier has. So, the file size qualifier can be easily used toprevent situations where the higher placement tiers don’trun out of space.

    4.3.4 File I/O Temp (FIOT)

    File I/O temperature is defined as the average number ofbytes transferred to or from a file over a period of time.This is independent of the file size and is one of the morepowerful qualifiers which can be used to automate theprocess of relocation.

    4.3.5 File Access Temp (FAT)

    File access temperature is defined as the ratio of thenumber of times the file has been accessed over a pe-riod of time. This helps us to determine the average I/Oactivity that is taking place on a file against all otherfiles in the file system. Such a measure can be useful tofind suitable candidates for relocation from both lowerto higher placement tier and vice versa.

    4.3.6 FTYP, UID and GID

    Relocation policies can also be based on the file type,user ID and group ID qualifiers. Since, the initial allo-cation can be based on these qualifiers, there is a greatchance that when these are combined with other reloca-tion qualifiers, form a finer granularity relocation crite-rion. A simple file relocation policy:

    1

    1

    12

    50 LT50 LT

    5 Prototype Implementation for ext2/ext3

    The prototype of OHSM involves basic implementationof the idea presented in the previous sections. The gen-eral concept of OHSM involves various modules andtheir relationship with the file system. OHSM consistsof roughly four components, namely the User Interface,OHSM Admin, Kernel Driver and File system. Our pro-totype provides functionality for creating policy files,enabling and disabling of OHSM, and triggering relo-cation manually. The User interface provides a set ofcommands to control and monitor the various function-ality offered. It also allows the user to create XMLbased policy files and logical volumes at the same time.In the prototype implementation the user is required tocreate separate policy files for allocation, relocation andtier device mapping. These policy files were requiredto be specified at the time of enabling OHSM on a filesystem. Before OHSM could be enabled, these policyfiles are required to be parsed and validated for any con-flicts. After verifying the policy files, the information isstored in internal data structures. The OHSM Adminis-trator uses the ioctl interface provided by OHSM kerneldriver to control and administer the system. On receiptof the data structures the kernel driver replicates thesedata structures in the kernel, and acknowledges back tothe administrator module success or errors if any. Onsuccess, EXT3_OHSM_ENABLE flag is set inside thefile system’s super block. When OHSM is disabled allthe data structures are cleared and the flag is reset. In or-der to achieve this, minor changes were made to structext3_inode and a new flag was introduced to be usedwithin struct ext3_super_block.

  • 270 • Online Hierarchical Storage Manager

    struct ext3_inode {......__u8 ohsm_home_tid;__u8 ohsm_dest_tid;};

    /** Misc. file system flags

    */

    #define EXT3_OHSM_ENABLE 0x0008 /* OHSM enabled */

    When a file is created it is intercepted by the OHSM in-ode updater and an additional check is made against theallocation policy enforced on the file system. In casea file qualifies, its ohsm_home_tid is set to the corre-sponding tier id. Otherwise, it remains zero. Later,for any data block requests for files having a non-zeroohsm_home_tid, the call to the block allocation rou-tine is diverted to OHSM’s block allocation routine, ifOHSM is enabled on the file system. This implementa-tion requires variations in the existing file system struc-tures and its block allocation strategy also known asranged block allocation. Ranged block allocation im-proves proficiency of file system in restricting alloca-tion of data blocks to a range of block groups. A ta-ble containing the map of tier against the block groupranges is maintained by OHSM kernel driver. Rangedblock allocation also uses this information to allocatenew data blocks for the file in a specific tier. This blockgroup range table is used by the block allocation rou-tine to identify the block group ranges of device hierar-chy. This table is created at the time of enabling OHSMand remains active in memory until the file system re-mains mounted or OHSM stays enabled. At the time ofunmounting or disabling of OHSM this information isdumped on disk to /etc/ohsm. This information is usedlater to reconstruct this table back when the file systemis mounted back or OHSM is re-enabled on a file sys-tem.

    In user space, the OHSM Admin uses libdevmapper toget the device topology and passes the extents of de-vices to the kernel driver. The driver later maps theseextents to file system specific block group ranges. Theohsm_home_tid field of inode is used as an index in thistable to get the specific block group range. The alloca-tion routine bounds the data block allocation within theselected range. Figure 3 illustrates two scenarios. Theleft half of it illustrates the scenario when a file is cre-ated. It shows that initially the ohsm_home_tid is setto zero, which later gets updated by the inode updater

    where it is qualified against the various file placementpolicies. If qualified, the files ohsm_home_tid is up-dated accordingly with the specific tier id. On the rightside, it shows the later scenario where there is block al-location request for a file. The block allocator checksfor a non-zero ohsm_home_tid and extracts the rele-vant block group ranges for the same. For a file havingohsm_home_tid equal to zero, the block group rangespans the complete file system. The call to the blockallocation routine is diverted to Range block allocationin place of file systems normal block allocation routine,which eventually serves the purpose. In case the tier isfull, the file’s ohsm_home_tid is set to zero and the filesystem’s normal block allocation routine is invoked.

    Relocation currently is a triggered event and has to bestarted manually by the administrator. When relocationis triggered, OHSM kernel driver scans all the inodesin the file system and pushes each qualifying inode tothe work queue for relocation. Prior to adding each in-ode to the work queue, the ohsm_dest_tid is set to therelevant tier for that inode. The workqueue handler rou-tine is implemented in the OHSM kernel driver whichpicks and does the task of relocation. After the reloca-tion is completed, the ohsm_dest_tid becomes the newohsm_home_tid for the file. As an optimization, OHSMuses a Tricky copy and swap algorithm to complete re-location as fast as possible.

    Tricky copy and swap algorithm starts by allocating anew ghost inode and reading the source inode in mem-ory. It then takes a lock on the source inode to stop anyfurther modifications to the inode in the course of re-location. Later, it reads the data blocks for the sourceinode and copies them to the destination inode’s blocks,block by block. The reading of source block data is donethrough block buffers and they are then copied to desti-nation buffer. The destination buffer is marked dirty. Fi-nally, when all the data is copied to the ghost inode, thesource inode is re-assigned with the contents of destina-tion inode’s data blocks by swapping them with that ofthe ghost inode. The source inode now contains point-ers to new data blocks. At this point, the source inodeis unlocked, synced and destination inode is released.OHSM Administrator is acknowledged of the comple-tion of these event. See Figure 4 which illustrates theprocess of relocation.

  • 2009 Linux Symposium • 271

    Figure 3: File placement and Block allocation mechanism

    6 Issues and Concerns

    During the course of OHSM’s prototype implementa-tion which was done primarily for ext2/ext3 file system,the process revealed several issues. Some of the majorones include the following:

    6.1 struct ext3_super_block

    Currently both the tier-block group range map and theallocation policies reside in OHSM kernel drivers. Thisenforces a dependency of the file system on OHSM ker-nel driver. We ensure to handle this situation currentlyby starting the OHSM services before any local file sys-tem is mounted. It required some modification to thesystem startup script. Since, this information is per filesystem this information should ideally reside in the su-per block of the file system. OHSM still struggles tofind an easy way out of this. Since, this table can behuge and number of policies can be quite high, keepingsuch information in the super block is undesirable.

    6.2 struct ext3_inode

    Allocation and relocation are the key components ofOHSM and as the object on which OHSM operates is

    a file, it is very tightly coupled with the inode structure.So, it was required to make on-disk changes in the structext3_inode in order to support allocation and relocation.We added "ohsm_home_tid" which stores the tier IDassigned upon allocation of a file and "ohsm_dest_tid"which stores the tier ID assigned during relocation of afile. Furthermore, to support different criteria of relo-cation like File Access Temp (FAT) and File Input Out-put Temp (FIOT) respective fields are to be added tothe inode structure of the file system. These changesmake the compact inode structure slightly bulky. Thesechanges might also disturb any existing file system par-titions present on the current system. OHSM plans touse extended attributes in order to avoid this problemand currently lacks a concrete way to handle this.

    6.3 Exporting internal functions

    The complete working of Online Hierarchical StorageManager requires support from the file system on whichit operates. In order to provide support to OHSM a fewstatic functions residing inside the file system are re-quired to be exported. This compromises the integrity ofthe file system code. For recent file systems like ext4 therequired functionality (EXT4_IOC_MOVE_VICTIM)is soon going to be present which will help OHSM tonot violate such integrity issues in the future. We are

  • 272 • Online Hierarchical Storage Manager

    Figure 4: File relocation mechanism

    trying to make OHSM completely functional with theext4 file system during the writing of this paper. Weare also monitoring and reviewing the implementationof the patches from Akira Fujita for restricted block al-location.

    6.4 Crash during relocation

    Currently OHSM uses a temporary inode in order to re-locate an inode’s data blocks from one tier to another. Incase, if the system crashes during this relocation, therecan be a chance of data loss. Currently we only releasethe original blocks after the complete relocation is com-pleted. This reserves some space for the time duringwhich relocation is going on. OHSM still needs a bet-ter way to handle this. Journaling may be required toovercome this problem.

    6.5 Inode lock contention

    During the process of relocation the inode is locked un-til all the data blocks attached with the inode are suc-cessfully relocated to a new destination tier. The timetaken during relocation is file size dependent. This timewould be more for large files, so any I/O pending on itwill have to wait for that period of time and the requestsmay even time out. We are in the process of dividing thiswhole relocation into chunks of 64K in order to reducethis lock contention period.

    7 One step Further

    OHSM looks forward to a list of enhancements in-orderto make it complete and stable. Here are a couple of

    them to start with:

    7.1 User space implementation

    The most desirable aspect with OHSM is to move asmuch as possible of its implementation into user space.This helps us remove the kernel components and otherdependencies of the file system. It will also help main-tain the source code integrity for the file system. Suchan implementation would surely require a lot of supportfrom the file system. Going ahead with an ioctl basedinterface would be one of the best options. Eventuallythe file system would need to support the OHSM basedfile creation and ranged block allocation. To achievethis without breaking the integrity and consistency ofthe code is a big challenge.

    7.2 Automatic Relocation Engine

    Currently relocation is a triggered event which in mostof the server based environments will not be a pleasantexperience for the administrators. Going a step furtherand designing an automatic relocation engine would beone of the most fascinating features OHSM can offer.The most important challenge in designing such an en-gine would be to derive the heuristics which would drivesuch an engine. A very frequent invocation to reloca-tion can damage the file systems performance to a greatextent. Also, a long interval between relocations canaffect the storage efficiency adversely. So, we need anintelligent mechanism which could be based on the I/Oand activities on the file system. Using FAA and FMAas criteria can impose hard restrictions with their values

  • 2009 Linux Symposium • 273

    being constant, which might not always yield optimumresults. FIOT and FAT can be the most efficient candi-date as they are softer and can be based truly on the fileactivities in real time. OHSM is still looking forward togood heuristics and measures which would provide anoptimum and efficient methods for making relocation adynamic event.

    7.3 Optimize mdraid Support

    OHSM uses a new block allocation strategy for the filesystem and also has the underlying device topology. IfOHSM is used over mdraid array, OHSM’s block al-location strategies can be further optimized to leveragethe underlying device layout. This may enhance the I/Ospeed over the devices in the mdraid array.

    8 Conclusion

    Online Hierarchical Storage Manager for GNU/Linuxcreates a platform and opens up various opportunitiesfor further work in the area of Hierarchical storage for aLinux based environment. OHSM sets up the basic in-frastructure where we can think of systematically merg-ing traditional and SSD based storage devices to reducethe overall cost of the system administration and alsoattaining a degree of storage efficiency. Moreover dueto the support for policy based migration of data, theadministrative cost of managing data also reduces. Theidea is to effectively reduce the cost of storage admin-istration and at the same time keep the system efficientand consistent. OHSM with some changes to the filesystems file creation and block allocation algorithms canachieve its goal of implementing a complete open sourcestorage software solution.

    9 Acknowledgment

    We would like to sincerely thank all the people whohelped us in this project, especially Greg Freemyer andManish Katiyar for providing us their valuable time andsupport on various technical and design issues. Also wewould like to thank Bharti Alatgi and Uma Nagaraj fortheir keen interest in OHSM from the early beginningand motivating us in the overall course of development.

    10 References

    1 Retrieving Multimedia Objects From HierarchicalStorage Systems, Eighteenth IEEE Symposium onMass Storage Systems and Technologies, 2001.

    2 Planned Extensions to the Linux Ext2Ext3 Filesystem,Proceedings of the FREENIX Track: 2002 USENIXAnnual Technical Conference.

    3 On Configuring Hierarchical Storage Structures(1998), Ali Esmail Dashti, Shahram Gh. InProceedings of the Joint NASA/IEEE Mass StorageConference.

    4 Ensuring Performance in Activity-Based FileRelocation, Wu, J.C. Bo Hong Brandt, S.A. Dept. ofComput. Sci., California Univ., Santa Cruz, CA.

    5 DHIS: discriminating hierarchical storage,Proceedings of SYSTOR 2009: The IsraeliExperimental Systems.

  • 274 • Online Hierarchical Storage Manager

  • Proceedings of theLinux Symposium

    July 13th–17th, 2009Montreal, Quebec

    Canada

  • Conference Organizers

    Andrew J. Hutton, Steamballoon, Inc., Linux Symposium,Thin Lines Mountaineering

    Programme Committee

    Andrew J. Hutton, Steamballoon, Inc., Linux Symposium,Thin Lines Mountaineering

    James Bottomley, NovellBdale Garbee, HPDave Jones, Red HatDirk Hohndel, IntelGerrit Huizenga, IBMAlasdair Kergon, Red HatMatthew Wilson, rPath

    Proceedings Committee

    Robyn BergeronChris Dukes, workfrog.comJonas FonsecaJohn ‘Warthog9’ Hawley

    With thanks toJohn W. Lockhart, Red Hat

    Authors retain copyright to all submitted papers, but have granted unlimited redistribution rightsto all as a condition of submission.