Data Footprint Reduction: Understanding IBM Storage Options

#IBMEDGE © 2012 IBM Corporation

sSE20

Data Footprint Reduction:

Understanding IBM Storage

Efficiency Options

Tony PearsonMaster Inventor and Senior Managing Consultant, IBM Corp

Sanjay S BhikotAdvisory Unix and Storage Administrator, Ricoh Amer icas Corp


Data Footprint Reduction is the

catch-all term for a variety of

technologies designed to help

reduce storage costs. This session

will cover thin provisioning, space-

efficient copies, deduplication and

compression technologies, and

describe the IBM storage products

that provide these

capabilities.


Sessions -- Tony Pearson

• Monday – 1:00pm Storing Archive Data for Compliance Challeng es– 4:15pm IBM Watson: What it Means for Society

• Tuesday – 4:15pm Using Social Media: Birds of a Feather (BOF)

• Wednesday – 9:00am Data Footprint Reduction: IBM Storage option s– 2:30pm IBM's Storage Strategy in the Smarter Comput ing era– 4:15pm IBM SONAS and the Cloud Storage Taxonomy

• Thursday – 9:00am IBM Watson: What it Means for Society– 10:30am Tivoli Storage Productivity Center Overview– 5:30pm IBM Edge “Free for All” hosted by Scott Drummo nd

3


Agenda

• Thin Provisioning• Space-Efficient Copy• Data Deduplication• Compression


History of Thin Provisioning

1994

1997 Today

The StorageTekIceberg 9200 ArrayIntroduced Thin Provisioning on slower 7200RPM drives for mainframe systems

IBM resold this as the RAMAC Virtual

Array (RVA) for mainframe servers

Thin Provisioning is available for many operating systems

on IBM storage, including DS8000,

XIV, SVC, N series, Storwize V7000,

DS3500 and DCS3700

5


Why Space is Over-Allocated

• Scenario 1– Space requirements

under-estimated

– Running out of space requires larger volume

– New request may take weeks to accommodate

• Application outage if not addressed in time

– Data must be moved to the larger volume

• Application outage during data movement

• Scenario 2– Space requirements

over-estimated

– Capacity lasts for years

• No data migration

• No application outages

• No penalties

When faced with this dilemma, most will err on the side of

over-estimating

6


Fully Allocated vs. Thin Provisioned

Host sees fully allocated amount

Host sees full virtual amount

Actual data written

Allocated but unused space dedicated to this host, wasted until written to

Actual data written

Empty space available to others

Physical Space Allocated

7


Fully Allocated vs. Thin Provisioned

Host sees a volumeor LUN that consists of blocks numbered 0 to nnnnnnnnnn

Extent – Allocation UnitOne or more grains

Volume/LUN – one or more extents

Grain – range of 1 or more blocks

Block – typically 512 or 4096 bytes

8


Coarse and Fine-Grain

9

8

7

6

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9

9

5

0

0 1 2 3 4 5 6 7 8 9

Block 00, 55, and 99 writtenFully Allocated, all 10 extents allocatedCoarse-Grain, only 3 extents allocatedFine-Grain, only 1 extent allocated

Fully Allocated Fine-GrainCoarse-Grain

Grain 54-55

Grain 00-01

Grain 98-99

Grain 90-99 = extent

9


How IBM has implemented TP

IBM DS8000 IBM XIV SVC and StorwizeV7000

DS3500,DCS3700

Type Coarse Fine Fine Fine

Allocation Unit

1 GB 17 GB 16MB to 8GB

4 GB

Grain size 1 MB 32-256 KB 64 KB

10


Thick-to-Thin Migration

Fully-allocatedvolume

Volume mirror

Only non-zero blocks copied

Copy 0 Copy 1

Thin-provisioned

volume

11


*** IBM Confidential until July 12, 2011 ***

Empty Space Reclaim

� Thin Provisioning , allocations in 17GB units, with 1MB chunks (grains). Only non-zero blocks consume physical space.

� Avoid writing empty blocks , any I/O request that tries to write a block of all zeros to unallocated space is ignored.

� Background task to find empty chunks , a background task scans all blocks, looking for chunks containing all zeros.

� Empty space reclaimed empty chunks are returned to unallocated space, so that it can be used for other volumes

12


Thin Provisioning

� ProsJust-in-Time increased utilization percentageEliminates the pressure to make accurate space estimatesDynamically expand volume without impacting applications or rebooting serverReduces the data footprint and lowers costsShifts focus from volumes to storage pool capacity

• Cons�Not all file systems

cooperate or friendly� Deletion of files does not

free space for others� “sdelete” writes zeros over

deleted file space

�Some implementations may impact I/O performance�May not support same set

of features, copy services, or replication� “Writing checks you can’t

cash”

13


Agenda



History of Space-Efficient Copies

1993

1997 Today

NetApp introduces Snapshot in its WAFL file system

IBM Enterprise Storage Server

(ESS) introduces NOCOPY parameter

on FlashCopy

Space-Efficient Copy is available on many

IBM storage systems, including DS8000, XIV,

SVC, N series, Storwize V7000,

DS3500, DS5000 and DCS3700

15


Space-Efficient Copies

Destination 1

100 GB allocated40 GB written

300 GB

30 GB

Traditional Copies

Space-Efficient Copies. 10% reserved

Source

Destination 2 Destination 3

16


Method 1: Copy on Write (COW)

• Copy-On -Write (COW)– Copy is set of pointers to

original data– Write to original volume:

• Pause I/O• Copy original block of data to

destination• Update original block

– Slows performance– May limit # of destination

copies– Can be combined with

background copy for a full copy

Block A B C D

Source Destination

Block A B C2 D

Source Destination

C

17


Method 2: Redirect on Write (ROW)

• Redirect-On -Write (ROW)– Copy is set of pointers to

original data– Write to original volume:

• Re-directed to new empty space

• Previous data left alone

– Does not impact performance

– Supports many destination copies

Block A B C D

Source Destination

Block A B C D

Source Destination

C2

18


Space-Efficient Copies

� ProsSupports both Fully-allocated and Thin-Provisioned SourcesReduces the data footprint and lowers costsAllows you to keep more copies onlineAllows you to take copies more frequently� Can be used as

checkpoint copies during batch processing

• Cons�Some implementations

may impact I/O performance�Requires that you

estimate the maximum percentage changed

• Typically 10-20 %

�Exceeding the reserved space invalidates destination copy

19


Agenda



History of Data Deduplication

2007

2008TodayAdvanced Single

Instance Store (A-SIS) bring deduplication for the IBM N series and NetApp disk storage

IBM acquires Diligent and introduces the

ProtecTIER TS7600 virtual tape library with

data deduplication

IBM offers a variety of choices, including

ProtecTIER, N series, and Tivoli Storage

Manager (TSM v6)

21


Data Deduplication

• Data deduplication reduces capacity requirements by only storing one unique instance of the data on dis k and creating pointers for duplicate data elements

22

#IBMEDGE © 2012 IBM Corporation23

Deduplication reduces disk

required for backup copies

23

#IBMEDGE © 2012 IBM Corporation24 31-May-12

HyperFactorA different approach based on an agnostic

view of data

Hash based Deduplication

Sometimes referred to as a Content

Addressable Storage approach

Two Primary Data Deduplication

Approaches

24


1. Slice data into chunks (fixed or variable)

2. Generate Hash per chunk and save

3. Slice next data into chunks and look for Hash Match

4. Reference data previously stored

A B C D E

A B C D E

Ah ChBh Dh Eh

Hash-Based Approach

25


1. Look through data for similarity

2. Read elements that are most similar3. Diff reference with version – will use several elements

4. Matches factored out – unique data added to repository

Element A Element B Element C

New Data Stream

HyperFactor Approach

26


Example: Imagine a chunk size of 8 KB

•1 TB repository has ~125,000,000 8 KB chunks

•Each hash is 20 bytes long•Need pointers scheme to reference 1 TB

The hash-table requires 2.5 GB RAM

» no issue

With a 100 TB repository» ~250 GB of RAM is

required

• Applicable for all chunking methods

• Hash Table in Memory– Overhead for in-band deduplication– Hash table will grow with data volume– Growing hash-table may become

performance bottleneck– Scalability issues

• Hash-Collisions must be handled• Hash table must be protected

– One copy might not be sufficient

Assessment of Hash-based

Approaches

27


When Deduplication Occurs

1. In-line Processing– As data is received by the target device it is

• Deduplicated in real time• Only unique data stored on disk

– Data written to the disk storage is deduplicated

2. Post-Processing– As data is received by the target device it is

• Temporarily stored on disk storage– Data is subsequently read back in to be processed by a

deduplication engine

28


Comparison of Offerings

Hash-based HyperFactor

In-line Process

Other vendors IBM ProtecTIER–TS7680G–TS7650G–TS7650–TS7620 Express–TS7610 Express

Post-Process

• IBM Tivoli Storage Manager (TSM)

• N series

29


IBM ProtecTIER with HyperFactor

• Gateways– Attaches up to 1PB of disk– Two models:

• TS7680 for IBM System z• TS7650G for distributed systems

• Appliances– Disk included inside– Three models for distributed

systems• TS7650 … in three sizes• TS7620 (New!)• TS7610 ... in two sizes

30


Complementary Solutions Today!

Can be used together but don’t deduplicate the same data twice

� Both Solutions Offer the Benefits of Target side De duplication:– Greatly reduced storage capacity requirements – Lower operational costs, energy usage and TCO– Faster recoveries with more data on disk

� Use ProtecTIER When:– Highest performance and capacity scaling are required!– Up to 1400 MB/sec (2.5GB/s with 2 node) deduplication rates are needed– Deduplicated capacities up to 25 PB are required– You wish to avoid operational impact of post processing deduplication– A VTL appliance model is desired– Deduplicating across multiple TSM (or other backup) servers

� Use TSM 6 Built-in Deduplication When:– You desire deduplication operations be completely integrated within TSM– The benefits of deduplication are desired without separate hardware or

software dependencies or licenses (ships with TSM Extended Edition)– You desire end to end data lifecycle management with minimized data

storeTSM

IBM TS7600

ProtecTIER vs.

Tivoli Storage Manager

31


Data Deduplication

� ProsDesigned for backupsCan offer up to 25x data footprint reduction

• Allows disk backup repositories to approach cost of tape-based solutions

Allows more backup copies to remain on disk for faster restoresAvailable with a variety of interfaces, including VTL, OST and NAS

• Cons�Dealing with Hash

Collisions • May require byte-for-byte

comparisons or keeping secondary copy of data

�Some systems do not scale�Some systems have slow

restores• Re-hydrating data back to

normal

�Primary data may not dedupe very well

• Your mileage may vary!

32


Agenda



History of Compression

1973

1986

Today

NASA and IBM developed the Houston Aerospace Spooling Protocol (HASP) with compression for long distance data transmission.

IBM introduced the Improved Data

Recording Capability(IDRC) for the 3480

tape drive

IBM offers real-time compression for file and block level access to disk storage

34


Lossy vs. Lossless Methods

• Lossy– Used with music, photos, video,

medical images, scanned documents, fax machines

• Lossless– Used with databases,

emails, spreadsheets, office documents, source code

Good enough?

Exactly the same

Compress

Decompressdoes not return data back to its original contents

Compress

Decompressreturns data back to its original contents

35


How Compression Works

• Lempel-Ziv lossless compression builds a dictionary of repeated phrases, sequences of two or more characters that can be represented with fewer number of bits

• In the above excerpt from “Lord of the Rings”, all of the red textrepresents repeated sequences eligible for compression!

Source: The Lempel Ziv Algorithm, Christian Zeeh, 200336


Compressed Volumes

Host sees full virtual amount

Actual data written

Allocated but unused space dedicated to this host, wasted until written to

Actual data written

Physical Space Allocated, up to 80% reduction from actual data written

37

Actual data written

Physical Space Allocated


Real-time Compression!

• Real-time Compression for primary data– Less data stored on primary storage (up to 80%) – No changes to applications or procedures

• Before it gets to the storage array– Larger effective storage cache– Disk Array can serve more requests from its read /

write cache– Lower storage CPU overhead

• Does not cause performance degradation– Much smaller I/O / lower disk workload– Reads/Writes are faster due to storage array’s

response from cache instead of disk– Additionally reads may come from advanced read

ahead cache (no write cache)Disk Array

CacheCache

IP Network

Workstations

ApplicationServers

38

38


FIVO vs. VIFO

• Fixed Input, Variable Output– WAN transmission– Sequential tape– IBM Tivoli Storage

Manager– zip, tar, etc.

• Variable Input, Fixed Output– Random Access Compression

Engine™ (RACE)– IBM Real-Time Compression

Appliances– IBM SVC, Storwize V7000

1

2

3

4

5

6

Data

1

2

3

4

5

6

1

2

3

4

5

6

CompressedData

2

1

3

4

5

6

DataCompressed

Data

39


Traditional Approaches

AD

BMN

G H

CF

I

File

NewCompressed

FileABC DMN FGH I

Blocks Shift

Compression after Modification

Compression for Disk data

• Extra work to ‘edit’ a file

• All blocks shift– Only one common block

(this example)– Negative impact to deduplication

• No notion of data location

40

Real-time Compression

File

Compressed File

AD

BMN

G H

CF

I

File

NewCompressed

FileABC DEF1 GHI MN

Identical Blocks

Compression after Modification

• Small amount of work / I/O to edit

• Only modified block changes– Multiple common blocks – Enhances deduplication

• Data location via map

AD

BE

G H

CF

I

ABC DEF GHI

40

#IBMEDGE © 2012 IBM Corporation41

Compression Without Compromise

Expected Compression Ratios

DatabasesUp to 80%

Server Virtualization

Linux virtual OSes Up to 70%

Windows virtual OSes Up to 55%

CollaborationOffice 2003 Up to 75%

Office 2007 or later Up to 25%

CAD/CAM Engineering/DesignUp to 75%

41


Objectives:

• Run over a block device

• Estimate:– Portion of non-zero blocks in the volume.– Compression rate of non-zero blocks with RTC.

Performance:• Runs FAST! < 60 seconds , no matter what the volume size

– Typical running time on a machine with multiple dis ks: < 20 seconds• Give guarantees on the estimation: ~5% max error guarantee

– Can improve guarantee with more running time

Method:• Random sampling and compression throughout the volume• Collect enough non-zero samples to gain desired con fidence

– More zero blocks � slower (takes more time to find non-zero blocks)• Mathematical analysis gives confidence guarantees

• Note: we are estimating compression during migration of a volume into RTC (data at rest)

42


IBM Real-Time Compression

• For NAS devices– IBM Real-Time

Compliance Appliance

• For Block devices– SAN Volume Controller– Storwize V7000

STN 6800

STN 6500

Storwize V7000

SAN Volume Controller


Migrating to Compressed Disk

Fully-allocatedor Thin-provisionedvolume

Volume mirror

Only non-zero blocks copied

Copy 0 Copy 1

Compressedvolume

44


Data Compression

� ProsCan be used for data transmission, tape and disk dataCan offer up to 80% data footprint reductionAvailable as front-end appliance or integrated into storage systemCan be “Dedupe-Friendly”

• Cons�Some implementations are

post-process• Stores uncompressed

data first, compress later

�Some implementations impact performance and/or consume substantial CPU resources�Benefits vary by data type,

and whether applications do their own compression or encryption

• Your mileage may vary

45

#IBMEDGE

Intel, the Intel logo, Xeon and Xeon Inside are trademarks or registered trademarks of Intel Corporation in the U.S. and /or other countries.

Thank You!

Session: sSE20Presenters: Tony Pearson,

Sanjay Bhikot


Additional Resources

62

Email:[email protected]

Twitter:http://twitter.com/az99Øtony

Blog: http://ibm.co/brAeZØ

Books:http://www.lulu.com/spotlight/99Ø_tony

IBM Expert Network:http://www.slideshare.net/az99Øtony

62


Trademarks and disclaimers© IBM Corporation 2012. All rights reserved.

Adobe, the Adobe logo, PostScript, and the PostScri pt logo are either registered trademarks or tradema rks of Adobe Systems Incorporated in the United Sta tes, and/or other countries. IT Infrastructure Library is a registe red trademark of the Central Computer and Telecommun ications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Ce leron, Intel Xeon, Intel SpeedStep, Itanium, and Pe ntium are trademarks or registered trademarks of Intel Corporation or its s ubsidiaries in the United States and other countrie s. Linux is a registered trademark of Linus Torva lds in the United States, other countries, or both. Microsoft, Windows, Wi ndows NT, and the Windows logo are trademarks of Mic rosoft Corporation in the United States, other count ries, or both. ITIL is a registered trademark, and a registered communi ty trademark of The Minister for the Cabinet Office , and is registered in the U.S. Patent and Trademark Office. UNIX is a registered trademark of The Open Group in the Unite d States and other countries. Java and all Java- based trademarks and logos are trademarks or regist ered trademarks of Oracle and/or its affiliates. Cell Broadband Eng ine is a trademark of Sony Computer Entertainment, Inc. in the United States, other contries, or both a nd is used under license therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countri es.

Other product and service names might be trademarks of IBM or other companies. Trademarks of Internati onal Business Machines Corporation in the United St ates, other countries, or both can be found on the World Wide W eb at http://www.ibm.com/legal/copytrade.shtml .

Information is provided "AS IS" without warranty of any kind.

The customer examples described are presented as il lustrations of how those customers have used IBM pr oducts and the results they may have achieved. Actu al environmental costs and performance characteristics may vary by customer.

Information concerning non-IBM products was obtaine d from a supplier of these products, published anno uncement material, or other publicly available sour ces and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance num bers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM h as not tested these products and cannot confirm the accuracy of performance, capability, or any other c laims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the suppli er of those products.

All statements regarding IBM future direction and i ntent are subject to change or withdrawal without n otice, and represent goals and objectives only.

Some information addresses anticipated future capab ilities. Such information is not intended as a defin itive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presen ted here to communicate IBM's current investment and developmen t activities as a good faith effort to help with ou r customers' future planning.

Performance is based on measurements and projection s using standard IBM benchmarks in a controlled env ironment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user' s job stream, the I/O configuration, the storage co nfiguration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve thro ughput or performance improvements equivalent to th e ratios stated here.

Prices are suggested U.S. list prices and are subje ct to change without notice. Starting price may not include a hard drive, operating system or other fe atures. Contact your IBM representative or Business Partner for the most current pricing in your geography.

Photographs shown may be engineering prototypes. Ch anges may be incorporated in production models.

References in this document to IBM products or serv ices do not imply that IBM intends to make them ava ilable in every country.

63

Data Footprint Reduction: Understanding IBM Storage Options

Technology

Transcript of Data Footprint Reduction: Understanding IBM Storage Options