Centera Integration Training

44
1 EMC Centera – API Best Practices © Copyright 2007 EMC Corporation. All rights reserved. Centera Integration Training Centera API Best Practices Corporate Systems Engineering July 2007

description

Centera Integration Training. Centera API Best Practices. Corporate Systems Engineering July 2007. Agenda. Refresher on Centera API Performance Optimizations Metadata Storage Strategies & ClipID Formats SDK Options (Buffers / Failover) Managing Retention Wrap Up and Questions. - PowerPoint PPT Presentation

Transcript of Centera Integration Training

Page 1: Centera Integration Training

1

EMC Centera – API Best Practices

© Copyright 2007 EMC Corporation. All rights reserved.

Centera Integration Training

Centera API Best Practices

Corporate Systems Engineering

July 2007

Page 2: Centera Integration Training

EMC Centera – API Best Practices

2© Copyright 2007 EMC Corporation. All rights reserved.

Agenda

Refresher on Centera API

Performance Optimizations

Metadata

Storage Strategies & ClipID Formats

SDK Options (Buffers / Failover)

Managing Retention

Wrap Up and Questions

Page 3: Centera Integration Training

EMC Centera – API Best Practices

3© Copyright 2007 EMC Corporation. All rights reserved.

API Background

Understand the main Centera API concepts– Pools, Clips, Streams, Tags, Blobs, Query

Appreciate the following points:– Open, read, write, query and delete transactions contact the Centera

while clip create, get /set attribute, close are local SDK operations– 1 transaction uses 1 socket is 1 thread of access– C-Clips only have relationships with blobs, never other C-Clips– The high transactional overhead for any write operation and how it

affects the writing of small objects

Understanding these points makes integration easier, more efficient and more effective!

Page 4: Centera Integration Training

EMC Centera – API Best Practices

4© Copyright 2007 EMC Corporation. All rights reserved.

Deploy In A Multi-Tier Architecture

Clients RunningApplication

API Library

CentraStarSoftware

• CentraStar Software resides on the Centera.

• API Library resides on the application server.

• Client application on PC connects to application server.

• Server application makes calls to Centera-supplied API.

• API interacts with Centera over TCP/IP using HPP protocol.

OS

Application

Page 5: Centera Integration Training

EMC Centera – API Best Practices

5© Copyright 2007 EMC Corporation. All rights reserved.

Transaction Example (Write)•The client sends a write (import) request to the server

•The server sends an acknowledgement to the client

•The client begins streaming its data to the server

•After the client has sent its last packet of data, it begins waiting for an ACK from the server

•The server distributes the data to the Storage Nodes, and updates its indices

•The server sends an acknowledgement to the client and the write operation returns

Client

Centera

Request ------ >< -----------ACKData-------- >Data-------- >Data-------- >Data-------- >.

.(Client waiting for ACK)

.< -----------ACK

ApplicationServer

Application

API Library

Operating System

Page 6: Centera Integration Training

EMC Centera – API Best Practices

6© Copyright 2007 EMC Corporation. All rights reserved.

Performance Optimizations

Multithreading

Small Object Management– Embedded Blobs– Containerization– Hybrid Containerization

Huge Object Management– Blob Slicing

Pool Management

Page 7: Centera Integration Training

EMC Centera – API Best Practices

7© Copyright 2007 EMC Corporation. All rights reserved.

Optimizing Write Performance

setup commit

setup commit

Large Object (~5Mb)

Small Object (~50Kb)

… with small objects, most of the total transaction time isspent in setup and commit – hardly any in data transfer!

So for a single stream over time we see …

Page 8: Centera Integration Training

EMC Centera – API Best Practices

8© Copyright 2007 EMC Corporation. All rights reserved.

Multithreaded Writes

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Thread 7

Thread 8

Multithreading provides overlapped I/O – more data istransferred in a given period of time.

Page 9: Centera Integration Training

EMC Centera – API Best Practices

9© Copyright 2007 EMC Corporation. All rights reserved.

Multithreading

Advantages– Better utilization of available bandwidth– Overlapped I/O yields better throughput– Takes advantage of multiple access nodes– Shared PoolConnection for improved load balancing

Disadvantages– More coding required– Multithreading coding/debugging generally trickier than single

threaded programming– Thread packages differ from platform to platform– Scales to a point, then rolls off

Page 10: Centera Integration Training

EMC Centera – API Best Practices

10© Copyright 2007 EMC Corporation. All rights reserved.

Multithreading

Make the number of threads configurable individually for Read and Write

A good combined number is 20 threads per access node– This needs to be configured at install time– For large numbers of threads, increase the value of

FP_OPTION_MAXCONNECTIONS (default is 100)

No application exists in a vacuum– Be conscious of workload imposed by other applications

Page 11: Centera Integration Training

EMC Centera – API Best Practices

11© Copyright 2007 EMC Corporation. All rights reserved.

Managing Small Content - Object Count Limitations

CentraStar 3.1 / Gen 4 has 50 million object count / node limitation

This count includes all types of Centera objects– CDF– Blobs– Mirror copies– Parity fragments– Reflections

CDF should be designed to fully utilise capacity (bytes) before these object count limits are encountered.

Embedded Blobs cuts down object usage by at least 50%

Page 12: Centera Integration Training

EMC Centera – API Best Practices

12© Copyright 2007 EMC Corporation. All rights reserved.

Managing Small Content - Embedded Blobs

Whenever a write or read is done, two objects are transferred

Writes– the Blob is written, added to the Tag in the CDF being constructed and

when fully constructed the CDF is written.

Reads– the CDF is read, the application navigates to the Tag containing the content

and the Blob is retrieved

CDF

Blob

Page 13: Centera Integration Training

EMC Centera – API Best Practices

13© Copyright 2007 EMC Corporation. All rights reserved.

Managing Small Content - Embedded Blobs

With embedded blobs, there is no separate blob so all data is transferred as a single object when the CDF is read or written

The SDK transparently stores the Blob as an Attribute on the Tag inside the CDF– Base64 encoded to adhere to the XML character set– Developer does not write any “special” code other then enabling the feature

I/O operations are reduced by at least half– only the CDF is read/written– proportionately greater savings if multiple blobs are stored in the CDF

Can only be used for relatively small objects (< 100KB)

CDF

Blob

Page 14: Centera Integration Training

EMC Centera – API Best Practices

14© Copyright 2007 EMC Corporation. All rights reserved.

Managing Small Content - Embedded Blobs

Embedded Blobs are easily enabled within an application– Globally via an FPPool option

FPPool_SetGlobalOption(FP_OPTION_EMBEDDED_DATA_THRESHOLD, 100*1024) Threshold (100KB in the example above, max is 100KB) is then used to determine how the data is

stored

– Explicitly on the FPTag_BlobWrite call FPTag_BlobWrite(theTag, theStream, FP_OPTION_LINK_DATA) FPTag_BlobWrite(theTag, theStream, FP_OPTION_EMBED_DATA) Overrides any Global setting that is in force

Page 15: Centera Integration Training

EMC Centera – API Best Practices

15© Copyright 2007 EMC Corporation. All rights reserved.

Managing Small Content - Embedded Blobs

Advantages– Can dramatically decrease object count usage if multiple blobs are

stored embedded in each CDF– Reduces I/Os (blob does not need to be read separately)– Easy to code

Disadvantages– Single instance storage is lost– XML-Compatible Data Encoding (Base64) increases storage

requirement by 33%– Read performance can be impacted

All blob content is retrieved when opening the clip The larger CDF takes longer to parse Standard guidelines should be followed i.e. CDF size < 10MB

Page 16: Centera Integration Training

EMC Centera – API Best Practices

16© Copyright 2007 EMC Corporation. All rights reserved.

Managing Small Content - Containerization

Small objects are collected and inserted into a larger container object

Each individual’s byte offset and length is stored in the metadata

When an object is retrieved, a byte offset read is done through the API and only the small object is returned

Content Descriptor File

Container Blob

1023.jpg 21078 2497 1023.jpg 21078 2497

Image name Byte offset Length

Page 17: Centera Integration Training

EMC Centera – API Best Practices

17© Copyright 2007 EMC Corporation. All rights reserved.

Managing Small Content - Containerization Advantages

– Better utilization of available bandwidth– Much faster ingest of a large number of small pieces of content– Reduces the object count

Disadvantages– More coding involved– Deletion of individual object requires re-writing and re-indexing

entire container– No Single Instance Storage

Limited “use case”– Only applicable where huge numbers of small objects require to be

stored in the same C-Clip Size of CDF would become unmanageable / non-performant (100MB limit)

– Embedded Blobs strategy is preferable in most situations

Page 18: Centera Integration Training

EMC Centera – API Best Practices

18© Copyright 2007 EMC Corporation. All rights reserved.

Managing Small Content - Hybrid Containerization

Combines aspects of Embedded Blobs and Containers

“Containers” are constructed using multiple embedded blobs– CDF effectively becomes the container

Each blob still represents a single application-level object

CDF

Check1003

Check1004

Check1005

Check1006

Page 19: Centera Integration Training

EMC Centera – API Best Practices

19© Copyright 2007 EMC Corporation. All rights reserved.

Managing Small Content - Hybrid vs. Classic Containers

Individual object indexing not required

Local storage managed by SDK rather than application– Application does not need to build a local container – The CDF becomes the container

Simplified deletion of individual objects from container

Automatic Single Instance Storage for objects larger than the embedded blob threshold

– No code changes required

Page 20: Centera Integration Training

EMC Centera – API Best Practices

20© Copyright 2007 EMC Corporation. All rights reserved.

Managing Huge Content - Blob Slicing

Write blob data to Centera using multiple threads provided by the application

Enables increased performance at time of write– No increase in performance for blobs < 5MB in size

Segments are exported to Centera as if they are different blobs

Segments are referenced by a single tag– The same method as the internal 100MB blob segmentation feature

Page 21: Centera Integration Training

EMC Centera – API Best Practices

21© Copyright 2007 EMC Corporation. All rights reserved.

Managing Huge Content - Blob Slicing

FPTag_BlobWritePartial(Tag, Stream, options, sequenceID)

Sequence ID determines the sequence of data written by multiple threads for one tag

– sets the order in which data is to be read back by the SDK– must be greater than 0– Duplicate IDs on a tag are not allowed and will return error

Read-back is performed in ascending Sequence ID order

Transparently supports FPTag_BlobRead() and FPTag_BlobReadPartial().

Page 22: Centera Integration Training

EMC Centera – API Best Practices

22© Copyright 2007 EMC Corporation. All rights reserved.

Managing Huge Content - Blob Slicing

Does not operate with any embedded options– Linked data only– FP_OPTION_EMBED_DATA causes an error– FP_OPTION_EMBEDDED_DATA_THRESHOLD setting is ignored

Page 23: Centera Integration Training

EMC Centera – API Best Practices

23© Copyright 2007 EMC Corporation. All rights reserved.

Managing Huge Content - Blob Slicing

FPTag_CreatePartialFileForInput(FilePath, Perm, BuffSize, Offset,Length)

Similar to CreateFileForInput– Allows for a section of the input file to be written

Two additional parameters– Offset where reading should begin within the file– Length of the segment to be read

Transparently supports FPTag_BlobWrite() and FPTag_BlobWritePartial().

Page 24: Centera Integration Training

EMC Centera – API Best Practices

24© Copyright 2007 EMC Corporation. All rights reserved.

Pool Management – Creation process

Pool of sockets (and associated data structures) is allocated Probe packet is sent to the first address provided

– if a response is not received before timeout (configurable, default is 2 minutes), subsequent addresses will be tried in sequence

Response is received from another AN, which contains a list of all known ANs ordered by load

The replica for the cluster is probed (with the same default timeout of 2 minutes)

The entire replica chain is walked– Allowing for timeouts, this could be a lengthy process

FPPool_Open(“10.0.0.1,10.0.0.2”)

Primary Replica1 Replica2

X

Page 25: Centera Integration Training

EMC Centera – API Best Practices

25© Copyright 2007 EMC Corporation. All rights reserved.

Pool Management – Recommended Strategy

FPPool_Open should be called once when the application is started

Subsequent Centera I/O should be done using this single Pool Reference

When the application shuts down, call FPPool_Close

FPPool_Open

FPPool_Close

WriteReadReadWriteReadReadWriteWriteWrite

Page 26: Centera Integration Training

EMC Centera – API Best Practices

26© Copyright 2007 EMC Corporation. All rights reserved.

Centera in a Multi-Application Repository Environment

Centera

Application #1ECM

Application #2Email Archiving

Application #3PC Backup

Page 27: Centera Integration Training

EMC Centera – API Best Practices

27© Copyright 2007 EMC Corporation. All rights reserved.

Pool 1 App Pool 1

How can we partition data access?

CDF= Blob=

Cluster Pool

Pool 2

App Pool 2

Pool 3

App Pool 3

Default PoolDefault Pool

Virtual Pools

Page 28: Centera Integration Training

EMC Centera – API Best Practices

28© Copyright 2007 EMC Corporation. All rights reserved.

Virtual Pools

Virtual Pools implement a form of Data Partitioning

Think of Virtual Pools as Virtual Centeras within a physical Centera cluster

Applications should– connect to Virtual Pools through Access Profiles and Capabilities

Page 29: Centera Integration Training

EMC Centera – API Best Practices

29© Copyright 2007 EMC Corporation. All rights reserved.

Metadata

Metadata is the key to disaster recovery Rich metadata helps to identify individual pieces of

content given local repository failure Clip level metadata can be retrieved as part of a Query

Result set when performing Disaster Recovery– FPQueryExpression_SelectField(myQueryExp, “aClipAttribute”);

CenteraSeek relies on metadata– capability to “Google” the Centera

this is not what Query is intended for!– chargeback reporting– uses metadata only, not the document content itself

all types of Metadata can be queried• Standard SDK metadata e.g. creation.date• Clip level metadata e.g. ApplicationName (attribute added by application)• Tag level metadata e.g. InvoiceNumber (attribute added by application)

Page 30: Centera Integration Training

EMC Centera – API Best Practices

30© Copyright 2007 EMC Corporation. All rights reserved.

Metadata - Sample CDF<?xml version='1.0' encoding='UTF-8' standalone='no'?>

<ecml version="3.0">

<eclipdescription>

<meta name="type" value="Standard"/>

<meta name=“name" value=“TrainingClip"/>

<meta name="creation.date" value="2004.08.05 09:31:19 GMT"/>

<meta name="modification.date" value="2004.08.05 09:35:51 GMT"/>

<meta name="numfiles" value=“1"/>

<meta name="totalsize" value="2082"/>

<meta name="refid" value="5DD0B54HG7OCG3UTUGV1FP004Q"/>

<meta name="prev.clip" value=""/>

<meta name="clip.naming.scheme" value="MD5"/>

<meta name="numtags" value=“1"/>

<meta name="sdk.version" value="3.0.377"/>

<custom-meta name=“ApplicationName" value=“MetadataExample"/>

<custom-meta name=“ApplicationVendor" value=“EMC_Engineering"/>

<custom-meta name=“ApplicationVersion" value=“3.4"/>

</eclipdescription>

<eclipcontents>

<myMainTag someMeaningfulAttribute=“aValue" anotherAttribute=“andAnmotherValue”>

<eclipblob md5="DM6JEBLJFH9I" size="694327" offset="0"/>

</myMainTag>

</eclipcontents>

</ecml>

Page 31: Centera Integration Training

EMC Centera – API Best Practices

31© Copyright 2007 EMC Corporation. All rights reserved.

What happens if local Content Address repository is lost?

Protect the relationship between the archive and the local store– Metadata assists in local store reconstruction using Query– Store the Transaction Logs or Incremental database backups on Centera

Email resulting C-Clip IDs to DBA or store on Profile Clip• Use a separate Virtual Pool exclusively for these backups

Logs / Backups can be easily retrieved in DR scenario

Disaster Recovery

DB

Centera

ApplicationServer

EFG242769LH32e57R23E2IBC4FETo: DBA

Page 32: Centera Integration Training

EMC Centera – API Best Practices

32© Copyright 2007 EMC Corporation. All rights reserved.

Storage Strategy Capacity (SSC)

Allows for Single Instance Storage– prevents identical content being stored numerous times

Uses M or M++ naming schemes– M++ performs additional SHA-256 calculation (performance

overhead)– Reduces collision potential

Content Address is 27 (M) or 53 (M++) bytes long– M++ CA incorporates part of the SHA-256 calculated ID

Page 33: Centera Integration Training

EMC Centera – API Best Practices

33© Copyright 2007 EMC Corporation. All rights reserved.

Storage Strategy Performance (SSPP / SSPF) Blobs below the threshold (default 250K) are written using a 53 byte

Content Address incorporating a Time element

2 different types of available– Partial (SSPP)

C-Clips retain the standard 27 byte Content Address Should only be used by applications which cannot 53 byte CA

– Full (SSPF) C-Clips and Blobs both use the 53 byte CA

Always use FP_OPTION_CLIENT_CALCID_STREAMING as content cannot pre-exist due to the change in the Content Address

When using Generic Streams of unknown size, set FP_OPTION_PREFETCH_SIZE >= SSP threshold to ensure the correct strategy is used

Use the SDK defined constant for Content Address Length to ensure compatibility with future naming schemes

Page 34: Centera Integration Training

EMC Centera – API Best Practices

34© Copyright 2007 EMC Corporation. All rights reserved.

SDK Options

The SDK allows for configuration of many options to control different aspects of PoolConnection behaviour

– Buffer sizes– Failover strategies– Storage strategies– Timeouts / Retry limits

2 main types of options– Global options which apply to all PoolReferences created by the

application– Local options which affect individual PoolReference instances

Page 35: Centera Integration Training

EMC Centera – API Best Practices

35© Copyright 2007 EMC Corporation. All rights reserved.

Buffer Sizes

FP_OPTION_BUFFERSIZE– Size of the CDF buffer in bytes.

Will swap out to disc if CDF is larger than the buffer Default 16KB / Min 1KB / Max 10MB

– Set this value to exceed the total CDF size (XML + Blob data) to avoid swapping to disc

e.g.(150 * 1024) for a C-Clip with a single embedded blob

FP_OPTION_PREFETCH_SIZE– Size of the temporary buffer used by the SDK to assist with content

length based storage decisions Storage Strategy Performance or Capacity Parity or Mirroring Default 32 KB / Maximum 1 MB.

– If using Generic Streams of unknown length, set this value to 1MB

Page 36: Centera Integration Training

EMC Centera – API Best Practices

36© Copyright 2007 EMC Corporation. All rights reserved.

Failover Strategies

Can be enabled or disabled for all operations– FP_OPTION_ENABLE_MULTICLUSTER_FAILOVER

Network failover occurs when connectivity to the Primary cluster is lost

Content failover occurs when an operation (read, write, delete, query or exists) relating to a particular C-Clip does not return success

– Can be configured separately for each operation FP_OPTION_MULTICLUSTER_XXXX_STRATEGY

– Three possible strategies FP_NO_STRATEGY / FP_FAILOVER_STRATEGY /

FP_REPLICATION_STRATEGY NB: Not all strategies are supported for all operations and default behavior differs.

Page 37: Centera Integration Training

EMC Centera – API Best Practices

37© Copyright 2007 EMC Corporation. All rights reserved.

Storage Strategies

FP_OPTION_DEFAULT_COLLISION_AVOIDANCE– Enable (by supplying value FP_TRUE) if

Single Instance Storage is unlikely to be of value Remote chance of a collision possibility is unacceptable

– can also be used as an option on FPTag_BlobWrite to control behaviour at the Object (rather than Pool) level

FP_OPTION_EMBEDDED_DATA_THRESHOLD– Already covered in depth

Page 38: Centera Integration Training

EMC Centera – API Best Practices

38© Copyright 2007 EMC Corporation. All rights reserved.

Setting Options as Environment Variables

Any pool option may be set as an environment variable– When set in this way, all options behave as if they are GlobalOptions– e.g. set FP_OPTION_OPENSTRATEGY=“FP_LAZY_OPEN”

Options set within the application code take precedence over those set in the environment

Benefits:– Allows customers to adopt options that are not used by the

developer/ISV– Increases options for troubleshooting applications in the field– Can sometimes enable immediate bug fixes in the field as the

developer/ISV works on a patch release

Page 39: Centera Integration Training

EMC Centera – API Best Practices

39© Copyright 2007 EMC Corporation. All rights reserved.

Authorized Capabilities

Identifies the Authorized operations that can be performed by an Application connecting to a Pool / Profile combination

– Read, Write, Delete, Exists, Query, Privileged Delete, Monitor– Purge capability is deprecated and should not be used

Applications should proactively determine which capabilities are available (see FPPool_GetCapability() in API Guide)

– Configure the application user interface / options accordingly

Page 40: Centera Integration Training

EMC Centera – API Best Practices

40© Copyright 2007 EMC Corporation. All rights reserved.

Managing Retention

C-Clips can have an associated Retention Period or Retention Class– Mandating the setting of retention can be enforced at the pool level.

If the application does not require Retention, the Retention Period should be explicitly set to zero.

– Do not rely on default Retention Period of the cluster Default on CE+ is infinity!

Retention is applied to the whole CDF rather than individual Tags.– C-Clip should contain objects with the same retention requirements

Use Retention Classes for well defined “common” retention Periods (typically set by Regulatory Body)

– Allows “painless” update should the period change

CentraStar 3.1 introduced Advanced Retention features– Separate license– Event Based Retention / Litigation Hold / Retention Governors

Page 41: Centera Integration Training

EMC Centera – API Best Practices

41© Copyright 2007 EMC Corporation. All rights reserved.

Retention is calculated from creation.date timestamp– set when the FPClip_Create is called

Retention can be changed only by creating new clip– existing C-Clip is opened– updates are made e.g. retention period / class changed– FPClip_Write is called and a new Content Address is returned– old Content Address is saved in prev.clipid– new creation timestamp is set in modification.date

Important: a new clip is created, so you must either– Delete the old one immediately using DELETE_PRIVILEGED

Not available on CE+ editions

– Or “Walk” the clip chain and delete when *all* the retentions have expired

Managing Retention

Page 42: Centera Integration Training

EMC Centera – API Best Practices

42© Copyright 2007 EMC Corporation. All rights reserved.

Application Registration

Applications should register via FPPool_RegisterApplication() prior to calling FPPool_Open()

For each pool connection established, the Centera records:– Application name / version (as specified in the API call)– Hostname– Hardware platform– Operating system– Profile used to connect– SDK version

The Application Name supplied to the API call should be constant– do not use argv[0]!

Information is retained for all application instances which have authenticated for a period equal to the ‘audit log retention period’.

Page 43: Centera Integration Training

EMC Centera – API Best Practices

43© Copyright 2007 EMC Corporation. All rights reserved.

Questions?

Please Visit SDK Forums on Centera Developer’s Portal– http://lighthouse.emc.com

Corporate Systems Engineering team provides:– Info, Training, Whitepapers– Design Advice / Design Reviews– Debugging Support– Code Reviews

Email Support Available from:– [email protected]

Page 44: Centera Integration Training