Centera Integration Training
description
Transcript of Centera Integration Training
1
EMC Centera – API Best Practices
© Copyright 2007 EMC Corporation. All rights reserved.
Centera Integration Training
Centera API Best Practices
Corporate Systems Engineering
July 2007
EMC Centera – API Best Practices
2© Copyright 2007 EMC Corporation. All rights reserved.
Agenda
Refresher on Centera API
Performance Optimizations
Metadata
Storage Strategies & ClipID Formats
SDK Options (Buffers / Failover)
Managing Retention
Wrap Up and Questions
EMC Centera – API Best Practices
3© Copyright 2007 EMC Corporation. All rights reserved.
API Background
Understand the main Centera API concepts– Pools, Clips, Streams, Tags, Blobs, Query
Appreciate the following points:– Open, read, write, query and delete transactions contact the Centera
while clip create, get /set attribute, close are local SDK operations– 1 transaction uses 1 socket is 1 thread of access– C-Clips only have relationships with blobs, never other C-Clips– The high transactional overhead for any write operation and how it
affects the writing of small objects
Understanding these points makes integration easier, more efficient and more effective!
EMC Centera – API Best Practices
4© Copyright 2007 EMC Corporation. All rights reserved.
Deploy In A Multi-Tier Architecture
Clients RunningApplication
API Library
CentraStarSoftware
• CentraStar Software resides on the Centera.
• API Library resides on the application server.
• Client application on PC connects to application server.
• Server application makes calls to Centera-supplied API.
• API interacts with Centera over TCP/IP using HPP protocol.
OS
Application
EMC Centera – API Best Practices
5© Copyright 2007 EMC Corporation. All rights reserved.
Transaction Example (Write)•The client sends a write (import) request to the server
•The server sends an acknowledgement to the client
•The client begins streaming its data to the server
•After the client has sent its last packet of data, it begins waiting for an ACK from the server
•The server distributes the data to the Storage Nodes, and updates its indices
•The server sends an acknowledgement to the client and the write operation returns
Client
Centera
Request ------ >< -----------ACKData-------- >Data-------- >Data-------- >Data-------- >.
.(Client waiting for ACK)
.< -----------ACK
ApplicationServer
Application
API Library
Operating System
EMC Centera – API Best Practices
6© Copyright 2007 EMC Corporation. All rights reserved.
Performance Optimizations
Multithreading
Small Object Management– Embedded Blobs– Containerization– Hybrid Containerization
Huge Object Management– Blob Slicing
Pool Management
EMC Centera – API Best Practices
7© Copyright 2007 EMC Corporation. All rights reserved.
Optimizing Write Performance
setup commit
setup commit
Large Object (~5Mb)
Small Object (~50Kb)
… with small objects, most of the total transaction time isspent in setup and commit – hardly any in data transfer!
So for a single stream over time we see …
EMC Centera – API Best Practices
8© Copyright 2007 EMC Corporation. All rights reserved.
Multithreaded Writes
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Thread 8
Multithreading provides overlapped I/O – more data istransferred in a given period of time.
EMC Centera – API Best Practices
9© Copyright 2007 EMC Corporation. All rights reserved.
Multithreading
Advantages– Better utilization of available bandwidth– Overlapped I/O yields better throughput– Takes advantage of multiple access nodes– Shared PoolConnection for improved load balancing
Disadvantages– More coding required– Multithreading coding/debugging generally trickier than single
threaded programming– Thread packages differ from platform to platform– Scales to a point, then rolls off
EMC Centera – API Best Practices
10© Copyright 2007 EMC Corporation. All rights reserved.
Multithreading
Make the number of threads configurable individually for Read and Write
A good combined number is 20 threads per access node– This needs to be configured at install time– For large numbers of threads, increase the value of
FP_OPTION_MAXCONNECTIONS (default is 100)
No application exists in a vacuum– Be conscious of workload imposed by other applications
EMC Centera – API Best Practices
11© Copyright 2007 EMC Corporation. All rights reserved.
Managing Small Content - Object Count Limitations
CentraStar 3.1 / Gen 4 has 50 million object count / node limitation
This count includes all types of Centera objects– CDF– Blobs– Mirror copies– Parity fragments– Reflections
CDF should be designed to fully utilise capacity (bytes) before these object count limits are encountered.
Embedded Blobs cuts down object usage by at least 50%
EMC Centera – API Best Practices
12© Copyright 2007 EMC Corporation. All rights reserved.
Managing Small Content - Embedded Blobs
Whenever a write or read is done, two objects are transferred
Writes– the Blob is written, added to the Tag in the CDF being constructed and
when fully constructed the CDF is written.
Reads– the CDF is read, the application navigates to the Tag containing the content
and the Blob is retrieved
CDF
Blob
EMC Centera – API Best Practices
13© Copyright 2007 EMC Corporation. All rights reserved.
Managing Small Content - Embedded Blobs
With embedded blobs, there is no separate blob so all data is transferred as a single object when the CDF is read or written
The SDK transparently stores the Blob as an Attribute on the Tag inside the CDF– Base64 encoded to adhere to the XML character set– Developer does not write any “special” code other then enabling the feature
I/O operations are reduced by at least half– only the CDF is read/written– proportionately greater savings if multiple blobs are stored in the CDF
Can only be used for relatively small objects (< 100KB)
CDF
Blob
EMC Centera – API Best Practices
14© Copyright 2007 EMC Corporation. All rights reserved.
Managing Small Content - Embedded Blobs
Embedded Blobs are easily enabled within an application– Globally via an FPPool option
FPPool_SetGlobalOption(FP_OPTION_EMBEDDED_DATA_THRESHOLD, 100*1024) Threshold (100KB in the example above, max is 100KB) is then used to determine how the data is
stored
– Explicitly on the FPTag_BlobWrite call FPTag_BlobWrite(theTag, theStream, FP_OPTION_LINK_DATA) FPTag_BlobWrite(theTag, theStream, FP_OPTION_EMBED_DATA) Overrides any Global setting that is in force
EMC Centera – API Best Practices
15© Copyright 2007 EMC Corporation. All rights reserved.
Managing Small Content - Embedded Blobs
Advantages– Can dramatically decrease object count usage if multiple blobs are
stored embedded in each CDF– Reduces I/Os (blob does not need to be read separately)– Easy to code
Disadvantages– Single instance storage is lost– XML-Compatible Data Encoding (Base64) increases storage
requirement by 33%– Read performance can be impacted
All blob content is retrieved when opening the clip The larger CDF takes longer to parse Standard guidelines should be followed i.e. CDF size < 10MB
EMC Centera – API Best Practices
16© Copyright 2007 EMC Corporation. All rights reserved.
Managing Small Content - Containerization
Small objects are collected and inserted into a larger container object
Each individual’s byte offset and length is stored in the metadata
When an object is retrieved, a byte offset read is done through the API and only the small object is returned
Content Descriptor File
Container Blob
1023.jpg 21078 2497 1023.jpg 21078 2497
Image name Byte offset Length
EMC Centera – API Best Practices
17© Copyright 2007 EMC Corporation. All rights reserved.
Managing Small Content - Containerization Advantages
– Better utilization of available bandwidth– Much faster ingest of a large number of small pieces of content– Reduces the object count
Disadvantages– More coding involved– Deletion of individual object requires re-writing and re-indexing
entire container– No Single Instance Storage
Limited “use case”– Only applicable where huge numbers of small objects require to be
stored in the same C-Clip Size of CDF would become unmanageable / non-performant (100MB limit)
– Embedded Blobs strategy is preferable in most situations
EMC Centera – API Best Practices
18© Copyright 2007 EMC Corporation. All rights reserved.
Managing Small Content - Hybrid Containerization
Combines aspects of Embedded Blobs and Containers
“Containers” are constructed using multiple embedded blobs– CDF effectively becomes the container
Each blob still represents a single application-level object
CDF
Check1003
Check1004
Check1005
Check1006
EMC Centera – API Best Practices
19© Copyright 2007 EMC Corporation. All rights reserved.
Managing Small Content - Hybrid vs. Classic Containers
Individual object indexing not required
Local storage managed by SDK rather than application– Application does not need to build a local container – The CDF becomes the container
Simplified deletion of individual objects from container
Automatic Single Instance Storage for objects larger than the embedded blob threshold
– No code changes required
EMC Centera – API Best Practices
20© Copyright 2007 EMC Corporation. All rights reserved.
Managing Huge Content - Blob Slicing
Write blob data to Centera using multiple threads provided by the application
Enables increased performance at time of write– No increase in performance for blobs < 5MB in size
Segments are exported to Centera as if they are different blobs
Segments are referenced by a single tag– The same method as the internal 100MB blob segmentation feature
EMC Centera – API Best Practices
21© Copyright 2007 EMC Corporation. All rights reserved.
Managing Huge Content - Blob Slicing
FPTag_BlobWritePartial(Tag, Stream, options, sequenceID)
Sequence ID determines the sequence of data written by multiple threads for one tag
– sets the order in which data is to be read back by the SDK– must be greater than 0– Duplicate IDs on a tag are not allowed and will return error
Read-back is performed in ascending Sequence ID order
Transparently supports FPTag_BlobRead() and FPTag_BlobReadPartial().
EMC Centera – API Best Practices
22© Copyright 2007 EMC Corporation. All rights reserved.
Managing Huge Content - Blob Slicing
Does not operate with any embedded options– Linked data only– FP_OPTION_EMBED_DATA causes an error– FP_OPTION_EMBEDDED_DATA_THRESHOLD setting is ignored
EMC Centera – API Best Practices
23© Copyright 2007 EMC Corporation. All rights reserved.
Managing Huge Content - Blob Slicing
FPTag_CreatePartialFileForInput(FilePath, Perm, BuffSize, Offset,Length)
Similar to CreateFileForInput– Allows for a section of the input file to be written
Two additional parameters– Offset where reading should begin within the file– Length of the segment to be read
Transparently supports FPTag_BlobWrite() and FPTag_BlobWritePartial().
EMC Centera – API Best Practices
24© Copyright 2007 EMC Corporation. All rights reserved.
Pool Management – Creation process
Pool of sockets (and associated data structures) is allocated Probe packet is sent to the first address provided
– if a response is not received before timeout (configurable, default is 2 minutes), subsequent addresses will be tried in sequence
Response is received from another AN, which contains a list of all known ANs ordered by load
The replica for the cluster is probed (with the same default timeout of 2 minutes)
The entire replica chain is walked– Allowing for timeouts, this could be a lengthy process
FPPool_Open(“10.0.0.1,10.0.0.2”)
Primary Replica1 Replica2
X
EMC Centera – API Best Practices
25© Copyright 2007 EMC Corporation. All rights reserved.
Pool Management – Recommended Strategy
FPPool_Open should be called once when the application is started
Subsequent Centera I/O should be done using this single Pool Reference
When the application shuts down, call FPPool_Close
FPPool_Open
FPPool_Close
WriteReadReadWriteReadReadWriteWriteWrite
EMC Centera – API Best Practices
26© Copyright 2007 EMC Corporation. All rights reserved.
Centera in a Multi-Application Repository Environment
Centera
Application #1ECM
Application #2Email Archiving
Application #3PC Backup
EMC Centera – API Best Practices
27© Copyright 2007 EMC Corporation. All rights reserved.
Pool 1 App Pool 1
How can we partition data access?
CDF= Blob=
Cluster Pool
Pool 2
App Pool 2
Pool 3
App Pool 3
Default PoolDefault Pool
Virtual Pools
EMC Centera – API Best Practices
28© Copyright 2007 EMC Corporation. All rights reserved.
Virtual Pools
Virtual Pools implement a form of Data Partitioning
Think of Virtual Pools as Virtual Centeras within a physical Centera cluster
Applications should– connect to Virtual Pools through Access Profiles and Capabilities
EMC Centera – API Best Practices
29© Copyright 2007 EMC Corporation. All rights reserved.
Metadata
Metadata is the key to disaster recovery Rich metadata helps to identify individual pieces of
content given local repository failure Clip level metadata can be retrieved as part of a Query
Result set when performing Disaster Recovery– FPQueryExpression_SelectField(myQueryExp, “aClipAttribute”);
CenteraSeek relies on metadata– capability to “Google” the Centera
this is not what Query is intended for!– chargeback reporting– uses metadata only, not the document content itself
all types of Metadata can be queried• Standard SDK metadata e.g. creation.date• Clip level metadata e.g. ApplicationName (attribute added by application)• Tag level metadata e.g. InvoiceNumber (attribute added by application)
EMC Centera – API Best Practices
30© Copyright 2007 EMC Corporation. All rights reserved.
Metadata - Sample CDF<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<ecml version="3.0">
<eclipdescription>
<meta name="type" value="Standard"/>
<meta name=“name" value=“TrainingClip"/>
<meta name="creation.date" value="2004.08.05 09:31:19 GMT"/>
<meta name="modification.date" value="2004.08.05 09:35:51 GMT"/>
<meta name="numfiles" value=“1"/>
<meta name="totalsize" value="2082"/>
<meta name="refid" value="5DD0B54HG7OCG3UTUGV1FP004Q"/>
<meta name="prev.clip" value=""/>
<meta name="clip.naming.scheme" value="MD5"/>
<meta name="numtags" value=“1"/>
<meta name="sdk.version" value="3.0.377"/>
<custom-meta name=“ApplicationName" value=“MetadataExample"/>
<custom-meta name=“ApplicationVendor" value=“EMC_Engineering"/>
<custom-meta name=“ApplicationVersion" value=“3.4"/>
</eclipdescription>
<eclipcontents>
<myMainTag someMeaningfulAttribute=“aValue" anotherAttribute=“andAnmotherValue”>
<eclipblob md5="DM6JEBLJFH9I" size="694327" offset="0"/>
</myMainTag>
</eclipcontents>
</ecml>
EMC Centera – API Best Practices
31© Copyright 2007 EMC Corporation. All rights reserved.
What happens if local Content Address repository is lost?
Protect the relationship between the archive and the local store– Metadata assists in local store reconstruction using Query– Store the Transaction Logs or Incremental database backups on Centera
Email resulting C-Clip IDs to DBA or store on Profile Clip• Use a separate Virtual Pool exclusively for these backups
Logs / Backups can be easily retrieved in DR scenario
Disaster Recovery
DB
Centera
ApplicationServer
EFG242769LH32e57R23E2IBC4FETo: DBA
EMC Centera – API Best Practices
32© Copyright 2007 EMC Corporation. All rights reserved.
Storage Strategy Capacity (SSC)
Allows for Single Instance Storage– prevents identical content being stored numerous times
Uses M or M++ naming schemes– M++ performs additional SHA-256 calculation (performance
overhead)– Reduces collision potential
Content Address is 27 (M) or 53 (M++) bytes long– M++ CA incorporates part of the SHA-256 calculated ID
EMC Centera – API Best Practices
33© Copyright 2007 EMC Corporation. All rights reserved.
Storage Strategy Performance (SSPP / SSPF) Blobs below the threshold (default 250K) are written using a 53 byte
Content Address incorporating a Time element
2 different types of available– Partial (SSPP)
C-Clips retain the standard 27 byte Content Address Should only be used by applications which cannot 53 byte CA
– Full (SSPF) C-Clips and Blobs both use the 53 byte CA
Always use FP_OPTION_CLIENT_CALCID_STREAMING as content cannot pre-exist due to the change in the Content Address
When using Generic Streams of unknown size, set FP_OPTION_PREFETCH_SIZE >= SSP threshold to ensure the correct strategy is used
Use the SDK defined constant for Content Address Length to ensure compatibility with future naming schemes
EMC Centera – API Best Practices
34© Copyright 2007 EMC Corporation. All rights reserved.
SDK Options
The SDK allows for configuration of many options to control different aspects of PoolConnection behaviour
– Buffer sizes– Failover strategies– Storage strategies– Timeouts / Retry limits
2 main types of options– Global options which apply to all PoolReferences created by the
application– Local options which affect individual PoolReference instances
EMC Centera – API Best Practices
35© Copyright 2007 EMC Corporation. All rights reserved.
Buffer Sizes
FP_OPTION_BUFFERSIZE– Size of the CDF buffer in bytes.
Will swap out to disc if CDF is larger than the buffer Default 16KB / Min 1KB / Max 10MB
– Set this value to exceed the total CDF size (XML + Blob data) to avoid swapping to disc
e.g.(150 * 1024) for a C-Clip with a single embedded blob
FP_OPTION_PREFETCH_SIZE– Size of the temporary buffer used by the SDK to assist with content
length based storage decisions Storage Strategy Performance or Capacity Parity or Mirroring Default 32 KB / Maximum 1 MB.
– If using Generic Streams of unknown length, set this value to 1MB
EMC Centera – API Best Practices
36© Copyright 2007 EMC Corporation. All rights reserved.
Failover Strategies
Can be enabled or disabled for all operations– FP_OPTION_ENABLE_MULTICLUSTER_FAILOVER
Network failover occurs when connectivity to the Primary cluster is lost
Content failover occurs when an operation (read, write, delete, query or exists) relating to a particular C-Clip does not return success
– Can be configured separately for each operation FP_OPTION_MULTICLUSTER_XXXX_STRATEGY
– Three possible strategies FP_NO_STRATEGY / FP_FAILOVER_STRATEGY /
FP_REPLICATION_STRATEGY NB: Not all strategies are supported for all operations and default behavior differs.
EMC Centera – API Best Practices
37© Copyright 2007 EMC Corporation. All rights reserved.
Storage Strategies
FP_OPTION_DEFAULT_COLLISION_AVOIDANCE– Enable (by supplying value FP_TRUE) if
Single Instance Storage is unlikely to be of value Remote chance of a collision possibility is unacceptable
– can also be used as an option on FPTag_BlobWrite to control behaviour at the Object (rather than Pool) level
FP_OPTION_EMBEDDED_DATA_THRESHOLD– Already covered in depth
EMC Centera – API Best Practices
38© Copyright 2007 EMC Corporation. All rights reserved.
Setting Options as Environment Variables
Any pool option may be set as an environment variable– When set in this way, all options behave as if they are GlobalOptions– e.g. set FP_OPTION_OPENSTRATEGY=“FP_LAZY_OPEN”
Options set within the application code take precedence over those set in the environment
Benefits:– Allows customers to adopt options that are not used by the
developer/ISV– Increases options for troubleshooting applications in the field– Can sometimes enable immediate bug fixes in the field as the
developer/ISV works on a patch release
EMC Centera – API Best Practices
39© Copyright 2007 EMC Corporation. All rights reserved.
Authorized Capabilities
Identifies the Authorized operations that can be performed by an Application connecting to a Pool / Profile combination
– Read, Write, Delete, Exists, Query, Privileged Delete, Monitor– Purge capability is deprecated and should not be used
Applications should proactively determine which capabilities are available (see FPPool_GetCapability() in API Guide)
– Configure the application user interface / options accordingly
EMC Centera – API Best Practices
40© Copyright 2007 EMC Corporation. All rights reserved.
Managing Retention
C-Clips can have an associated Retention Period or Retention Class– Mandating the setting of retention can be enforced at the pool level.
If the application does not require Retention, the Retention Period should be explicitly set to zero.
– Do not rely on default Retention Period of the cluster Default on CE+ is infinity!
Retention is applied to the whole CDF rather than individual Tags.– C-Clip should contain objects with the same retention requirements
Use Retention Classes for well defined “common” retention Periods (typically set by Regulatory Body)
– Allows “painless” update should the period change
CentraStar 3.1 introduced Advanced Retention features– Separate license– Event Based Retention / Litigation Hold / Retention Governors
EMC Centera – API Best Practices
41© Copyright 2007 EMC Corporation. All rights reserved.
Retention is calculated from creation.date timestamp– set when the FPClip_Create is called
Retention can be changed only by creating new clip– existing C-Clip is opened– updates are made e.g. retention period / class changed– FPClip_Write is called and a new Content Address is returned– old Content Address is saved in prev.clipid– new creation timestamp is set in modification.date
Important: a new clip is created, so you must either– Delete the old one immediately using DELETE_PRIVILEGED
Not available on CE+ editions
– Or “Walk” the clip chain and delete when *all* the retentions have expired
Managing Retention
EMC Centera – API Best Practices
42© Copyright 2007 EMC Corporation. All rights reserved.
Application Registration
Applications should register via FPPool_RegisterApplication() prior to calling FPPool_Open()
For each pool connection established, the Centera records:– Application name / version (as specified in the API call)– Hostname– Hardware platform– Operating system– Profile used to connect– SDK version
The Application Name supplied to the API call should be constant– do not use argv[0]!
Information is retained for all application instances which have authenticated for a period equal to the ‘audit log retention period’.
EMC Centera – API Best Practices
43© Copyright 2007 EMC Corporation. All rights reserved.
Questions?
Please Visit SDK Forums on Centera Developer’s Portal– http://lighthouse.emc.com
Corporate Systems Engineering team provides:– Info, Training, Whitepapers– Design Advice / Design Reviews– Debugging Support– Code Reviews
Email Support Available from:– [email protected]