Implementing Oracle on EMC CLARiiON Storage Systems · Implementing Oracle on EMC CLARiiON Storage...

Implementing Oracle on EMC CLARiiON Storage Systems

Best Practices Planning

Abstract

This white paper outlines the issues to consider when implementing an Oracle database using an EMC® CLARiiON® CX or CX3 UltraScale™ series Fibre Channel storage system. It contrasts the general Oracle recommendations with the specific performance characteristics of the CLARiiON systems and offers general recommendations for using a CLARiiON storage system with Oracle.

February 2008

Copyright © 2004, 2008 EMC Corporation. All rights reserved.

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com

All other trademarks used herein are the property of their respective owners.

Part Number H796.3

Implementing Oracle on EMC CLARiiON Storage Systems Best Practices Planning 2

Table of Contents Executive summary ............................................................................................5 Introduction.........................................................................................................5

Audience ...................................................................................................................................... 5 Terminology ................................................................................................................................. 5

About this paper .................................................................................................6 Qualification of CLARiiON storage for Oracle..................................................6 Oracle database design considerations for CLARiiON storage .....................7

Dealing with the “magic bullet” syndrome.................................................................................... 7 Database logical layout and performance ................................................................................... 8

Oracle OFA and special areas and tables ............................................................................... 9 How application types affect performance................................................................................... 9

Typical OLTP application characteristics ................................................................................. 9 Typical DSS application characteristics ................................................................................. 10 Sequential or random I/O? ..................................................................................................... 10 How to determine the I/O profile............................................................................................. 11 Temporal patterns and peak activities ................................................................................... 11

The Oracle I/O structure ............................................................................................................ 11 Data table I/O ......................................................................................................................... 12 Redo log I/O ........................................................................................................................... 12 Archive I/O.............................................................................................................................. 13 Database block size (DB_BLOCK_SIZE) .............................................................................. 13

Redo logs ................................................................................................................................... 13 Considerations for the redo logs ............................................................................................ 13

Instance parameters .................................................................................................................. 16 Instance parameter example.................................................................................................. 17

Backing up the database ........................................................................................................... 17 Cold backup............................................................................................................................ 17 Hot backup ............................................................................................................................. 17 Hot backup with SnapView..................................................................................................... 18 Impact-free backups with SnapView ...................................................................................... 19

Other considerations for performance ...........................................................19 Host OS and HBA considerations.............................................................................................. 20

Max I/O size............................................................................................................................ 20 Alignment................................................................................................................................ 21

File system or raw partition ........................................................................................................ 21 Raw partitions......................................................................................................................... 21 File systems............................................................................................................................ 21

Host-based striping (plaids) ....................................................................................................... 22 Oracle’s SAME ....................................................................................................................... 22 Guidelines for host-based striping.......................................................................................... 22

Using metaLUNs........................................................................................................................ 23 Pure metaLUNs and round-robin logging............................................................................... 23 Hybrid use of metaLUNs and traditional log devices ............................................................. 24

The CLARiiON cache................................................................................................................. 24 Cache page size..................................................................................................................... 24 Which LUNs to cache............................................................................................................. 24


Spindles and stripes................................................................................................................... 25 Stripe element size................................................................................................................. 25

RAID levels and performance.................................................................................................... 25 When to use RAID 6............................................................................................................... 25 When to use RAID 5............................................................................................................... 26 When to use RAID 1/0............................................................................................................ 26 When to use RAID 1............................................................................................................... 26 When to use RAID 0............................................................................................................... 26 RAID levels and redundancy.................................................................................................. 26

Disks .......................................................................................................................................... 27 Conclusion ........................................................................................................27 References ........................................................................................................28 Appendix A: The redo log ................................................................................29

The need for consistency........................................................................................................... 29 Leveraging for performance....................................................................................................... 29

Further optimization: Buffer coalescing.................................................................................. 30 Appendix B: DB tuning basic steps ................................................................31


Executive summary Oracle’s performance problems are found in the design phase, as application and logical infrastructure design are critical to performance. Assuming that these problems are solved, object-level contention is the primary focus of storage performance optimization. Object-level analysis requires knowledge of the database design. Optimization is then achieved by deploying contending objects on different physical disk drives on the storage subsystem.

The design of an Oracle database results in some components with known I/O patterns, such as the redo log. Exploitation of these patterns with RAID and disk configurations usually means ensuring that contentious tables do not share spindles. Certain Oracle instance parameters should be set so that they work with EMC® CLARiiON® storage characteristics.

Introduction Oracle is the industry-leading relational database management system (RDBMS). It is a highly available, robust data engine. To provide availability, robustness, and performance, Oracle makes general recommendations for implementing storage for the database tables. These recommendations are not always applicable to all implementations and storage systems. This white paper examines how best to implement Oracle databases on the CLARiiON CX and CX3 UltraScale™ series Fibre Channel storage systems. Note that this paper is not intended for those planning on deploying under the Oracle Automated Storage Management model, available since Oracle 10g.

This white paper is divided into three sections: • “Qualification of CLARiiON storage for Oracle” • “Oracle database design considerations for CLARiiON storage” • “Other considerations for performance”

For implementation guidance in the context of Oracle ASM deployments, use the white paper EMC CLARiiON SnapView and MirrorView for Oracle Database 10g Automatic Storage Management – Best Practices Planning.

Audience The intended audience for this white paper is the CLARiiON system engineer or any database administrator (DBA) who is interested in implementing an Oracle RDBMS using CLARiiON storage. The reader should have a general knowledge of RDBMS basic features and terminology, as well as familiarity with Oracle-specific terms and technology.

Terminology Atomic: An atomic change is one that happens in a single discrete step, so that a system failure leaves the state either unchanged or updated—no partially updated state is possible with an atomic transaction.

Automatic Storage Management (ASM): A feature in Oracle Database 10g that provides the database administrator with a simple storage management interface that is consistent across all server and storage platforms. As a vertically integrated file system and volume manager, purpose-built for Oracle database files, ASM provides the performance of async I/O with the easy management of a file system. Checkpoint: A process by which the RDBMS processes all items in the redo log that have not been executed, and updates the tables with a new bookmark (the System Control Number).

Coalescing: The bundling of smaller I/Os, at the file-system level, into larger ones. This can be accomplished if there is some data locality (for example, the blocks are close together in the file system).


Elevator Algorithm: A method by which the disk drive orders the access requests in the most efficient manner.

Oracle Parallel Server (OPS): The name of Oracle’s clustering technology, launched with Oracle 7. An Oracle server engine instance typically exclusively owns access to a set of OS files or raw partitions, running from some server host. To provide for both scaling and support for HA, the Oracle RDBMS software was extended to allow two or more Oracle engine instances to work together to support application access to the same set of data sitting on these OS files or raw partitions.

Real Application Cluster (RAC): With Oracle 9i and later, Oracle Parallel Server (OPS) was renamed RAC, to carry the key message of the RDBMS technology that supports Real Oracle application scaling and high availability.

Redo Log/Online Log/Transaction Log: The redo log is Oracle’s most important tool for providing high performance. It is a scratchpad for recording intended changes to the database, which can then be placed in a queue for DBWR to write while the Oracle engine goes on to the next job. This special area must be implemented on fast, reliable storage that is optimized for sequential I/O.

Reserve Capacity: The concept of having more performance potential than is used during normal operations. For example, if the response time target is 10 ms, under normal operation the storage system may deliver 8 ms. Thus, in peak demand periods, the system can still deliver 10 ms performance.

Stripe/Stripe Element: RAID algorithms achieve speed and reliability by distributing sequential chunks of storage across many disks. The number of blocks written before I/O jumps to the next disk in the group is the stripe element size. The number of disks in a group*stripe element size = size of the RAID stripe. The stripe size is more commonly used in various calculations, but the stripe element size is crucial as well.

System Global Area (SGA): The memory region occupied by the Oracle database engine. In it, Oracle manages buffers, such as the DBWR buffer, in order to provide very fast access to recently used data, and to do lazy writes (writes committed to a queue but not yet written to disk). Because Oracle uses so much memory and does so much of its own buffering, OS tuning is important.

System Control Number: A global bookmark used to determine which changes were made after the last checkpoint.

Write-Aside Size: The largest size, measured in 512-byte blocks, for which the RAID engine will write cache I/Os; requests larger than this are written directly to disk. The EMC CLARiiON Best Practices for Fibre Channel Storage white paper has more information.

About this paper To help you use this paper as a reference, specific recommendations are called out in marked paragraphs:

Recommendation: If you do not understand a concept, check the appendix.

Also, a glossary of terms specific to this subject is located in the “Terminology” section. Words defined in the “Terminology” section are shown in boldface at their first use. The appendices provide more detail on several subjects:

• The Oracle redo log process • Oracle’s basic database tuning steps • Other considerations for performance

Qualification of CLARiiON storage for Oracle To ensure compatibility between Oracle software and storage technologies, Oracle created a partner program known as the Oracle Storage Compatibility Program (OSCP). This program validated storage solutions’ compatibility with Oracle in one of three areas: network-attached file servers, remote mirroring technologies, and snapshot technologies. Test kits were provided by Oracle and testing was done by the


partner, with Oracle validating the results. Testing included an emphasis on correctness of behavior over various fault-injection scenarios. EMC participated in this program through its duration, which extended until January 2007.

Oracle now believes that these storage system technologies have been well received by the industry and have matured to a point where validation is no longer required for the newer generation systems offered by storage system vendors. The CLARiiON product line has been validated by the OSCP program and customers can rest assured that new generations of CLARiiON storage systems continue to conform to those requirements since that testing methodology has been factored into the CLARiiON validation testing done before new CLARiiON products are released.

Additionally, EMC qualifies its storage systems for use with clustering solutions as well as backup tools, and in both cases, the EMC E-Lab™ qualifies Oracle 11g, Oracle 10g, and Oracle9i Real Application Clusters (RAC) and Oracle Clustered File System (OCFS) for various operating systems. These qualifications ensure that EMC and Oracle technologies will work together and can be fully supported should a customer problem arise.

http://www.emc.com/products/interoperability/index.htm

EMC maintains a Partner Engineering group that works closely with Oracle. EMC engineers test new versions of Oracle software on various server types attached to CLARiiON storage systems. These laboratories produce the research that explains how to integrate CLARiiON functionality, such as SnapView™ and MirrorView®, with the latest Oracle software.

Oracle database design considerations for CLARiiON storage Setting up Oracle database files on any storage system requires some good practical understanding of the database’s logical data layout, and the pattern in which the data is used. Concerns include:

• Database logical layout and performance • Applications and performance • Oracle I/O structure • Redo logs • Oracle instance parameters • Backup This section addresses each of these concerns in the order shown.

Dealing with the “magic bullet” syndrome Some nontechnical personnel may question the approach taken in this paper. After all, why should database design be addressed in a paper about storage-system recommendations? The storage system is seen as a magic bullet that eliminates any performance issues for the database. However, the bottom line is: Design is critical.

“Appendix B: DB tuning basic steps” lists Oracle’s recommended tuning steps. Step 8 is for tuning the storage system. Any doubts about the importance of tuning the software layer can perhaps be dispelled by Oracle’s own recommendations:

The tuning process does not begin when users complain about poor response time … it is usually too late to use some of the most effective tuning strategies. At that point, if you are unwilling to completely redesign the application, you may only improve performance marginally by reallocating


memory and tuning I/O.1 … additional hardware does not improve the system’s performance at all.

Poorly designed systems perform poorly no matter how much extra hardware is allocated.2

So, we will take some time up front to address how application design can affect storage system performance.

Database logical layout and performance It is necessary to ensure that high-usage tables are located on storage that can handle their I/O requirements. It is important to know the schema—which tables are accessed in concert—as that will determine the physical layout of data on disk.

Figure 1 shows a simplistic—but illustrative—example. Four tables share a set of spindles, and all four will be used extensively in the sample SQL statement, causing much disk contention.

SQL: SELECT FirstName, LastName, Acctnum, Balance FROM Users INNER JOIN Accounts ON (Users.Acctnum = Accounts.Acctnum) WHERE Balance > 0 ORDER BY Balance

Figure 1. Example of disk contention Note that the important aspect of database layout is that the interactions of I/O are at the table level, not the file level: The relationships between tables and indexes must be known. The general idea in table placement on physical media is that objects that are accessed simultaneously should be located on different physical spindles. In the example shown in Figure 1, two tables make up part of a single query. By placing them both on the same disks, the read requests cause the disk heads to seek between two locales on disk. Although the CLARiiON RAID driver makes use of the disk elevator algorithm—thus making disk access as efficient as possible—it is nonetheless recommended that you avoid such contention.

Of course, there are exceptions. If the table and the index are small and are referenced with high frequency, and there is sufficient System Global Area (SGA) and server memory, the index is cached in memory and thus disk access will not be an issue.

Recommendation: Objects that are accessed simultaneously should be located on different physical drives. Objects to include in this approach are indexes, tables, and the TEMP table, RBS (rollback segments), and logs.

Of course, there are exceptions. If the table and the index are small and are referenced with high frequency, and there is sufficient System Global Area (SGA) and server memory, the index is cached in memory and thus disk access will not be an issue.

1 Oracle8i Tuning Release 8.1.5, A67775-01. 2 Oracle9i Database Performance Planning Release 2 (9.2), Part Number A96532-01.


Oracle OFA and special areas and tables Oracle has guidelines in the form of Optimal Flexible Architecture (OFA) that should be followed as much as possible. The goal is to organize large amounts of data on disk to avoid device bottlenecks and poor performance.

OFA recommends separation between these components:

• DATA tables • INDX tables (indexes), as they correspond to DATA • RBS (rollback segments) • Redo logs • TEMP • SYSTEM • SYSAUX (new in 10g) • Flash recovery area (new in 10g) However, the guidelines adjust for the type of storage used. Given a striped configuration—either host- or array-based striping—the following exceptions can be made:

• RBS and TEMP can coexist if the majority of sorts are executed in memory. (Optimize SORT_AREA_SIZE and verify that there are not sustained concurrent writes to rollback and temporary segments.)

• For most environments, sharing SYSTEM with RBS or TEMP is fine. • In some cases, an index and table can coexist. The assumption is that the index and table are accessed

concurrently (as in a table scan). However, in an OLTP environment, if the index is not held in memory, unacceptable performance could result.

How application types affect performance Applications are categorized into two broad categories: Online Transaction Processing (OLTP) and Decision Support Systems (DSS). These categories provide a convenient way to approach storage, but typically any commercial RDBMS installation includes aspects of both. The question is, which type of I/O will you optimize for?

As you read this paper, you will see recommendations based on OLTP or DSS workloads. Use the following guidelines to determine if your application fits more closely with one or the other.

Typical OLTP application characteristics From the end user’s perspective, an OLTP application looks like this:

• Small amount of data (for example, a page of text fields) for read/update per transaction • Response time per (user) transaction is short, usually measured in seconds • Large number (tens to thousands) of user connections • Database data must be current, so availability is critical Typical OLTP applications are order entry, account update, insurance, and government forms.

OLTP I/O typical profile From the storage system’s perspective, the I/O profile of an OLTP application looks like this:

• Small I/O size (usually less than 8 KB) • Mostly random I/O access (data tables and indexes) • More than 30 percent of the workload is random writes • Periodic checkpoints synchronize the DB on disk with cyclical write bursts


• The redo log device can be very active with intensive sequential write activity • Occasional backup workload—usually sequential and with larger I/O size—differs radically from the

actual application profile

Performance ramifications The OLTP workload benefits from drives with fast access times/low latency. This helps speed random reads, and helps flush random writes out of the cache faster, allowing more load on the storage system. The highly random nature makes striped RAID solutions a requirement. RAID 1/0 is a better choice for OLTP than RAID 5, as the write load for the disks is lower—again, allowing the random writes to be flushed more quickly from the cache. Reads will be about the same. (Refer to the “Sequential or random I/O?” section on page 10.)

Typical DSS application characteristics From the end-user point of view, a DSS application looks like this:

• No (or few) updates of data • Complex queries retrieving large amounts (reports of tens to hundreds of lines) of data, with many

different types or records being related • Elapsed time of a query is expected in the range of minutes to hours, depending on the complexity • Data age may be measured in hours or days, and updates are applied as batch jobs • Output data contains aggregate quantities (totals), sorted and/or grouped • Data retrieved may be large files (geographical database) DSS I/O typical profile From the storage system’s perspective, the I/O profile of a DSS application looks like this:

• Large I/O size (typically from 16 to 512 KB) • Multiple sequential streams reading data/indexes • During query execution, significant write activity takes place against temporary DB storage • The periodic batch update workload differs from the actual application workload • Log device and checkpoints are relevant only during batch updates Performance ramifications Striped RAID types are necessary to get high bandwidth. Separation of concurrently accessed tables is critical, as sequential access of the drives should be maximized. Managing the cache is important, since forced flushes may make it impossible to do RAID 5 stripe (MR3) writes. RAID 5 is quite effective for DSS.

Due to the sorting activity, large, fast TEMP spaces (using striped RAID) are required.

Sequential or random I/O? Some database access patterns do not fit cleanly into OLTP or DSS. In this case, the chief characteristic that affects storage-system performance is the randomness of the I/O access. Random write behavior causes much more stress on cache resources than sequential write behavior, especially with RAID 5. Random reads benefit from more spindles per table—either RAID 1/0 or larger RAID 5 groups. In both cases, faster drives will help, but the effect is more pronounced with random access patterns.

Sequential I/O does not stress the cache as much as random I/O because sequential requests can be bundled into large transfers for more efficient disk operation. In fact, sequential writes are better handled by RAID 5 than RAID 1/0. Sequential reads can be effectively prefetched from either RAID type.

Examples of random updates include:


• Client account balance updates • Inventory tracking • Accounting Examples of sequential I/O are:

• Any type of cumulative addition to tables, such as adding new users • Appending realtime data for later analysis • File backup The randomness of the access pattern determines Oracle design decisions—such as the decision whether to index a table—and the RAID type on which to deploy the table.

How to determine the I/O profile Suppose a DBA with an existing database wants to migrate data to a newer system. In such a case, an empirical analysis can be made. If the current storage is a CLARiiON system, the best tool to characterize the I/O is Navisphere®

Analyzer. In a Symmetrix® environment, Workload Analyzer is the best tool.

Prior to Oracle 10g the utlbstat/utlestat script3 can be used to output a file called report.txt, in which the long and short table scans are reported. With Oracle 10g and later, the use of the Automatic Workload Repository (AWR) can be used to collect more comprehensive performance statistics of the database engine.

Temporal patterns and peak activities Plan for transaction batching, such as daily receipts or weekly reports. These can cause spikes in the service, both on a file-system level and globally. Global resources such as database buffers and storage-system cache must have the reserve capacity to accommodate spikes.

Peak activity is also caused by events that can be anticipated. Events to plan for include meal times, scheduled events, and busy days (for example, Friday, payday, holidays).

The Oracle I/O structure Performance tradeoffs discussed in this paper can be best understood if the Oracle I/O structure is known.

The Oracle structure uses a set of buffers to which the database engine writes. These buffers reside in the System Global Area (SGA), which resides in the host’s RAM. Oracle uses the buffers extensively to optimize its I/O. This highlights a key fact: For good performance, an RDBMS host requires large amounts of RAM and virtual memory.

Oracle processes operate on these buffers, reading and writing files that are implemented either on raw partitions or file systems (Figure 2).

3 $ORACLE_HOME/rdbms/admin/utlbstat.sql to start, utlestat.sql to end


Figure 2. Oracle I/O structure Note that Oracle uses a divide-and-conquer approach to parallel access of storage. For various functions, the database has discrete SGA buffers in memory, separate processes to flush the buffers to disk, and separate tables or files with which to do I/O. Furthermore, Oracle can achieve concurrency by implementing multiple instances of each process type, and by using asynchronous I/O.

Ensure that the Oracle processes have sufficient resources to perform I/O at their highest possible rates.

Data table I/O As changes or requests are made on the database, Oracle uses its buffers as a way to optimize read and write performance. For example, Oracle can perform read-ahead, putting data it expects to need into the data buffers. It also uses write back, where data is placed in an in-memory data buffer. Oracle does not wait for disk I/O to complete before continuing on to the next transaction4.

The DBWR process fulfills the

reads, writes, and read-ahead operations indicated in the buffers.

The characterization of DBWR I/O—large or small block, random, or sequential—is largely determined by the application, as described in the “How application types affect performance” section.

Redo log I/O The redo log (also called transaction log or online log) is Oracle’s way of executing tasks that update the database in shorthand: Changes that Oracle intends to make are recorded here. The actual changes to tables are made later. Since it is quicker to make a shorthand note of the change than execute the change itself, Oracle can write a log entry and ensure the log is written to disk before going to the next task. The writing of the modified table pages can be deferred, to be performed by the DBWR as background bulk dirty page flushing, without blocking the commit of the database transaction.

Redo log I/O is sequential and synchronous, meaning that each operation must complete before another one begins. The redo log is written in small chunks in multiples of 512 bytes. Note that as the log tracks changes to the database, read-only applications will not utilize the redo log.

The LGWR process executes redo log operations. Typically, the online redo log file is written until full, at which time the LGWR switches to another redo log file. If ARCHIVELOG mode is set, the filled redo log file is archived to a defined location. After it has been archived, the redo log file is ready to be reused when the current redo log file fills up. If ARCHIVELOG mode is not set, the status of this filled log file is set to

4 Refer to the “ ” section to see how Oracle can use write back and maintain database integrity. Redo logs


be reusable. Once reused, the previous redo log record content is overwritten and lost. Oracle tags the state of the log file(s) waiting for reuse as offline.

Performance of the redo log subsystem is important enough to warrant its own subsection (refer to “Redo logs”).

Archive I/O The ARCH process is an optional feature. It backs up the redo log that is currently offline. A backup of the redo logs written since the last synchronization of the database (known as a checkpoint) allows rebuilding of the database in case of catastrophic failure. It can also be used to synchronize an offline backup or a remote copy of the database. For more information, refer to the “Redo logs” section.

Database block size (DB_BLOCK_SIZE) DB_BLOCK_SIZE is an extremely important value for Oracle I/O efficiency. It determines the smallest increment of change that the database will attempt. Typically, this value is set small for OLTP applications, and as large as possible for DSS applications. The intent is that for small changes, as little data as possible is exchanged. For large operations, fewer large I/Os are more efficient, and large blocks can be coalesced into larger requests to the storage system.

Recommendation: You should set the CLARiiON cache page size to the same value as the Oracle DB_BLOCK_SIZE. Oracle and CLARiiON suggest that the DB_BLOCK_SIZE be the same size as the file system block size.

In the case where the OS page size is smaller than the file-system block size, or where file systems are not used, a very conservative DBA will set the DB_BLOCK_SIZE to the OS page size. A small Oracle block size operating against a larger file system block size wastes file system resources, as more data is fetched than is needed.

However, an Oracle block size that is too large can have a side effect: unintentional prefetch. This prefetch can occur at the file-system (if used) or storage-system level. For example, a database block size two times the size of the file-system block size requires two requests from the file system. This may result in the file system performing an unneeded prefetch, thereby wasting storage-system resources.

Redo logs As stated earlier, the redo log is Oracle’s scratch pad. Changes that Oracle intends to make are recorded here. Redo log implementation has two aspects: the storage concerns of the logs themselves and of the archive.

Considerations for the redo logs Due to Oracle’s need to ensure atomicity, the redo log is written to synchronously—the database will not continue from any write operation to the redo log until the write to physical media is complete. This is because the contents of the redo log are critical for the recovery of the database. It must be kept consistent and secure.

Write-through file systems File systems should not be used to hold redo log files, unless the write caching of the file system can be bypassed5. Oracle requires the redo log writes to be saved in a persistent manner before continuing.

Write-through storage systems Storage devices should not use write caching unless the write-cache scheme guarantees that, in the event of a failure, data in the write cache is coherent. The best case is that write-cache data is stored to disk in a

5 A catastrophic failure of the server may lose the data Oracle assumes was committed to physical media.


failure6.

Some systems cannot protect the log even in non-catastrophic failures. For example, systems such

as the LSI “E” series and the HP StorageWorks EVA allow unmirrored writes to cache. In this case, a LUN trespass can result in a corrupt database.

OFA and the redo log devices As the log writer process does sequential writes in multiples of 512-byte blocks, write caching will be very effective for the log LUNs. OFA suggests that the online log be on drives that are not used for other I/O. This is unrealistic when the storage system is a CLARiiON Fibre Channel system. Why?

• Write caching decouples host writes from disk access: What is important is that the write cache be able to flush the log writes quickly enough to avoid filling the cache.

• Log writes are sequential; sequential writes allow the write cache to flush at an optimal rate, so even a shared RAID 5 group can keep up with an Oracle LGWR.

• As drives get larger, few users can accept the dedication of several spindles to only a few GB for the Oracle logs.

The introduction of Navisphere QoS Manager (NQM) in FLARE® 24 can be used to eliminate the need of dedicating a set of distinct spindles to service I/O for the redo logs. Multiple LUNs can be created from the same RAID Group so that the extra capacity of the spindles is not wasted. NQM can then be used to maintain a specified service level on the redo log LUN(s) ensuring a desired level of performance from the database even if other the other LUNs within the same RAID Group are being access by other applications. It is recommended that LUNs in the shared RAID Group should be allocated to applications that limited bandwidth and throughput requirements so the spindles do not become overloaded. Information on the use of NQM can be found in the white paper Using Navisphere QoS Manager in Oracle Database Deployments available on Powerlink®.

Use of the redo log archive is determined by the {NO}ARCHIVELOG setting in Oracle. ARCHIVELOG mode is usually set in a production environment. The redo log archive is written to sequentially, and in much larger blocks than the redo log itself. The large I/O size can be a factor with older (FC series) storage systems, which have limited write-cache bandwidth. A write-intensive database running on these older systems will benefit from bypassing write cache of archive writes. (The CX and CX3 series systems provide much better write bandwidth, so this is typically not necessary.)

Bypassing cache for archival writes The DBA can take advantage of flexibility in the CLARiiON write cache architecture to free up cache pages normally filled by archive log activity. If these writes can bypass the cache, more pages will be free for production I/O. Use the CLARiiON write-aside setting to control this. Set the Write-Aside size for the archive LUN to be just smaller than the I/Os used to back up the log7.

For example, if the archive process

will be executing 512 KB writes, set the write-aside size to 1023 blocks (511.5 KB). Alternatively, you can turn off write cache for the archive LUNs.

To ensure the log archival process completes before the offline log is needed to go online, ensure the uncached writes are going to be executed efficiently. There are several approaches to this technique.

The first approach requires that the I/O is aligned to the RAID stripe on the archive LUN. In that case, the archive device can be RAID 5, which will perform stripe writes (modified RAID 3, or MR3), which are very efficient. Refer to the EMC CLARiiON Best Practices for Fibre Channel Storage white paper for alignment and optimizing for MR3.

The second approach assumes that either the file system precludes aligned writes, or that the archival process and file system combination preclude I/Os of sufficient size to fill a parity stripe. In this case, the archive must be on mirrored storage—RAID 1 or RAID 1/0. Disk activity will be greater than that in an optimized RAID 5 stripe, but will nevertheless be efficient. If I/O to the archive process is very small, it may make sense to turn write cache off, rather than trying to bypass cache with the write-aside parameter.

6 All CLARiiON and Symmetrix arrays store the cache on disk in the case of a failure. 7 Refer to EMC CLARiiON Best Practices for Fibre Channel Storage for details on write aside.


Use of multiple log devices Many large systems use multiple logs. These are typically numbered in order of access from the LGWR process and are grouped (Figure 3). Ideally, the log groups are located on two dedicated RAID groups—one for the active log and one for the inactive log (which is being archived).

Figure 3. Multiple log layout

Note that all three log operations—writing to the online log, reading from the offline log, and writing to the archive—are sequential I/O. Dedicating RAID groups to these devices maximizes the disk’s ability to do sequential I/O; the result is that log writes and archive are flushed from the cache quickly, leaving more overhead for database writes. In systems that are under a heavy write load, this technique will help overall write performance (Figure 4).

Figure 4. Oracle Log I/O configuration with ARCHIVELOG mode active

Recommendation: In large systems (more than 40 drives in use) with heavy write loads (more than 30 percent of all host traffic), deploy the online, offline, and archive log devices on separate sets of drives. Sharing the archive drives with other nonperformance-critical data is fine as long as the disk group is striped RAID.

Adapting OFA to smaller systems Oracle databases can be modest in size with modest I/O requirements. In systems with fewer drives, redo log data may have to coexist with other files. In this case, the ability of the storage system to effectively cache the redo log is critical.

The redo log writes hit write cache, so redo log operations execute at cache speed. With a small Oracle deployment, it is unlikely that this load alone will stress the write cache. However, if the storage system is shared with other write-intensive processes, cache saturation must be anticipated. When the cache saturates,


forced flushes result, all I/O is slowed, and redo log performance suffers. Monitoring of the system with Navisphere Analyzer is the best way to detect that the cache is saturating.

An effective tactic when deploying Oracle on a small storage system (under 20 drives) is to partition disk groups so that the RDBMS files are using as many drives as possible—even if that means they must be shared. MetaLUNs are ideal for this case, as drives can be shared among low-access hosts (such as departmental file servers) and the database.

For example, without metaLUNs, a 20-disk system would have typically four RAID groups. Under the usual deployment method (five-disk groups), the most disks any one LUN could access would be five. With metaLUNs, each disk group would be partitioned, and one LUN per group assigned to a host’s metaLUN. Each metaLUN would have up to 20 drives available to absorb a burst of I/O.

Archiving of the redo log in a shared environment requires monitoring of the storage-system cache dirty pages. When the log and data tables share disks, if writing to the archive log causes forced flushes, other requests for those drives are impacted. As long as there are enough drives to absorb the archive process without forced flushing, the host can drive concurrent I/O to those drives. Also, use of write aside (or turning off write cache altogether) for the archive device frees cache pages for production I/O.

Recommendation: In small systems (fewer than 20 drives) spread the busiest tables over as many drives as possible in order to absorb bursts. It is acceptable to share data with log devices, as the write cache will buffer drive access. Use metaLUNs to maximize use of your disk drives.

Instance parameters An Oracle database instance has parameters that control how it interacts with its storage (Table 1). When setting up the database, take the storage configuration into account. These parameters should work with the storage, not against it. For example, in Table 2, the parameters are tuned in order to match stripe sizes on the RAID groups used in the example.

These parameters can usually be found in $ORACLE_HOME/dbs/init.ora or files included in the init.ora file. Use the show parameters command from the SQLPLUS prompt (or, for 8i systems, svrmgr prompt) report containing the Oracle instance parameters.

Table 1. Important Oracle settings and their default values

Parameter Set in increments of

Typical default

Description and recommendation

DB_BLOCK_SIZE Bytes 2048 Equal to the file system block size and also greater than or equal to OS page size8.

DB_BLOCK_CHECKPOINT_ WRITE_BATCH DB_BLOCK_SIZE 8

Write chunk size for Checkpoint writes. Set to CLARiiON LUN stripe element size up to CLARiiON stripe size, but not greater than OS maximum I/O size.

DB_FILE_MULTIBLOCK_ READ_COUNT DB_BLOCK_SIZE 8

Read chunk size for table and index full scans. Set to CLARiiON LUN stripe element size up to CLARiiON stripe size, but not greater than OS maximum I/O size.

HASH_MULTIBLOCK_IO_ COUNT DB_BLOCK_SIZE 8

I/O chunk size for hash joins. Set to CLARiiON LUN stripe element size up to CLARiiON stripe size, but not greater than OS maximum I/O size.

USE_DIRECT_IO Boolean N/A This is where you can bypass the file system, if available.

8 Refer to “ ” on page 13 for more detail on the DB_BLOCK_SIZE.

Database block size (DB_BLOCK_SIZE)


Note that the sizing of the parameters (from CLARiiON stripe element size up to stripe size) depends somewhat on the type of application. For OLTP, use the stripe element size. For DSS, use the stripe size. If using metaLUNs, use the stripe element size of the base LUN as a guide, not the metaLUN stripe element size.

Instance parameter example Table 2 clarifies the settings of the instance parameters. It lists the instance parameters based on this configuration:

• Windows 2000 host, deploying tables on an NTFS file system (file system uses 4 KB block size) • 64 KB (128 blocks) stripe element size for data LUNs

Table 2. Instance parameters for OLTP configuration

Parameter Value Comment DB_BLOCK_SIZE 4096 The file system block size. DB_BLOCK_CHECKPOINT_ WRITE_BATCH

16 16*4096 = 64 KB, which is our base LUN stripe element size.

DB_FILE_MULTIBLOCK_READ_COUNT

16 16*4096 = 64 KB, which is our base LUN stripe element size.

HASH_MULTIBLOCK_IO_COUNT 16 16*4096 = 64 KB, which is our base LUN stripe element size.

USE_DIRECT_I/O Not used

Recommendation: Ensure that the Oracle read-ahead and write cluster sizes are aligned with the base LUN RAID stripe or stripe element size.

Backing up the database A database may be backed up in two ways: cold (database brought down) or hot (database still running, but in hot backup mode). Both operations are important to consider as they affect the design of the storage subsystem. The time and I/O load incurred by the backup must be considered while designing the system to ensure sufficient performance during hot backups.

Details for Oracle backup can be found in the Oracle document User-Managed Backup and Recovery Guide.

Cold backup In a cold backup, the database is down. The Oracle database files that make up the database can be copied to back up media in parallel. Oracle RMAN cannot be used as it requires the database to be running.

Since the database is stopped, the requirements for performance are predictable: The number of LUNs to be backed up is noted, and total bandwidth for read access computed. The total bandwidth is compared to the maximum bandwidth of the storage system itself, the backup host, and the media to determine which is the bottleneck. Read access should be large block and sequential.

Hot backup DBAs have great incentive to find a way to back up the database without halting all I/O. Oracle has a hot backup mode that can be used; details of this method can be found in the Oracle 9i or 10g with SnapView in SAN environments white papers mentioned in the “References” section. Briefly:

• The database must be operating in logging enabled mode. • Hot backup mode is initiated.


• The database is checkpointed and the System Change Number (SCN) is frozen.

changes. everts

• ckup interval should be captured, archived, and saved along with

Bec is required by Oracle during the hot backup interval for this portion of the r

ance are less

w

Hot backup with SnapView software is that the hot backup can be executed with a reduced

e

Ns, and the contents of those LUNs are backed

p. The es—most

nning concurrently with the production system depends on the write volume

If, itional load of the

• Backup is done, while changes to the database are permitted; the redo log records• When backup is completed, the database is taken out of hot backup mode, and redo logging r

back to the normal logging mode. The SCN is unfrozen, and advanced to correctly reflect all committed changes to the database. Redo logs generated during the hot bathe database backup files. Also, a backup copy of the control file should be made after the redo log archival is completed. ause more bookkeeping

database, the redo logs grow rapidly if there is heavy activity against the database. The response time fotransactions is negatively impacted while the database remains in hot backup mode.

Since the database is still doing I/O during a hot backup, the requirements for performpredictable than a cold backup. The read sequentiality of the backup process is broken up by database activity, and I/O is impacted for both the production applications and the backup process. Significant performance headroom must be designed into the database, or the hot backup must be run during a sloperiod.

An advantage to CLARiiON SnapViewimpact on the database system. Placing the database in hot backup mode—which imparts overhead on thsystem—is only required for as long as it takes to execute the SnapView session start, which is scriptable and very quick. (The details of the procedure can be found in the Oracle 9i or 10g with SnapView in SANenvironments white papers mentioned in the “References” section.) Thus, the impact on database transaction rates due to the schema being locked is minimized.

Performance implications of a snapped backup SnapView creates instantaneous images of the database LUup. Changes to the database cause some of the snapped data to migrate to the SnapView save LUN in an operation known as copy on first write (COFW). COFW operations are done in chunks, so a small write (4 KB) to the production volume results in a larger write (typically 64 KB) to the snap cache LUN. However, further changes to the chunk area on the production disk need not be copied.

The load on the redo log and archive is much less than during a non-SnapView hot backusequentiality of the backup process is still affected by application I/O contending for the disk drivof the data read is from the production LUN, not the snap save LUN—and this affects the speed of the backup. The system should be designed with some performance overhead available, or the backup should be run in an off-peak period.

The overhead for SnapView ruto the data tables. Preliminary data from CLARiiON Performance Engineering indicates that the impact of maintaining SnapView images of production data is 5 percent to 15 percent. This assumes no disk contention or cache saturation

Note: Drives used for the SnapView save area should not be shared with production data.

under normal production load, the CLARiiON cache is near saturation, then the addCOFW operations may cause contention in the cache, and thus impact ongoing operations. The locality ofupdates to the tables being written has a significant impact on this load: SnapView performs COFW in chunks, and subsequent writes to that chunk do not require I/O to the production or SnapView save LUN.

If backups during busy periods are a requirement and the minimum impact on production I/O is desired, consider an impact-free backup.


Impact-free backups with SnapView An impact-free backup strategy can be designed with SnapView clones. A clone relationship differs from that of a SnapView snapshot relationship: A clone receives mirrors of the writes to the host volume. If the production volume receives a 4 KB write, so does the clone. If the production volume writes to that same 4 KB block, the write is mirrored to the clone. Unlike COFW, clones get copied on every write, which is referred to as a Clone Mirrored Write (CMW). CMWs increase the load on the write cache; up to eight clones can be maintained for each production LUN, and each CMW hits write cache. The extra load during production, however, pays off when backups are required.

Two techniques can be used to execute an impact-free backup:

• Fracture the clones and use them as backup targets • Use SnapView to make snapshots of the clones The following sections describe these techniques.

Fracturing the clones and using them as backup targets In this case, the database is put into hot backup mode, and the clone (one clone per production LUN to be backed up) is fractured. The database is then returned to normal operation.

As compared to a SnapView snapshot backup, this approach reduces the load on production LUNs during the backup. First, CMW operations cease as soon as the BCV is fractured. Second, the backup reads are executed against different spindles than the production spindles (the best practice for clones is to put them on different disks from the LUNs they mirror). There is no impact on the write cache or the production disks during the backup.

The downside to this technique is that a resynchronization of the clone is required when the backup is done, and until the resynchronization is completed, the database image on the clones is not in a coherent state. However, as up to eight clones can be associated with a LUN, users requiring a coherent online mirror at all times can be accommodated.

A variant of this technique is to keep the clone fractured until the backup window approaches. The resynchronization of the clone is part of the backup preparation phase. After the clone synchronization is complete, the database is placed in hot backup mode and the clone is fractured. It stays fractured until the next backup.

Using SnapView to make snapshots of the clones In this case, a clone relationship exists for all LUNs to be backed up, but the clone is not fractured for the backup. Instead, after the database is put in hot backup mode, SnapView makes a snapshot of the clone. The database is then returned to normal operation. The backup is executed against this snapshot.

Backup read activity impacts the clone and SnapView cache disks only; no backup-related contention occurs for the production data. However, CMW activity occurs between the production and the clone disks, and this incurs contention for the clone disks between the CMW and the backup reads. As the clone disks experience a heavier I/O load, the backup may affect the responsiveness of production data. Although a CMW to the clone hits write cache, the extra load on the clone disks may slow write cache flushes to those disks. This can lead to watermark or forced cache flushing, which can affect overall system performance.

The advantage to this technique is that a resynchronization of the clone is not required, and thus the clone always contains a valid image of the database.

Recommendation: For the lowest impact on RDBMS performance, use SnapView clones as the backup source.

Other considerations for performance This section addresses considerations inherent in implementing the Oracle RDBMS, starting at the host, and working back to the storage system:


• Host OS considerations • File system or raw partition • Host-based striping (plaids) • MetaLUNs • The CLARiiON cache • LU distribution • Spindles and stripes • RAID levels and performance • Disks

Host OS and HBA considerations This section outlines the ways in which the Oracle database can be tuned to leverage the performance characteristics of the storage system.

Max I/O size The default maximum I/O size on most systems is 64 KB to 128 KB. This is sufficient for OLTP-type applications, but even in that case, database backups and redo log archival benefit from a larger value. A practical target for I/O sizes for a CLARiiON storage system is 1 MB.

DSS-type applications also benefit from a larger I/O size. The following are examples of parameters to change in order to increase the I/O size for a file system:

• Solaris file system settings: maxphys, set in bytes. maxcontig, set in blocks, set to the same capacity as maxphys

• AIX settings on a per-hdisk level, using chdev: max_transfer, set in bytes max_coalesce, set in bytes

• VERITAS VXFS: vxio:vol maxio is set in 512-byte units; it should be set as high as 2048 (which translates to 1

MB)

Recommendation: For best performance with RAID 5, the file system maximum I/O size should be a multiple of, or equal to, the stripe size used on the logical volume and CLARiiON LUN.

The TEMP database Write bandwidth to the TEMP database may behave in a counterintuitive fashion when the file system maximum physical size is increased. The expectation is that bandwidth will increase. However, if the maximum physical size is greater than the CLARiiON write-aside size, then large writes to TEMP bypass cache. Sequential writes may reach the write-aside size as they are coalesced by the file system into a single large I/O.

This results in slow writes to TEMP—write-aside bypasses cache, and disk I/O is always slower than cached I/O. Furthermore, when TEMP is reread, if the file system buffer does not have the data requested, the file system will request the data from the CLARiiON storage system. If write aside was used, the data is not in the cache, and the read request must go to the disk to retrieve the data. This results in slower reads than if the data was in the CLARiiON write cache. The effect can be quite dramatic.

There are a few ways to work around this, depending on the flexibility of the operating system:

• Create the TEMP file system with a lower maximum physical or maximum contiguous setting.


• Increase the write-aside size for the LUN above the maximum contiguous size of the file system

housing TEMP. • Create the TEMP database on a raw device.

Alignment To make the most of the RAID algorithms, if a file system is deployed on a striped RAID LUN (RAID 0, RAID 1/0, RAID 5, RAID 3), ensure that the writes from the file system are aligned. This reduces disk and stripe crossings, which increase latency. Disk and stripe crossings are particularly costly for parity RAID types, such as RAID 5. For example, when writing to a 256 KB CLARiiON RAID 5 stripe, a 512 KB I/O that is misaligned causes two partial stripe operations, requiring parity reads and writes. If aligned, the 512 KB I/O fills two stripes and allows the RAID 5 stripe to be written more efficiently to disk (no parity operations).

Refer to the EMC CLARiiON Best Practices for Fibre Channel Storage white paper on Powerlink® for details on alignment and fixing the alignment issue.

File system or raw partition The Oracle DBA has the option of implementing the tables on raw partitions or file systems. Each has advantages.

Raw partitions Raw devices have a number of advantages over file systems that benefit performance. Many production databases are implemented on raw partitions for these reasons.

Advantages of raw partitions are that they:

• Avoid file system caching, which costs CPU cycles on the host, and can be wasteful due to double buffering of I/O. It is more efficient to allow Oracle to perform its own buffer management using more memory dedicated to the Oracle SGA.

• Avoid file system locking. File systems lock a single writer per file, thus serializing I/O that Oracle could do simultaneously (using its own table and row-level locking).

• Have no file system block size. Oracle can write to raw devices at DB_BLOCK_SIZE. • Allow you to define large I/O sizes to a raw partition, whereas many file systems have upper

limitations. • Are easier to snap (that is, perform a SnapView function) and remain in coherent state. You only need

to put the database in hot backup mode; you would not need to synchronize the file system. Disadvantages of raw partitions are as follows:

• There is no file-system-level management available. • To reload tables on a backup machine, the partition upon which they are located must have the exact

device context and permissions as on the production server, or the database engine cannot load them.

File systems File systems may offer some benefit from OS-level caching. Rereads of blocks from the OS cache are very fast, taking some of the read cache load off of the storage system. This helps in systems configured with very large amounts of RAM (above 8 GB) that can load entire indexes and tables into file system buffers.

Advantages of file systems are that they:

• Offer coalescing of writes, which is useful in maximizing bandwidth in sequential access operations. • Are easier to back up and mount from a backup host. • Are easier to manage and have more tools for analysis than raw partitions.


A file system should not be used for the Oracle redo logs unless the write caching can be bypassed. For example, on Solaris, Oracle opens the redo log file with D_SYNC to force writes through to the physical medium.

Disadvantages of file systems are as follows:

• There is an extra layer of indirection and logic. • File system buffering requires that a sync of the file system be done before a backup is commenced. Advanced file systems Some file systems offer advanced features that improve performance. Advanced file systems, such as XFS, JFS, and VxFS, are preferred over UFS file systems. These advanced file systems offer improved journaling and performance (by eliminating double buffering).

Advanced file systems’ direct options Some advanced file systems offer direct I/O features, which bypass file system caching. This improves performance, though you are still limited by file-level locking. On the other hand, direct I/O options retain the advantages of a file system—such as file-level indirection for tables in the Oracle metadata. Note that when using direct I/O, you should disable asynchronous I/O in the Oracle system. In the init.ora file, set:

disk_asynch_io=false Some third-party file systems claim to offer both an efficient I/O model (which does support Asynchronous mode) and elimination of file-level locking for raw device performance.

Host-based striping (plaids) Creating a large file system from many LUNs is an effective tactic for creating large file systems and distributing bursty, random I/O over many drives. However, taking it to an extreme is not suggested as this approach can hurt large I/O performance (where one I/O spans many drives). The advantages and problems with plaiding are discussed in the EMC CLARiiON Best Practices for Fibre Channel Storage white paper.

Oracle’s SAME An aging philosophy for implementing storage, SAME, or Striping And Mirroring Everywhere, is not cost-effective when considering CLARiiON storage. The fundamental emphasis behind SAME is to ensure that I/O to Oracle database files is evenly spread out to as many physical resources (for example, disks, I/O channels, and others) as possible to minimize resource access contention. This concept has been replaced by OFA when modern high-capacity storage systems are used.

Guidelines for host-based striping Plaiding is recommended when the size of the file system cannot be accommodated by a single RAID group, or when random I/O performance requires that the I/O be spread across a large number of drives, or across both storage processors.

Stripe a reasonable depth Oracle recommends host-based striping with a stripe depth as small as 16 KB. This recommendation does not take into account the storage system’s caching abilities and is not recommended.

The host-level stripe depth (stripe element size) should be the same or a multiple of the CLARiiON stripe size. RAID 5 stripe optimizations benefit from this more than RAID 1/0, but the approach applies to both RAID types.

For example, you have four 4+1 RAID 5 LUs with 256 KB stripe sizes. Your Oracle stripe size should be 1 MB as the stripe depth (1 MB divided by 4 LUs = 256 KB) would then match the stripe size of each CLARiiON LU.


Using metaLUNs As discussed in the EMC CLARiiON Best Practices for Fibre Channel Storage white paper, metaLUNs are an effective tool for distributing bursts of I/O over many disk drives. Consider your goals for performance and data growth before implementing metaLUNs. For example, high-bandwidth DSS systems might perform better with traditional RAID-group-based LUNs, as fencing disk access between tables may allow the drives to work more efficiently.

MetaLUNs will assist performance when access to various tables is bursty and unpredictable. For example, with a large application such as SAP over an Oracle database, it is extremely difficult to determine ahead of deployment which tables will be busy, and which tables will interact. In this case, using metaLUNs (or a host plaid) is sensible as it guarantees all disks will be loaded evenly.

For an RDBMS such as Oracle, the design considerations are:

• Disk pooling: associate sets of RAID groups into sets that will host a common set of metaLUNs • Data fencing • Location of log devices

MetaLUNs and traditional LUNs can work together to effectively deploy the Oracle database. For example, for a single database server or cluster that is the sole client for a storage system, an approach might be: pure metaLUNs, and hybrid metaLUNs for data storage plus traditional LUNs for the logs.

Pure metaLUNs and round-robin logging In this design, a minimum of three disk groups are created. Each disk group will hold a subset of the data, RBS, and TEMP tables, and one of the following:

• Online log • Offline Log • Archive log

One disk group is always performing log writes, while the other two are doing archive reads and archive writes. The total or archive activity is distributed evenly over all disk drives. All disks are involved in data (DBWR) I/O (Figure 5).

Figure 5. An example of a pure metaLUN Oracle implementation

Note that the three groups allow some fencing of data I/O between disks: If interrelations between tables and indexes are known, the tables and indexes should be located on different metaLUNs. Also, the


preceding example has RAID group sets of different RAID types. The use of multiple RAID types is optional of course but points out the flexibility of the multiple-RAID set design.

Hybrid use of metaLUNs and traditional log devices In all but the largest, most write-intensive databases, implementing logs on separate spindles from data is unnecessary. The write cache insulates the LGWR process from disk flushing latency, and prefetch allows the archive process to efficiently read the offline log even if those drives are experiencing significant random I/O.

However, the flexibility of the CLARiiON metaLUN design plays well into the hands of database managers who absolutely insist that log devices be implemented on separate spindles from data. This approach is difficult with some “virtual” RAID schemes but it is easy to do with CLARiiON, and there really is no downside, if the client is comfortable with dedicating quite a lot of capacity for logs.

Figure 6 illustrates a dedicated log spindle approach, using RAID 1 sets for each log device. RAID 1 is chosen simply because it represents the fewest disks necessary to implement a redundant volume.

Figure 6. An example of a hybrid metaLUN Oracle implementation

The CLARiiON cache The CLARiiON storage-system cache is a key element in providing good response times and throughput for Oracle.

Cache page size CLARiiON storage systems allow you to set the cache page size. It is a global setting and thus affects all LUNs. The cache page size can be 2, 4, 8, or 16 KB. It should be set to the Oracle DB_BLOCK_SIZE.

Warning: If the database is stored on a file system, and the file system has a different block size, the database will be effectively working with the file system block size at its back end. So, in this case, use the file system block size as the cache page size. It is a best practice to match the database block size with the file system block size.

Which LUNs to cache All tables benefit from read cache, and any nonstatic table benefits from write cache. Redo log devices should have write and read caching enabled. The only LUNs that should be considered for disabling write cache are:


• The redo log archive, in order to keep redo log archive activity out of the write cache • Static tables (no writes) The “Redo logs” section on page 13 has more details.

Spindles and stripes Calculating the number of spindles to dedicate for an RDBMS file system is not difficult, given an expected I/O profile. However, calculating an expected I/O load can be very difficult. Very often, DBAs do not know their I/O profiles, as host-based tools are limited in scope. The most accurate predictions are those based on empirical data: Workload Analyzer or Navisphere Analyzer is run during production, and the load recorded.

Ensuring the appropriate number of spindles for each workload (DSS,OLTP, archiving) is critical to delivering optimum performance. This is even more a consideration given the large density of modern disk drives, which will drive a tendency to share multiple applications on the same spindles. This sharing can result in spindle contention and will have an impact on performance-sensitive applications. For additional guidance in minimizing disk contention reference the white paper EMC CLARiiON Fibre Channel Storage Fundamentals.

Stripe element size The stripe element size suggested by CLARiiON Performance Engineering is 64 KB (128 blocks). The resulting stripe sizes for typical RAID stripes are shown in Table 3.

Table 3. Stripe segment and stripe sizes

RAID type and size Stripe size, 64 KB element size 5-disk RAID 5 256 KB

8-disk RAID 1/0 256 KB 9-disk RAID 5 512 KB

16-disk RAID 1/0 512 KB

In high-bandwidth operations, use these values to match up with the expected maximum I/O sizes and maximum coalescing values on the file system. The goal is to have the host I/O equal to the CLARiiON stripe or an even multiple. That is why 4+1, 4+4, 8+1, and 8+8 stripe sets are popular—it is easy to align the CLARiiON RAID stripe with host I/O sizes when the number of effective disks is a power of two.

RAID levels and performance EMC CLARiiON Best Practices for Fibre Channel Storage provides details on the relative performance of RAID 5 and RAID 1/0, and CLARiiON RAID 5 optimizations.

When to use RAID 6 New to FLARE® 26 is RAID 6, which offers increased protection against double drive failures in parity RAID. In terms of performance RAID 6 is comparable to RAID 5 but requires an additional parity calculated.

For random workloads, RAID 6 performs the same as RAID 5 with regards to read operations. Because of the additional parity drive, the back-end activity for RAID 6 can be increased by as much as 50 percent for writes. If the workload can be destaged without the need for forced flushing, RAID 5 and RAID 6 can have similar behavior from a host response time point of view.

For sequential workloads, the read performance is nearly identical between RAID 5 and RAID 6. Due to the double parity protection in RAID 6, the sequential write performance will be lower by about 10 percent. Therefore, RAID 6 can be used as an alternative to RAID 5 when the need for increased reliability


outweighs the overhead of the additional parity drive. Refer to the “When to use RAID 5” section to get an idea of what kind of workloads are best for RAID 5.

When to use RAID 5 Since RAID 5 works best on very large I/O size workloads and in cases of sequential I/O, it is considered the best option for an Oracle implementation in which the DBA is effectively implementing read-ahead, write-behind, and vector writes. If the host OS and HBA are capable of larger than 64 KB transfers, RAID 5 is also very attractive. Note the term effectively—random I/O over a large data structure does not make use of the data coalescing the Oracle database is attempting.

Applications that would benefit from this performance profile would be:

• Any table space where the record size is greater than 64 KB and access is random (personnel records with binary data such as photographs, geophysical databases).

• A DSS database in which access is sequential (performing statistical analysis on sales records). • A scenario in which cost concerns outweigh performance concerns

When to use RAID 1/0 RAID 1/0 allows more random write I/O for any given storage system, for any given useable capacity, as compared to RAID 5. Thus, there are two effects favoring RAID 1/0 over RAID 5 in an environment with heavy write I/O:

• The system can maintain more random writes before saturating write cache. • The disks are less heavily loaded, which helps random read response times. Otherwise, given equal capacity, random read performance of the two RAID types is very close.

Examples of random small I/O workloads are:

• Data tables containing small records that are updated frequently (account balances) • TEMP space where the database will be doing a lot of sorting (DSS with structured reports) Most large application sets, such as SAP and Oracle Solutions, use data buffering in the application servers. This technique results in heavy table scans at initialization, but very random access during the run of the applications.

When to use RAID 1 Use RAID 1 when a dedicated RAID group is required (such as for a log device), and the storage needs are small enough to make a RAID 1/0 LU too costly. RAID 1 handles sequential I/O fairly well.

Be careful with RAID 1 since it does not offer striping. A RAID 1 volume is at a disadvantage when the I/O size increases beyond 128 KB.

When to use RAID 0 RAID 0 offers striping but no data protection. It should only be used for TEMP tables. It offers very good performance in that role, but ensure the client is aware of the cost of a failed drive—the database must be restarted in the case of a drive failure in an unprotected TEMP space.

RAID levels and redundancy During the planning process, tolerance to service loss due to component failure needs to be assessed. Certain tables are so critical to the database that they may drive the choice of RAID types, irrespective of performance or cost. For example, the cost—in time to rebuild—of a faulted drive in a RAID 5 group is higher than in a mirrored pair or mirrored stripe. A RAID 5 LU also suffers more performance loss than a mirrored device during a rebuild.


Refer to EMC CLARiiON Fibre Channel Storage Fundamentals for details on the relative redundancy of RAID types.

For tables that are under constant heavy use, or which have frequent updates, the increased cost (in deployed drives for the capacity) of a RAID 1/0 group could pay off in a failure situation. Compared to a RAID 5 group, a RAID 1/0 group:

• Has a lower impact on host performance during a rebuild. • Rebuilds faster. • Can sustain a multidisk failure (a RAID 1/0 group can have up to half of its disks lost and still

function).

Disks The difference in performance among disks of the same rotational speed (rpm) is minimal. However, Oracle performance is highly dependent on the number of drives used. For highest performance, the smallest drives available should be used. This gives more performance per gigabyte; the rule of thumb is 100 mixed IOPS per drive.

Also, the high-rpm (15k rpm) drives offer significantly more performance—up to 30 percent—in random I/Os, and thus these are ideal for OLTP applications with high transaction rates. Sequential I/O offers very little benefit, if any. Refer to EMC CLARiiON Best Practices for Fibre Channel Storage for details.

Conclusion Capacity, performance, and reliability are the primary considerations for choosing storage for an Oracle database.

Performance planning requires careful analysis of the database design, the host memory configuration, and performance characteristics of the storage system. Performance starts with good application design, and follows through from there to host tuning. The storage system cannot deliver data any faster than the host requests it. In any case where performance is an issue, refer your client to Oracle’s recommended steps in performance tuning (refer to “Appendix B: DB tuning basic steps”).

Given a good application design, deployment of interrelating objects should be deployed on different physical spindles. RAID 1/0 is preferred for OLTP databases, as are the 15k rpm drives, and the smallest capacity drives that fit the client’s capacity needs should be used. DSS applications are served well by RAID 5 and larger 10k rpm drives.

Reliability is a hallmark of CLARiiON CX and CX3 UltraScale series storage systems. You should consider RAID 1/0 for sensitive portions of the database, due to the lower impact of component failure.

Time spent planning before implementation is well spent.


References These EMC technical white papers are available on Powerlink:

• EMC CLARiiON Fibre Channel Storage Fundamentals • EMC CLARiiON Best Practices for Fibre Channel Storage • EMC CLARiiON Data Replication Options for Oracle Deployments – Technology Concepts and

Business Considerations • EMC CLARiiON SnapView and MirrorView for Oracle Database 10g Automatic Storage Management

– Best Practices Planning • EMC CLARiiON Database Storage Solutions: Oracle 10g with SnapView in SAN Environments • EMC CLARiiON Database Storage Solutions: Oracle 9i with SnapView in SAN Environments From Oracle TechNet, the key documents are:

• Oracle9i User-Managed Backup and Recovery Guide • Oracle Database Backup and Recovery Basics 10g Release 2 • The Oracle Database Administrator's Guides for Oracle 9i, 10g, or 11g


Appendix A: The redo log Oracle writes to the redo log when:

• A COMMIT is executed against the database • The log buffer is one-third full • DBWR writes to the database files The redo log design obviously costs some in complexity. Why is there a redo log?

The need for consistency The initial goal of the redo log was to record multistage transactions in an atomic fashion. This was to ensure that, in the event of a catastrophic failure, the database could be brought into a consistent state.

For instance, when two tables have to be updated in one operation, and the updates are interdependent (for example, the database will be out of sync if one executes and the other does not), a single redo log entry—which can be written atomically—is used as a way to record the intent of both changes with one operation. The data does not get updated in the tables until later.

In case of a catastrophic failure, the contents of the redo log are used along with bookmarks in the database itself to finish executing transactions logged but not completed.

Leveraging for performance The redo log is key for the transaction processing performance in Oracle.

Early database designers realized that by essentially bundling multiple writes in one operation, the redo log provided a way to increase performance. Complex business transactions often involve multiple modifications to data in different tables. Multiple table writes are expensive—the latency in writing to multiple files, even file systems—with multiple pieces of storage hardware behind them means latency for the execution of complex transactions. But unless the modified data are properly written back to durable storage, there is an inherent risk of losing committed transactions if the system should crash.

In modern databases, only the logging data is flushed to persistent storage in a synchronous manner. Overall I/O performance on the dirty database pages are improved, partly because multiple changes to the same page can be effectively written with only one physical I/O, and partly because the writes of multiple dirty pages can be performed as batches by the DBWR processes asynchronously in the background (Figure 7).


Figure 7. An example of a transaction After step 3 in Figure 7, the database can execute the next operation; the DBWR process can execute the writes to REG2 and CENTRAL asynchronously. Note that the hint in the redo log—that a row is to be moved—is all the database needs in case of a failure. Since the first table operation—adding the row to REG2—is an add, a failure means that to make the database consistent, the old row of CENTRAL needs to be deleted. No data is lost and the database can be made consistent.

Further optimization: Buffer coalescing The fact that the data to be written is held in the DBWR memory buffer allows a further performance advantage. The DBWR process can scan the buffers for multiple instances of a table update. This gives the database an opportunity to bundle multiple changes to a table in a few operations, thus reducing I/O load, disk contention, and latency. For example, a record with 200 fields has individual requests causing changes to 20 of its fields. All these changes would be coalesced into a single write to the table, rather than 20.


Appendix B: DB tuning basic steps Oracle’s recommended procedure for tuning a database is as follows:

1. Tune the business rules.

2. Tune the data design.

3. Tune the application design.

4. Tune the logical structure of the database.

5. Tune database operations.

6. Tune the access paths.

7. Tune memory allocation.

8. Tune I/O and the physical structure.

9. Tune resource contention.

10. Tune the underlying platform(s).


Implementing Oracle on EMC CLARiiON Storage Systems · Implementing Oracle on EMC CLARiiON Storage...

Documents

Transcript of Implementing Oracle on EMC CLARiiON Storage Systems · Implementing Oracle on EMC CLARiiON Storage...