databases: Oracle - Dell...Reliable Datagram Sockets (RDS) Over InfiniBand can provide a...
Transcript of databases: Oracle - Dell...Reliable Datagram Sockets (RDS) Over InfiniBand can provide a...
databases: Oracle
DELL POWER SOLUTIONS | May 200774
In the past, large database systems were often synony-
mous with costly mainframes. Today, however, grid
computing with commodity servers can provide many
advantages for large databases, including cost-effectiveness,
scalability, high performance and availability, network con-
solidation, and simple installation and management. Two
key technologies enabling this type of system for large
databases are Oracle Real Application Clusters (RAC) and
industry-standard grid components with efficient intercon-
nects that provide high throughput and low latency for data
traffic between components. Oracle Database 10g and Oracle
RAC enable the creation of scalable, high-performance,
highly available shared databases on clusters of cost-effective
industry-standard servers, helping increase return on invest-
ment and reduce total cost of ownership.
The cluster interconnect can have a major effect on
Oracle RAC performance. In Oracle RAC systems with
interconnect-intensive workloads, Gigabit Ethernet can often
become a bottleneck for high-volume cluster messaging and
Oracle Cache Fusion traffic between nodes for many applica-
tions. InfiniBand, in contrast, can provide significant advan-
tages in both raw bandwidth and reduced latency compared
with Gigabit Ethernet, and typically can provide higher per-
formance than Gigabit Ethernet for Oracle RAC systems.
Although support problems and the lack of a standard
protocol has historically made implementing InfiniBand for
clusters of this size a challenge, Oracle Database 10g
Release 2 (R2) and the 10.2.0.3 patch set support a cluster
interconnect protocol developed by Oracle and QLogic spe-
cifically for Oracle RAC called Reliable Datagram Sockets
(RDS), which is agnostic to underlying Remote Direct Memory
Access (RDMA)–capable devices and simplifies implementa-
tion. This protocol can work over either an RDMA-capable
Ethernet network interface card (NIC) or an InfiniBand host
channel adapter (HCA).
The RDS protocol uses InfiniBand delivery capabilities
that offload end-to-end error checking to the InfiniBand
fabric, which frees processor cycles for application pro-
cessing and enables significant increases in processor
scaling compared to Gigabit Ethernet implementations.
The key advantages of using this method with Oracle
Database 10g and Oracle RAC can include high bandwidth
and availability, low latency and processor utilization,
reliable packet delivery with no discards or retransmis-
sions, ease of use, and a simplified infrastructure com-
pared with Gigabit Ethernet. Figure 1 illustrates the
architecture differences between traditional interconnect
protocols and RDS.
Related Categories:
Clustering
Database
Dell/EMC storage
InfiniBand
Oracle
Reliable Datagram Sockets (RDS)
Visit www.dell.com/powersolutions
for the complete category index.
Using Reliable Datagram Sockets Over InfiniBand for Oracle Database 10g Clusters Reliable Datagram Sockets (RDS) Over InfiniBand can provide a horizontally scalable, high-performance alternative to traditional vertical scaling for enterprises using Oracle® Database 10g and Oracle Real Application Clusters (RAC). This article discusses the advantages of using RDS Over InfiniBand to build scalable, high-performance Oracle RAC clusters with cost-effective, industry-standard Dell™ and QLogic components.
By Zafar MahMood
anthony fernandeZ
Gunnar K. Gunnarsson
reprinted from Dell Power Solutions, May 2007. Copyright © 2007 dell Inc. all rights reserved.
75www.dell.com/powersolutions
Oracle RAC configurations: Gigabit Ethernet and InfiniBand Typical Oracle RAC 10g configurations, such as those using Gigabit
Ethernet, require three distinct networks:
• Dedicated, secure Oracle RAC cluster interconnect: Provides data
coherency between multiple servers to scale out the database;
interconnect bandwidth, latency, and processor overhead are critical
factors in database scaling and performance
• Storage area network (SAN): Provides access to shared storage
resources supporting the database cluster; these connections are
sensitive to latency and available I/Os per second
• Public network for client and application tier: Provides network com-
munications between the database tier and the client and application
tiers that require the data; these connections are sensitive to proces-
sor overhead from transmission protocols
These three separate networks, with redundant components and con-
nections, can require four to six NICs and host bus adapters (HBAs) in each
server in the database tier, increasing network complexity and cost. An addi-
tional challenge in building Oracle RAC clusters is that rack-mount and blade
servers include a limited number of PCI slots, typically making it difficult or
impossible to construct clusters with the necessary I/O connectivity, through-
put, and availability. And because these same servers now include increased
component density and multi-core technology, their processing capacity
creates high demands on cluster interconnects and I/O resources.
An InfiniBand infrastructure can be implemented in two ways:
• RDS Over InfiniBand only for the interconnect traffic, which can reduce
latency and increase throughput for cluster messaging and Oracle
Cache Fusion traffic between Oracle RAC nodes
• RDS Over InfiniBand for the interconnect traffic, and SCSI RDMA Protocol
(SRP) for the SAN running over the InfiniBand infrastructure
In the second implementation, a Fibre Channel gateway allows SRP to
translate InfiniBand traffic into Fibre Channel Protocol (FCP) frames to uti-
lize the Fibre Channel SAN. This architecture helps greatly increase through-
put while reducing cabling, component complexity, and latency. In addition,
applications can take advantage of the InfiniBand infrastructure’s inherent
RDMA capability by utilizing Socket Direct Protocol (SDP) for zero-copy data
transfers between application servers and database servers.
Oracle RAC scalability in a test environment In August 2006, the Dell Database and Application Solutions engineering
team evaluated the scalability of different Oracle RAC configurations by
building an eight-node Oracle RAC 10g cluster on Dell PowerEdge™ 1850
servers, then running the Transaction Processing Performance Council
TPC-H workload with both Gigabit Ethernet and RDS Over InfiniBand to
evaluate total runtime and average response time. The RDS Over
InfiniBand tests used the second implementation described in the preced-
ing section, deploying RDS Over InfiniBand for the interconnect traffic and
SRP for the SAN running over the InfiniBand infrastructure. In both tests
the storage consisted of two Dell/EMC CX700 arrays with 60 spindles
each. Figure 2 shows the test configuration; Gigabit Ethernet was used
as the interconnect in the first round of tests, and was then replaced by
RDS Over InfiniBand. Figure 3 summarizes the hardware and software
used in the test environment.
Figure 1. Architectures for traditional interconnect and Reliable Datagram Sockets protocols
Figure 2. Test configuration
HardwareHCA
Traditional model
OracleRAC database
IPC libraryUser
UDP
IP
Kernel
IPolB
NIC
Hardware
HCA
RDS library model
IPC libraryUser
RDSKernel
RDMA-capable
NIC
OracleRAC database
LAN/wide areanetwork
GigabitEthernet
InfiniBand
SAN 1 SAN 2
reprinted from Dell Power Solutions, May 2007. Copyright © 2007 dell Inc. all rights reserved.
databases: Oracle
DELL POWER SOLUTIONS | May 200776
the dell database and application solutions engi-neering team conducted five tests to compare the scalability of oracle database 10g r2 clusters using various components. for test 5, the Gigabit ethernet interconnect was replaced by an InfiniBand infrastruc-ture. the process of replacing a Gigabit ethernet inter-connect with InfiniBand is simple and does not require reinstalling oracle binaries to make changes to the data. enabling the rds over InfiniBand fabric can be accomplished with the following steps:
1. after cabling the InfiniBand hCas and switches, install the required InfiniBand drivers and rds support libraries on each host. these include the InfiniBand network stack, fast fabric, IP over InfiniBand (IPoIB), and rds drivers. use the ifconfig command to ensure that the InfiniBand interface appears in the host network interface list, as shown in the ib1 section in figure a. oracle database 10g should initially be configured to use IPoIB to validate oracle raC functionality over the InfiniBand fabric.
2. define the set of host names and associated IP addresses that will be used for the oracle raC pri-vate Interprocess Communication (IPC) network and specify these in /etc/hosts along with IPoIB host names, as shown in figure B. Note: figure B assumes a single InfiniBand connection to each host; however, redundant InfiniBand connections, if configured in IPoIB, are used automatically as part of rds failover.
3. define the set of host names that make up the oracle raC cluster and start oracle database.
4. once oracle raC cluster operation is validated over Gigabit ethernet, shut down oracle database and modify the /etc/hosts files on each node to reference the IPoIB interface IP addresses for the oracle raC private interconnect, as shown in figure C.
5. through the oracle Interface Configuration tool (oifcfg), change the cluster interconnect interface on each node with the following three commands (note that oifcfg requires oracle Cluster ready services to be running):
then restart oracle database. oracle raC should now utilize InfiniBand, with user datagram Protocol (udP) IPC traffic passed using IPoIB. Confirm IPoIB use through IPoIB interface statistics or by checking switch port statistics through the switch graphical user interface.
6. after validating the oracle raC cluster operation over IPoIB, shut down oracle database and build the oracle raC IPC library (libskgxp.so) by performing the following commands (as user “oracle”) on each oracle raC node:
[root@node3 ~]# cd $ORACLE_HOME/rdbms/lib
ImPlEmEntIng Rds OvER InfInIband
bond0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:192.168.0.34 Bcast:192.168.0.255 Mask:255.255.255.0
inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
eth0 Link encap:Ethernet HWaddr 00:14:22:21:E7:F1
inet addr:155.16.5.34 Bcast:155.16.255.255 Mask:255.255.0.0
inet6 addr: fe80::214:22ff:fe21:e7f1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:38093292 errors:0 dropped:0 overruns:0 frame:0
TX packets:3080668 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2979697141 (2.7 GiB) TX bytes:318248063 (303.5 MiB)
Base address:0xdcc0 Memory:fe4e0000-fe500000
eth0:2 Link encap:Ethernet HWaddr 00:14:22:21:E7:F1
inet addr:155.16.5.134 Bcast:155.16.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Base address:0xdcc0 Memory:fe4e0000-fe500000
ib1 Link encap:Ethernet HWaddr 06:06:6A:00:6E:6D
inet addr:192.168.1.34 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::406:6aff:fe00:6e6d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:21764931 errors:0 dropped:0 overruns:0 frame:0
TX packets:21672564 errors:0 dropped:2 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4786497450 (4.4 GiB) TX bytes:2947031664 (2.7 GiB)
Figure A. Host network interface list from the ifconfig command
oifcfg getif –global
oifcfg delif –global <ifname, ex. bond0>
oifcfg setif –global ib1/192.168.1.0:cluster_interconnect
reprinted from Dell Power Solutions, May 2007. Copyright © 2007 dell Inc. all rights reserved.
77www.dell.com/powersolutions
[root@node3 ~]# make -f ins_rdbms.mk ipc_rds
restart oracle database. oracle raC should now utilize InfiniBand and rds.7. to confirm that oracle database is utilizing rds, look for the IPC version
string in the oracle database alert logs, as follows:
[root@node4 ~]# vi /opt/oracle/admin/dwdb/bdump/
alert_dwdb2.log
here, dwdb is the name of the oracle raC 10g database. When rds is enabled for oracle raC, the log contains an entry similar to the one shown in figure d indicating the interconnect protocol and IP address associated with the InfiniBand hCa.
# GE RAC private network
# bond0 – Ethernet RAC private IPC
192.168.0.33 node3-ge node3-priv
192.168.0.34 node4-ge node4-priv
192.168.0.35 node5-ge node5-priv
192.168.0.36 node6-ge node6-priv
192.168.0.37 node7-ge node7-priv
192.168.0.38 node8-ge node8-priv
192.168.0.39 node9-ge node9-priv
192.168.0.40 node10-ge node10-priv
192.168.0.41 node11-ge node11-priv
192.168.0.42 node12-ge node12-priv
# InfiniBand
# ib1 – InfiniBand IPC (not in use for RAC)
192.168.1.100 sst9024
192.168.1.33 node3-ib
192.168.1.34 node4-ib
192.168.1.35 node5-ib
192.168.1.36 node6-ib
192.168.1.37 node7-ib
192.168.1.38 node8-ib
192.168.1.39 node9-ib
192.168.1.40 node10-ib
192.168.1.41 node11-ib
192.168.1.42 node12-ib
Figure B. Set of host names, IPoIB host names, and their associated IP addresses used for the Oracle RAC private IPC network
# GE RAC private network
# bond0 Ethernet RAC private IPC (no longer
used for RAC)
192.168.0.33 node3-ge
192.168.0.34 node4-ge
192.168.0.35 node5-ge
192.168.0.36 node6-ge
192.168.0.37 node7-ge
192.168.0.38 node8-ge
192.168.0.39 node9-ge
192.168.0.40 node10-ge
192.168.0.41 node11-ge
192.168.0.42 node12-ge
# InfiniBand
# ib1 – InfiniBand RAC IPC (now used for RAC)
192.168.1.100 sst9024
192.168.1.33 node3-ib node3-priv
192.168.1.34 node4-ib node4-priv
192.168.1.35 node5-ib node5-priv
192.168.1.36 node6-ib node6-priv
192.168.1.37 node7-ib node7-priv
192.168.1.38 node8-ib node8-priv
192.168.1.39 node9-ib node9-priv
192.168.1.40 node10-ib node10-priv
192.168.1.41 node11-ib node11-priv
192.168.1.42 node12-ib node12-priv
Figure C. /etc/hosts file edited to reference IPoIB interface IP addresses for the Oracle RAC private interconnect
Cluster communication is configured to use the
following interface(s) for this instance
192.168.1.34
Sat Aug 14 12:46:40 2010
cluster interconnect IPC version:Oracle RDS/IP
(generic)
IPC Vendor 1 proto 3
Version 1.0
Figure D. Oracle Database 10g alert log showing IPC version string
reprinted from Dell Power Solutions, May 2007. Copyright © 2007 dell Inc. all rights reserved.
databases: Oracle
DELL POWER SOLUTIONS | May 200778
storage test configuration The Dell/EMC CX700 storage was laid out so that the tablespaces spanned
across the maximum number of disks utilizing both storage processors
on both storage arrays, and both storage arrays were configured so that
100 percent of the cache was allocated to read ahead, to accommodate
the sequential read-intensive nature of data warehousing queries. Each
storage processor had 4 GB of cache, all of which was made available
for read cache after data load.
Oracle Database 10g performs I/O in sizes specified by the value of
the DB_BLOCK_SIZE and DB_FILE_MULTIBLOCK_READ_COUNT
parameters. These parameters were set to 8 KB and 128, respectively, to
achieve the I/O size of 1 MB (8 KB × 128). Each logical unit (LUN) was cre-
ated using 16 spindles with a stripe size of 64 KB, matching the Oracle
Database I/O size to optimize large sequential read performance.
testsThe test team performed five tests, as shown in Figure 4, and compared total
runtime and average response time for each test. Test 5 used RDS Over
InfiniBand instead of Gigabit Ethernet for the cluster interconnect; for more
information about how the test team reconfigured the environment for this
test, see the “Implementing RDS Over InfiniBand” sidebar in this article.
The goal of the testing was to determine the relative scalability of Oracle
Database 10g R2 clusters running a decision support system (DSS) workload
for each test, which involved scaling the Oracle cluster from one to eight
nodes using Oracle parallel execution and partitioning. The TPC-H database
size was 300 GB, and the large tables were partitioned and sub-partitioned
based on range and hash keys. Data was spread across the two Dell/EMC
CX700 storage arrays as described in the preceding section.
Test results Figures 5 and 6 show the total runtimes and average response times
when using Gigabit Ethernet as the cluster interconnect in the test envi-
ronment. This data has been normalized (by dividing all of the data by a
constant) to show relative performance gains and does not represent
actual test results.
While test 4 was running, the cluster interconnect began showing large
cluster latency and Oracle consistent-read global-cache transfer times. Test
5 repeated this test, but used RDS Over InfiniBand as the cluster intercon-
nect in place of Gigabit Ethernet. Figure 7 shows the average response times
for test 4 (with Gigabit Ethernet) and test 5 (with RDS Over InfiniBand) for
three TPC-H queries that seemed to be interconnect intensive. This data
has also been normalized (by dividing all of the data by a constant) to show
relative performance gains and does not repre-
sent actual test results. Adding the times for
these three queries together for each intercon-
nect shows that RDS Over InfiniBand provided an
average performance gain of 33 percent over
Gigabit Ethernet for these queries.
Oracle RAC best practicesSeveral best practices can help administrators
configure and scale Oracle RAC clusters to
increase performance and availability in enter-
prise environments:
Figure 3. Hardware and software used in the test environment
Figure 4. Tests used to evaluate Oracle RAC cluster scalability for different configurations
servers eight dell Poweredge 1850 servers
Processors single-core Intel® Xeon® processors at 2.8 Ghz with a 2 MB L2 cache and 800 Mhz frontside busdual-core Intel Xeon processors at 2.8 Ghz with two 2 MB caches and an 800 Mhz frontside bus
•
•
Memory 8 GB
I/O slots two PCI extended (PCI-X) slots
laN NIc Intel Gigabit* ethernet PCI-X
Hba dual-port QLogic QLa2342
cluster interconnects
• Gigabit Ethernet tests: two Intel Gigabit ethernet interconnects• RDS Over InfiniBand tests: one dual-port QLogic 9000**
double data rate (ddr) hCa
switches • Gigabit Ethernet tests: dell PowerConnect™ 5224 switch• RDS Over InfiniBand tests: QLogic 9024** ddr switch
storage two dell/eMC CX700 arrays
Os red hat® enterprise Linux® as 4 update 3
software • oracle database 10g r2 enterprise edition with 10.2.0.2 patch set
• eMC PowerPath 4.5• QLogic Quicksilver** host access software version 3.3.0.5.2
(for rds over InfiniBand tests)• Quest Benchmark factory, spotlight on raC, and toad
for oracle• tPC-h 300 GB database using a scaling factor of 300
*this term does not connote an actual operating speed of 1 Gbps. for high-speed transmission, connection to a Gigabit ethernet server and network infrastructure is required.
**at the time of the dell tests, the QLogic 9000 hCa, 9024 switch, and Quicksilver software were products of silverstorm technologies. QLogic acquired silverstorm in november 2006.
test Parallelism streams Nodes Processors cluster interconnect
1 node-level 4 1 single-core Intel Xeon Gigabit ethernet
2 node- and cluster-level 4 4 single-core Intel Xeon Gigabit ethernet
3 node- and cluster-level 4 4 dual-core Intel Xeon Gigabit ethernet
4 node- and cluster-level 4 8 dual-core Intel Xeon Gigabit ethernet
5 node- and cluster-level 4 8 dual-core Intel Xeon rds over InfiniBand
reprinted from Dell Power Solutions, May 2007. Copyright © 2007 dell Inc. all rights reserved.
79www.dell.com/powersolutions
• Establish a baseline on a single node to provide a point of comparison
when measuring performance increases.
• Use Oracle parallel execution to optimize I/O subsystems for
parallelism. Adding new I/O paths (HBAs) as needed can mitigate high
I/O waits, and administrators can use multipath software such as the
EMC® PowerPath® and Microsoft® Multipath I/O applications for I/O
balancing and failover across I/O paths.
• Use Oracle parallel execution to enable node-level parallelism with
two parallel threads per processor, and to add nodes and enable
cluster-level parallelism.
• For processor-intensive queries, scale up by adding processing power
to existing nodes (for example, by upgrading to multi-core processors),
which can be more efficient than scaling out by adding nodes. In
contrast, I/O throughput bottlenecks (which are typical in data ware-
houses) require additional I/O channels—meaning that scaling out
can be more efficient than scaling up.
• Use NIC bonding to provide interconnect scalability.
• If using Gigabit Ethernet, use jumbo frames for the cluster
interconnect.
• Use a high-bandwidth, low-latency interconnect such as InfiniBand for
Oracle RAC configurations with more than eight nodes or for database
applications with high interconnect demand (for example, applications
for which Oracle Enterprise Manager reflects significant cluster waits).
Scalable Oracle RAC clustersHorizontally scalable InfiniBand-enabled Oracle RAC 10g clusters can provide
a viable alternative to vertical scaling through traditional symmetric multi-
processing platforms. Deploying such clusters enables enterprises to achieve
the processing power they need using cost-effective, industry-standard com-
modity servers, and to add capacity incrementally as application demands
increase by adding cluster nodes. Oracle RAC clusters with InfiniBand also
include effective high-availability features: unlike mainframes, Oracle RAC
10g clusters with n+1 or more database servers and a redundantly configured
InfiniBand network can deliver a scalable, high-performance database plat-
form with no single points of failure, enabling failed servers or switches to
be replaced without application interruption or impact to users. Implementing
these types of configurations helps provide scalable, high-performance,
high-availability database clusters in enterprise data centers.
Zafar Mahmood is a senior consultant in the Dell Database and Application
Solutions team in Enterprise Solutions Engineering, Dell Product Group.
Zafar has a B.S. and an M.S. in Electrical Engineering, with specialization
in Computer Communications, from the City University of New York.
Anthony Fernandez is a senior analyst with the Dell Database and Application
Solutions team in Enterprise Solutions Engineering, Dell Product Group. His
focus is on database optimization and performance. Anthony has a bache-
lor’s degree in Computer Science from Florida International University.
Gunnar K. Gunnarsson is a program manager and the Oracle global alliance
manager at QLogic Corporation. Gunnar has a Bachelor of Electrical Engineering
from the University of Delaware, a Master of Engineering with specialization
in computer architecture from the Pennsylvania State University, and an
M.B.A. from the Wharton School of the University of Pennsylvania.
Figure 6. Normalized average response times in tests using Gigabit Ethernet interconnects
Figure 7. Normalized average response times for three TPC-H queries using Gigabit Ethernet and RDS Over InfiniBand interconnects
Figure 5. Normalized total runtimes in tests using Gigabit Ethernet interconnects
Test � Test � Test � Test �
��,���
��,���
��,���
��,���
�,���
�Aver
age
resp
onse
tim
e (s
econ
ds)
Query � Query � Query ��
�,���
�,���
�,���
�,���
�,���
�,���
�
Tim
e (m
illis
econ
ds)
TPC-H queries
Test � (Gigabit Ethernet)
Test � (RDS Over InfiniBand)
Test � Test � Test � Test �
���
���
���
���
���
�
Tota
l run
time
(min
utes
)
reprinted from Dell Power Solutions, May 2007. Copyright © 2007 dell Inc. all rights reserved.