Reliable Datagram Sockets and InfiniBand Hanan Hit NoCOUG Staff 2010.
-
Upload
alannah-payne -
Category
Documents
-
view
221 -
download
1
Transcript of Reliable Datagram Sockets and InfiniBand Hanan Hit NoCOUG Staff 2010.
Agenda
2
• Infiniband Basics• What is RDS (Reliable Datagram
Sockets)?• Advantages of RDS over InfiniBand• Architecture Overview• TPC-H over 11g Benchmark• InfiniBand vs. 10GE
3November 11,2010
3
Value Proposition - Oracle Database RAC
• Oracle Database Real Application Clusters (RAC) provides the ability to build an application platform from multiple systems clustered together
• Benefits– Performance
• Increase performance of a RAC database by adding additional servers to the cluster
– Fault Tolerance• A RAC database is constructed from
multiple instances. Loss of an instance does not bring down the entire database
– Scalability• Scale a RAC database by adding
instances to the cluster database Shared DatabaseShared Database
OracleOracleInstanceInstance
OracleOracleInstanceInstance
OracleOracleInstanceInstance
Shared DatabaseShared DatabaseShared DatabaseShared Database
OracleOracleInstanceInstance
OracleOracleInstanceInstance
OracleOracleInstanceInstance
Some Facts
4
• High-end database applications in the OLTP category are in size range from 10-20 TB with 2-10k IOPS.
• The high end DW applications falls into the category of 20-40 TB with I/O bandwidth requirement of around 4-8 GB per second.
• The x86_64 server with 2 sockets seems to offer the best price at the current point.
•The major limitations of the above servers is limited number of slots available to connect to the external I/O cards and the CPU cost of processing I/O in conventional kernel based I/O mechanisms.
•The main challenge in building cluster databases that runs in multiple serves is the ability to provide low cost balanced I/O bandwidth.
•The conventional fiber channel based storage arrays with its expensive plumbing does not scale very well to create the balance where these db servers could be optimally utilized.
November 11,2010
IBA/Reliable Datagram Sockets (RDS) Protocol
5
What is IBA InfiniBand Architecture (IBA) is an industry-standard, channel-based, switched-fabric, high-speed interconnect architecture with low latency and high throughput. The InfiniBand architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices.
What is RDS • A low overhead, low latency, high bandwidth, ultra reliable, supportable, Inter-Process Communication (IPC) protocol and transport system• Matches Oracle’s existing IPC models for RAC communication
Optimized for transfers from 200Bytes to 8MByte• Based on Socket API
November 11,2010
Reliable Datagram Sockets (RDS) Protocol
6
• Leverage InfiniBand’s built-in high availability and load balance features
• Port failover on the same HCA• HCA failover on the same system• Automatic load balancing
• Open Source on Open Fabric / OFED
http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-docs/
November 11,2010
Advantages of RDS over InfiniBand
7
• Lowering Data Center TCO requires efficient fabrics• Oracle RAC 11g will scale for database intensive applications only with the proper high speed protocol and efficient interconnect
• RDS over 10GE•10Gbps not enough to feed multi core Server IO needs
•Each core may require > 3Gbps•Packets can be lost and require retransmit
•Statistics are not accurate throughput indication•Efficiency is much lower than reported
• RDS over InfiniBand • The network efficiency is always 100%• 40Gbps today• Uses Infiniband delivery capabilities that offload end-to-end checking to the Infiniband fabric. •Integrated in the Linux kernel
•More tools will be ported to support RDS, i.e.: netstat, etc. •Shows significant real world application performance boost
•Decision Support System•Mixed Batch/OLTP workloads
November 11,2010
Infiniband considerations
8
Why do Oracle use Infiniband?
• High bandwidth (1x SDR = 2.5 Gbps, 1x DDR = 5.0 Gbps, 1x QDR
= 10.0 Gbps)
•V2 DB machine uses 4x QDR links (40 Gbps in each direction,
simultaneously)
• Low latency (few µs end-to-end, 160ns per switch hop)
• RDMA capable
•Exadata cells recv/send large transfers using RDMA, thus
saving CPU for other operations
November 11,2010
10November 11,2010
10
#1 Price/Performance TPC-H over 11g Benchmark
• 11g over DDR– Servers: 64 x ProLiant BL460c
• CPU: 2 x Intel Xeon X5450– Quad-Core
– Fabric: Mellanox DDR InfiniBand– Storage:
• Native InfiniBand Storage– 6 x HP Oracle Exadata
World Record clustered TPC-H Performance and Price/Performance
11g over 1GE 11g over DDR
Price / QphH*@1000GB DB
$5.00
$10.00
$15.00
$20.00
$25.00
73% TC
O S
aving
11November 11,2010
11
POC Hardware ConfigurationApplication Servers2x HP BL480C2 Processors / 8 core X560 3.16GHz64GB RAM4x 72GB 15K drivesNIC: HP NC373i 1GB NIC
Concurrent Manager Servers6x HP BL480C2 Processors / 8 core X560 3.16GHz64GB RAM4x 72GB 15K drivesNIC: HP NC373i 1GB NIC
Database Servers6x HP DL580 G54 processors / 24 cores X7460 2.67GHz256GB RAM8x 72GB 15K drivesNIC: Intel 10GBE XF SR 2 port PCIe NICInterconnect: Mellanox 4x PCIe Infiniband
Storage ArrayHP XP2400064GB cache / 20GB shared memory60 Array Groups of 4 spindles240 spindles total146GB 15K fibre channel disk drives
1 GbE Network
10 GbE Network
Infiniband Network
4Gb Fibre Channel Network
ApplicationServers
ConcurrentManagement
Servers
DatabaseServers
StorageArray
12November 11,2010
CPU Utilization• InfiniBand maximize CPU efficiency
– Enables >20% higher than 10GE
InfiniBandInterconnect
10GigEInterconnect
13November 11,2010
Disk IO Rate• InfiniBand maximizes Disk utilization
– Delivers 46% higher IO traffic than 10GE
InfiniBandInterconnect
10GigEInterconnect
14November 11,2010
InfiniBand deliver 63% more TPS vs. 10GE
Activity Start Time End Time Duration Records TPSInfiniBand Interconnect
1 Invoice Load - Load File 6/17/09 7:48 6/17/09 7:54 0:06:01 9,899,635 27,422.812 Invoice Load - Auto Invoice 6/17/09 8:00 6/17/09 9:54 1:54:21 9,899,635 1,442.893 Invoice Load – Total N/A N/A 2:00:22 9,899,635 1,370.76
10 GigE interconnect1 Invoice Load - Load File 6/25/09 17:15 6/25/09 17:20 0:05:21 7,196,171 22,417.982 Invoice Load - Auto Invoice 6/25/09 18:22 6/25/09 20:39 2:17:05 7,196,171 874.913 Invoice Load – Total N/A N/A 2:22:26 7,196,171 842.05
• Work Load– Nodes 1 through 4: Batch processing
– Node 5: Extra Node not used
– Node 6: EBS Other Activity
• Database size (2 TB)– ASM
– 5 LUNS @ 400 GB
• TPS Rates for invoice load use case 1 2 3 4 5 6
Oracle RAC Workload
InfiniBand needs only 6 servers vs. 10 Servers needed by 10GE
10GE InfiniBand
0
200
400
600
800
1000
1200
1400
1600
TP
S
15November 11,2010
Sun Oracle Database Machine
• Clustering is the architecture of the future– Highest performance, lowest cost, redundant, incrementally scalable
• Sun Oracle Database Machine that based on 40Gb/s InfiniBand delivers a complete clustering architecture for all data management needs
16November 11,2010
Sun Oracle Database Server Hardware
• 8 Sun Fire X4170 DB per rack
• 8 CPU cores
• 72 GB memory
• Dual-ports 40Gb/s InfiniBand card
• Fully redundant power and cooling
17November 11,2010
Exadata Storage Server Hardware
• Building block of massively parallel Exadata Storage Grid
– Up to 1.5 GB/sec raw data bandwidth per cell
– Up to 75,000 IOPS with Flash
• Sun Fire™ X4275 Server– 2 Quad-Core Intel® Xeon® E5540 Processors
– 24GB RAM
– Dual-port 4X QDR (40Gb/s) InfiniBand card• Disk Options12 x 600 GB SAS disks (7.2 TB
total)
• 12 x 2TB SATA disks (24 TB total)
– 4 x 96 GB Sun Flash PCIe Cards (384 GB total)
• Software pre-installed– Oracle Exadata Storage Server Software
– Oracle Enterprise Linux
– Drivers, Utilities
• Single Point of Support from Oracle– 3 year, 24 x 7, 4 Hr On-site response
18November 11,2010
Mellanox 40Gbps InfiniBand Networking
• Sun Datacenter InfiniBand Switch– 36 Ports QSFP
• Fully redundant non-blocking IO paths from servers to storage
• 2.88 Tb/sec bi-sectional bandwidth per switch• 40Gb/s QDR, Dual ports per server
Highest Bandwidth and Lowest Latency
• DB machine protocol stack
19November 11,2010
Infiniband HCA
IPoIBRDS
TCP/UDP
iDB
Oracle IPC
RAC
RDS provides
- Zero loss
- Zero copy (ZDP)
SQL*Net, CSS, etc
20November 11,2010
What's new in V2
• 2 managed, 2 unmanaged switches
• 24 port DDR switches
• 15 second min. SM failover timeout
• CX4 connectors
• SNMP monitoring available
• Cell HCA in x4 PCIe slot
• 3 managed switches
• 36 port QDR switches
• 5 seconds min. SM failover timeout
• QSFP connectors
• SNMP monitoring coming soon
• Cell HCA in x8 PCIe slot
V1 DB machine V2 DB machine
21November 11,2010
Infiniband Monitoring
• SNMP alerts on Sun IB switches are coming
• EM support for IB fabric coming
– Voltaire EM plugin available (at an extra cost)
• In the meantime, customers can & should monitor using
– IB commands from host
– Switch CLI to monitor various switch components
• Self monitoring exists
– Exadata cell software monitors its own IB ports
– Bonding driver monitors local port failures
– SM monitors all port failures on the fabric
22November 11,2010
Scale Performance and Capacity
• Scalable– Scales to 8 rack database
machine by just adding wires• More with external InfiniBand
switches
– Scales to hundreds of storage servers
• Multi-petabyte databases
• Redundant and Fault Tolerant– Failure of any component is
tolerated– Data is mirrored across storage
servers