Couchbase Server 2.0 and Incremental Map reduce for real-time analytics
Couchbase 2.0 performance.hp-dl380p.gen8
-
Upload
donjoice -
Category
Data & Analytics
-
view
187 -
download
0
description
Transcript of Couchbase 2.0 performance.hp-dl380p.gen8
Technical white paper
Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server An Emerging Database Lab Reference Architecture
Table of contents Executive summary ...................................................................................................................................................................... 3
Couchbase overview ..................................................................................................................................................................... 4
Schema-free documents ........................................................................................................................................................ 5
Queries and views ..................................................................................................................................................................... 5
Programmer interface ............................................................................................................................................................. 5
Replication and sharding ......................................................................................................................................................... 5
Test load server – HP DL380p Gen8 Server ............................................................................................................................ 6
Load test – Yahoo! Cloud Serving Benchmark ........................................................................................................................ 7
YCSB is a cloud serving data system load test and measurement tool ....................................................................... 7
Benchmark value ...................................................................................................................................................................... 8
Workloads used ......................................................................................................................................................................... 8
Test scenarios ............................................................................................................................................................................ 8
HP DL380p Gen8 and Couchbase 2.0 performance .............................................................................................................. 8
Test setup ................................................................................................................................................................................... 8
Client setup properties ............................................................................................................................................................ 9
DL380p Gen8 setup ................................................................................................................................................................ 10
Interrupt assignments and memcache processor affinity ............................................................................................. 11
Memory ..................................................................................................................................................................................... 13
Storage ...................................................................................................................................................................................... 17
Scale-out ................................................................................................................................................................................... 17
Replication ................................................................................................................................................................................ 18
Couchbase and Hadoop ............................................................................................................................................................. 22
Installing the Couchbase Hadoop Connector .................................................................................................................... 22
Exporting data from Hadoop to Couchbase Server......................................................................................................... 22
Importing data from Couchbase Server to Hadoop ........................................................................................................ 23
HP Insight Cluster Management Utility .............................................................................................................................. 23
Summary ....................................................................................................................................................................................... 24
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
Appendices.................................................................................................................................................................................... 24
Appendix A: Increasing memcache threads ...................................................................................................................... 24
Appendix B: Setting processor affinity ............................................................................................................................... 25
Appendix C: Java jar files required to support the YCSB client ...................................................................................... 25
Appendix D: Performance comparison NIC processor affinity data ............................................................................. 26
Appendix E: Hyper-Threading throughput comparison ................................................................................................. 28
Appendix F: Comparison of operations per second and average read latency ......................................................... 29
Appendix G: Performance implications of varying memory bucket size .................................................................... 29
Appendix H: Scale-out performance of two and four nodes ......................................................................................... 30
Appendix I: Replication tables .............................................................................................................................................. 30
Appendix J: JSON output for Hadoop .................................................................................................................................. 33
For more information ................................................................................................................................................................. 34
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
3
Simpler, Faster, Better – Making it Matter We in HP have tested the performance of Couchbase Server 2.0 on our ProLiant DL380p Gen8 server using the Yahoo! Cloud Serving Benchmark and also verified how to connect this solution with Apache Hadoop. We take the guesswork out of a new implementation by showing you how we implemented this solution in our test laboratory. Get a jump start on your application and system deployment by reviewing our test configurations and results – and save yourself some time and money.
Executive summary
“When all you have is a hammer, everything looks like a nail.”
– Bernard Baruch
This has certainly been true for most enterprise applications; we have used the hammer of the traditional relational database management system (RDBMS) to solve most database design problems. In this paper, we are not suggesting eliminating the RDBMS from the data center, but recognizing that a new class of database management tools is becoming mainstream; NoSQL database management systems (DBMSs) are among this class.
NoSQL DBMSs have a different operational model that can enhance throughput and increase application flexibility when compared with an RDBMS, leaving many traditional expectations lying on the cutting room floor. Many have no explicit table structure, no support for data joins, limited transaction support, and a security model delegated almost completely to the application tier … with demonstrable value for many of today’s most demanding applications. Among the first we have tested is Couchbase Server 2.0, an open-source NoSQL1 document-oriented DBMS.
In documented-oriented DBMSs, documents do not have fixed structures like relational tables; they are simply collections of key-value pairs that can be accessed via unique document identifiers or indexed by fields in the documents. Many such DBMSs support JSON2 documents. There is no rigid metadata associated with JSON documents, so application designers can add new fields to a logical schema without rebuilding the entire database. This flexibility may increase developer productivity and assist in keeping pace with the rapid advance of modern web applications.
The design paradigm of the NoSQL DBMS necessitates different performance tuning strategies than for an RDBMS. Couchbase Server is designed with a scale-out architecture, including built-in replication and sharding mechanisms to ensure data integrity, availability, horizontal scale and efficient remote client access. Proper configuration of NoSQL DBMS clusters and system-level tuning for optimal memory, disk, and network performance is critical for enterprise-class deployments.
Many companies are looking for solutions with Big Data in mind, so we have included an example configuration linking this solution to a Hadoop cluster.
1 NoSQL has multiple meanings at this time, but we will go forward with Not a Relational Database Management System. 2 JavaScript Object Notation (JSON) is a compact, yet human-readable object-oriented data structure language that is useful for the sharing of information
between heterogeneous computer systems and languages.
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
4
To assist our customers in evaluating NoSQL solutions, HP has configured and tested Couchbase Server on the latest HP ProLiant DL380p Gen8 server. This reference architecture documents these efforts and highlights several unique advantages HP platforms offer when deploying this DBMS. Following is a small sample of DL380p features.
• Flexible network solutions including advanced LAN-On-Motherboard options that enable customer-tailored network fabrics without consuming PCI-E expansion slots.
• Storage solutions ranging from direct-access hard disks, embedded RAID solutions, high-performance solid-state disk drives, and storage area networks, enabling customers to select the appropriate price and performance point for their individual needs.
• HP Active Health System, which provides continuous, proactive monitoring of over 1,600 system parameters including hardware, operating system, and some application software via the HP Integrated Lights-Out (iLO 4) management processor.
Additionally, cluster management capabilities are provided by HP Insight Cluster Management Utility (CMU). CMU provides push-button scale out and provisioning with industry leading provisioning performance (deployment of 800 nodes in 30 minutes), reducing deployments from days to hours. In addition, CMU provides real-time and historical infrastructure and Hadoop monitoring with 3D visualizations allowing customers to easily characterize Hadoop workloads and cluster performance reducing complexity and improving system optimization leading to improved performance and reduced cost. HP Insight Management and HP Service Pack for ProLiant, allow for easy management of firmware and the server. In the following pages, we’ll discuss the basics of Couchbase Server deployments on the HP ProLiant platform. We use industry quasi-standard workloads to demonstrate the impact of hardware and software configuration choices on server performance. These configuration choices are:
• A basic Couchbase system implementation
• A replication implementation for a highly-available configuration
• A sharded3 implementation for scale-out write and update-heavy configuration
Target audience: This document is intended for decision makers, system and solution architects, system administrators
and experienced users who are interested in reducing time to design, purchase, and implement an HP ProLiant and Couchbase Server solution with an optional Hadoop integration.
This white paper describes testing performed November 2012 through January 2013.
DISCLAIMER OF WARRANTY
This document may contain the following HP or other software: XML, CLI statements, scripts, parameter files. These are provided as a courtesy, free of charge, “AS-IS” by Hewlett-Packard Company (“HP”). HP shall have no obligation to maintain or support this software. HP MAKES NO EXPRESS OR IMPLIED WARRANTY OF ANY KIND REGARDING THIS SOFTWARE INCLUDING ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. HP SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES, WHETHER BASED ON CONTRACT, TORT OR ANY OTHER LEGAL THEORY, IN CONNECTION WITH OR ARISING OUT OF THE FURNISHING, PERFORMANCE OR USE OF THIS SOFTWARE.
Couchbase overview
Couchbase Server 2.0 is an open-source NoSQL DBMS optimized for interactive applications that store data in key-value or document format. It has support for JSON, binary data, indexing and querying, replication for high availability (including cross-datacenter replication), and sharding that provides horizontal scaling. Couchbase is easily scalable, highly performant, with a driving philosophy of “always on”. In this section we present several high-level features of Couchbase, and we direct you to more detail in the footnotes and references.
3 Sharding is a partitioning of the database across two or more systems with the intent to increase throughput by enlarging the number of systems, disk
spindles, and other resources working on a sharable task. Documents are partitioned based upon a document identifier or key.
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
5
Schema-free documents
Documents in Couchbase Server can be stored in JSON or binary format. Documents contain key and value pairs that are sent to the server and are returned back to the applications by client libraries (APIs). Using Couchbase SDKs, the document ID is hashed and documents are uniformly distributed across partitions in the cluster. On each node in the cluster, documents are stored in data containers called buckets, which is similar to the concept of a database in an RDBMS. Each bucket can have multiple documents and documents with different structures can be collocated in the same bucket. Buckets provide a logical grouping of physical resources within a cluster. Based on your configuration settings on the bucket, you can replicate a bucket up to 3 times within a cluster.
In contrast with RDBMSs where tables are normalized to minimize duplication of information, NoSQL database management systems typically store data in a de-normalized way. There are no joins.
Agile programming enablement is a key focus for a NoSQL DBMS. Its schema-free document model makes coding more flexible and adaptable to the needs of the application and the availability of information provided in a less-structured content.
Schema-free means that the database doesn't need to know the content of the documents, and that documents with different structures may be collocated in the same bucket if desirable. This doesn’t mean the contents are not used by the DBMS, as indexes are based on fields (key-value pairs) in the documents; it is simply a statement that the fields are not required to function as they would with an RDBMS. If the field does not exist, or is not of the value required, the document simply isn’t included in the results.
Performance is enhanced by including relevant data together in a document. Typically in an RDBMS environment, tables are normalized to minimize duplication of information, but even in these shops, selective denormalization is used to enhance application performance. NoSQL database management systems typically include redundant information in the documents based upon application requirements and information availability to enhance performance, especially in a web environment, to minimize the number of round-trips to access information.
Queries and views
“Couchbase Server incorporates the ability to summarize and query the information that is stored in the database through the use of views. Views in Couchbase are written as a JavaScript Map Reduce function. Views define the structure and content of the response generated from the stored key and value pairs. From the information generated by the view, you can query and select specific rows, ranges of rows, and summaries of this information such as counts or sums.”
Views create indexes based upon attribute values that are emitted. The index stores a key and value for each qualified document. A view key defines the search parameters, and a view value provides the fields emitted in the view. This is significant in that the entire document isn’t accessed or returned, only the attributes defined in a view. A view also provides the key to the original document, so if additional information beyond the value fields is required, the original document can be obtained. To provide information in various combinations, multiple views may be used.
Programmer interface
Programmers interface with the DBMS via Software Development Kits (SDKs); these are also known as client libraries or APIs. These language-specific APIs, provided by Couchbase, Inc. and the Couchbase Server open source community, are responsible for communicating with the server to perform database operations. Client libraries are cluster-aware, maintain database topology and node status in a dynamic environment, distribute read and write requests to the appropriate nodes, and compensate for failed nodes automatically by redirecting requests as appropriate.
Along with the standard get and set operations, Couchbase Server provides some integrity checking operations. The add() operation assures a record is not already in the database, the replace() operation assures the record already exists, and the cas() operation, compare and swap, allows us to verify a field’s state before an update operation occurs. Couchbase also provides atomic operations for binary values for atomically incrementing (incr) or decrementing (decr) a value for a given key.
Replication and sharding
High availability comes with replication, and horizontal scale comes with sharding. Replication and sharding may be combined to provide a horizontal scale deployment that provides high availability with large-scale storage capacity.
Data stored in Couchbase is also partitioned using consistent hash partitioning. By design, each Couchbase data bucket is split into 1,024 partitions called vBuckets. Documents are auto-sharded and evenly distributed across these partitions.
Replication is a feature that allows copies of a database bucket to reside on alternate nodes; this is the basis of high availability. The system internally manages the replication of partitions for each bucket. The client library manages this for the developer, and based on the document ID referenced, routes requests to the appropriate node that manages the
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
6
partition to which the document belongs. For Couchbase Server, the number of replicas for a bucket must be established when the bucket is created.
From a user or developer perspective, the client library manages the failover of a partition automatically when a node becomes unresponsive. From a server perspective, the replica partition is promoted to active status and the administrator may rebalance the cluster to establish another replica partition for the data4.
When replication is applied, we must also recognize the eventually-consistent paradigm of NoSQL DBMSs. Relational database management systems normally use an ACID model (atomicity, consistency, isolation, durability) to ensure data consistency, where NoSQL DBMSs, including Couchbase Server, typically use a BASE model (basically available, soft state, eventually consistent). When replication is in use, we must recognize that write durability (persistence) is an important consideration regarding the availability of writes and updates to the replicas. Couchbase Server 2.0 provides the observe() operation5 that allows us to assure a document is persisted to disk and, if replication is active, that the document is replicated.
Nodes may be added to an active cluster to increase storage, and the administrator may rebalance the buckets as appropriate.6 These same nodes may also contain replication vBuckets.
Next, we will review the system configuration used for the performance load testing.
Test load server – HP DL380p Gen8 Server
The server hardware for the Couchbase Server test configuration is an HP ProLiant DL380p Gen8 server. Each HP DL380p Gen8 server is outfitted with two Intel Xeon E5-2690 8-core 2.9GHz CPUs, 192GB of RAM7, 24Gbps network bandwidth (48Gbps bidirectional bandwidth), and eight 300GB small form factor hard disk drives for a total of 2.4TB. When used in the scale-out configuration, a four node cluster includes 64-cores, 128-threads, 768GB RAM, 96Gbps network bandwidth, and 9.6TB of disk storage. This is just the server test configuration. The HP DL380p Gen8 server has much more capacity than shown here; please see the data sheet for more information8.
Figure 1. HP DL380p Gen8 Server
4 See administration of node failover: couchbase.com/docs/couchbase-manual-2.0/couchbase-admin-tasks-failover.html 5 See observe in the SDK manual: couchbase.com/docs/couchbase-devguide-2.0/couchbase-sdk-when-to-observe.html 6 For more information on sharding, see: couchbase.com/docs/couchbase-devguide-2.0/brief-info-on-vBuckets.html 7 For the load test, we limit physical RAM to 192GB; we also decrease the available RAM to the database to assist in gathering load performance statistics when
comparing database size with RAM availability. 8 HP DL380p Gen8 server data sheet: http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA3-9615ENW.
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
7
Table 1. HP DL380p Gen8 server configuration
Quantity Description
1 HP DL380p Gen8 8-SFF CTO Server
2 HP DL380p Gen8 Intel® Xeon® E5-2690 Performance Processors
1 HP Smart Array P420i SAS controller
1 HP 1GB P-series Smart Array Flash Backed Write Cache
12 HP 16GB 2Rx4 PC3-12800R-11
1 HP 1GbE 4 Port 331FLR Adapter
1 HP NC523SFP 10Gb 2-port Server Adapter
2 HP 750W Common Slot Gold Hot Plug Power Supply Kit
8 HP 300GB 6G SAS 15K 2.5in SC Enterprise Hard Drive
Next, we review the load generating benchmark.
Load test – Yahoo! Cloud Serving Benchmark
For every solution, we need to define the applicable areas for which the system may be used, and to show configuration trade-offs of the solution (like the amount of memory required, network bandwidth consumed, CPU usage). We have chosen the Yahoo! Cloud Serving Benchmark (YCSB) workload and its framework to characterize the performance of the NoSQL DBMS on the HP DL380p.
In this section we present a brief overview of YCSB and the workloads used in our tests. More information is available in the footnotes and references.
YCSB is a cloud serving data system load test and measurement tool
Yahoo!’s research arm developed a framework and a common set of workloads to help understand the performance characteristics of various DBMSs under various types of load. The stated purpose of YCSB is to use the workloads in evaluating the performance of "key-value" and "cloud" serving stores.9 This allows us to use this common benchmark to test the NoSQL database systems and determine their performance characteristics on a server. We may also use this performance information to compare with other DBMS solutions. For further information, there is an original Association for Computing Machinery paper, “Benchmarking Cloud Serving Systems with YCSB”,10 published by the Yahoo! research team.
The research team recognized a new class of DBMSs that didn’t follow the traditional ACID approach to transactions and write durability, nor the OLTP types of workloads. Therefore a new approach was proposed to facilitate the comparison of similar “cloud data serving systems”11. The YCSB framework is extensible to allow expansion of the workloads and DBMSs as desired.
The YCSB framework has a client that manages the workload and maintains and reports the statistics. This client has multiple threads that are used to send requests to the database and monitor the responses using a driver specific to the DBMS under test. This same client is also used to load the test documents into the database.
Source code for this benchmark is maintained at github.12 Additional published benchmark results are available, (2010-03-31) ycbs-v4.pdf13).
9 The benchmark purpose may be reviewed at : http://research.yahoo.com/Web_Information_Management/YCSB 10 Information about the paper (2010): http://research.yahoo.com/node/3202,
Paper download: http://research.yahoo.com/files/ycsb.pdf 11 http://research.yahoo.com/node/3202 12 Source code may be obtained at: http://github.com/brianfrankcooper/YCSB 13 http://research.yahoo.com/files/ycsb-v4.pdf
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
8
Benchmark value
The value of running these tests is that the results help us position the solution, understand best practices, and understand the benefits of the various configuration options. Latency and throughput are two results from these tests that indicate the value of our choices. There is a significant difference when running workloads where the working set of data is all cached in memory compared with workloads having a working set that is too large to be fully contained in memory, thus forcing disk I/O. We may also observe the impact of write durability options that force data to be preserved on disk before returning to the client, or replication options that require documents to be preserved across the network on another node. These results may also be compared with other solutions.
• Latency is the time it takes to return from a request.
• Throughput is the number of requests per second the solution is capable of achieving.
Workloads used
There are two YCSB workloads we use to test the NoSQL database management systems; they are called Workload A and Workload C.14
Workload A is an update-heavy workload with operations split between 50% read and 50% write. Workload C is a read-only workload with 100% read operations.
• Update-heavy
• Read-only
The data in a document, for this benchmark, is approximately 1,000 bytes. Documents consist of ten fields with 100 bytes in each field, and a key. The entire document is brought in for each read and update.
When updating or writing to the disk, Couchbase Server processes 250k15 documents at a time, and then commits the writes to the disk. It only writes the latest update to a document, pruning multiple writes to the same document.
For both the update-record and read-record selection, we use the uniform distribution algorithm, giving any record an even chance of being selected. Zipfian is the default test algorithm, but for our tests, we choose the uniform algorithm to lessen the potential performance improvement that caching of hot documents would provide.16 When a document is selected, the entire document is returned to the client.
Test scenarios
The test database is loaded with 25 million documents on each server, containing 10 fields of 100 bytes each, per node; therefore, a four-node cluster has 100 million documents.
Our experience indicates that NoSQL DBMSs prefer to have enough memory to hold the entire working set in memory, thus we have two sub-scenarios for each workload to document performance. In the first scenario, enough memory to hold the working set is provided, and in the second, half the memory required to hold the entire working set is provided.
• Working set cached in memory
• Working set twice the capacity for memory
Now, let us review some of the performance results.
HP DL380p Gen8 and Couchbase 2.0 performance
Test setup
The test environment includes four HP ProLiant DL380p Gen8 servers configured as previously described as the Couchbase servers, and sixteen HP ProLiant DL380p servers consisting of a mixture including twelve Gen8 servers with sixteen cores and four G7 servers with twelve cores each as the clients (or drivers). We utilize one 10GbE NIC on each node.
14 More information regarding workloads is available at: https://github.com/brianfrankcooper/YCSB/wiki/core-workloads 15 couchbase.com/docs/couchbase-manual-2.0/couchbase-monitoring-diskwritequeue.html 16 More information can be found in the YCSB document: http://research.yahoo.com/files/ycsb.pdf
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
9
The test setup consists of the following software:
• OS Version: Red Hat Enterprise Linux Server release 6.1 (Santiago), 2.6.32-131.0.15.el6.x86_64
• Oracle Java version: 1.7.0_0317
• YCSB – commit version 8afcc6e from https://github.com/Altoros/YCSB
• Couchbase version: Couchbase Community Server 2.0.0 Beta, Build 1723
The test setup consists of the following hardware:
• Sixteen ProLiant DL380p Gen8 with 2 Intel Xeon 2.9 GHz E5-2690 Sandy Bridge processors
• Four ProLiant DL380p G7 with 2 Intel Xeon 3.46 GHz X5690 Westmere processors
• One HP 6600-24XG18 Switch
Figure 2. Test setup
Client setup properties
During the test runs, we ran 1 to 32 YCSB clients (drivers) on each node with 128 threads; for a maximum of 512 clients with 128 threads.
JVM options The Java Virtual Machine on the client systems is run with the following parameters:
-Xms2048m -Xmx2048m -XX:MaxDirectMemorySize=2048m
-XX:+UseConcMarkSweepGC -XX:MaxGCPauseMillis=850
YCSB run parameters
Table 2 describes the YCSB parameters used for the test runs.
17 Appendix C provides a list of Java jar files required to support the YCSB client. 18 Because of the heavy network traffic, we suggest upgrading the test configuration network switch to HP 5920AF high-performance 24 port top-of-rack
10GbE switch with ultra-deep packet buffering.
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
10
Table 2. YCSB client run parameters
Parameter Value Brief
couchbase.timeout 60,000
couchbase.readBufferSize -1
insertorder ordered
requestdistribution uniform
recordcount 25 million
maxexecutiontime 360
operationcount 999,999,999
writeallfields true
readallfields true
threads 128
readproportion 1 or 0.5 1 for workload ‘C’, or 0.5 for workload ‘A’
updateproportion 0 or 0.5 0 for workload ‘C’, or 0.5 for workload ‘A’
scanproportion 0
insertproportion 0
DL380p Gen8 setup
We recommend the DL380p Gen8 nodes be provisioned with 10G Ethernet ports.19
We recommend leaving Hyper-Threading disabled.20
We recommend the DL380p Gen8 nodes be provisioned with a minimum of 128GB of RAM. See the Memory section that follows. A two processor DL380p has 4 memory channels per processor with 3 DIMM slots in each channel for a total of 24 slots. Memory configuration is subject to DIMM population rules and guidelines described in the white paper “Configuring and using DDR3 memory with HP ProLiant Gen8 Servers”.21 The maximum number of DIMMs per channel (DPC) to utilize the highest supported DIMM speed of 1600 MHz is 2 DPC.
We recommend using the suggested settings from the white paper “Configuring and Tuning HP ProLiant Servers for Low-Latency Applications”22. For the tests we used all the recommended settings except for the Intel Turbo Boost Technology setting, which we enabled since in the testing we were careful not to exceed the processor temperature limit which would cause it to transition out of turbo boost mode.
19 See Appendix F for a comparison of operations per second and average read latency. 20 See Appendix E for Hyper-Threading throughput comparison. 21 http://h20000.www2.hp.com/bc/docs/support/SupportManual/c03293145/c03293145.pdf 22 Configuring and Tuning HP ProLiant Servers for Low-Latency Applications,
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
11
Table 3. BIOS settings
Parameter Value
Intel Virtualization Technology Disabled
Intel Hyper-Threading Options Disabled
Intel Turbo Boost Technology Enabled
Intel VT-d Disabled
Thermal Configuration Maximum Cooling
Processor Power and Utilization Monitoring Disabled
Memory Pre-Failure Notification Disabled
HP Power Profile Maximum Performance
HP Power Regulator Static High Performance
Intel QPI Link Power Management Disabled
Minimum Processor Idle Power Core State No C-states
Minimum Processor Idle Power Package State No Package State
Energy/Performance Bias Maximum Performance
Collaborative Power Control Disabled
DIMM Voltage Preference Optimized for Performance
Interrupt assignments and memcache processor affinity
The Intel Sandy Bridge processors have an integrated PCI Express 3.0 controller. In our test configuration, we have two of these processors in each DL380p Gen8 server. Therefore, to best support maximum performance, we localize PCIe interrupt processing (interrupts from the 10GbE NICs) to the processor hosting the task requesting the network traffic. If we allow the interrupts to be hosted by the other processor, we incur an extra hop between processors to handle the interrupt, and given our test results, the impact is significant.23
Table 4. DL380p PCIe slot to processor assignments24
PCIe Riser Slot Processor
Standard 1 1
Standard 2 1
Onboard LOM n/a 1
Expansion 4 2
Expansion 5 2
Expansion 6 2
23 More information on the E5-2600 processor is available at intel.com/p/en_US/embedded/hwsw/hardware/xeon-e5-c604/overview 24 For more information on the DL380p PCIe slot assignments, see the quickspecs at
http://h18004.www1.hp.com/products/quickspecs/14212_div/14212_div.html
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
12
In the Couchbase server process, the Couchbase-memcache worker threads handle the network I/O for Couchbase client requests. For the system under test the PCIe controller for the 10Gb NIC is connected to Processor 1, therefore we set the processor affinity of these threads to Processor 1. By having all the worker threads running on a single processor, we also get additional performance gains due to less memcache thread synchronization overhead between the processor chips.
By default, the Couchbase Server process starts with four Couchbase-memcache worker threads per Couchbase server instance. To improve performance, and to fully utilize all cores of the processor, we recommend starting one Couchbase-memcache worker thread per core. For the system under test, with eight cores per processor available, eight Couchbase-memcache worker threads were started.
The second processor provides additional compute capacity for all the other Couchbase processes and threads (the disk writer threads, the disk compaction process, and other overhead tasks).
Appendix A provides an example C program to increase the number of Couchbase-memcache worker threads, and Appendix B provides an example shell script to set processor affinity.
Figure 3 shows a throughput comparison when the NIC interrupts are processed by processor 1 and the memcache threads have varying processor affinities. The throughput is measured for YCSB benchmark ‘C’. The Couchbase Server was started with eight Couchbase-memcache worker threads. The detail data is provided in Appendix D, Table D-1.
Figure 3. Performance comparison with varying memcache thread processor affinity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 4 8 16 32 64 128 256 512
Thro
ugh
pu
t (o
pe
rati
on
s/se
con
d)
Mill
ion
s
Number of YCSB Clients
Performance With Processor Affinity
8 memcache threads with Processor 1affinity
8 memcache threads with Processor 2affinity
8 memcache threads equally splitProcessor 1 and 2
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
13
Figure 4 shows average document read latency comparison when the NIC interrupts are processed by processor 1 and the memcache threads have varying processor affinities. The latency is measured for YCSB benchmark ‘C’. The Couchbase server was started with eight Couchbase-memcache worker threads. The detail data is provided in Appendix D, Table D-4.
Figure 4. Average document read latency comparison with varying memcache thread processor affinity
Memory
System dynamic RAM is a determining factor when calculating throughput performance. There are two portions to this section:
• Populating the DIMM slots for maximum performance
• Sizing memory correctly for Couchbase server data buckets
Populating DIMM slots
For maximizing Couchbase performance, DL380p Gen8 nodes should be provisioned with a minimum of 128GB RAM configured as eight 16GB DDR3-1600 RDIMMs, one DIMM per channel. The memory configuration may be expanded to 256GB by adding 8 x 16GB DIMMs, or 192GB by adding 8 x 8GB DIMMS, in the second DIMM slot.25
25 A two processor DL380p has 4 memory channels per processor with 3 DIMM slots in each channel for a total of 24 slots. Memory configuration is subject to
DIMM population rules and guidelines described in the white paper “Configuring and using DDR3 memory with HP ProLiant Gen8 Servers”
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c03293145/c03293145.pdf. To utilize the highest supported DIMM speed of 1600 MHz, a
maximum of two DIMMs per channel may be occupied. Page 20 states “For optimal throughput and latency, populate all four channels of each installed CPU
identically.”, and that in the second slot “There are no performance implications for mixing sets of different capacity DIMMs at the same operating speed.”
Therefore, the second DIMM slot may be populated with 8GB or 16GB modules while still maintaining optimal performance.
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128 256 512
Ave
rage
Re
ad L
ate
ncy
-m
illis
eco
nd
s
Number of YCSB Clients
Average Read Latency With Processor Affinity
8 memcache threads with Processor 1affinity
8 memcache threads with Processor 2affinity
8 memcache threads equally splitProcessor 1 and 2
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
14
Figure 5: Memory bandwidth when populating slots, from “Configuring and Using DDR3 Memory”
Couchbase memory recommendations
Sizing the memory working set correctly has a significant impact on Couchbase server performance.
Buckets are used to compartmentalize data within the Couchbase Server and per node RAM Quota can be set for each bucket. When the memory usage exceeds a threshold of 65% of Couchbase data bucket allocated RAM (the default low water mark) and replication is enabled, eviction of replica data items begins. When memory usage exceeds a threshold of 75% (the default high water mark) Couchbase starts evicting items from the cache using a random eviction policy until the memory usage falls below the low water mark. When memory usage reaches 90% of bucket allocated RAM, Couchbase will return temporary out of memory errors to clients when storing data.
To get to the appropriate memory recommendation, we need to calculate the working set size of the application using Couchbase server. You will need to perform a similar calculation for your application as we perform for YCSB. The formulas and values below follow the recommendations Couchbase Inc. provides in the sizing guidelines available on their website.26
The behavior of our test with the YCSB client is to evenly distribute its requests across the complete target data set as configured in the YCSB requestdistribution parameter. Your application may only have a small percentage of “hot” or working set data. In our test, we are using the entire database as the working set to assist with our performance analysis. Estimating real-world performance may then use these performance results as a base to assist with specifying the configuration and estimating performance expectations. For example, a 2.5TB dataset with a 1% working set would be expected to perform similarly to the systems tested due to the comparable active dataset size of 25GB. Remember, since each application has its own system usage characteristics, your mileage may vary.
For our test, we use the following sizing input variables.
• Number of Documents: 25 million, the total number of documents in our working set – per server node
• Number of Replicas: zero or one, the number of copies of the original data we want to keep
• Document key size: 10 bytes, the size of the document IDs
• Document value size: 1,000, the size of values
• Working set percentage: 100%, or less to assist with performance analysis, the percentage of data we want in memory
There are also several sizing constants.
• Metadata per document, memory resident storage overhead for each document: 64 bytes.
• Headroom, an additional 25-30% space overhead over the size of the dataset required by the cluster
• High water mark, the default is set to 75% of allocated memory
26 Couchbase Sizing guidelines, couchbase.com/docs/couchbase-manual-2.0/couchbase-bestpractice-sizing.html
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
15
Following are the formulas to calculate RAM required.
• Total Size of Metadata = (Number of documents) * ( metadata per document + document key size) * (1 + Number of replicas)
• Total Size of Dataset = (Number of documents) * (document value size) * (1 + Number of replicas)
• RAM required = (Total size of Metadata + (Total size of dataset * (working set percentage/100))) * (1 + headroom)/ (high water mark)
Table 5: YCSB RAM size estimation
Number of Documents 25,000,000
Number of replicas 0 or 1, doubling the RAM required
Metadata per document 64 bytes
Document key size 10 bytes
Document value size 1000 bytes
Total Metadata 1,850,000,000 bytes
Total Dataset 25,000,000,000 bytes
Working set % 100
Working set size 25,000,000,000
Headroom 0.3
High Water Mark 0.75
RAM Required 43.34 GB
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
16
Performance implications of varying memory configurations
Varying the memory assigned to the bucket size, we observe the following results. Supporting tabular data is in Appendix G, Table G-1 and Table G-2.
With a data bucket size set to 43GB, which was the minimum required to hold all the 25 million data items in memory, we find the benchmark ‘C’ performance comparable to the performance at 64GB. However once the allocated RAM was reduced by 5% to 41GB – i.e. only 95% of the working set now resides in memory – the performance drops considerably as shown in Figure 6.
Figure 6: YCSB workload ‘C’ with varying bucket memory configurations
In the case of YCSB workload ‘A’, the disk write queues required to hold the mutations to the dataset before being persisted to disk requires additional memory. So even though 43GB was sufficient for a read-only workload (YCSB ‘C’), in the case of YCSB ‘A’ we see a performance degradation because the additional memory usage exceeds the high water mark threshold, causing items to be evicted from memory until the memory usage drops to the low water mark. This causes additional cache misses. In Figure 7, we see an additional memory requirement of approximately 10GB. Therefore, setting the data bucket size to 53GB, the YCSB workload ‘A’ performance is comparable to 64GB.
Figure 7: YCSB workload ‘A’ with varying bucket memory configurations
0
200
400
600
800
1000
1200
1 2 4 8 16 32 64 128 256
Max
imu
m T
hro
ugh
pu
t (o
pe
rati
on
s/se
con
d)
Tho
usa
nd
s
Number of YCSB Clients
Workload 'C' - Throughput versus Memory
data bucket memory = 64GB
data bucket memory = 53GB
data bucket memory = 43GB
data bucket memory = 41GB
data bucket memory = 32GB
0
100
200
300
400
500
600
1 2 4 8 16 32 64 128 256
Max
imu
m T
hro
ugh
pu
t (o
pe
rati
on
s/se
con
d)
Tho
usa
nd
s
Number of YCSB Clients
Workload 'A' - Throughput versus Memory
data bucket memory = 64GB
data bucket memory = 53GB
data bucket memory = 43GB
data bucket memory = 41GB
data bucket memory = 32GB
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
17
Figure 8 plots the Couchbase disk write queues for a 60 minute YCSB workload ‘A’ test run using one Couchbase server with 25 million documents configured with a 64GB data bucket. There are 32 YCSB clients driving the workload and updates are occurring at approximately 250,000 updates per second for the duration of the run.
Figure 8: YCSB workload ‘A’ showing memory consumption and disk write queue
Storage
Couchbase’s replication scheme addresses redundancy requirements. Drives may be striped (RAID 0) for maximum performance. The DL380p Gen8 provides a maximum internal storage of 25TB using 25 small form factor (SFF) serial attached SCSI (SAS) hot plug drives, or 36TB using larger SATA form factor drives.
Scale-out
For scale-out (sharding) performance tests, we increase the number of server nodes to two and four Couchbase Servers, comparing the results to a single node using the same number of clients. We loaded each server with 25 million additional documents, until we reached a total of 100 million documents with four nodes. Figure 9 plots the throughput and Figure 10 plots the average read latency for YCSB workloads ‘A’ and ‘C’ using 32 test clients. See Appendix H, Table H-1, for tabular data. The results show a near linear scaling in throughput with a comparable decrease in read latency.
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
0
10
20
30
40
50
0 20 40 60
Wri
te Q
ue
ue
Dra
in R
ate
M
em
ory
Use
d (
MB
)
Dis
k W
rite
Qu
eu
e
Mill
ion
s
Time (minutes)
Disk Write Queues - Memory Used, Drain Rate
Write Queue Size
Write Queue Drain Rate
Memory Used (MB)
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
18
Figure 9: Scale-out performance of 2 and 4 nodes compared with a single node
Figure 10: Scale-out read latency of 2 and 4 nodes compared with a single node
Replication
For high availability, we recommend a three node configuration with one replica.
The following graphs provide some general statistics for three node clusters with and without a replication factor of one. At moderate update rates, replication is consistent with write and update rates. As we exceed six clients (~230k write or updates per second per node), replication backoff is requested. At high update rates, most of the document mutations are in memory and are slowly drained to disk and replication is delayed. Tables in Appendix I provide more detail of replication statistics and backlog.
As the number of items in the disk write and replication queues exceed the throughput capability of a node, replication is asked to backoff. As a node receives requests, both the update and write data that a node receives from client applications and the items to be replicated received from other servers are placed on a disk write queue. If there are too many items waiting in the disk write queue at any given destination, Couchbase Server will request the other servers to reduce the rate
0
0.5
1
1.5
2
2.5
3
1 2 4
Thro
ugh
pu
t (O
pe
rati
on
s/se
con
d)
Mill
ion
s
Number of Nodes
Scale-out Throughput at 32 Clients
32 client Throughput - YCSB 'C'
32 client Throughput - YCSB 'A'
0
2
4
6
8
10
12
14
16
18
1 2 4
Ave
rage
Re
ad L
ate
ncy
pe
r O
pe
rati
on
(i
n m
illis
eco
nd
s)
Number of Nodes
Scale-Out Average Read Latency at 32 Clients
32 client Average Read Latency -YCSB 'C'
32 client Average Read Latency -YCSB 'A'
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
19
that replication data is sent to a destination. By default, this limit is very low and needs to be modified (see issue report couchbase.com/issues/browse/MB-4358). For our tests this limit is approximately 230k write or updates per second per cluster node; your mileage will vary.
For our testing, we set the queue size to 100 million items27.
Figure 11 shows that throughput continues to grow beyond six clients, however, it doesn’t show that replication is not keeping up with throughput. During burst traffic, this may be allowed, but if this volume is consistent, more nodes should be added to the cluster. We must observe the TAP28 queues to know whether the documents are being replicated at an acceptable rate.
Figure 11: Throughput with and without replication for YCSB ‘A’ for 3 Couchbase nodes, 1 replica
To demonstrate the observation that replication is backing-off; observe the difference in drain rate when compared with set rate in Table 6. At six clients, with consolidation of writes, the drain rate is keeping up with sets ; at eight clients, there is a consistent backlog that drains after the test stops; and at 32 clients, the client write and update load is so heavy that the drain rate is reduced significantly during the test, delaying replication. To optimize replication resource usage, sets to the same document are consolidated before transmission.
Table 6: Replication set rate compared with drain rate for 6, 8, and 32 clients; 3 Couchbase nodes, 1 replica
6 Clients 8 Clients 32 Clients
Set Rate 323k 536k 734k
Drain Rate 232k 237k 40.1k
27 /opt/couchbase/bin/cbepctl <couchbase server nodes>:11210 set -b <bucket name> tap_param tap_throttle_queue_cap 100000000
28 The TAP protocol is an internal part of the Couchbase Server system and is used in a number of different areas to exchange data throughout the system.
TAP provides a stream of data of the changes that are occurring within the system. Source: couchbase.com/docs/couchbase-manual-2.0/couchbase-
introduction-architecture-tap.html
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1 2 4 6 8 16 32
Thro
ugh
pu
t O
pe
rati
on
s p
er
Seco
nd
Number of YCSB Clients
Throughput with and without Replication
Throughput (ops/sec) withoutReplication
Throughput (ops/sec) withReplication
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
20
Table 7 shows the replication backlog over a sixty minute test run using three Couchbase nodes with one replica. Notice that six clients are complete at the end of the test, and that the backlog takes several minutes to clear with eight and thirty-two clients.
Table 7: Replication backlog during sixty minute test run for 6, 8, and 32 clients; 3 Couchbase nodes, 1 replica
Minutes 6 Clients 8 Clients 32 Clients
Test Starts 0 0 0
15 13 13,969,771 35,458,754
30 20 28,124,147 26,007,996
45 18 17,937,968 41,194,417
60 0 14,131,188 32,997,405
Test Stops - - -
TS+1 0 11,743,104 26,887,516
TS+2 0 3,592,056 15,294,458
TS+3 0 36,723 2,552,896
TS+4 0 0 0
Figure 12 shows the consumption of processor one between six and eight clients stabilizes; all memcached threads are running on this processor. These memcached threads process all the sets for updates, writes, and replication.
Figure 12: Processor usage with and without replication for YCSB ‘A’ for 3 Couchbase nodes, 1 replica
0
20
40
60
80
100
1 2 4 6 8 16 32
Pro
cess
or
Usa
ge %
CP
U
Number of YCSB Clients
Processor Usage with and without Replication
Processor 1 w/o Replication
Processor 1 w/ Replication
Processor 2 w/o Replication
Processor 2 w/ Replication
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
21
Figure 13 shows that network traffic stabilizes at approximately eight clients. This is where replication is asked to backoff. The observation is that the processor running the memcached threads is wholly consumed between six and eight clients. Again, watch your TAP queues.
Figure 13: Network usage with and without replication for YCSB ‘A’ for 3 Couchbase nodes, 1 replica
Figure 14 shows a general trend of greater disk write rate with replication. The observation is that each node has greater disk requirements due to replication.
Figure 14: Disk usage with and without replication for YCSB ‘A’ for 3 Couchbase nodes, 1 replica
0
1
2
3
4
5
6
1 2 4 6 8 16 32
Tota
l Ne
two
rk U
sage
(G
bp
s)
Number of YCSB Clients
Network Usage with Replication
Network Usage - NoReplication
Network Usage - 1 Replica
0
5
10
15
20
25
30
35
40
1 2 4 6 8 16 32
Dis
k U
sage
Tra
nsa
ctio
ns
pe
r Se
con
d
Number of YCSB Clients
Disk Usage with Replication
Disk TPS - No Replication
Disk TPS - 1 Replica
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
22
Couchbase and Hadoop
NoSQL DBMSs and Couchbase Server are designed for interactive, OLTP-like applications. For batch processing or large scale analytics, other solutions like Hadoop can be used. Couchbase and Hadoop are complementary solutions. Over time, users may move data to Hadoop and Hadoop Distributed File System (HDFS) for historical or pattern analysis. Couchbase provides a Hadoop Connector and is Cloudera certified. It can be used for integration with existing HP Solutions for Apache Hadoop29. The Connector utilizes the sqoop plugin to stream data between Couchbase and Hadoop clusters.
Installing the Couchbase Hadoop Connector
The currently distributed Couchbase Hadoop Connector only works with Cloudera CDH3. The next release of the Couchbase Hadoop Connector will support Cloudera CDH4. At the time of writing, the Couchbase Hadoop Connector for Cloudera CDH4 is not generally available and the version we used was a preview version provided by Couchbase Inc. At the HP Emerging Database Lab, we’ve tested with both Cloudera CDH3 and CDH4. Once you’ve installed Cloudera CDH3, or CDH4, and sqoop, download the Couchbase Hadoop Connector from couchbase.com/develop/connectors/hadoop.
The Connector is packaged as a zip file. Unzip and run the install.sh script inside pointing to the location of the sqoop libraries installed on the system (default is /usr/lib/sqoop). So for example:
To install the Couchbase Hadoop Connector,
./install.sh /usr/lib/sqoop
Exporting data from Hadoop to Couchbase Server
To get data from Hadoop to the Couchbase cluster, we need to use sqoop to export the data from HDFS to Couchbase. As an example we can use the Hadoop WordCount example from the map-reduce tutorial30.
The Hadoop WordCount example reads text files from HDFS and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. Each mapper takes a line as input and breaks it into words. It then emits a key-value pair of the word and 1. Each reducer sums the counts for each word and emits a single key-value with the word and the sum.
Assuming the text files are stored in the HDFS directory /wordcount-input, we can kick off a Hadoop map-reduce job as follows.
/usr/bin/hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar wordcount /wordcount-input /wordcount-output
To export these key-value pairs from HDFS so that they can be queried from Couchbase applications, we invoke sqoop as follows.
sqoop export --connect http://<ip address of Couchbase node>:8091/pools \
--table dummy_argument --export-dir /wordcount-output \
--fields-terminated-by '\t' --lines-terminated-by '\n'
The table parameter is required, but ignored. The export-dir parameter is the output directory from our Hadoop word count map-reduce job. The fields-terminated-by and lines-terminated-by parameters describe the format of the files.
To export the data as JSON documents, you can modify the reducer in the word count example to emit JSON formatted data from the reduce phase of the processing. See Appendix J for emitting JSON formatted data.
29 HP Solutions for Apache Hadoop, hp.com/go/hadoop 30 Hadoop word count example, http://hadoop.apache.org/docs/r0.19.1/mapred_tutorial.html and http://wiki.apache.org/hadoop/WordCount
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
23
Importing data from Couchbase Server to Hadoop
Similar to exporting data out of Hadoop to Couchbase we can import data from Couchbase into Hadoop, to perform map-reduce processing in the Hadoop cluster.
sqoop import --connect http://<ip address of Couchbase node>:8091/pools \
--table DUMP --username default --target-dir /exporteddatafromcouch
For importing JSON documents from Couchbase to Hadoop with the current released version of the Couchbase Hadoop Connector, users must write their own class that implements the SqoopRecord abstract class and use it when invoking sqoop import with the jar-file and class-name parameters.
sqoop import --connect http:// http://<ip address of Couchbase node>:8091/pools \
--table DUMP --jar-file /userdir/DUMP.jar --class-name DUMP --username default \
--target-dir /exporteddatafromcouch
The preceding line is required until the upcoming release of Couchbase Hadoop Connector.
HP Insight Cluster Management Utility
HP Insight Cluster Management Utility31 (CMU) is an efficient and robust hyperscale cluster lifecycle management framework and suite of tools for large Linux clusters such as those found in High Performance Computing (HPC) and Big Data environments. A simple graphical interface enables an ‘at-a-glance’ view of the entire cluster across multiple metrics, provides frictionless scalable remote management and analysis, and allows rapid provisioning of software to all the nodes of the system. Insight CMU makes the management of a cluster more user friendly, efficient, and error free than if it were being managed by scripts, or on a node-by-node basis. Insight CMU offers full support for iLO 2, iLO 3, iLO 4 and LO100i adapters on all HP ProLiant servers in the cluster.
Insight CMU is highly flexible and customizable, offers both GUI and CLI interfaces, and is being used to deploy a range of software environments, from simple compute farms to highly customized, application-specific configurations. Insight CMU is available for HP ProLiant and HP BladeSystem servers with Linux operating systems, including Red Hat Enterprise Linux, SUSE Linux Enterprise, CentOS, and Ubuntu. Insight CMU also includes options for monitoring Graphical Processing Units (GPUs) and for installing GPU drivers and software. Figure 15 provides an example throughput graphic available through CMU.
Figure 15: CMU example throughput graphics
31 For more information on HP Insight Cluster Management Utility, hp.com/go/cmu
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
24
Summary
The HP DL380p Gen8 server is an excellent host for Couchbase Server 2.0 and provides state-of-the-art capabilities and capacities in a compact 2U chassis.
• Two Intel Xeon E5-2690 8-core processors
• Up to 768GB RAM
• Up to 16 SFF HP SmartDrives with the latest HP Smart Array Controller
• Two 10Gbps Ethernet or FlexFabric LOM
• Six PCIe Gen3 slots
• HP Active Health System for always on diagnostics
• HP iLO 4th generation management processor
As shown in Table 8, using an HP DL380p Gen8 server paired with Couchbase 2.0 in a single node configuration running YCSB, the benchmark throughput is 796,044 read or 505,634 update-heavy operations per second. In a scale-out four node configuration with a sharded database, the YCSB throughput is near linear with 2,680,010 read and 1,777,368 update-heavy operations per second. Read latency showed near linear decrease as nodes are added.
Table 8: YCSB workloads ‘A’ and ‘C’, one to four nodes, 25 to 100 million documents; throughput in operations per second, and latency in microseconds
Nodes ‘C’ Throughput ‘C’ Read Latency ‘A’ Throughput ‘A’ Read Latency
1 796,044 5,141 505,634 16,194
2 1,387,095 2,975 952,431 8,591
4 2,680,010 1,543 1,777,368 4,600
Appendices
Appendix A: Increasing memcache threads
Currently there is no simple way to increase the number of Couchbase-memcache worker threads. See Couchbase issue # MB-5519 couchbase.com/issues/browse/MB-5519.
For our tests, we increased the number of Couchbase-memcache worker threads by creating a ‘stub’, which would invoke the real executable binary with the required parameters. To set the number of memcached worker threads, the original memcached binary (memcached.orig) is invoked with the parameter string “–t 8”.
Move the original “/opt/couchbase/bin/memcached” to “/opt/couchbase/bin/memcached.orig”.
Compile the source for the following stub and move the executable to /opt/couchbase/bin/memcached and set the file permissions to the original memcached values.
#include <unistd.h>
#include <stdio.h>
main(int argc, char** argv)
{
execl("/opt/couchbase/bin/memcached.orig","/opt/couchbase/bin/memcached.orig",
"-t","8","-X", "/opt/couchbase/lib/memcached/stdin_term_handler.so",
"-X","/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;"
"sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p","11210",
"-E","/opt/couchbase/lib/memcached/bucket_engine.so","-B","binary","-r",
"-c","10000","-e","admin=_admin;default_bucket_name=default;auto_create=false",
(char *) 0);
}
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
25
Appendix B: Setting processor affinity
The following is a sample script to set processor affinity.
#!/bin/bash
TMP1=`mktemp`
TMP2=`mktemp`
TMP3=`mktemp`
#get a stack trace of the couchbase memcached process
pstack `ps -u couchbase | grep memcached | sed 's/^[ ]*//' | \
cut -d' ' -f1` > $TMP1
#the memcached worker threads all have 'worker_libevent' in the thread stack frame
grep -B 4 "worker_libevent" $TMP1 | grep LWP | sort > $TMP2
grep LWP $TMP1 | sort > $TMP3
#set the affinities of the non-memcached worker threads to the second processor
comm -3 $TMP2 $TMP3 | cut -d\( -f3 | cut -d\) -f1 | sed 's/LWP//' | \
sed 's/^/taskset -c -p 8-15/' > $TMP1
echo "will execute this script:"
cat $TMP1
chmod +x $TMP1
$TMP1
#set the affinities of the memcached worker threads to the first processor
cat $TMP2 | cut -d\( -f3 | cut -d\) -f1 | sed 's/LWP//' | \
sed 's/^/taskset -c -p 0-7/' > $TMP1
echo "will execute this script:"
cat $TMP1
chmod +x $TMP1
$TMP1
pstack `ps -u couchbase | grep beam.smp | sed 's/^[ ]*//' | cut -d' ' -f1` | \
grep LWP > $TMP1
cat $TMP1 | cut -d\( -f3 | cut -d\) -f1 | sed 's/LWP//' | \
sed 's/^/taskset -c -p 8-15/' > $TMP2
chmod +x $TMP2
$TMP2
rm -rf $TMP1 $TMP2 $TMP3
Appendix C: Java jar files required to support the YCSB client
A list of jar files used by the YCSB test client:
core-1.0.jar
couchbase-2.0-1.0.jar
couchbase-client-1.0.2.jar
slf4j-api-1.5.2.jar
slf4j-log4j12-1.5.2.jar
log4j-1.2.14.jar
jackson-core-asl-1.9.2.jar
jackson-mapper-asl-1.9.2.jar
spymemcached-2.8.0.jar
jettison-1.1.jar
netty-3.2.0.Final.jar
commons-codec-1.5.jar
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
26
Appendix D: Performance comparison NIC processor affinity data
All tests in this appendix are run with YCSB workload ‘C’ on one Couchbase server node with:
• 64GB bucket size
• No Hyper-Threading
• 25 million documents
• 8 Couchbase-memcache threads
The information related to each test displayed in Figure 3 follows.
Table D-1: Affinity set to Processor 1
YCSB Clients Throughput (ops/sec)
Average Latency (micro secs)
Total CPU% Processor 1 %CPU
Processor 2 %CPU
Network Usage (Gbps)
1 87,165 1,464 5 8 1 1
2 162,491 1,571 9 17 1 2
4 290,953 1,759 17 33 1 3
8 535,220 1,912 33 65 1 6
16 716,585 2,856 44 87 1 7
32 796,044 5,141 48 95 2 8
64 856,588 9,566 48 95 2 9
128 898,709 18,350 49 95 2 9
256 924,019 35,920 48 95 2 9
512 945,230 69,420 48 94 2 9
Table D-2: Affinity set to Processor 2
YCSB Clients Throughput (ops/sec)
Average Latency (micro secs)
Total CPU% Processor 1 %CPU
Processor 2 %CPU
Network Usage (Gbps)
1 77,558 1,646 5 1 8 1
2 147,474 1,732 9 1 17 2
4 270,725 1,894 17 1 32 3
8 515,503 1,983 32 1 61 5
16 684,837 2,991 42 1 79 7
32 728,545 5,625 45 1 84 8
64 715,887 11,427 44 1 84 8
128 711,053 23,016 45 1 84 8
256 712,483 45,687 45 1 84 8
512 700,544 91,975 45 1 85 7
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
27
Table D-3: Affinity equally split between Processor 1 and 2
YCSB Clients Throughput (ops/sec)
Average Latency (micro secs)
Total CPU% Processor 1 %CPU
Processor 2 %CPU
Network Usage (Gbps)
1 87,562 1,457 5 9 1 1
2 143,029 1,800 10 9 10 2
4 263,978 1,937 18 9 25 3
8 466,853 2,220 35 35 35 5
16 574,819 3,591 45 44 46 6
32 593,565 6,935 47 46 48 6
64 588,088 13,969 46 46 47 6
128 599,402 27,302 47 46 47 6
256 599,154 54,445 47 46 47 6
512 611,754 105,377 48 47 48 6
Table D-4: Average document read latency comparison with varying memcache thread processor affinity; latency in microseconds
YCSB Clients Processor 1 Affinity Packet Latency
Processor 2 Affinity Packet Latency
Split Processor Affinity Packet Latency
1 1,464 1,646 1,457
2 1,571 1,732 1,800
4 1,759 1,894 1,937
8 1,912 1,983 2,220
16 2,856 2,991 3,591
32 5,141 5,625 6,935
64 9,566 11,427 13,969
128 18,350 23,016 27,302
256 35,920 45,687 54,445
512 69,420 91,975 105,377
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
28
Appendix E: Hyper-Threading throughput comparison Figure E-1: Operations per second with Hyper-Threading enabled and disabled
Table E-1: Operations per second with Hyper-Threading enabled and disabled
YCSB Clients Hyper-Threading Off (Disabled)
Hyper-Threading On (Enabled)
1 87,165 87,721
2 162,491 153,645
4 290,953 246,990
8 535,220 471,060
16 716,985 696,002
32 796,044 797,855
64 856,588 865,621
128 898,709 902,177
256 924,019 923,406
512 945,230 922,442
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 4 8 16 32 64 128 256 512
Thro
ugh
pu
t (o
pe
rati
on
s p
er
seco
nd
)
Mill
ion
s
Number of YCSB Clients
Performance with Hyperthreads
HT 'off', 8 memcache threads
HT 'on', 16 memcache threads
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
29
Appendix F: Comparison of operations per second and average read latency
For maximum performance, the DL380p nodes should be provisioned with 10G Ethernet ports. The throughput shown in Table F-1 is for one DL380p Gen8 server node.
Table F-1: Comparison of operations per second and average read latency with 10GbE NIC
YCSB Clients Throughput (ops per second)
Average Read Latency (microseconds)
1 87,165 1,464
2 162,491 1,571
4 290,953 1,759
8 535,220 1,912
16 716,585 2,856
32 796,044 5,141
64 856,588 9,566
128 898,709 18,350
256 924,019 35,920
512 945,230 69,420
Appendix G: Performance implications of varying memory bucket size Table G-1: YCSB workload ‘C’, one node, 25 million documents, with varying bucket memory configurations; throughput in operations per second
YCSB Clients
64GB 53GB 43GB 41GB 32GB
1 87,358 89,393 90,679 70 5
2 164,010 170,792 167,272 162 14,293
4 292,671 302,591 299,893 35,534 26,885
8 540,882 548,807 547,101 92,717 26,561
16 714,945 730,184 720,451 94,775 30,873
32 802,232 809,897 805,374 101,311 34,666
64 871,037 888,026 875,130 100,629 33,072
128 913,413 937,368 918,457 95,732 35,525
256 946,124 969,239 949,576 103,886 35,840
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
30
Table G-2: YCSB workload ‘A’, one node, 25 million documents, with varying bucket memory configurations; throughput in operations per second
YCSB Clients
64GB 53GB 43GB 41GB 32GB
1 87,047 87,204 5,645 155 10
2 156,855 157,582 160 178 1,906
4 278,978 281,590 24,982 1,727 23,884
8 504,167 494,053 10,9627 75,093 24,628
16 515,176 508,931 98,790 70,338 25,715
32 521,661 519,218 90,692 67,388 25,769
64 523,903 521,885 95,937 67,359 26,384
128 521,993 522,801 101,818 67,384 27,073
256 518,300 519,793 102,405 71,504 29,416
Appendix H: Scale-out performance of two and four nodes Table H-1: YCSB workloads ‘A’ and ‘C’, one to four nodes, 25 to 100 million documents; throughput in ops per second, and latency in microseconds
Nodes ‘C’ Throughput ‘C’ Read Latency ‘A’ Throughput ‘A’ Read Latency
1 796,044 5,141 505,634 16,194
2 1,387,095 2,975 952,431 8,591
4 2,680,010 1,543 1,777,368 4,600
Appendix I: Replication tables
The first set of tables below provides some general statistics for two and three node clusters with and without replication. At high update rates, most of the document mutations are in memory and are slowly drained to disk. For high availability, we recommend a three node configuration with one replica, however we recognize some sites may only use a two node cluster. With that in mind, we include the results and statistics for both two and three node configurations with and without one replica.
The second set of tables provides replication backlog information. As the number of items in the disk write and replication queues exceed the throughput capability of the cluster, replication is asked to backoff. As a node receives requests, both the update and write data that a node receives from client applications and the items to be replicated received from other servers are placed on a disk write queue. If there are too many items waiting in the disk write queue at any given destination, Couchbase Server will request the other servers to reduce the rate that data is sent to a destination. By default, this limit is very low and needs to be modified (see issue report couchbase.com/issues/browse/MB-4358). For our tests this limit is approximately 230k write or updates per second per cluster node; your mileage will vary.
For our testing, we use a three node cluster and set the queue size to 100 million items. The following command is an example to set the queue size.
/opt/couchbase/bin/cbepctl <couchbase server nodes>:11210 set -b <bucket name>
tap_param tap_throttle_queue_cap 100000000
Replication statistics for two and three nodes with and without replication
To demonstrate the replication capability of the cluster, we only use the update benchmark YCSB ‘A’. Tables I-1 and I-2 are the results and statistics for two Couchbase nodes with one replica; Tables I-3 and I-4 are the results and statistics for three Couchbase nodes with one replica.
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
31
Table I-1: Throughput and resource consumption without replication – YCSB workload ‘A’, 2 Couchbase nodes, 1 replica
YCSB Clients
Throughput (ops/sec)
Average Latency (micro secs)
Total CPU%
Processor 1 %CPU
Processor 2 %CPU
Memory %
Network Usage (Gbps)
Disk (tps)
1 167,302 1,495 13 12 14 42 .88 3
2 300,916 1,669 19 24 13 42 1,58 2
4 528,127 1,908 30 45 13 42 2,78 2
6 737,731 2,051 40 66 13 44 3.74 2
8 944,832 2,135 51 86 14 42 4.65 2
16 955,568 4,266 51 86 14 40 4.67 2
32 947,329 8,667 53 84 19 42 4.58 1
Table I-2: Throughput and resource consumption with replication – YCSB workload ‘A’, 2 Couchbase nodes, 1 replica
YCSB Clients
Throughput (ops/sec)
Average Latency (micro secs)
Total CPU%
Processor 1 %CPU
Processor 2 %CPU
Memory %
Network Usage (Gbps)
Disk (tps)
1 153,800 1,633 20 22 17 74 1.58 2
2 278,665 1,805 34 43 24 78 2.68 2
4 397,187 2,946 37 50 24 77 3.18 12
6 617,684 2,509 44 60 28 75 3.73 17
8 832,199 2,507 54 82 25 76 4.70 24
16 902,888 4,589 56 84 26 76 4.66 24
32 955,981 8,580 57 87 24 76 4.85 32
Table I-3: Throughput and resource consumption without replication – YCSB workload ‘A’, 3 Couchbase nodes, 1 replica
YCSB Clients
Throughput (ops/sec)
Average Latency (micro secs)
Total CPU%
Processor 1 %CPU
Processor 2 %CPU
Memory %
Network Usage (Gbps)
Disk (tps)
1 187,912 1,328 16 8 23 64 0.64 19
2 366,488 1,361 16 18 13 65 1.25 2
4 682,691 1,502 31 36 25 63 2.29 12
6 975,603 1,583 35 56 13 62 3.35 2
8 1,246,078 1,647 50 74 24 61 4.35 21
16 1,393,160 2,899 50 82 15 62 4.61 6
32 1,384,244 5,901 52 81 21 64 4.52 16
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
32
Table I-4: Throughput and resource consumption with replication – YCSB workload ‘A’, 3 Couchbase nodes, 1 replica
YCSB Clients
Throughput (ops/sec)
Average Latency (micro secs)
Total CPU%
Processor 1 %CPU
Processor 2 %CPU
Memory %
Network Usage (Gbps)
Disk (tps)
1 186,958 1,335 17 16 17 78 1.24 3
2 231,865 2,683 27 25 29 84 1.58 35
4 293,408 6,463 39 44 33 79 1,99 32
6 666,862 9,378 51 61 41 78 3.98 21
8 1,031,963 2,077 50 80 18 80 5,11 8
16 1,273,242 3,215 54 81 25 76 4.83 31
32 1,364,619 5,985 53 79 25 76 4.69 27
Replication backlog
The following tables show the replication backlog during 60 minute test runs with 6, 8 and 32 YCSB clients on a three node cluster. Replication backlog is obtained using the cbstats tool.
Tables I-5, I-6, and I-7 show the replication backlog. The database and replication is eventually consistent, yet once we pass approximately 230k updates per node mark, the replication queue starts to grow and replication is asked to backoff. When the clients are consuming the processor capability (like at 32 clients), replication approaches zero until processing cycles are available to handle the replication queue. Once the load falls below maximum capacity, replication will continue. The actual document writes and updates are consolidated; multiple writes to the same document are purged and the last write or update is replicated and written to disk.
Notice in Table I-5 that there is no backlog when the test stops. At rates above approximately 237k writes and updates per second a backlog occurs.
Table I-5: Replication backlog 6 clients
Time Replication backlog (Ep_tap_total_backlog_size)
05:00 0
05:15 13
05:30 20
05:45 18
06:00 0
Test Stops
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
33
Table I-6: Replication backlog 8 clients
Time Replication backlog (Ep_tap_total_backlog_size)
06:46 0
07:01 13,969,771
07:16 28,124,147
07:31 17,937,968
07:46 14,131,188
Test Stops
07:47 11,743,104
07:48 3,592,056
07:49 36723
07:50 0
Table I-7: Replication backlog 32 clients
Time Replication backlog (Ep_tap_total_backlog_size)
08:51 0
09:06 35,458,754
09:21 26,007,996
09:36 41,194,417
09:50 32,997,405
Test Stops
09:51 26,887,516
09:52 15,294,458
09:53 2,552,896
09:54 0
Appendix J: JSON output for Hadoop
The map-reduce word count job outputs the results in a key-value format. If you export that to Couchbase, the “value” part will be in binary format; we want the value part in JSON document format (new for Couchbase 2.0). Modify the map-reduce example to use a JSON library of your choice. http://hadoop.apache.org/docs/r0.19.1/mapred_tutorial.html is an example for exporting as key-value, and ibm.com/developerworks/opensource/library/ba-hadoop-couchbase/index.html is an example for exporting values as JSON documents.
Several Java/JSON libraries are available, following are two.
• http://code.google.com/p/google-gson/
• http://jackson.codehaus.org/
Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server
For more information
HP DL380p Gen8 server datasheet: http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA3-9615ENW
HP Insight Cluster Management Utility: hp.com/go/cmu
Configuring and Tuning HP ProLiant Servers for Low-Latency Applications: http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf
Configuring and using DDR3 memory with HP ProLiant Gen8 Servers: http://h20000.www2.hp.com/bc/docs/support/SupportManual/c03293145/c03293145.pdf
Couchbase Server Overview: couchbase.com/couchbase-server/overview
Couchbase Server 2.0 white paper: couchbase.com/sites/default/files/uploads/all/whitepapers/Introducing-Couchbase-Server-2_0.pdf
Couchbase 2.0 server manual: couchbase.com/docs/couchbase-manual-2.0/
Couchbase 2.0 developer’s guide: couchbase.com/docs/couchbase-devguide-2.0/
Couchbase Server technical overview: couchbase.com/sites/default/files/uploads/all/whitepapers/Couchbase-Server-Technical-Whitepaper.pdf
Couchbase important UI stats: couchbase.com/docs/couchbase-manual-2.0/couchbase-bestpractice-ongoing-ui.html
Couchbase Sizing guidelines: couchbase.com/docs/couchbase-manual-2.0/couchbase-bestpractice-sizing.html
YCSB source code: http://github.com/brianfrankcooper/YCSB
Additional YCSB published benchmark results, dated 2010-03-31: http://research.yahoo.com/files/ycsb-v4.pdf
To help us improve our documents, please provide feedback at hp.com/solutions/feedback.
Sign up for updates
hp.com/go/getupdated
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for
HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as
constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.
Intel and Xeon are trademarks of Intel Corporation in the U.S. and other countries. Oracle and Java are registered trademarks of Oracle and/or its affiliates.
4AA4-6203ENW, June 2013, Rev. 1