Gain more k-means clustering data analysis performance per ...

15
Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers Compared to the same server powered by 2nd Gen AMD EPYC 7542 processors Organizations using data analytics to inform their business strategies stand to benefit from a server solution that can quickly process compute-intensive analytics workloads. We put a Dell EMC PowerEdge R6525 server to the test in two otherwise- identical configurations: one with 3rd Gen AMD EPYC 75F3 processors, and one with 2nd Gen AMD EPYC 7542 processors. In hands-on testing with Spark-Bench, the AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 server completed a k-means clustering workload targeting an Apache Spark database in 28.5 percent less time than the same server with AMD EPYC 7542 processors. We also found that the Dell EMC PowerEdge server with the 3rd Gen AMD EPYC processors not only processed up to 40 percent more data per hour, it also offered 24.5 percent more performance per dollar than the solution with the 2nd Gen AMD EPYC processors. By choosing Dell EMC PowerEdge R6525 servers with AMD EPYC 75F3 processors, organizations running k-means clustering workloads could reach business insights sooner while enjoying better analytics performance for every dollar they spend. Analyze data in 28.5% less time * Get 24.5% better performance for every dollar *† Handle up to 40% more data per hour * *vs. the same server with AMD EPYC 7542 processors Based on the total hardware cost with 3 years of Basic Next Business Day support Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers April 2021 A Principled Technologies report: Hands-on testing. Real-world results.

Transcript of Gain more k-means clustering data analysis performance per ...

Page 1: Gain more k-means clustering data analysis performance per ...

Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 serversCompared to the same server powered by 2nd Gen AMD EPYC 7542 processors

Organizations using data analytics to inform their business strategies stand to benefit from a server solution that can quickly process compute-intensive analytics workloads. We put a Dell EMC™ PowerEdge™ R6525 server to the test in two otherwise-identical configurations: one with 3rd Gen AMD EPYC™ 75F3 processors, and one with 2nd Gen AMD EPYC 7542 processors.

In hands-on testing with Spark-Bench, the AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 server completed a k-means clustering workload targeting an Apache Spark database in 28.5 percent less time than the same server with AMD EPYC 7542 processors. We also found that the Dell EMC PowerEdge server with the 3rd Gen AMD EPYC processors not only processed up to 40 percent more data per hour, it also offered 24.5 percent more performance per dollar than the solution with the 2nd Gen AMD EPYC processors. By choosing Dell EMC PowerEdge R6525 servers with AMD EPYC 75F3 processors, organizations running k-means clustering workloads could reach business insights sooner while enjoying better analytics performance for every dollar they spend.

Analyze data in 28.5% less time*

Get 24.5% better performance for

every dollar*†

Handle up to 40% more data

per hour*

*vs. the same server with AMD EPYC 7542 processors †Based on the total hardware cost with 3 years of Basic Next Business Day support

Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers April 2021

A Principled Technologies report: Hands-on testing. Real-world results.

Page 2: Gain more k-means clustering data analysis performance per ...

How we testedIn our data center, we tested the following configurations:

• A Dell EMC PowerEdge R6525 server powered by AMD EPYC 75F3 processors. As of April 5, 2021, the list price of this hardware plus Basic Next Business Day support (36 months) was $64,650.01.1

• A Dell EMC PowerEdge R6525 server powered by AMD EPYC 7542 processors. As of April 5, 2021, the list price of this hardware plus Basic Next Business Day support (36 months) was $57,510.01.2

In each configuration, we used the same 1,024 GB of 3,200MHz RAM; two 480GB, 6Gbps SATA M.2 SSDs; and four 1.92TB PCIe® Gen4 NVMe™ SSDs.

During testing, we first ran a k-means clustering workload using Spark-Bench, targeting a Spark database on each server under test. We recorded the total time each configuration took to complete the workload, as reported by Spark. We then divided the dataset size (811GB) by that time to calculate the rate of megabytes per hour (MB/hour) that each solution processed. To calculate the performance-to-cost ratio, we divided the MB/hour throughput results by the price of hardware and support. For more details about our configurations, testing methodology, and cost calculations, see the science behind the report.

About AMD EPYC 75F3 processors

These 32-core processors use AMD Infinity Architecture and are part of the AMD EPYC 7003 Series. The latest offering from AMD, 3rd Gen EPYC processors offer increased I/O with up to 32MB L3 cache per core, 7nm x86 hybrid die core, and new security features like Secure Encrypted Virtualization - Secure Nested Paging (SEV-SNP) and Encrypted State (SEV-ES).3 AMD positions the EPYC 75F3 model as being well suited for high-frequency use cases such as CAE/CFD/FEA, VM density, and VDI.4 Learn more at https://www.amd.com/en/processors/epyc-7003-series.

April 2021 | 2Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Page 3: Gain more k-means clustering data analysis performance per ...

Get insights soonerWe used a Spark-Bench k-means clustering workload to conduct our data analytics testing. K-means clustering is a machine learning algorithm that helps organizations identify patterns within datasets by sorting data into similar groups or clusters. A solution that can complete a k-means workload, and thus deliver data insights, in less time can enable your organization to act on these insights sooner—for example, by adjusting your marketing strategy to target a new demographic. (See page 4 for more examples of real-world benefits). Note: although our testing measured the analytics performance of just one Dell EMC PowerEdge server, organizations running big data analytics workloads typically use larger server clusters.

As Figure 1 illustrates, the 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 completed the k-means workload in 2 hours—an improvement of 28.5 percent over the configuration with the AMD EPYC 7542 processors, which took 2 hours and 48 minutes to finish the workload.

About Dell EMC PowerEdge R6525 servers

According to Dell Technologies, these servers offer the following specifications:5

• Up to 64 high-performance 3rd Gen AMD EPYC cores

• Up to 32 DDR4 RDIMM/LRDIMM slots

• Support for PCIe Gen4 SSDs

• Integrated security features

• Embedded management tools

To learn more, visit https://www.dell.com/en-us/work/shop/povw/poweredge-R6525.

Figure 1: Time in hours to complete the Spark-Bench k-means workload on an 811GB dataset. Lower is better. Source: Principled Technologies.

Time (hours, lower is better)

2.8

2.0

Dell EMC PowerEdge R6525 server with 3rd Gen AMD EPYC 75F3 processors

Dell EMC PowerEdge R6525 server with 2nd Gen AMD EPYC 7542 processors

April 2021 | 3Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Mohan_Rokkam
Comment on Text
128 cores in 2 sockets
Mohan_Rokkam
Sticky Note
NVMe
Page 4: Gain more k-means clustering data analysis performance per ...

More efficient data processingTo arrive at our processing rate (the rate of MB/hour each solution processed), we divided the size of the dataset (811GB) by the time each solution took to finish the k-means workload. The numbers below thus represent the rate of processing the dataset, rather than disk or networking throughput. Figure 2 shows that the AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 server processed 405,500 MB/hour, while the AMD EPYC 7542 processor-powered configuration handled 289,642 MB/hour—meaning the 3rd Gen AMD processor-powered Dell EMC PowerEdge server processed 40 percent more data per hour.

Figure 2: Average data processing rate (MB/hour) during the Spark-Bench k-means workload. Higher is better. Source: Principled Technologies.

Real-world benefits for financial services

Using Spark to run data analytics, financial institutions can quickly identify credit fraud, analyze regulatory filing, compile customer profiles, and target marketing, allowing them more time to make critical decisions, interface with clients, and keep operations running smoothly.6,7 Choosing Dell EMC PowerEdge R6525 servers with 3rd Gen AMD EPYC 75F3 processors could help analyze that data in 28.5 percent less time and at a 40 percent higher rate than the same servers with 2nd Gen AMD EPYC 7542 processors. And with quicker results, organizations could keep their clients’ assets secure and expand their customer base efficiently—all while getting 24.5 percent better performance for each dollar they’ve invested.

Data processing rate (MB/hour, higher is better)

289,642

405,500

Dell EMC PowerEdge R6525 server with 3rd Gen AMD EPYC 75F3 processors

Dell EMC PowerEdge R6525 server with 2nd Gen AMD EPYC 7542 processors

April 2021 | 4Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Page 5: Gain more k-means clustering data analysis performance per ...

Achieve better performance per dollarTo arrive at our performance-to-cost ratio, we first determined the list prices of hardware plus support for both configurations of the Dell EMC PowerEdge R6525 server, which we show on page 2. We then divided the processing rate we describe on page 4 by the system cost to arrive at a performance-to-cost ratio. We found that the Dell EMC PowerEdge R6525 with AMD EPYC 75F3 processors handled 6.27 MB per hour for each dollar of the list price, while the configuration with AMD EPYC 7542 processors offered a ratio of 5.03 MB per hour for each dollar (Figure 3). This means that for every dollar an organization spends on hardware and support for the Dell EMC PowerEdge R6525 server with 3rd Gen AMD EPYC 75F3 processors, they could get 24.5 percent higher application throughput than the same server with the 2nd Gen AMD EPYC processors.

Figure 3: Ratio of k-means performance (as measured by the number of MB/hour each solution processed) to hardware and support cost per US dollar. Higher is better. Source: Principled Technologies.

Performance/cost ratio (MB/hour per dollar, higher is better)

5.03

6.27

Dell EMC PowerEdge R6525 server with 3rd Gen AMD EPYC 75F3 processors

Dell EMC PowerEdge R6525 server with 2nd Gen AMD EPYC 7542 processors

April 2021 | 5Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Page 6: Gain more k-means clustering data analysis performance per ...

ConclusionOrganizations that rely on data analytics workloads could benefit from new hardware that takes less time to run k-means machine learning algorithms while offering better performance per dollar. When we compared the k-means performance of a Dell EMC PowerEdge R6525 with 3rd Gen AMD EPYC 75F3 processors to that of the same server with AMD EPYC 7542 processors, we found that the solution with the 3rd Gen AMD EPYC processors completed a k-means clustering workload on the Spark-Bench benchmark in 28.5 percent less time. Based on our processing rate calculations, 3rd Gen AMD EPYC processor-powered Dell EMC PowerEdge server handled up to 40 percent more data per hour. Factoring this performance into the hardware and support list price, we also found that this configuration offered 24.5 percent better performance per dollar.

These results indicate that organizations running k-means workloads could reach business insights sooner and get improved performance for every dollar they spend on the Dell EMC PowerEdge R6525 powered by 3rd Gen AMD EPYC 75F3 processors.

1 “PowerEdge R6525 Rack Server,” accessed April 5, 2021, https://www.dell.com/en-us/work/shop/cty/pdp/spd/poweredge-r6525/pe_r6525_13783_vi_vp.

2 “PowerEdge R6525 Rack Server.”

3 “AMD EPYC 7003 Series Processors,” accessed April 5, 2021, https://www.amd.com/en/processors/epyc-7003-series.

4 “AMD EPYC 75F3,” accessed April 5, 2021, https://www.amd.com/en/products/cpu/amd-epyc-75f3.

5 “R6525 Spec Sheet,” accessed April 7, 2021, https://i.dell.com/sites/csdocuments/Product_Docs/en/poweredge-r6525-spec-sheet.pdf.

6 Chris D’Agostino, “Credit Fraud Prevention with Spark and Graph Analysis.” Spark Summit, June 11, 2016. YouTube video, 18:59, accessed February 17, 2021, https://youtu.be/0VO-ts0dsbI.

7 Level Up Education, “How are Big Companies using Apache Spark,” accessed February 17, 2021, https://medium.com/@tao_66792/how-are-big-companies-using-apache-spark-413743dbbbae.

Principled Technologies is a registered trademark of Principled Technologies, Inc.All other product names are the trademarks of their respective owners. For additional information, review the science behind this report.

PrincipledTechnologies®

Facts matter.®PrincipledTechnologies®

Facts matter.®

This project was commissioned by Dell Technologies.

Read the science behind this report at http://facts.pt/MnKmAMQ

April 2021 | 6Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Page 7: Gain more k-means clustering data analysis performance per ...

Disclaimer:

The content on the following pages includes appendices and methodologies from our hands-on work.

We will publish this content as a separate document linked to the report.

We must receive your approval on both the report and this document before taking them public simultaneously.

Page 8: Gain more k-means clustering data analysis performance per ...

The science behind the report:

Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

This document describes what we tested, how we tested, and what we found. To learn how these facts translate into real-world benefits, read the report Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers.

We concluded our hands-on testing on April 7, 2021. During testing, we determined the appropriate hardware and software configurations and applied updates as they became available. The results in this report reflect configurations that we finalized on March 31, 2021 or earlier. Unavoidably, these configurations may not represent the latest versions available when this report appears.

Our resultsTable 1: The results of our testing.

AMD EPYC™ 7542 processor-powered Dell EMC™ PowerEdge™ R6525

AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525

Time to complete the Spark-Bench k-means workload on an 811GB dataset (hours) 2.8 2.0

Percentage less time - 28.57%

Processing rate (MB/hour) 289,642 405,500

Percentage higher rate - 40.00%

Hardware and support price (USD) $57,510.01 $64,650.01

Performance per dollar (MB/hour per dollar) 5.036 6.272

Percentage higher performance per dollar - 24.53%

April 2021Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

A Principled Technologies report: Hands-on testing. Real-world results.

Page 9: Gain more k-means clustering data analysis performance per ...

CPU utilization charts

Figure 1: CPU utilization for the Dell EMC PowerEdge R6525 server with AMD EPYC 75F3 processors for the duration of the k-means clustering workload. Source: Principled Technologies.

Figure 2: CPU utilization for the Dell EMC PowerEdge R6525 server with AMD EPYC 7542 processors for the duration of the k-means clustering workload. Source: Principled Technologies.

April 2021 | 2Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Page 10: Gain more k-means clustering data analysis performance per ...

System configuration informationTable 2: Detailed information on the system we tested.

System configuration information Dell EMC PowerEdge R6525

BIOS name and version Dell 2.0.3

Non-default BIOS settings N/A

Operating system name and version/build number RHEL 8.3 (Kernel 4.18.0-240.15.1.el8_3.x86_64)

Date of last OS updates/patches applied 4/2/21

Power management policy Performance

Processor 3rd Gen 2nd Gen

Number of processors 2 2

Vendor and model AMD EPYC 75F3 AMD EPYC 7542

Core count (per processor) 32 32

Core frequency (GHz) 2.95 2.90

Memory module(s)

Total memory in system (GB) 1,024

Number of memory modules 16

Vendor and model Hynix HMAA8GR7AJR4N-XN

Size (GB) 64

Type PC4-3200

Speed (MHz) 3,200

Speed running in the server (MHz) 3,200

Storage controller 1

Vendor and model Dell BOSS-S1 Adapter

Cache size 0

Firmware version 2.5.13.3024

Storage controller 2

Vendor and model Dell PERC S150 Controller

Cache size 0

Firmware version 6.0.3-0005

Local storage (OS)

Number of drives 2

Drive vendor and model Intel® SSDSCKKB480G8R

Drive size (GB) 480

Drive information 6Gbps SATA M.2 SSD

April 2021 | 3Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Page 11: Gain more k-means clustering data analysis performance per ...

System configuration information Dell EMC PowerEdge R6525

Local storage (data)

Number of drives 4

Drive vendor and model Samsung® MZ-WLJ1T90

Drive size (GB) 1,920

Drive information NVMe™ PCIe® Gen4 SSD

Network adapter

Vendor and model Broadcom® BCM5720

Number and type of ports 2x 1Gb Ethernet Adapter

Firmware version 21.60.2

Cooling fans

Vendor and model Foxconn PIH040M12P

Number of cooling fans 6

Power supplies

Vendor and model Dell L1400E-S0

Number of power supplies 2

Wattage of each (W) 1,400

April 2021 | 4Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Page 12: Gain more k-means clustering data analysis performance per ...

How we testedWe assessed which of two systems would take less time complete a k-means algorithm from the Spark-Bench benchmark suite. The solutions we tested are as follows:

• Dell EMC PowerEdge R6525 powered by AMD EPYC 75F3 processors• Dell EMC PowerEdge R6525 powered by AMD EPYC 7542 processors

We installed Red Hat® Enterprise Linux® 8.3 (RHEL 8.3) on each solution and ran a k-means clustering algorithm from the Spark-Bench benchmark suite on an 811GB dataset. We hosted this dataset on a Linux software RAID10 we built from four 1.92TB PCIe Gen4 NVMe drives.

Installing Spark on RHEL 8.3

We installed RHEL 8.3. During installation, we disabled kdump, enabled the Ethernet port, and changed the hostname to accommodate our environment.

1. After installing RHEL, use the subscription manager to register the operating system, update the software, and install mdadm and vim:

subscription-manager register --username * --password * --auto-attach yum upgrade -yyum install mdadm vim -y

2. Disable the firewall, and disable SELinux:

sudo systemctl stop firewalld sudo systemctl disable firewalld sudo setenforce 0#Edit the selinux config file vi /etc/selinux/config…SELINUX = disabled...

3. Prepare each of the four drives you need for the software RAID. We used lsblk to determine which drives to include. Perform the following commands on each individual disk:

parted#Select the target disk select /dev/nvme*n1#Clear and create a new partition table. mklabel gpt#Create new primary partition mkpart primary ext4 0 1.5T

4. Create the RAID10:

#Create a RAID10 from the 4 target NVME drive's partitions. List each of the target partitions for each NVMemdadm --create /dev/md3 --level=10 --raid-devices=4 /dev/nvme*n1p1 /dev/nvme*n1p1 /dev/nvme*n1p1 / dev/nvme*n1p1#Define filesystem mkfs.ext4 /dev/md3 #Mount the RAID mkdir /storsudo mount /dev/md3 /stor#add the disk to fstab so it mounts on reboot vim /etc/fstab/dev/md3 /stor ext4 defaults 0 2

5. Download the Java JDK, and install it:

yum install tar wget java-1.8.0-openjdk -y

6. Determine and set your JAVA home:

export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-2.el8_3.x86_64/jre" #Edit the bash_profilevi ~/.bash_profile...# User specific environment and startup programsexport JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-2.el8_3.x86_64/jre PATH=$PATH:$HOME/bin:$JAVA_HOMEexport PATH

April 2021 | 5Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Page 13: Gain more k-means clustering data analysis performance per ...

7. Download the Spark files:

cd /home/wget http://www.gtlib.gatech.edu/pub/apache/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz tar -xvf spark-2.4.7-bin-hadoop2.7.tgz

8. Navigate to the Spark directory, and start Spark:

#Start the controller server./sbin/start-master.sh#Verify that the server is running by navigating to http://[localhost]:8080 #Start the worker server./sbin/start-slave.sh spark://[local machine IP]:7077

9. Download and extract the Spark-Bench package from https://github.com/CODAIT/spark-bench. We downloaded spark- bench_2.3.0_0.4.0-RELEASE_99.tgz, and used SCP to copy it to our target server at /home/.

10. Set up Spark-Bench:

#Unzip the file usingtar -xvfz spark-bench_2.3.0_0.4.0-RELEASE_99.tgz #create a symbolic link to the spark home directory ln -s /home/spark-2.4.4-bin-hadoop2.7 /opt/spark

11. To set up environment variables for Spark-Bench, add the to the end of /root/bashrc:

vi /root/.bachrcexport SPARK_HOME=/opt/spark export PATH=$SPARK_HOME/bin:$PATH

12. In the Spark-Bench folder, under examples, create the workload files KMeans_generator.conf and KMeans_run.conf. (We provide the text for these files at the end of this document.)

13. Start the test:

cd spark-bench_2.3.0_0.4.0-RELEASEbin/spark-bench.sh examples/KMeans_generator.conf bin/spark-bench.sh examples/KMeans_run.conf

April 2021 | 6Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Page 14: Gain more k-means clustering data analysis performance per ...

Workload files

Generating the datasetWe used the following configuration file to generate an 811GB dataset for the Spark-Bench k-means clustering workload. Note that:

• rows: The number of rows to generate for the dataset• cols: The number of rows and columns to generate for the dataset• k: The number of clusters the workload generates• scaling: The scaling factor of the dataset• partitions: The number of partitions in the dataset

[sparkbench install]/examples/KMeans_generator.conf

spark-bench = { spark-home = "/opt/spark" spark-submit-config = [{ spark-args = { master = "spark://hspark:7077" } workload-suites = [ { descr = "KMean data generator" benchmark-output = "console" workloads = [ { name = "data-generation-kmeans" rows = 450000000 cols = 99 output = "/stor/kmeans-data.csv" k = 2000 scaling = 1.6 partitions = 10 } ] } ] }]}

April 2021 | 7Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers

Page 15: Gain more k-means clustering data analysis performance per ...

Principled Technologies is a registered trademark of Principled Technologies, Inc.All other product names are the trademarks of their respective owners.

DISCLAIMER OF WARRANTIES; LIMITATION OF LIABILITY:Principled Technologies, Inc. has made reasonable efforts to ensure the accuracy and validity of its testing, however, Principled Technologies, Inc. specifically disclaims any warranty, expressed or implied, relating to the test results and analysis, their accuracy, completeness or quality, including any implied warranty of fitness for any particular purpose. All persons or entities relying on the results of any testing do so at their own risk, and agree that Principled Technologies, Inc., its employees and its subcontractors shall have no liability whatsoever from any claim of loss or damage on account of any alleged error or defect in any testing procedure or result.

In no event shall Principled Technologies, Inc. be liable for indirect, special, incidental, or consequential damages in connection with its testing, even if advised of the possibility of such damages. In no event shall Principled Technologies, Inc.’s liability, including for direct damages, exceed the amounts paid in connection with Principled Technologies, Inc.’s testing. Customer’s sole and exclusive remedies are as set forth herein.

This project was commissioned by Dell Technologies.

PrincipledTechnologies®

Facts matter.®PrincipledTechnologies®

Facts matter.®

Running the k-means clustering workloadWe used the following configuration file to run the Spark-Bench k-means workload. Note that:

• The number of executors is based on the processor’s core count. We used 31 executors.• We assigned a total of 992 GB of memory to the executors for both servers.• The exec_mem is 992 divided by the number of executors. We used 32GB.

[sparkbench install]/examples/KMeans_run.conf

spark-bench = { spark-home = "/opt/spark" spark-submit-config = [{ spark-args = { master = "spark://hspark:7077" num-executors = 31 executor-cores = 4 executor-memory = 32g } workload-suites = [ { descr = "KMean data generator" benchmark-output = "console" workloads = [ { name = "kmeans" input = "/stor/kmeans-data.csv" rows = 450000000 cols = 99 scaling = 1.6 partitions = 10 output = "/home/kmeans/results/results.csv" k = 1200 maxiterations = 4 } ] } ] }]}

Read the report at http://facts.pt/RRQ3nvZ

April 2021 | 8Gain more k-means clustering data analysis performance per dollar with 3rd Gen AMD EPYC 75F3 processor-powered Dell EMC PowerEdge R6525 servers