Low Level CPU Performance Profiling Examples

gluent.com 1

Low-Level CPU Performance Profiling Examples

Tanel Podera long time computer performance geek

@tanelpoderblog.tanelpoder.com

gluent.com 2

Intro: About me

• Tanel Põder• RDBMS Performance geek 20+ years (Oracle)• Unix/Linux Performance geek• Hadoop Performance geek• Spark Performance geek?

• http://blog.tanelpoder.com• @tanelpoder

Expert Oracle Exadata book

gluent.com 3

GluentOracle

TeradataNoSQL

Big Data Sources

A data sharing platform for enterprise applications

Gluent as a data virtualization layer

gluent.com 4

Some Microscopic level stuff to talk about…

1. Some things worth knowing about modern CPUs2. Measuring internal CPU efficiency (C++)3. A columnar database scanning example (Oracle)4. Low level Analysis of Spark Performance

• RDD vs DataFrame• DataFrame with bad code

This is gonna be a (hopefully fun)

hacking session!

gluent.com 5

”100%” busy?

A CPU close to 100% busy?

What if I told you your CPU is not that busy?

gluent.com 6

CPU Performance Counters on Linux# perf stat -d -p PID sleep 30

Performance counter stats for process id '34783': 27373.819908 task-clock # 0.912 CPUs utilized 86,428,653,040 cycles # 3.157 GHz 32,115,412,877 instructions # 0.37 insns per cycle # 2.39 stalled cycles per insn 7,386,220,210 branches # 269.828 M/sec 22,056,397 branch-misses # 0.30% of all branches 76,697,049,420 stalled-cycles-frontend # 88.74% frontend cycles idle 58,627,393,395 stalled-cycles-backend # 67.83% backend cycles idle 256,440,384 cache-references # 9.368 M/sec 222,036,981 cache-misses # 86.584 % of all cache refs 234,361,189 LLC-loads # 8.562 M/sec 218,570,294 LLC-load-misses # 93.26% of all LL-cache hits 18,493,582 LLC-stores # 0.676 M/sec 3,233,231 LLC-store-misses # 0.118 M/sec 7,324,946,042 L1-dcache-loads # 267.589 M/sec 305,276,341 L1-dcache-load-misses # 4.17% of all L1-dcache hits 36,890,302 L1-dcache-prefetches # 1.348 M/sec 30.000601214 seconds time elapsed

Measure what’s going on inside a

Metrics explained in my blog entry: http://

bit.ly/1PBIlde

gluent.com 7

Modern CPUs can run multiple operations concurrently

http://software.intel.com

Multiple ports/execution

units for computation &

memory ops

If waiting for RAM – CPU pipeline

stall!

gluent.com 8

Latency Numbers Every Programmer Should Know

Latency Comparison Numbers--------------------------L1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMutex lock/unlock 25 nsMain memory reference 100 ns 20x L2 cache, 200x L1 cacheCompress 1K bytes with Zippy 3,000 ns 3 usSend 1K bytes over 1 Gbps network 10,000 ns 10 usRead 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSDRead 1 MB sequentially from memory 250,000 ns 250 usRound trip within same datacenter 500,000 ns 500 usRead 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memoryDisk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter

roundtripRead 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSDSend packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms Source: https://gist.github.com/jboner/2841832

gluent.com 9

CPU = fast

CPU L2 / L3 cache in between

RAM = slow

gluent.com 10

Tape is dead, disk is tape, flash is disk, RAM locality is king

Jim Gray, 2006

http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt

gluent.com 11

Just caching all your data in RAM does not give you a modern “in-memory” system!

* Columnar data structures to the rescue!

gluent.com 12

Row-Major Data Structures

SELECT SUM(column) FROM array

gluent.com 13

Variable field offsets Memory line (cache line)

size = 64 Bytes

gluent.com 14

Columnar Data Structure (conceptual)

Store values of a column next to each other (data locality)

Much less data to scan (or filter)

if accessing a subset of columns

Better compression due

to adjacent repeating (or

slightly differing) values

gluent.com 15

Single-Instruction-Multiple-Data (SIMD) processing

• Run an operation (like ADD) on multiple registers/memory locations in a single instruction:

Do the same work with less (but more

complex) instructions

More concurrency inside CPU

If the underlying data structures “feed”

data fast enough …

gluent.com 16

A database example (Oracle)

gluent.com 17

A simple Data Retrieval test!

• Retrieve 1% rows out of a 8 GB table:

SELECT COUNT(*) , SUM(order_total)FROM orders WHERE warehouse_id BETWEEN 500 AND 510

The Warehouse IDs range between

1 and 999

Test data generated by

SwingBench tool

gluent.com 18

Data Retrieval: Test Results• Remember, this is a very simple scanning + filtering query:

TESTNAME PLAN_HASH ELA_MS CPU_MS LIOS BLK_READ------------------------- ---------- -------- -------- --------- ---------test1: index range scan * 16715356 265203 37438 782858 511231test2: full buffered */ C 630573765 132075 48944 1013913 849316test3: full direct path * 630573765 15567 11808 1013873 1013850test4: full smart scan */ 630573765 2102 729 1013873 1013850test5: full inmemory scan 630573765 155 155 14 0test6: full buffer cache 630573765 7850 7831 1014741 0

Test 5 & Test 6 run entirely

from memory

Source: http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action

But why 50x difference in CPU usage?

gluent.com 19

CPU & cache friendly data structures are key!

Headers, ITL entries

Row Directory

#0 hdr row

#1 hdr row

#2 hdr row

#3 hdr row

#4 hdr row

#5 hdr row

#6 hdr row

#7 hdr row

#8 hdr row

… row

#1 offset

#2 offset

#3 offset

#0 offset

Hdrbyte Column dataLock

byteCC

byteCol. len Column dataCol.

len Column dataCol. len Column dataCol.

• OLTP: Block->Row->Column format• 8kB blocks• Great for writes, changes

• Field-length encoding• Reading column #100 requires walking

through all preceding columns

• Columns (with similar values) not densely packed together

• Not CPU cache friendly for analytics!

gluent.com 20

Scanning columnar data structures

Scanning a column in a row-oriented data block

Scanning a column in a column-oriented compression unit

col 1 col 2

col 2col 2

col 3col 3

col 4col 4

col 5col 5

col5col 6

col 1 col 2

col 3 col 4col 4 col 5

col 6 col 1 col 2col 3

col 5col 1 col 2

col 6col 6

col 1 col 2

col 5col 1 col 2

col 6col 6

col 1 col 2

col 5col 1 col 2

col 6col 6 Read filter

column(s) first. Access only

projected columns if matches found.

Reduced memory traffic. More

sequential RAM access, SIMD on adjacent data.

gluent.com 21

Testing data access path differences on Oracle 12c

SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0

Run the same query on same dataset stored in different formats/layouts.

Full details:http://blog.tanelpoder.com/2015/11/30/ram-is-the-new-disk-and-how-to-measure-its-performance-part-3-cpu-instructions-cycles/

Test result data:http://bit.ly/1RitNMr

gluent.com 22

CPU instructions used for scanning/counting 69M rows

gluent.com 23

Average CPU instructions per row processed

• Knowing that the table has about 69M rows, I can calculate the average number of instructions issued per row processed

gluent.com 24

CPU cycles consumed (full scans only)

gluent.com 25

CPU efficiency (Instructions-per-Cycle)

Yes, modern superscalar CPUs can execute multiple

instructions per cycle

gluent.com 26

Reducing memory writes within SQL execution

• Old approach:1. Read compressed data chunk2. Decompress data (write data to temporary memory location)3. Filter out non-matching rows4. Return data

• New approach:1. Read and filter compressed columns2. Decompress only required columns of matching rows3. Return data

gluent.com 27

Memory reads & writes during internal processing

Unit = MB Read only requested columns

Rows counted from chunk headers

Scan compressed data: few memory writes

gluent.com 28

Spark Examples

• Will use:• Spark built in tools• Perf• Honest Profiler• FlameGraphs

gluent.com 29

Apache Spark Tungsten Data Structures

Databricks presentation:http://www.slideshare.net/SparkSummit/deep-dive-into-project-tungsten-josh-rosen

Much denser data structure

Using sun.misc.Unsafe

API to bypass JVM object allocator

gluent.com 30

Apache Spark Tungsten Data Structures

Much denser data structure

“Good memory locality”

gluent.com 31

Spark test setup (RDD)

CSV RDD (partitoned)

RDD(single

partition)

“For each” sum

column X

val lines = sc.textFile("/tmp/simple_data.csv").repartition(1)

val stringFields = lines.map(line => line.split(","))val fullFieldLength = stringFields.first.lengthval completeFields = stringFields.filter(fields => fields.length == fullFieldLength)

val data = completeFields.map(fields => fields.patch(yearIndex, Array(Try(fields(yearIndex).toInt).getOrElse(0)), 1))

log("cache entire RDD in memory")data.cache()

log("run map(length).max to populate cache")println(data.map(r => r.length).reduce((l1, l2) => Math.max(l1, l2)))

.cache().repartition(1)

I wanted to simplify this test as much as

possible

gluent.com 32

“SELECT” sum (Year) from RDD// SUM all values of “year” columnprintln(data.map(d => d(yearIndex).asInstanceOf[Int]).reduce((y1, y2) => y1 + y2))

Cached RDD ~1M records, ~40 columns

1-column sum: 0.349 seconds!

17/01/19 18:43:36 INFO DAGScheduler: ResultStage 123 (reduce at demo.scala:89) finished in 0.349 s17/01/19 18:43:36 INFO DAGScheduler: Job 61 finished: reduce at demo.scala:89, took 0.353754 s

gluent.com 33

Spark test setup (DataFrame)

CSV RDD partitioned

RDDsingle

partition

“For each” sum

column X

val lines = sc.textFile("/tmp/simple_data.csv").repartition(1)

val stringFields = lines.map(line => line.split(","))val fullFieldLength = stringFields.first.lengthval completeFields = stringFields.filter(fields => fields.length == fullFieldLength)

val data = completeFields.map(fields => fields.patch(yearIndex, Array(Try(fields(yearIndex).toInt).getOrElse(0)), 1))

val dataFrame = ss.createDataFrame(data.map(d => Row(d: _*)), schema)

log("cache entire data-frame in memory")dataFrame.cache()

log("run map(length).max to populate cache")println(dataFrame.map(r => r.length).reduce((l1, l2) => Math.max(l1, l2)))

.cache().repartition(1)

DataFrame

gluent.com 34

“SELECT” sum (Year) from DataFrame (silly example!)// SUM all values of “year” columnprintln(dataFrame.map(r => r(yearIndex).asInstanceOf[Int]).reduce((y1, y2) => y1 + y2))

17/01/19 19:39:25 INFO DAGScheduler: ResultStage 29 (reduce at demo.scala:71) finished in 4.664 s17/01/19 19:39:25 INFO DAGScheduler: Job 14 finished: reduce at demo.scala:71, took 4.673204 s

Cached DataFrame: ~1M records, ~40 columns

1-column SUM: 4.67 seconds! (13x more than RDD?)

This does not make sense!

gluent.com 35

“SELECT” sum (Year) from DataFrame (proper)// SUM all values of “year” columnprintln(dataFrame.agg(sum("Year")).first.get(0))

17/01/19 19:32:02 INFO DAGScheduler: ResultStage 118 (first at demo.scala:70) finished in 0.004 s17/01/19 19:32:02 INFO DAGScheduler: Job 40 finished: first at demo.scala:70, took 0.041698 s

Cached DataFrame ~1M records, ~40 columns

1-column sum with aggregation pushdown: 0.041 seconds!

(Over 100x faster than previous Silly DataFrame and 8.5x faster than 1st RDD example)

gluent.com 36

Summary

• New data structures are required for CPU efficiency!• Columnar …

• On efficient data structures, efficient code becomes possible• Bad code still performs badly …

• It is possible to measure the CPU efficiency of your code• That should come after the usual profiling and DAG / execution plan

validation

• All secondary metrics (like efficiency ratios) should be used in context of how much work got done

gluent.com 37

Past & Future

gluent.com 38

Future-proof Open Data Formats!

• Disk-optimized columnar data structures• Apache Parquet

• https://parquet.apache.org/

• Apache ORC• https://orc.apache.org/

• Memory / CPU-cache optimized data structures• Apache Arrow

• Not only storage format• … also a cross-system/cross-platform IPC communication framework• https://arrow.apache.org/

gluent.com 39

Future

1. RAM gets cheaper + bigger, not necessarily faster

2. CPU caches get larger

3. RAM blends with storage and becomes non-volatile

4. IO subsystems (flash) get even closer to CPUs

5. IO latencies shrink

6. The latency difference between non-volatile storage and volatile RAM shrinks - new database layouts!

7. CPU cache is king – new data structures needed!

gluent.com 40

The tools used here:

• Honest Profiler by Richard Warburton (@RichardWarburto)• https://github.com/RichardWarburton/honest-profiler

• Flame Graphs by Brendan Gregg (@brendangregg)• http://www.brendangregg.com/flamegraphs.html

• Linux perf tool• https://perf.wiki.kernel.org/index.php/Main_Page

• Spark-Prof demos:• https://github.com/gluent/spark-prof

gluent.com 41

References

• Slides & Video of a similar presentation (about Oracle):• http://www.slideshare.net/tanelp• https://vimeo.com/gluent

• RAM is the new disk series:• http

://blog.tanelpoder.com/2015/08/09/ram-is-the-new-disk-and-how-to-measure-its-performance-part-1/

• https://docs.google.com/spreadsheets/d/1ss0rBG8mePAVYP4hlpvjqAAlHnZqmuVmSFbHMLDsjaU/

gluent.com 42

Thanks!

http://gluent.com/

We are hiring developers & data engineers!!!

http://blog.tanelpoder.com@tanelpoder

Low Level CPU Performance Profiling Examples

Data & Analytics

Transcript of Low Level CPU Performance Profiling Examples

CPU, memory, latency - freedesktop.orgbilboed/GStreamer... · 2012-09-03 · CPU, memory, latency profiling and optimizing GStreamer Edward Hervey Senior Multimedia Architect edward@collabora.com

CPU: SOFTWARE ARCHITECTUREece.uprm.edu/~ahchinaei/courses/2014jun/inel4206/...instruction is CPU family dependent. • The three examples below all express an instruction of the form

Chapter 5: CPU Scheduling. 5.2/42 Chapter 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Operating Systems Examples.

Protection and System Calls · Protection Examples • CPU protection – Prevent a user from using the CPU for too long • Throughput of jobs, and response time to events (incl.

DDH INVESTMENT ACCESS FUNDS · 2016. 10. 11. · Distributions paid and payable 138,990 461,960 854,974 815,972 365,243 316,091 449,310 627,142 CPU CPU CPU CPU CPU CPU CPU CPU Distributions

and Toshio Nakatani from a Hardware Performance Monitor€¦ · Low overhead is important for online profiling techniques ... CPU Hardware Performance Monitor 1) configure the HPM

Version 12svoboda/courses/191-NDBI040/... · ProgrammingModels Examples • Tradional vonNeumannmodel Architectureofaphysicalcomputerwithseveralcomponents suchasacentralprocessingunit(CPU),arithmec-logicunit

Analysis of Algorithms intro. What is “goodness”? How to measure efficiency? ◦ Profiling, Big-Oh Big-Oh: ◦ Motivation ◦ Informal examples ◦ Informal.

Sampling uide for isplacement Situations Practical Examples...development organisations design and implement profiling exercises of displacement situations. Our primary mission is

JBOSS Profiler Performance Profiling. Contents ● Motivation – The problem ● Profiling ● Profiling Tools ● Java and Profiling ● JBoss Profiler ● Example.

Nutrient Profiling Scientific aims versus actual impact on ... · Advantages and disadvantages of reference amounts for nutrient profiling Table 2 Examples of nutrient profiling strategies

CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Operating Systems Examples Algorithm.

Linux Profiling at Netflix - Brendan · PDF fileLinux Profiling at Netflix using perf_events ... The most commonly used perf commands are:! ... perf script -f comm,tid,pid,time,cpu,event,ip,sym,dso!!

CPU 1515R-2 PN (6ES7515-2RM00-0AB0)... · 2020. 12. 3. · Documentation guide CPU 1515R-2 PN (6ES7515-2RM00-0AB0) Manual, 10/2018, A5E42009914-AA 9 Application examples The application

Node.js CPU Profiling and Memory Leak Detection with StrongLoop Arc

MODULE 2. Syllabus Fixed and floating point formats code improvement Constraints TMS 320C64x CPU simple programming examples using C/assembly.

Server Resources: Server Examples, Resources and CPU Scheduling

Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Get moving with CMC FPGA/GPU Cluster · 2019. 12. 12. · FPGA GPU CPU CPU FPGA GPU CPU CPU FPGA GPU CPU CPU FPGA GPU CPU 2 x CPU 2 x CPU 2 x CPU 2 x CPU 8 x CPU 2 x CPU & Large memory

on “Ending Racial Profiling in America”...generate results in practice and, instead, we face new and more insidious examples of profiling taking root. We must come together now