AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J....

51
Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels J. Laukemann, J. Hammer, G. Hager, G. Wellein

Transcript of AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J....

Page 1: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

Automatic Throughput and Critical Path Analysisof x86 and ARM Assembly Kernels

J. Laukemann, J. Hammer, G. Hager, G. Wellein

Page 2: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

1. Analytic Performance ModelingWhy?Assumptions & Related Tools

2. Throughput & Latency - NomenclatureDefinition of Throughput, Critical Path and Loop-Carried Dependency

3. OSACA: Automating the in-core model constructionOverview, Structure and Output

4. Gauss-Seidel-Method Example

5. Future Work

18.11.2019 2PMBS19 | OSACA | Jan Laukemann

Overview

Page 3: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• How fast can my kernel run at best?

• What are the relevant hardwarebottlenecks?

• Apply simplified model of underlyinghardware• In-core execution

• Data transfer

• Combining execution and data transfer

18.11.2019 3PMBS19 | OSACA | Jan Laukemann

ECM Model

Roofline Model

Performance Modeling for Loop Kernels

Page 4: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• How fast can my kernel run at best?

• What are the relevant hardwarebottlenecks?

• Apply simplified model of underlyinghardware• In-core execution

• Data transfer

• Combining execution and data transfer

18.11.2019 4PMBS19 | OSACA | Jan Laukemann

ECM Model

Roofline Model

Performance Modeling for Loop Kernels

Page 5: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

1. All Data in L1

2. Average distribution of port scheduling

PMBS19 | OSACA | Jan Laukemann

Assumptions & Related Tools

18.11.2019 5

OSACA v0.21 OSACA v0.3 IACA2 (EoL) LLVM-MCA3

Throughput ✔ ✔ ✔ ✔

Critical Path ✘ ✔ ✘ 😕

Loop-Carried Dependencies

✘ ✔ ✘ 😕

1 Presented at PMBS182 Intel Architecture Code Analyzer (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer) 3 LLVM Machine Code Analyzer (https://llvm.org/docs/CommandGuide/llvm-mca.html)

Page 6: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 6PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

TP: 1 cy

Throughput & Latency

Page 7: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 7PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 8: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 8PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

t

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 9: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 9PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

t

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 10: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 10PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

t

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 11: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 11PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

1 cy/it

CP: 3 cy

t

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 12: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 12PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

1 cy/it

CP: 3 cy

t

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 13: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 13PMBS19 | OSACA | Jan Laukemann

TP: 1 cy

Throughput & Latency

• Dependencies within loop

• Loop-carried dependencies

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 14: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 14PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

• Dependencies within loop

• Loop-carried dependencies

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 15: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 15PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

• Dependencies within loop

• Loop-carried dependencies

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 16: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 16PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

• Dependencies within loop

• Loop-carried dependencies

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 17: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 17PMBS19 | OSACA | Jan Laukemann

3 cy/itCP: 3 cyLCD: 3 cy

t

TP: 1 cy

Throughput & Latency

• Dependencies within loop

• Loop-carried dependencies

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 18: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 18PMBS19 | OSACA | Jan Laukemann

TP: 1 cy

Throughput & Latency

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 19: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 19PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 20: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 20PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 21: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 21PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 22: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 22PMBS19 | OSACA | Jan Laukemann

3 cy/itLCD: 3 cy CP: 5 cy

t

TP: 1 cy

Throughput & Latency

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 23: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 23PMBS19 | OSACA | Jan Laukemann

movl $111,%ebx #START MARKER.byte 100,103,144 #START MARKER.L22:vmovapd 0(%r13,%rax),%ymm0vfmadd213pd (%r14,%rax),%ymm1,%ymm0vmovapd %ymm0,(%r12,%rax)addq $32,%raxcmpq %rax,%r15jne .L22

movl $222,%ebx #END MARKER .byte 100,103,144 #END MARKER

mov x1,#111 //START.byte 213,3,32,31 //START.L18:ldr q2, [x20, x0]ldr q1, [x21, x0]fmla v1.2d, v2.2d, v0.2dstr q1, [x19, x0]add x0, x0, #16cmp x22, x0bne .L18

mov x1,#222 //END.byte 213,3,32,31 //END

Marked Assembly

OSACA Workflow: Input – Database

x86

arm

Page 24: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

Intel Cascade Lake

18.11.2019 24PMBS19 | OSACA | Jan Laukemann

movl $111,%ebx #START MARKER.byte 100,103,144 #START MARKER.L22:vmovapd 0(%r13,%rax),%ymm0vfmadd213pd (%r14,%rax),%ymm1,%ymm0vmovapd %ymm0,(%r12,%rax)addq $32,%raxcmpq %rax,%r15jne .L22

movl $222,%ebx #END MARKER .byte 100,103,144 #END MARKER

mov x1,#111 //START.byte 213,3,32,31 //START.L18:ldr q2, [x20, x0]ldr q1, [x21, x0]fmla v1.2d, v2.2d, v0.2dstr q1, [x19, x0]add x0, x0, #16cmp x22, x0bne .L18

mov x1,#222 //END.byte 213,3,32,31 //END

Marked Assembly

OSACA Workflow: Input – Database

x86

arm

Page 25: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

load_latency: {gpr: 4, xmm: 4, ymm: 4, zmm: 4}load_throughput: {port_pressure: [0,0,0,0.5 ... ,0]}- name: vfmadd213pdoperands:- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: true

throughput: 0.5latency: 4 # 0 DV 1 2 D 3 D 4 5 6 7 port_pressure: [0.5,0,0.5,0.5,0.5,0.5,0.5,0,0,0,0]

18.11.2019 25PMBS19 | OSACA | Jan Laukemann

movl $111,%ebx #START MARKER.byte 100,103,144 #START MARKER.L22:vmovapd 0(%r13,%rax),%ymm0vfmadd213pd (%r14,%rax),%ymm1,%ymm0vmovapd %ymm0,(%r12,%rax)addq $32,%raxcmpq %rax,%r15jne .L22

movl $222,%ebx #END MARKER .byte 100,103,144 #END MARKER

mov x1,#111 //START.byte 213,3,32,31 //START.L18:ldr q2, [x20, x0]ldr q1, [x21, x0]fmla v1.2d, v2.2d, v0.2dstr q1, [x19, x0]add x0, x0, #16cmp x22, x0bne .L18

mov x1,#222 //END.byte 213,3,32,31 //END

Marked Assembly

Machine Files / Databases

OSACA Workflow: Input – Database

x86

arm

Page 26: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

load_latency: {gpr: 4, xmm: 4, ymm: 4, zmm: 4}load_throughput: {port_pressure: [0,0,0,0.5 ... ,0]}- name: vfmadd213pdoperands:- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: true

throughput: 0.5latency: 4 # 0 DV 1 2 D 3 D 4 5 6 7 port_pressure: [0.5,0,0.5,0.5,0.5,0.5,0.5,0,0,0,0]

18.11.2019 26PMBS19 | OSACA | Jan Laukemann

movl $111,%ebx #START MARKER.byte 100,103,144 #START MARKER.L22:vmovapd 0(%r13,%rax),%ymm0vfmadd213pd (%r14,%rax),%ymm1,%ymm0vmovapd %ymm0,(%r12,%rax)addq $32,%raxcmpq %rax,%r15jne .L22

movl $222,%ebx #END MARKER .byte 100,103,144 #END MARKER

mov x1,#111 //START.byte 213,3,32,31 //START.L18:ldr q2, [x20, x0]ldr q1, [x21, x0]fmla v1.2d, v2.2d, v0.2dstr q1, [x19, x0]add x0, x0, #16cmp x22, x0bne .L18

mov x1,#222 //END.byte 213,3,32,31 //END

Marked Assembly

Machine Files / Databases

OSACA Workflow: Input – Database

x86

arm

Page 27: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

load_latency: {gpr: 4, xmm: 4, ymm: 4, zmm: 4}load_throughput: {port_pressure: [0,0,0,0.5 ... ,0]}- name: vfmadd213pdoperands:- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: true

throughput: 0.5latency: 4 # 0 DV 1 2 D 3 D 4 5 6 7 port_pressure: [0.5,0,0.5,0.5,0.5,0.5,0.5,0,0,0,0]

18.11.2019 27PMBS19 | OSACA | Jan Laukemann

movl $111,%ebx #START MARKER.byte 100,103,144 #START MARKER.L22:vmovapd 0(%r13,%rax),%ymm0vfmadd213pd (%r14,%rax),%ymm1,%ymm0vmovapd %ymm0,(%r12,%rax)addq $32,%raxcmpq %rax,%r15jne .L22

movl $222,%ebx #END MARKER .byte 100,103,144 #END MARKER

mov x1,#111 //START.byte 213,3,32,31 //START.L18:ldr q2, [x20, x0]ldr q1, [x21, x0]fmla v1.2d, v2.2d, v0.2dstr q1, [x19, x0]add x0, x0, #16cmp x22, x0bne .L18

mov x1,#222 //END.byte 213,3,32,31 //END

Marked Assembly

Machine Files / Databases

OSACA Workflow: Input – Database

x86

arm

Page 28: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

Combined Analysis Report------------------------

Port pressure in cycles| 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD |

-------------------------------------------------------------------------------------------------179 | | | | | | | | || | | .L22:180 | | | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vmovapd 0(%r13,%rax), %ymm0

181 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vfmadd213pd (%r14,%rax),%ymm1,%ymm0182 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, (%r12,%rax)

183 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax184 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | | cmpq %rax, %r15

185 | | | | | | | | || | | * jne .L221.00 1.00 1.50 1.00 1.50 1.00 1.00 0.50 0.50 13.0 1.0

Loop-Carried Dependencies Analysis Report-----------------------------------------183 | 1.0 | addq $32, %rax | [183]

18.11.2019 28PMBS19 | OSACA | Jan Laukemann

185: jne 179: label 180: vmovapd

181: vfmadd213pd

182: vmovapd

181: LOAD

4 4

4

183: addq

184: compq

1

OSACA Workflow: Output

Page 29: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 29PMBS19 | OSACA | Jan Laukemann

185: jne 179: label 180: vmovapd

181: vfmadd213pd

182: vmovapd

181: LOAD

4 4

4

183: addq

184: compq

1

OSACA Workflow: OutputCombined Analysis Report------------------------

Port pressure in cycles| 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD |

-------------------------------------------------------------------------------------------------179 | | | | | | | | || | | .L22:180 | | | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vmovapd 0(%r13,%rax), %ymm0

181 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vfmadd213pd (%r14,%rax),%ymm1,%ymm0182 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, (%r12,%rax)

183 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax184 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | | cmpq %rax, %r15

185 | | | | | | | | || | | * jne .L221.00 1.00 1.50 1.00 1.50 1.00 1.00 0.50 0.50 13.0 1.0

Loop-Carried Dependencies Analysis Report-----------------------------------------183 | 1.0 | addq $32, %rax | [183]

Page 30: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 30PMBS19 | OSACA | Jan Laukemann

185: jne 179: label 180: vmovapd

181: vfmadd213pd

182: vmovapd

181: LOAD

4 4

4

183: addq

184: compq

1

OSACA Workflow: OutputCombined Analysis Report------------------------

Port pressure in cycles| 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD |

-------------------------------------------------------------------------------------------------179 | | | | | | | | || | | .L22:180 | | | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vmovapd 0(%r13,%rax), %ymm0

181 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vfmadd213pd (%r14,%rax),%ymm1,%ymm0182 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, (%r12,%rax)

183 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax184 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | | cmpq %rax, %r15

185 | | | | | | | | || | | * jne .L221.00 1.00 1.50 1.00 1.50 1.00 1.00 0.50 0.50 13.0 1.0

Loop-Carried Dependencies Analysis Report-----------------------------------------183 | 1.0 | addq $32, %rax | [183]

Page 31: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 31PMBS19 | OSACA | Jan Laukemann

185: jne 179: label 180: vmovapd

181: vfmadd213pd

182: vmovapd

181: LOAD

4 4

4

183: addq

184: compq

1

OSACA Workflow: OutputCombined Analysis Report------------------------

Port pressure in cycles| 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD |

-------------------------------------------------------------------------------------------------179 | | | | | | | | || | | .L22:180 | | | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vmovapd 0(%r13,%rax), %ymm0

181 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vfmadd213pd (%r14,%rax),%ymm1,%ymm0182 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, (%r12,%rax)

183 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax184 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | | cmpq %rax, %r15

185 | | | | | | | | || | | * jne .L221.00 1.00 1.50 1.00 1.50 1.00 1.00 0.50 0.50 13.0 1.0

Loop-Carried Dependencies Analysis Report-----------------------------------------183 | 1.0 | addq $32, %rax | [183]

Page 32: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

Combined Analysis Report------------------------

Port pressure in cycles| 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD |

-------------------------------------------------------------------------------------------------179 | | | | | | | | || | | .L22:180 | | | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vmovapd 0(%r13,%rax), %ymm0

181 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vfmadd213pd (%r14,%rax),%ymm1,%ymm0182 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, (%r12,%rax)

183 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax184 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | | cmpq %rax, %r15

185 | | | | | | | | || | | * jne .L221.00 1.00 1.50 1.00 1.50 1.00 1.00 0.50 0.50 13.0 1.0

Loop-Carried Dependencies Analysis Report-----------------------------------------183 | 1.0 | addq $32, %rax | [183]

18.11.2019 32PMBS19 | OSACA | Jan Laukemann

185: jne 179: label 180: vmovapd

181: vfmadd213pd

182: vmovapd

181: LOAD

4 4

4

183: addq

184: compq

1

OSACA Workflow: Output

Page 33: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• Limited by loop-carried dependency

• Create code with -Ofast, -funroll-loops(+ architecture specific flags)

• Analyze for Intel Cascake Lake, AMD Zen andMarvell ThunderX2

18.11.2019 33

do it=1, itmaxdo k=1, kmax-1

do i=1, imax-1phi(i,k,t0) = 0.25 * (

phi(i,k-1,t0) + phi(i+1,k,t0) +phi(i,k+1,t0) + phi(i-1,k,t0))

dodo

do

PMBS19 | OSACA | Jan Laukemann

Gauss-Seidel Method Example

Page 34: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 34

mov x1, #111 // START MARKER.byte 213,3,32,31 // START MARKER

.L20:ldr d31, [x15, x18, lsl 3]ldr d0, [x15, 8]mov x14, x15add x16, x15, 24ldr d2, [x15, x30, lsl 3]add x15, x15, 32fadd d1, d31, d0fadd d3, d1, d30fadd d4, d3, d2fmul d5, d4, d9str d5, [x14], 8ldr d6, [x14, x18, lsl 3]ldr d16, [x14, 8]add x13, x14, 8ldr d7, [x14, x30, lsl 3]fadd d17, d6, d16fadd d18, d17, d5fadd d19, d18, d7fmul d20, d19, d9str d20, [x15, -24]

PMBS19 | OSACA | Jan Laukemann

ldr d21, [x13, x18, lsl 3]ldr d23, [x14, 16]ldr d22, [x13, x30, lsl 3]fadd d24, d21, d23fadd d25, d24, d20fadd d26, d25, d22fmul d27, d26, d9str d27, [x14, 8]ldr d30, [x15]ldr d28, [x16, x18, lsl 3]ldr d29, [x16, x30, lsl 3]fadd d31, d28, d30fadd d2, d31, d27fadd d0, d2, d29fmul d30, d0, d9str d30, [x15, -8]cmp x7, x15bne .L20mov x1, #222 // END MARKER.byte 213,3,32,31 // END MARKER

Gauss-Seidel Method Example

Page 35: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 35PMBS19 | OSACA | Jan Laukemann

mov x1, #111 // START MARKER.byte 213,3,32,31 // START MARKER

.L20:ldr d31, [x15, x18, lsl 3]ldr d0, [x15, 8]mov x14, x15add x16, x15, 24ldr d2, [x15, x30, lsl 3]add x15, x15, 32fadd d1, d31, d0fadd d3, d1, d30fadd d4, d3, d2fmul d5, d4, d9str d5, [x14], 8ldr d6, [x14, x18, lsl 3]ldr d16, [x14, 8]add x13, x14, 8ldr d7, [x14, x30, lsl 3]fadd d17, d6, d16fadd d18, d17, d5fadd d19, d18, d7fmul d20, d19, d9str d20, [x15, -24]

ldr d21, [x13, x18, lsl 3]ldr d23, [x14, 16]ldr d22, [x13, x30, lsl 3]fadd d24, d21, d23fadd d25, d24, d20fadd d26, d25, d22fmul d27, d26, d9str d27, [x14, 8]ldr d30, [x15]ldr d28, [x16, x18, lsl 3]ldr d29, [x16, x30, lsl 3]fadd d31, d28, d30fadd d2, d31, d27fadd d0, d2, d29fmul d30, d0, d9str d30, [x15, -8]cmp x7, x15bne .L20mov x1, #222 // END MARKER.byte 213,3,32,31 // END MARKER

Gauss-Seidel Method Example

Page 36: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 36PMBS19 | OSACA | Jan Laukemann

mov x1, #111 // START MARKER.byte 213,3,32,31 // START MARKER

.L20:ldr d31, [x15, x18, lsl 3]ldr d0, [x15, 8]mov x14, x15add x16, x15, 24ldr d2, [x15, x30, lsl 3]add x15, x15, 32fadd d1, d31, d0fadd d3, d1, d30fadd d4, d3, d2fmul d5, d4, d9str d5, [x14], 8ldr d6, [x14, x18, lsl 3]ldr d16, [x14, 8]add x13, x14, 8ldr d7, [x14, x30, lsl 3]fadd d17, d6, d16fadd d18, d17, d5fadd d19, d18, d7fmul d20, d19, d9str d20, [x15, -24]

ldr d21, [x13, x18, lsl 3]ldr d23, [x14, 16]ldr d22, [x13, x30, lsl 3]fadd d24, d21, d23fadd d25, d24, d20fadd d26, d25, d22fmul d27, d26, d9str d27, [x14, 8]ldr d30, [x15]ldr d28, [x16, x18, lsl 3]ldr d29, [x16, x30, lsl 3]fadd d31, d28, d30fadd d2, d31, d27fadd d0, d2, d29fmul d30, d0, d9str d30, [x15, -8]cmp x7, x15bne .L20mov x1, #222 // END MARKER.byte 213,3,32,31 // END MARKER

Gauss-Seidel Method Example

Page 37: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 37PMBS19 | OSACA | Jan Laukemann

mov x1, #111 // START MARKER.byte 213,3,32,31 // START MARKER

.L20:ldr d31, [x15, x18, lsl 3]ldr d0, [x15, 8]mov x14, x15add x16, x15, 24ldr d2, [x15, x30, lsl 3]add x15, x15, 32fadd d1, d31, d0fadd d3, d1, d30fadd d4, d3, d2fmul d5, d4, d9str d5, [x14], 8ldr d6, [x14, x18, lsl 3]ldr d16, [x14, 8]add x13, x14, 8ldr d7, [x14, x30, lsl 3]fadd d17, d6, d16fadd d18, d17, d5fadd d19, d18, d7fmul d20, d19, d9str d20, [x15, -24]

ldr d21, [x13, x18, lsl 3]ldr d23, [x14, 16]ldr d22, [x13, x30, lsl 3]fadd d24, d21, d23fadd d25, d24, d20fadd d26, d25, d22fmul d27, d26, d9str d27, [x14, 8]ldr d30, [x15]ldr d28, [x16, x18, lsl 3]ldr d29, [x16, x30, lsl 3]fadd d31, d28, d30fadd d2, d31, d27fadd d0, d2, d29fmul d30, d0, d9str d30, [x15, -8]cmp x7, x15bne .L20mov x1, #222 // END MARKER.byte 213,3,32,31 // END MARKER

Gauss-Seidel Method Example

Page 38: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 38PMBS19 | OSACA | Jan Laukemann

mov x1, #111 // START MARKER.byte 213,3,32,31 // START MARKER

.L20:ldr d31, [x15, x18, lsl 3]ldr d0, [x15, 8]mov x14, x15add x16, x15, 24ldr d2, [x15, x30, lsl 3]add x15, x15, 32fadd d1, d31, d0fadd d3, d1, d30fadd d4, d3, d2fmul d5, d4, d9str d5, [x14], 8ldr d6, [x14, x18, lsl 3]ldr d16, [x14, 8]add x13, x14, 8ldr d7, [x14, x30, lsl 3]fadd d17, d6, d16fadd d18, d17, d5fadd d19, d18, d7fmul d20, d19, d9str d20, [x15, -24]

ldr d21, [x13, x18, lsl 3]ldr d23, [x14, 16]ldr d22, [x13, x30, lsl 3]fadd d24, d21, d23fadd d25, d24, d20fadd d26, d25, d22fmul d27, d26, d9str d27, [x14, 8]ldr d30, [x15]ldr d28, [x16, x18, lsl 3]ldr d29, [x16, x30, lsl 3]fadd d31, d28, d30fadd d2, d31, d27fadd d0, d2, d29fmul d30, d0, d9str d30, [x15, -8]cmp x7, x15bne .L20mov x1, #222 // END MARKER.byte 213,3,32,31 // END MARKER

Gauss-Seidel Method Example

Page 39: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 39

Port pressure in cycles| 0 - 0DV | 1 - 1DV | 2 | 3 | 4 | 5 || CP | LCD |

----------------------------------------------------------------------------520 | | | | | | || | | .L20:521 | | | | 0.50 | 0.50 | || 4.0 | | ldr d31, [x15, x18, lsl 3]522 | | | | 0.50 | 0.50 | || | | ldr d0, [x15, 8]523 | 0.50 | 0.50 | | | | || | | mov x14, x15524 | 0.33 | 0.33 | 0.33 | | | || | | add x16, x15, 24525 | | | | 0.50 | 0.50 | || | | ldr d2, [x15, x30, lsl 3]526 | 0.33 | 0.33 | 0.33 | | | || | | add x15, x15, 32527 | 0.50 | 0.50 | | | | || 6.0 | | fadd d1, d31, d0528 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d3, d1, d30529 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d4, d3, d2530 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d5, d4, d9531 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d5, [x14], 8532 | | | | 0.50 | 0.50 | || 4.0 | | ldr d6, [x14, x18, lsl 3]533 | | | | 0.50 | 0.50 | || | | ldr d16, [x14, 8]534 | 0.33 | 0.33 | 0.33 | | | || | | add x13, x14, 8535 | | | | 0.50 | 0.50 | || | | ldr d7, [x14, x30, lsl 3]536 | 0.50 | 0.50 | | | | || 6.0 | | fadd d17, d6, d16537 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d18, d17, d5538 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d19, d18, d7539 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d20, d19, d9540 | | | | 0.50 | 0.50 | 1.00 || | | str d20, [x15, -24]541 | | | | 0.50 | 0.50 | || | | ldr d21, [x13, x18, lsl 3]542 | | | | 0.50 | 0.50 | || | | ldr d23, [x14, 16]543 | | | | 0.50 | 0.50 | || | | ldr d22, [x13, x30, lsl 3]544 | 0.50 | 0.50 | | | | || | | fadd d24, d21, d23545 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d25, d24, d20546 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d26, d25, d22547 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d27, d26, d9548 | | | | 0.50 | 0.50 | 1.00 || | | str d27, [x14, 8]549 | | | | 0.50 | 0.50 | || | | ldr d30, [x15]550 | | | | 0.50 | 0.50 | || | | ldr d28, [x16, x18, lsl 3]551 | | | | 0.50 | 0.50 | || | | ldr d29, [x16, x30, lsl 3]552 | 0.50 | 0.50 | | | | || | | fadd d31, d28, d30553 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d2, d31, d27554 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d0, d2, d29555 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d30, d0, d9556 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d30, [x15, -8]557 | 0.33 | 0.33 | 0.33 | | | || | | cmp x7, x15558 | | | | | | || | | * bne .L20

9.83 9.83 1.33 8.00 8.00 4.00 100.0 72.0

Loop-Carried Dependencies Analysis Report-----------------------------------------526 | 1.0 | add x15, x15, 32 | [526]555 | 72.0 | fmul d30, d0, d9 | [528, 529, 530, 537, 538, 539, 545, 546, 547, 553, 554, 555]

PMBS19 | OSACA | Jan Laukemann

Gauss-Seidel Method Example – Output

Page 40: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 40

Port pressure in cycles| 0 - 0DV | 1 - 1DV | 2 | 3 | 4 | 5 || CP | LCD |

----------------------------------------------------------------------------520 | | | | | | || | | .L20:521 | | | | 0.50 | 0.50 | || 4.0 | | ldr d31, [x15, x18, lsl 3]522 | | | | 0.50 | 0.50 | || | | ldr d0, [x15, 8]523 | 0.50 | 0.50 | | | | || | | mov x14, x15524 | 0.33 | 0.33 | 0.33 | | | || | | add x16, x15, 24525 | | | | 0.50 | 0.50 | || | | ldr d2, [x15, x30, lsl 3]526 | 0.33 | 0.33 | 0.33 | | | || | | add x15, x15, 32527 | 0.50 | 0.50 | | | | || 6.0 | | fadd d1, d31, d0528 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d3, d1, d30529 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d4, d3, d2530 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d5, d4, d9531 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d5, [x14], 8532 | | | | 0.50 | 0.50 | || 4.0 | | ldr d6, [x14, x18, lsl 3]533 | | | | 0.50 | 0.50 | || | | ldr d16, [x14, 8]534 | 0.33 | 0.33 | 0.33 | | | || | | add x13, x14, 8535 | | | | 0.50 | 0.50 | || | | ldr d7, [x14, x30, lsl 3]536 | 0.50 | 0.50 | | | | || 6.0 | | fadd d17, d6, d16537 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d18, d17, d5538 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d19, d18, d7539 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d20, d19, d9540 | | | | 0.50 | 0.50 | 1.00 || | | str d20, [x15, -24]541 | | | | 0.50 | 0.50 | || | | ldr d21, [x13, x18, lsl 3]542 | | | | 0.50 | 0.50 | || | | ldr d23, [x14, 16]543 | | | | 0.50 | 0.50 | || | | ldr d22, [x13, x30, lsl 3]544 | 0.50 | 0.50 | | | | || | | fadd d24, d21, d23545 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d25, d24, d20546 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d26, d25, d22547 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d27, d26, d9548 | | | | 0.50 | 0.50 | 1.00 || | | str d27, [x14, 8]549 | | | | 0.50 | 0.50 | || | | ldr d30, [x15]550 | | | | 0.50 | 0.50 | || | | ldr d28, [x16, x18, lsl 3]551 | | | | 0.50 | 0.50 | || | | ldr d29, [x16, x30, lsl 3]552 | 0.50 | 0.50 | | | | || | | fadd d31, d28, d30553 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d2, d31, d27554 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d0, d2, d29555 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d30, d0, d9556 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d30, [x15, -8]557 | 0.33 | 0.33 | 0.33 | | | || | | cmp x7, x15558 | | | | | | || | | * bne .L20

9.83 9.83 1.33 8.00 8.00 4.00 100.0 72.0

Loop-Carried Dependencies Analysis Report-----------------------------------------526 | 1.0 | add x15, x15, 32 | [526]555 | 72.0 | fmul d30, d0, d9 | [528, 529, 530, 537, 538, 539, 545, 546, 547, 553, 554, 555]

PMBS19 | OSACA | Jan Laukemann

Block Throughput 2.46 cy

9.83 9.83

Gauss-Seidel Method Example – Output

Page 41: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 41

Port pressure in cycles| 0 - 0DV | 1 - 1DV | 2 | 3 | 4 | 5 || CP | LCD |

----------------------------------------------------------------------------520 | | | | | | || | | .L20:521 | | | | 0.50 | 0.50 | || 4.0 | | ldr d31, [x15, x18, lsl 3]522 | | | | 0.50 | 0.50 | || | | ldr d0, [x15, 8]523 | 0.50 | 0.50 | | | | || | | mov x14, x15524 | 0.33 | 0.33 | 0.33 | | | || | | add x16, x15, 24525 | | | | 0.50 | 0.50 | || | | ldr d2, [x15, x30, lsl 3]526 | 0.33 | 0.33 | 0.33 | | | || | | add x15, x15, 32527 | 0.50 | 0.50 | | | | || 6.0 | | fadd d1, d31, d0528 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d3, d1, d30529 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d4, d3, d2530 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d5, d4, d9531 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d5, [x14], 8532 | | | | 0.50 | 0.50 | || 4.0 | | ldr d6, [x14, x18, lsl 3]533 | | | | 0.50 | 0.50 | || | | ldr d16, [x14, 8]534 | 0.33 | 0.33 | 0.33 | | | || | | add x13, x14, 8535 | | | | 0.50 | 0.50 | || | | ldr d7, [x14, x30, lsl 3]536 | 0.50 | 0.50 | | | | || 6.0 | | fadd d17, d6, d16537 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d18, d17, d5538 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d19, d18, d7539 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d20, d19, d9540 | | | | 0.50 | 0.50 | 1.00 || | | str d20, [x15, -24]541 | | | | 0.50 | 0.50 | || | | ldr d21, [x13, x18, lsl 3]542 | | | | 0.50 | 0.50 | || | | ldr d23, [x14, 16]543 | | | | 0.50 | 0.50 | || | | ldr d22, [x13, x30, lsl 3]544 | 0.50 | 0.50 | | | | || | | fadd d24, d21, d23545 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d25, d24, d20546 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d26, d25, d22547 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d27, d26, d9548 | | | | 0.50 | 0.50 | 1.00 || | | str d27, [x14, 8]549 | | | | 0.50 | 0.50 | || | | ldr d30, [x15]550 | | | | 0.50 | 0.50 | || | | ldr d28, [x16, x18, lsl 3]551 | | | | 0.50 | 0.50 | || | | ldr d29, [x16, x30, lsl 3]552 | 0.50 | 0.50 | | | | || | | fadd d31, d28, d30553 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d2, d31, d27554 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d0, d2, d29555 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d30, d0, d9556 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d30, [x15, -8]557 | 0.33 | 0.33 | 0.33 | | | || | | cmp x7, x15558 | | | | | | || | | * bne .L20

9.83 9.83 1.33 8.00 8.00 4.00 100.0 72.0

Loop-Carried Dependencies Analysis Report-----------------------------------------526 | 1.0 | add x15, x15, 32 | [526]555 | 72.0 | fmul d30, d0, d9 | [528, 529, 530, 537, 538, 539, 545, 546, 547, 553, 554, 555]

PMBS19 | OSACA | Jan Laukemann

100

Block Throughput 2.46 cy

Critical Path 25.0 cy

9.83 9.83

Gauss-Seidel Method Example – Output

Page 42: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 42

Port pressure in cycles| 0 - 0DV | 1 - 1DV | 2 | 3 | 4 | 5 || CP | LCD |

----------------------------------------------------------------------------520 | | | | | | || | | .L20:521 | | | | 0.50 | 0.50 | || 4.0 | | ldr d31, [x15, x18, lsl 3]522 | | | | 0.50 | 0.50 | || | | ldr d0, [x15, 8]523 | 0.50 | 0.50 | | | | || | | mov x14, x15524 | 0.33 | 0.33 | 0.33 | | | || | | add x16, x15, 24525 | | | | 0.50 | 0.50 | || | | ldr d2, [x15, x30, lsl 3]526 | 0.33 | 0.33 | 0.33 | | | || | | add x15, x15, 32527 | 0.50 | 0.50 | | | | || 6.0 | | fadd d1, d31, d0528 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d3, d1, d30529 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d4, d3, d2530 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d5, d4, d9531 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d5, [x14], 8532 | | | | 0.50 | 0.50 | || 4.0 | | ldr d6, [x14, x18, lsl 3]533 | | | | 0.50 | 0.50 | || | | ldr d16, [x14, 8]534 | 0.33 | 0.33 | 0.33 | | | || | | add x13, x14, 8535 | | | | 0.50 | 0.50 | || | | ldr d7, [x14, x30, lsl 3]536 | 0.50 | 0.50 | | | | || 6.0 | | fadd d17, d6, d16537 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d18, d17, d5538 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d19, d18, d7539 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d20, d19, d9540 | | | | 0.50 | 0.50 | 1.00 || | | str d20, [x15, -24]541 | | | | 0.50 | 0.50 | || | | ldr d21, [x13, x18, lsl 3]542 | | | | 0.50 | 0.50 | || | | ldr d23, [x14, 16]543 | | | | 0.50 | 0.50 | || | | ldr d22, [x13, x30, lsl 3]544 | 0.50 | 0.50 | | | | || | | fadd d24, d21, d23545 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d25, d24, d20546 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d26, d25, d22547 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d27, d26, d9548 | | | | 0.50 | 0.50 | 1.00 || | | str d27, [x14, 8]549 | | | | 0.50 | 0.50 | || | | ldr d30, [x15]550 | | | | 0.50 | 0.50 | || | | ldr d28, [x16, x18, lsl 3]551 | | | | 0.50 | 0.50 | || | | ldr d29, [x16, x30, lsl 3]552 | 0.50 | 0.50 | | | | || | | fadd d31, d28, d30553 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d2, d31, d27554 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d0, d2, d29555 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d30, d0, d9556 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d30, [x15, -8]557 | 0.33 | 0.33 | 0.33 | | | || | | cmp x7, x15558 | | | | | | || | | * bne .L20

9.83 9.83 1.33 8.00 8.00 4.00 100.0 72.0

Loop-Carried Dependencies Analysis Report-----------------------------------------526 | 1.0 | add x15, x15, 32 | [526]555 | 72.0 | fmul d30, d0, d9 | [528, 529, 530, 537, 538, 539, 545, 546, 547, 553, 554, 555]

PMBS19 | OSACA | Jan Laukemann

100 72

Block Throughput 2.46 cy

Critical Path 25.0 cy

Loop-Carried Dep. 18.0 cy9.83 9.83

Gauss-Seidel Method Example – Output

Page 43: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 43PMBS19 | OSACA | Jan Laukemann

521: ldr 4

527: fadd 528: fadd

522: ldr

529: fadd

525: ldr

530: fmul 537: fadd 538: fadd 539: fmul 545: fadd 546: fadd 553: fadd547: fmul 554: fadd 555: fmul

556: str

523: mov

531: str

532: ldr

536: fadd

533: ldr

535: ldr

534: add

542: ldr

548: str544: fadd

541: ldr

543: ldr

540: str

526: add

549: ldr

557: cmp

552: fadd

524: add

551: ldr

550: ldr

6 6 6 6 6 6 6 6 6 6 6 6

6

61

CP

LCD 1

LCD 2

Gauss-Seidel Method Example – Output

Page 44: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 44PMBS19 | OSACA | Jan Laukemann

ArchitectureUnroll

factor

MeasuredPrediction [cy/it]

OSACA IACA LLVM-MCA

MLUP/s cy/it TP LCD CP TP LCD CP TP LCD CP

Intel Cascade Lake X 4x 178.3 14.02 2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

AMD Zen 4x 194.4 11.83 2.0 11.5 15.0 3.0 18.0 24.0

Marvell ThunderX2 4x 118.9 18.50 2.46 18.0 25.0

Results & Comparison

Page 45: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 45PMBS19 | OSACA | Jan Laukemann

ArchitectureUnroll

factor

MeasuredPrediction [cy/it]

OSACA IACA LLVM-MCA

MLUP/s cy/it TP LCD CP TP LCD CP TP LCD CP

Intel Cascade Lake X 4x 178.3 14.02 2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

AMD Zen 4x 194.4 11.83 2.0 11.5 15.0 3.0 18.0 24.0

Marvell ThunderX2 4x 118.9 18.50 2.46 18.0 25.0

Results & Comparison

2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

Page 46: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 46PMBS19 | OSACA | Jan Laukemann

ArchitectureUnroll

factor

MeasuredPrediction [cy/it]

OSACA IACA LLVM-MCA

MLUP/s cy/it TP LCD CP TP LCD CP TP LCD CP

Intel Cascade Lake X 4x 178.3 14.02 2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

AMD Zen 4x 194.4 11.83 2.0 11.5 15.0 3.0 18.0 24.0

Marvell ThunderX2 4x 118.9 18.50 2.46 18.0 25.0

Results & Comparison

2.0 11.5 15.0 3.0 18.0 24.0

2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

Page 47: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 47PMBS19 | OSACA | Jan Laukemann

ArchitectureUnroll

factor

MeasuredPrediction [cy/it]

OSACA IACA LLVM-MCA

MLUP/s cy/it TP LCD CP TP LCD CP TP LCD CP

Intel Cascade Lake X 4x 178.3 14.02 2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

AMD Zen 4x 194.4 11.83 2.0 11.5 15.0 3.0 18.0 24.0

Marvell ThunderX2 4x 118.9 18.50 2.46 18.0 25.0

Results & Comparison

2.46 18.0 25.0

2.0 11.5 15.0 3.0 18.0 24.0

2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

Page 48: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• Automatic extraction, throughput and critical path analysis

• Cross-platform (Intel, AMD, ARM)

• Accurate predictions

• Open Source

• Allows architectural exploration

18.11.2019 48PMBS19 | OSACA | Jan Laukemann

Summary – OSACA

Page 49: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• Support of hidden dependencies

• More precise LCD analysis

• Support new micro-architectures (Zen 2, Power 9, …)

• More precise latency analysis for FMA instructions

• Considering ROB, register renaming, retirement, …

• Optimally balanced port utilization

18.11.2019 49PMBS19 | OSACA | Jan Laukemann

Future Work

Page 50: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• Support of hidden dependencies

• More precise LCD analysis

• Support new micro-architectures (Zen 2, Power 9, …)

• More precise latency analysis for FMA instructions

• Considering ROB, register renaming, retirement, …

• Optimally balanced port utilization

18.11.2019 50PMBS19 | OSACA | Jan Laukemann

Future Work

Page 51: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• Support of hidden dependencies

• More precise LCD analysis

• Support new micro-architectures (Zen 2, Power 9, …)

• More precise latency analysis for FMA instructions

• Considering ROB, register renaming, retirement, …

• Optimally balanced port utilization

18.11.2019 51PMBS19 | OSACA | Jan Laukemann

Future Work

Open Source Architecture Code Analyzer

github.com/RRZE-HPC/OSACA

Reproduce at: https://github.com/RRZE-HPC/OSACA-CP-2019/