Exercise: ASIP Programminggmichi/asocd_2014/exercises/ex_04.pdf · – Learn how to extend a...

Integrated Systems Laboratory

Exercise: ASIP Programming

Introduction to Tensilica Xtensa Xplorer

Michael Gautschi

12.05.2014


Getting Started

• Copy data from master account: $ mkdir asip_ex $ cp /home/soc_master/asip_ex/asip_ex.xws asip_ex/.

• Start xtensa xplorer $ cd asip_ex $ xtensa xplorer&

• Select the created directory ‘asip_ex’ as workspace

• Go to the following directory to adjust the installation directories: Window -> Preferences -> Directories & Tools -> Directories

12.05.2014 2


• Change directories according to the figure

12.05.2014 3

• Install the workspace File -> import -> Xtensa Xplorer -> Import Xtensa Xplorer Workspace Select the asip_ex.xws we copied to our asip_ex folder Import everything (select all) finish and wait till everything is imported.

• Return to workbench, we are ready to start!

/home/soc_master/xtensa/downloads

/home/soc_XX/asip_ex/XtensaRegistry /scratch/soc_XX/xtensa/install/builds


Exercises

• 1. Simple TIE programming – Learn how to extend a processor with custom instructions

• 2. GCD (Greatest Common Divisor) – Accelerate the GCD computation using your own instruction!

• 3. Matrix Transform – Explore different options how to program instruction set extensions

• 4. CRC – Using the benefit of a bitstream co-processor

• 5. Viterbi Decoder – Using a Soft Stream Processor with a Viterbi Decoder

12.05.2014 4


Getting to Know Xtensa Xplorer 1/4

12.05.2014 5

Debug Perspective

Benchmark Perspective C Perspective

Project Explorer

Console Output

Edit and see “Lacerta” Processor Configuration

Select Active Project Select Active Processor Configuration

Get Power/Area/Timing Estimates of the “Lacerta” Processor Configuration

Available Processor Configurations

Available TIE sources Power/Area/Timing Estimates assuming a 28 nm Technology



12.05.2014 6

2. Add TIE Source to Processor Configuration

4. Compile TIE Source 6. Compiled TIE source

Adding a Tensilica Instruction Extension (TIE) source to a processor 3. Select TIE source

5. Observe Console Output

1. Select Core Configuration



12.05.2014 7

1. Switch to Debug Perspective

How to debug an application

2. Add Breakpoints

3. Start the Debugger

Observe Variables

Disassembled Code

4. Step through the code



12.05.2014 8

Shows Profiling Results

How to profile an application

1. Switch to Benchmark Perspective

2. Start the Profiler

Number of cycles used by main function

3. Select a line to see profiling results

Number of cycles spent to complete this statement

5. See the status of the pipeline when executing the selected instruction

4. Select an instruction of the disassembled Code


Xtensa Xplorer Documentation

• Tensilica Xtensa Xplorer is very well documented! – Documentation can be accessed in the following two ways:

• Help -> Welcome -> Help/Docs • Using a browser:

file:///usr/pack/xtensa-2014.5-bt/XtDevTools/downloads/RE-2014.5/docs/index.html

– Throughout this exercise session we will use: • Tensilica Instruction Extension (TIE) Language User's Guide (Exercise 1-3) • ConnX BSP3 Bit Stream Processor 3 User's Guide (Exercise 4) • ConnX SSP16 Soft Stream Processor User's Guide (Exercise 5)

– Other useful sources:

• Xtensa Instruction Set Architecture (ISA) Reference Manual • Xtensa Hardware User's Guide

12.05.2014 9


Available Processor Configurations

• Creating your own processor core – Generate a basic configuration file – Customize it – Upload the configuration – Processors will be built and verified on the Tensilica servers (takes 30-60

minutes) – Download & install the provided package

• Pre-built processor configurations:

12.05.2014 10

Name GPR regs Mult Pipline depth

I$ I$/Mem. interface width

Coproc. type

Exercise

Lacerta 16 None 5 None 32/64 None 1/3

Corvus 32 32bit 5 2 KB (direct) 32/32 None 0/1/2/4

Cygnus 32 32bit 5 None 64/32 (2*LSU) BSP 4

Pavo 64 16bit 7 32 KB (2way) 64/128 (2*LSU) SSP 5


Exercise 1 – Simple TIE Programming 1/3

a) HelloWorld • Set Helloworld as active project and select the “Corvus” processor core • Run, Debug, and Profile the application • Console should output “helloworld”

b) TIE-Example 1 • Review the TIE source tie_example1.tie • Attach the provided TIE source to the Corvus core • Run the application

12.05.2014 11

tie_example1: 1. Sum three vectors

– Use TIE function ADD3 to sum three elements in one instruction!

– ADD3 is included in the header tie_example1.h – The definition is in tie_example1.tie

2. Divide each element by two

#include <stdio.h> #include <xtensa/tie/tie_example1.h> int main(void){ unsigned int a[10] = {0,0,0,0,0,0,0,0,0,0}; unsigned int b[10] = {0,1,2,3,4,5,6,7,8,9}; unsigned int c[10] = {9,8,7,6,5,4,3,2,1,0}; int i; for (i=0; i < 10; i++) { ADD3(a[i], b[i], c[i]); a[i] =a[i] >> 1; } }

tie_example1.c



c) TIE-Example 2 • Attach the TIE source, compile it and run the application with the custom

instruction

12.05.2014 12

operation ADD3_DIV2 {inout AR res, in AR in0, in AR in1} {} { wire [31:0] tmp = res + in0 + in1; assign res = {1’b0, tmp[31:1]}; }

tie_example2.tie

operation ADD3 {inout AR res, in AR in0, in AR in1} {} { assign res = res + in0 + in1; }

tie_example1.tie Custom Instruction ADD3: New Instruction name: ADD3 Arguments list: Inputs: res, in0, in1 Outputs: res Argument Direction: In, out, inout Operation body contains a Verilog like syntax Argument Type: AR normal processor register

Custom instruction ADD3_DIV2: • Sums up the three inputs res, in0, in1 • Divides tmp by 2 • Writes the result back to register res



d) ByteSwap • Swaps bytes randomly using a conventional C-function and a custom instruction • Attach the TIE source, and profile the application!

12.05.2014 13

state COUNT 32 add_read_write operation BYTESWAP{out AR outR, in AR inpR} {inout COUNT} { assign outR = {inpR[7:0], inpR[15:8], inpR[23:16], inpR[31:24]}; assign COUNT = COUNT + 1; }

byteswap.tie New TIE element: state • A State is not stored in the processors

register file, but in the “accelerator” itself. • Keyword add_read_write automatically

generates functions to read/write from the state.

• The state COUNT can be accessed in C with RCOUNT(), WCOUNT(value)

• The state COUNT is incremented each time BYTESWAP() is called.

C Code: • Compare golden model with TIE function • Read final value of state COUNT with

RCOUNT() and print it.

… for (i = 0; i < NUM; i++){ s = data[i%N]; if (GOLDEN_BYTESWAP(s) != BYTESWAP(s)) fail++; } printf(“State COUNT=%d\n”, RCOUNT()); …

byteswap.c


Exercise 2 – Greatest Common Divisor

• Find the greatest common divisor of two integers a and b.

• The function golden_gcd() returns the gcd. • Accelerate this function by introducing a new

instruction GCD().

• Tasks: 1. Study the golden_gcd() function. 2. Define a new TIE function GCD() which

accelerates the c-code inside the while loop. 3. Add the TIE source to the “Corvus” processor

core and compile it. 4. Call the defined TIE function inside

xtensa_gcd(). 5. Profile the application and find the speedup!

12.05.2014 14

… int golden_gcd(int a, int b) { while (a != b) { if (a>b) a = a – b; else b = b – a; } return a; } … int xtensa_gcd(int a, int b) { while (a != b) { GCD(a,b); } return a; } …

gcd.c


TIE Modules

12.05.2014 15

• TIE modules offer some basic functionality and can be directly used in a TIE source.


Exercise 3 – Matrix Transform 1/7

• We want to compute the following matrix transformation:

• golden_matrixtransform() computes A in pure C. • Task 1:

– In matrixtransform.tie: define a MAC unit for each element of A. (Use the TIEmac() module to build a MAC).

– Since the result is only required at the end, use a state for the accumulation register.

– Add the matrixtransform.tie to the Lacerta core and compile it. – Initialize the accumulators in the beginning. – Read the accumulators in the end and assign them to A. – Read/write function to states are defined in matrixtransform.h – Run the Application.

12.05.2014 16

… int golden_matrixtransform(int A[], short B[], short M[]) { int i; int a0,a1,a2,a3; for (i=0; i<4; i++) { a0 += B[i] * M[i]; a1 += B[i] * M[i+4]; a2 += B[i] * M[i+8]; a3 += B[i] * M[i+12]; } A[0] = a0; A[1] = a1; A[2] = a2; A[3] = a3; } …

matrixtransform.c

M0 M4 M8 M12

M1 M5 M9 M13

M2 M6 M10 M14

M3 M7 M11 M15

a0 a1 a2 a3

A M b0 b1 b2 b3

B



Check the additional area of the TIE instruction:

12.05.2014 17

4 mac units

4 acc states

Get area estimates of a compiled TIE source

The four units are never used concurrently => This solution is not optimal!



Can we share the MAC units? Yes with TIE - functions! TIE function mac16: • 32 bit return value • 32 bit accumulator • 16 bit inputs • shared keyword • Task 2:

– Use a TIE function, and call it in each MAC unit. – Add the matrixtransform_basicfunc.tie to the Lacerta core and

compile it. – Initialize the accumulators in the beginning. – Read the accumulators in the end and assign them to A. – Set #define BASICFUNC to run this test. – Run the Application.

• Check the size of the new TIE source!

12.05.2014 18

… function [31:0] mac16 ([31:0] accumulator, [15:0] multiplier, [15:0] multiplicand) shared { assign mac16 = TIEmac(multiplier, multiplicand, accumulator, 1’b1, 1’b0); }

matrixtransform_basic.tie



12.05.2014 19

• The compiler generates load, store and move instructions to access the register file. • accum registers can be declared like integers, shorts etc. accum acc0; • The registers can be accessed with a pointer: int *p_acc0 = (int*)&acc0; • Task 3:

– Define a register file in matrixtransform_rf.tie instead of a state – Add the matrixtransform_rf.tie to the Lacerta core and compile it. – Declare each register of the registerfile. – Initialize the register file in the beginning. – Read the register file in the end and assign them to A. – Set #define RF to run this test. – Run the Application.

• Compare the size and number of instructions

regfile accum 32 4 ac operation mac.accum {in AR oper10, in AR oper1, inout accum accumulator} {} { assign accumulator = TIEmac(oper0[15:0], oper1[15:0], accumulator, 1’b1, 1’b0); }

matrixtransform_rf.tie

A separate register file can be used for the accumulator register! TIE Syntax: regfile <name> <width> <depth> <short_name>



12.05.2014 20

Single Instruction, Multiple Data (SIMD) of the Matrix Transformation

M0 M4 M8 M12

M1 M5 M9 M13

M2 M6 M10 M14

M3 M7 M11 M15

a0 a1 a2 a3

A M b0 b1 b2 b3

B

… for (i=0; i<4; i++) { a0 += B[i] * M[i]; a1 += B[i] * M[i+4]; a2 += B[i] * M[i+8]; a3 += B[i] * M[i+12]; } …

matrixtransform.c

… A[0] = B[0]*M[0] + B[1]*M[1] + B[2]*M[2] + B[3]*M[3]; A[1] = B[0]*M[4] + B[1]*M[5] + B[2]*M[6] + B[3]*M[7]; A[2] = B[0]*M[8] + B[1]*M[9] + B[2]*M[10] + B[3]*M[11]; A[3] = B[0]*M[12] + B[1]*M[13] + B[2]*M[14] + B[3]*M[15]; …

matrixtransform.c

SISD: 16 instructions SIMD: 4 Instructions

Define a new instruction which computes the dot product! => Instruction requires 8 shorts(8*16bit) as input and produces a 32 bit output.



• Task 4a: – Complete the TIE source – Change the C code to use

the new dotprod() instruction

– Compare the number of cycles with the previous implementations!

12.05.2014 21

regfile vec16x4 WIDTH DEPTH vec operation dotprod {out AR acc, in vec16x4 vec, in vec16x4 mat} {} { // 4-way SIMD multiplication wire [31:0] prod0 = TIEmul(vec[..],mat[..], 1’b1); wire ... // fused acumulation assign acc = TIEaddn(….); }

matrixtransform_SIMD.tie

• The SIMD like instruction allows to speed up the matrix transformation by a factor of ~4. • The above TIE instruction is computing the sum of 4 multiplications in one cycle! => We have to expect a negative impact on our timing constraints! • Solution: multi-cycle instruction! => Define a scheduling for this instruction. schedule <schedule_name> {operation-list} {stage_assignments} Explicit scheduling: define each assignment in one cycle. Automatic scheduling: define the number of cycles, the instruction can use.



• Task 4b: – Let the dotprod take 3 cycles. – During synthesis Synopsys

Design Compiler will automatically retime the dotprod function

12.05.2014 22

… schedule dotprod_sched {dotprod} {} { def acc 3; }

matrixtransform_SIMD.tie (automatic scheduling)

• Task 5: – Use explicit scheduling and

share the multiplication units with a shared function! (as with the MAC unit earlier in this exercise)

– Make sure that the multiplication unit is only used once in each cycle!

… schedule dotprod_sched {dotprod} {} { def prod0 1; …. def acc ?; }

matrixtransform_SIMD.tie (explicit scheduling)

• Can you save a lot of resources when we share the multiplication? • What is the prize we are paying?


Exercise 4 – Bit Stream Processor: CRC 1/2

• In this exercise we will experience the advantage of the Bit Stream Processor using CRC computation as an example. – 3 Coprocessors – FLIX(Flexible Instruction eXtension) instruction format (3 slots) – (2*32) bit memory/ 64 bit instruction interface – Read the introduction of BSP User’s Guide (Chapter 1)

12.05.2014 23


Exercise 4 – Bit Stream Processor: CRC 2/2

• Task 1: – Run the CRC example code with the Corvus core (without the BSP extension)

• Task 2: – Run the CRC example with the Cygnus core (which has a BSP coprocessor) – Check in the code what kind of BSP-instructions have been used

• More details about the instruction format can be found under: file:///scratch/soc_XX/xtensa/install/builds/RE-2014.5-linux/Cygnus/html/ISA/ISAhtml/index.html

• Task 3: – Compare the runtime of the two versions. – Also have a look at the power/area/timing estimates of the two configuration

options!

12.05.2014 24


Exercise 5 – Soft Stream Processor: LTE Viterbi Dec.

12.05.2014 25

Soft Stream Processor • 64 bit instruction interface • (2*128) bit memory interface

⇒ 16 way SIMD support! ⇒ 16*160 bit vector register file

• Optional Viterbi Decoder

• Task:

– Read the introduction of SSP User’s Guide (Chapter 1) – Run the Viterbi decoder main function with the Pavo core – Compare the reference design with the SSP assisted

implementation.

Exercise: ASIP Programminggmichi/asocd_2014/exercises/ex_04.pdf · – Learn how to extend a...

Documents

Transcript of Exercise: ASIP Programminggmichi/asocd_2014/exercises/ex_04.pdf · – Learn how to extend a...