Exercise: ASIP Programminggmichi/asocd/exercises/ex_05.pdf · • The function golden_gcd() returns...

Integrated Systems Laboratory

Exercise: ASIP Programming

Introduction to Xtensa Xplorer

Michael Gautschi

17.05.2016


Getting Started

• Copy data from master account:

$ mkdir 5_asip_ex

$ cp /home/soc_master/5_asip/asip_ex_2016_notools.xws 5_asip_ex/.

• Start xtensa xplorer

$ cd 5_asip_ex$ xtensa-2015.2 xplorer & (use this version!)

• Select the created directory ‘5_asip_ex’ as workspace

• Now we have to adjust the installation directories and locate the xtensa tools (which are installed on the soc_masteraccount)

17.05.2016 2


• Go to the following directory to adjust the installation directories:Window -> Preferences -> Directories & Tools -> Directories

– Change directories according to the figure:

17.05.2016 3

/home/soc_master/xtensa/downloads

/home/soc_XX/5_asip_ex/XtensaRegistry

/scratch/soc_XX/xtensa/install/builds

/home/soc_master/xtensa/XtDevTools/install/tools

Manage Xtensa Tools => next slide


• Relocate the Xtensa Tools package:

17.05.2016 4

• Install the workspaceFile -> import -> Xtensa Xplorer -> Import Xtensa Xplorer Workspace

Select the asip_ex_2016_notools.xws we copied to our asip_ex folder

Import everything (select all) finish and wait untill everything is imported.

• Return to workbench, we are ready to start!

Enter this path!


Exercises

• 1. Simple TIE programming

– Learn how to extend a processor with custom instructions

• 2. GCD (Greatest Common Divisor)

– Accelerate the GCD computation using your own instruction!

• 3. Matrix Transform

– Explore different options how to program instruction set extensions

• 4. ECDSA Example

– Speed up the squaring in the finite field

• 5. CRC

– Using the benefit of a bitstream co-processor

• 6. Viterbi Decoder

– Using a Soft Stream Processor with a Viterbi Decoder

17.05.2016 5


Getting to Know Xtensa Xplorer 1/4

17.05.2016 6

Debug Perspective

Benchmark Perspective

C Perspective

Project Explorer

Console Output

Edit and see “Lacerta”

Processor Configuration

Select Active Project Select Active Processor Configuration

Get Power/Area/Timing

Estimates of the “Lacerta”

Processor Configuration

Available Processor

Configurations

Available TIE sources Power/Area/Timing Estimates

assuming a 28 nm Technology



17.05.2016 7

2. Add TIE Source to

Processor Configuration4. Compile TIE Source6. Compiled TIE source

Adding a Tensilica Instruction Extension (TIE) source to a processor3. Select TIE source

5. Observe Console Output

1. Select Core

Configuration



17.05.2016 8

1. Switch to Debug Perspective

How to debug an application

2. Add Breakpoints

3. Start the Debugger

Observe Variables

Disassembled Code

4. Step through the code



17.05.2016 9

Shows Profiling Results

How to profile an application

1. Switch to Benchmark Perspective

2. Start the Profiler

Number of cycles used

by main function

3. Select a line to see profiling results

Number of cycles spent to

complete this statement

5. See the status of the pipeline when executing

the selected instruction

4. Select an instruction

of the disassembled Code


Xtensa Xplorer Documentation

• Tensilica Xtensa Xplorer is very well documented!

– Documentation can be accessed in the following two ways:

• Help -> Welcome -> Help/Docs

• Using a browser:

file:///usr/pack/xtensa-2015.2-bt/XtDevTools/downloads/RF-2015.2/docs/index.html

– Throughout this exercise session we will use:

• Tensilica Instruction Extension (TIE) Language User's Guide(Exercise 1-4)

• ConnX BSP3 Bit Stream Processor 3 User's Guide (Exercise 5)

• ConnX SSP16 Soft Stream Processor User's Guide (Exercise 6)

– Other useful sources:

• Xtensa Instruction Set Architecture (ISA) Reference Manual

• Xtensa Hardware User's Guide

17.05.2016 10

/usr/pack/xtensa-2015.2-bt/XtDevTools/downloads/RE-2015.2/docs/index.html


Available Processor Configurations

• Creating your own processor core

– Generate a basic configuration file

– Customize it

– Upload the configuration

– Processors will be built and verified on the Tensilica servers (takes 30-60 minutes)

– Download & install the provided package

• Pre-built processor configurations:

17.05.2016 11

Name GPR regs Mult Pipline

depth

I$ I$/Mem.

interface width

Coproc.

type

Exercise

Lacerta 16 None 5 None 32/64 None 1/3

Corvus 32 32bit 5 2 KB (direct) 32/32 None 0/1/2/4

Cygnus 32 32bit 5 None 64/32 (2*LSU) BSP 5

Pavo 64 16bit 7 32 KB (2way) 64/128 (2*LSU) SSP 6


Exercise 1 – Simple TIE Programming 1/3

a) HelloWorld• Set Helloworld as active project and select the “Corvus” processor core

• Run, Debug, and Profile the application

• Console should output “helloworld”

b) TIE-Example 1• Review the TIE source tie_example1.tie

• Attach the provided TIE source to the Corvus core

• Run the application

17.05.2016 12

tie_example1:

1. Sum three vectors

– Use TIE function ADD3 to sum three elements in

one instruction!

– ADD3 is included in the header tie_example1.h

– The definition is in tie_example1.tie

2. Divide each element by two

#include <stdio.h>

#include <xtensa/tie/tie_example1.h>

int main(void){

unsigned int a[10] = {0,0,0,0,0,0,0,0,0,0};

unsigned int b[10] = {0,1,2,3,4,5,6,7,8,9};

unsigned int c[10] = {9,8,7,6,5,4,3,2,1,0};

int i;

for (i=0; i < 10; i++) {

ADD3(a[i], b[i], c[i]);

a[i] =a[i] >> 1;

}

}

tie_example1.c



c) TIE-Example 2• Attach the TIE source, compile it and run the application with the custom

instruction

17.05.2016 13

operation ADD3_DIV2 {inout AR res, in AR in0, in AR in1} {} {

wire [31:0] tmp = res + in0 + in1;

assign res = {1’b0, tmp[31:1]};

}

tie_example2.tie

operation ADD3 {inout AR res, in AR in0, in AR in1} {} {

assign res = res + in0 + in1;

}

tie_example1.tieCustom Instruction ADD3:New Instruction name: ADD3

Arguments list:

Inputs: res, in0, in1

Outputs: res

Argument Direction:

In, out, inout

Operation body contains a Verilog like syntax

Argument Type:

AR normal processor register

Custom instruction ADD3_DIV2:

• Sums up the three inputs res, in0, in1

• Divides tmp by 2

• Writes the result back to register res



d) ByteSwap• Swaps bytes randomly using a conventional C-function and a custom instruction

• Attach the TIE source, and profile the application!

17.05.2016 14

state COUNT 32 add_read_write

operation BYTESWAP{out AR outR, in AR inpR} {inout COUNT}

{

assign outR = {inpR[7:0], inpR[15:8], inpR[23:16], inpR[31:24]};

assign COUNT = COUNT + 1;

}

byteswap.tieNew TIE element: state

• A State is not stored in the processors

register file, but in the SFU itself.

• Keyword add_read_write automatically

generates functions to read/write from the

state.

• The state COUNT can be accessed in C

with RCOUNT(), WCOUNT(value)

• The state COUNT is incremented each

time BYTESWAP() is called.

C Code:

• Compare golden model with TIE function

• Read final value of state COUNT with

RCOUNT() and print it.

…

for (i = 0; i < NUM; i++){

s = data[i%N];

if (GOLDEN_BYTESWAP(s) != BYTESWAP(s))

fail++;

}

printf(“State COUNT=%d\n”, RCOUNT());

…

byteswap.c


Exercise 2 – Greatest Common Divisor

• Find the greatest common divisor of two integers a and b.

• The function golden_gcd() returns the gcd.

• Accelerate this function by introducing a new instruction GCD().

• Task 1: Study the golden_gcd() function.

Define a new TIE function GCD() which accelerates the c-code inside the while loop.

Add the TIE source to the “Corvus” processor core and compile it.

Call the defined TIE function inside xtensa_gcd().

Profile the application and find the speedup!

17.05.2016 15

…

int golden_gcd(int a, int b)

{

while (a != b)

{

if (a>b)

a = a – b;

else

b = b – a;

}

return a;

}

…

int xtensa_gcd(int a, int b)

{

while (a != b)

{

GCD(a,b);

}

return a;

}

…

gcd.c


TIE Modules

17.05.2016 16

•


Exercise 3 – Matrix Transform 1/7

• We want to compute the following matrix

transformation:

• golden_matrixtransform() computes A in pure C.

• Task 1: In matrixtransform.tie: define a MAC unit for each element of A.

(Use the TIEmac() module to build a MAC).

Since the result is only required at the end, use a state for the

accumulation register.

Add the matrixtransform.tie to the Lacerta core and compile it.

Initialize the accumulators in the beginning.

Read the accumulators in the end and assign them to A.

Read/write function to states are defined in matrixtransform.h

Run the Application.

17.05.2016 17

…

int golden_matrixtransform(int A[],

short B[], short M[])

{

int i;

int a0,a1,a2,a3;

for (i=0; i<4; i++)

{

a0 += B[i] * M[i];

a1 += B[i] * M[i+4];

a2 += B[i] * M[i+8];

a3 += B[i] * M[i+12];

}

A[0] = a0;

A[1] = a1;

A[2] = a2;

A[3] = a3;

}

…

matrixtransform.c

M0M4M8M12

M1M5M9M13

M2M6M10M14

M3M7M11M15

a0a1a2a3

A Mb0b1b2b3

B



Check the additional area of the TIE instruction:

17.05.2016 18

4 mac units

4 acc states

Get area estimates of a compiled TIE source

The four units are never used concurrently => This solution is not optimal!



Can we share the MAC units?

Yes with TIE - functions!

TIE function mac16:

• 32 bit return value

• 32 bit accumulator

• 16 bit inputs

• shared keyword

• Task 2: Use a TIE function, and call it in each MAC unit.

Add the matrixtransform_basicfunc.tie to the Lacerta core and compile it.

Initialize the accumulators in the beginning.

Read the accumulators in the end and assign them to A.

Set #define BASICFUNC to run this test.


• Check the size of the new TIE source!

17.05.2016 19

…

function [31:0] mac16 ([31:0] accumulator, [15:0] multiplier, [15:0] multiplicand) shared

{

assign mac16 = TIEmac(multiplier, multiplicand, accumulator, 1’b1, 1’b0);

}

matrixtransform_basic.tie



17.05.2016 20

• The compiler generates load, store and move instructions to access the register file.

• accum registers can be declared like integers, shorts etc. accum acc0;

• The registers can be accessed with a pointer: int *p_acc0 = (int*)&acc0;

• Task 3: Define a register file in matrixtransform_rf.tie instead of a state

Add the matrixtransform_rf.tie to the Lacerta core and compile it.

Declare each register of the registerfile.

Initialize the register file in the beginning.

Read the register file in the end and assign them to A.

Set #define RF to run this test.


• Compare the size and number of instructions

regfile accum 32 4 ac

operation mac.accum {in AR oper10, in AR oper1, inout accum accumulator} {}

{

assign accumulator = TIEmac(oper0[15:0], oper1[15:0], accumulator, 1’b1,

1’b0);

}

matrixtransform_rf.tie

A separate register file can be used for the accumulator register!TIE Syntax: regfile <name> <width> <depth> <short_name>



17.05.2016 21

Single Instruction, Multiple Data (SIMD) of the Matrix Transformation

M0

M4

M8

M12

M1

M5

M9

M13

M2

M6

M10

M14

M3

M7

M11

M15

a0

a1

a2

a3

A Mb0

b1

b2

b3

B

…

for (i=0; i<4; i++)

{

a0 += B[i] * M[i];

a1 += B[i] * M[i+4];

a2 += B[i] * M[i+8];

a3 += B[i] * M[i+12];

}

…

matrixtransform.c

…

A[0] = B[0]*M[0] + B[1]*M[1] + B[2]*M[2] + B[3]*M[3];

A[1] = B[0]*M[4] + B[1]*M[5] + B[2]*M[6] + B[3]*M[7];

A[2] = B[0]*M[8] + B[1]*M[9] + B[2]*M[10] + B[3]*M[11];

A[3] = B[0]*M[12] + B[1]*M[13] + B[2]*M[14] + B[3]*M[15];

…

matrixtransform.c

SISD: 16 instructions SIMD: 4 Instructions

Define a new instruction which computes the dot product!

=> Instruction requires 8 shorts(8*16bit) as input and produces a 32 bit output.



• Task 4a:

Complete the TIE source

Change the C code to use the new dotprod() instruction

Compare the number of cycles with the previous implementations!

17.05.2016 22

regfile vec16x4 WIDTH DEPTH vec

operation dotprod {out AR acc, in vec16x4 vect, in vec16x4 mat} {}

{

// 4-way SIMD multiplication

wire [31:0] prod0 = TIEmul(vect[..],mat[..], 1’b1);

wire ...

// fused acumulation

assign acc = TIEaddn(….);

}

matrixtransform_SIMD.tie

• The SIMD like instruction allows to speed up the matrix transformation by a factor of ~4.

• The above TIE instruction is computing the sum of 4 multiplications in one cycle!

=> We have to expect a negative impact on our timing constraints!

• Solution: multi-cycle instruction!

=> Define a scheduling for this instruction.

schedule <schedule_name> {operation-list} {stage_assignments}

Explicit scheduling: define each assignment in one cycle.

Automatic scheduling: define the number of cycles, the instruction can use.



• Task 4b:

Let the dotprod take 3 cycles.

During synthesis Synopsys Design

Compiler will automatically retime

the dotprod function

17.05.2016 23

…

schedule dotprod_sched {dotprod} {}

{

def acc 3;

}

matrixtransform_SIMD.tie (automatic scheduling)

• Task 5:

Use explicit scheduling and share

the multiplication units with a

shared function! (as with the MAC

unit earlier in this exercise)

Make sure that the multiplication

unit is only used once in each

cycle!

…

schedule dotprod_sched {dotprod} {}

{

def prod0 1;

….

def acc ?;

}

matrixtransform_SIMD.tie (explicit scheduling)

• Can you save a lot of resources when we share the multiplication?

• What is the price we are paying?


Exercise 4 – ECDSA 1/2

• ECDSA signature verification

– Elliptic Curve Digital Signature Algorithm (ECDSA)

– We have seen that multiplications and squaring take a very long time in this field.

We want to speed up those operations!

17.05.2016 24

• Task 1:

Attach the tie file “BinaryFieldMultiplier.tie” to the Corvus core

Change the defines in multi_precision.h to include the multiplier

What speed up do you measure with the profiler?


Exercise 4 – ECDSA 2/2

• Binary Squaring:– Look at the square function

“mp_bin_square_only” in multi_precision.c

• What is this operation doing?

17.05.2016 25

operation BinSqLower16 {out AR result, in AR input} {}

{

…

}

operation BinSqUpper16 {out AR result, in AR input} {}

{

…

}

BinarySquare.tie

• Task 2:

Add two tie operations “BinSqLower16” and “BinSqUpper16” to speed up the

squaring

Attach TIE files to the Corvus core

Since you have multiple TIE files, make sure the output name matches, the include

name(“ecdsa_tie”) in multi_precision.c

• Again, what is the speed up?


Exercise 5 – Bit Stream Processor: CRC 1/2

• In this exercise we will experience the advantage of the Bit Stream Processor using CRC computation as an example.– 3 Coprocessors

– FLIX(Flexible Instruction eXtension) instruction format (3 slots)

– (2*32) bit memory/ 64 bit instruction interface

– Read the introduction of BSP User’s Guide (Chapter 1)

17.05.2016 26


Exercise 5 – Bit Stream Processor: CRC 2/2

• Task 1: Run the CRC example code with the Corvus core (without the BSP

extension)

• Task 2: Run the CRC example with the Cygnus core (which has a BSP

coprocessor)

Check in the code what kind of BSP-instructions have been used

More details about the instruction format can be found under:

file:///scratch/soc_XX/xtensa/install/builds/RF-2015.2-linux/Cygnus/html/ISA/ISAhtml/index.html

• Task 3: Compare the runtime of the two versions.

Also have a look at the power/area/timing estimates of the two configuration options!

17.05.2016 27

/scratch/soc_XX/xtensa/install/builds/RF-2015.2-linux/Cygnus/html/ISA/ISAhtml/index.html


Exercise 6 – Soft Stream Processor: LTE Viterbi Dec.

17.05.2016 28

Soft Stream Processor

• 64 bit instruction interface

• (2*128) bit memory interface

16 way SIMD support!

16*160 bit vector register file

• Optional Viterbi Decoder

• Task:

Read the introduction of SSP User’s Guide (Chapter 1)

Run the Viterbi decoder main function with the Pavo core

Compare the reference design with the SSP assisted

implementation.


Summary

• Xtensa CoreGen is an easy tool which allows to build SoCs in little time!

– Useful when time matters more than performance

• Sample solutions under:

– /home/soc_master/5_asip/asip_ex_2016_solution.xws

• RTL access currently not available

– (In discussion with partners)

• Possible Mini-project:

– Further optimize the ECDSA algorithm by replacing other functions

• Possible Semester project:

– Design your own processor core with Xtensa Software and do a first tape out

17.05.2016 29

Exercise: ASIP Programminggmichi/asocd/exercises/ex_05.pdf · • The function golden_gcd() returns...

Documents

Transcript of Exercise: ASIP Programminggmichi/asocd/exercises/ex_05.pdf · • The function golden_gcd() returns...