Exercise: ASIP Programminggmichi/asocd/exercises/ex_05.pdf · • The function golden_gcd() returns...
Transcript of Exercise: ASIP Programminggmichi/asocd/exercises/ex_05.pdf · • The function golden_gcd() returns...
Integrated Systems Laboratory
Exercise: ASIP Programming
Introduction to Xtensa Xplorer
Michael Gautschi
17.05.2016
Integrated Systems Laboratory
Getting Started
• Copy data from master account:
$ mkdir 5_asip_ex
$ cp /home/soc_master/5_asip/asip_ex_2016_notools.xws 5_asip_ex/.
• Start xtensa xplorer
$ cd 5_asip_ex$ xtensa-2015.2 xplorer & (use this version!)
• Select the created directory ‘5_asip_ex’ as workspace
• Now we have to adjust the installation directories and locate the xtensa tools (which are installed on the soc_masteraccount)
17.05.2016 2
Integrated Systems Laboratory
• Go to the following directory to adjust the installation directories:Window -> Preferences -> Directories & Tools -> Directories
– Change directories according to the figure:
17.05.2016 3
/home/soc_master/xtensa/downloads
/home/soc_XX/5_asip_ex/XtensaRegistry
/scratch/soc_XX/xtensa/install/builds
/home/soc_master/xtensa/XtDevTools/install/tools
Manage Xtensa Tools => next slide
Integrated Systems Laboratory
• Relocate the Xtensa Tools package:
17.05.2016 4
• Install the workspaceFile -> import -> Xtensa Xplorer -> Import Xtensa Xplorer Workspace
Select the asip_ex_2016_notools.xws we copied to our asip_ex folder
Import everything (select all) finish and wait untill everything is imported.
• Return to workbench, we are ready to start!
Enter this path!
Integrated Systems Laboratory
Exercises
• 1. Simple TIE programming
– Learn how to extend a processor with custom instructions
• 2. GCD (Greatest Common Divisor)
– Accelerate the GCD computation using your own instruction!
• 3. Matrix Transform
– Explore different options how to program instruction set extensions
• 4. ECDSA Example
– Speed up the squaring in the finite field
• 5. CRC
– Using the benefit of a bitstream co-processor
• 6. Viterbi Decoder
– Using a Soft Stream Processor with a Viterbi Decoder
17.05.2016 5
Integrated Systems Laboratory
Getting to Know Xtensa Xplorer 1/4
17.05.2016 6
Debug Perspective
Benchmark Perspective
C Perspective
Project Explorer
Console Output
Edit and see “Lacerta”
Processor Configuration
Select Active Project Select Active Processor Configuration
Get Power/Area/Timing
Estimates of the “Lacerta”
Processor Configuration
Available Processor
Configurations
Available TIE sources Power/Area/Timing Estimates
assuming a 28 nm Technology
Integrated Systems Laboratory
Getting to Know Xtensa Xplorer 2/4
17.05.2016 7
2. Add TIE Source to
Processor Configuration4. Compile TIE Source6. Compiled TIE source
Adding a Tensilica Instruction Extension (TIE) source to a processor3. Select TIE source
5. Observe Console Output
1. Select Core
Configuration
Integrated Systems Laboratory
Getting to Know Xtensa Xplorer 3/4
17.05.2016 8
1. Switch to Debug Perspective
How to debug an application
2. Add Breakpoints
3. Start the Debugger
Observe Variables
Disassembled Code
4. Step through the code
Integrated Systems Laboratory
Getting to Know Xtensa Xplorer 4/4
17.05.2016 9
Shows Profiling Results
How to profile an application
1. Switch to Benchmark Perspective
2. Start the Profiler
Number of cycles used
by main function
3. Select a line to see profiling results
Number of cycles spent to
complete this statement
5. See the status of the pipeline when executing
the selected instruction
4. Select an instruction
of the disassembled Code
Integrated Systems Laboratory
Xtensa Xplorer Documentation
• Tensilica Xtensa Xplorer is very well documented!
– Documentation can be accessed in the following two ways:
• Help -> Welcome -> Help/Docs
• Using a browser:
file:///usr/pack/xtensa-2015.2-bt/XtDevTools/downloads/RF-2015.2/docs/index.html
– Throughout this exercise session we will use:
• Tensilica Instruction Extension (TIE) Language User's Guide(Exercise 1-4)
• ConnX BSP3 Bit Stream Processor 3 User's Guide (Exercise 5)
• ConnX SSP16 Soft Stream Processor User's Guide (Exercise 6)
– Other useful sources:
• Xtensa Instruction Set Architecture (ISA) Reference Manual
• Xtensa Hardware User's Guide
17.05.2016 10
Integrated Systems Laboratory
Available Processor Configurations
• Creating your own processor core
– Generate a basic configuration file
– Customize it
– Upload the configuration
– Processors will be built and verified on the Tensilica servers (takes 30-60 minutes)
– Download & install the provided package
• Pre-built processor configurations:
17.05.2016 11
Name GPR regs Mult Pipline
depth
I$ I$/Mem.
interface width
Coproc.
type
Exercise
Lacerta 16 None 5 None 32/64 None 1/3
Corvus 32 32bit 5 2 KB (direct) 32/32 None 0/1/2/4
Cygnus 32 32bit 5 None 64/32 (2*LSU) BSP 5
Pavo 64 16bit 7 32 KB (2way) 64/128 (2*LSU) SSP 6
Integrated Systems Laboratory
Exercise 1 – Simple TIE Programming 1/3
a) HelloWorld• Set Helloworld as active project and select the “Corvus” processor core
• Run, Debug, and Profile the application
• Console should output “helloworld”
b) TIE-Example 1• Review the TIE source tie_example1.tie
• Attach the provided TIE source to the Corvus core
• Run the application
17.05.2016 12
tie_example1:
1. Sum three vectors
– Use TIE function ADD3 to sum three elements in
one instruction!
– ADD3 is included in the header tie_example1.h
– The definition is in tie_example1.tie
2. Divide each element by two
#include <stdio.h>
#include <xtensa/tie/tie_example1.h>
int main(void){
unsigned int a[10] = {0,0,0,0,0,0,0,0,0,0};
unsigned int b[10] = {0,1,2,3,4,5,6,7,8,9};
unsigned int c[10] = {9,8,7,6,5,4,3,2,1,0};
int i;
for (i=0; i < 10; i++) {
ADD3(a[i], b[i], c[i]);
a[i] =a[i] >> 1;
}
}
tie_example1.c
Integrated Systems Laboratory
Exercise 1 – Simple TIE Programming 2/3
c) TIE-Example 2• Attach the TIE source, compile it and run the application with the custom
instruction
17.05.2016 13
operation ADD3_DIV2 {inout AR res, in AR in0, in AR in1} {} {
wire [31:0] tmp = res + in0 + in1;
assign res = {1’b0, tmp[31:1]};
}
tie_example2.tie
operation ADD3 {inout AR res, in AR in0, in AR in1} {} {
assign res = res + in0 + in1;
}
tie_example1.tieCustom Instruction ADD3:New Instruction name: ADD3
Arguments list:
Inputs: res, in0, in1
Outputs: res
Argument Direction:
In, out, inout
Operation body contains a Verilog like syntax
Argument Type:
AR normal processor register
Custom instruction ADD3_DIV2:
• Sums up the three inputs res, in0, in1
• Divides tmp by 2
• Writes the result back to register res
Integrated Systems Laboratory
Exercise 1 – Simple TIE Programming 3/3
d) ByteSwap• Swaps bytes randomly using a conventional C-function and a custom instruction
• Attach the TIE source, and profile the application!
17.05.2016 14
state COUNT 32 add_read_write
operation BYTESWAP{out AR outR, in AR inpR} {inout COUNT}
{
assign outR = {inpR[7:0], inpR[15:8], inpR[23:16], inpR[31:24]};
assign COUNT = COUNT + 1;
}
byteswap.tieNew TIE element: state
• A State is not stored in the processors
register file, but in the SFU itself.
• Keyword add_read_write automatically
generates functions to read/write from the
state.
• The state COUNT can be accessed in C
with RCOUNT(), WCOUNT(value)
• The state COUNT is incremented each
time BYTESWAP() is called.
C Code:
• Compare golden model with TIE function
• Read final value of state COUNT with
RCOUNT() and print it.
…
for (i = 0; i < NUM; i++){
s = data[i%N];
if (GOLDEN_BYTESWAP(s) != BYTESWAP(s))
fail++;
}
printf(“State COUNT=%d\n”, RCOUNT());
…
byteswap.c
Integrated Systems Laboratory
Exercise 2 – Greatest Common Divisor
• Find the greatest common divisor of two integers a and b.
• The function golden_gcd() returns the gcd.
• Accelerate this function by introducing a new instruction GCD().
• Task 1: Study the golden_gcd() function.
Define a new TIE function GCD() which accelerates the c-code inside the while loop.
Add the TIE source to the “Corvus” processor core and compile it.
Call the defined TIE function inside xtensa_gcd().
Profile the application and find the speedup!
17.05.2016 15
…
int golden_gcd(int a, int b)
{
while (a != b)
{
if (a>b)
a = a – b;
else
b = b – a;
}
return a;
}
…
int xtensa_gcd(int a, int b)
{
while (a != b)
{
GCD(a,b);
}
return a;
}
…
gcd.c
Integrated Systems Laboratory
TIE Modules
17.05.2016 16
•
Integrated Systems Laboratory
Exercise 3 – Matrix Transform 1/7
• We want to compute the following matrix
transformation:
• golden_matrixtransform() computes A in pure C.
• Task 1: In matrixtransform.tie: define a MAC unit for each element of A.
(Use the TIEmac() module to build a MAC).
Since the result is only required at the end, use a state for the
accumulation register.
Add the matrixtransform.tie to the Lacerta core and compile it.
Initialize the accumulators in the beginning.
Read the accumulators in the end and assign them to A.
Read/write function to states are defined in matrixtransform.h
Run the Application.
17.05.2016 17
…
int golden_matrixtransform(int A[],
short B[], short M[])
{
int i;
int a0,a1,a2,a3;
for (i=0; i<4; i++)
{
a0 += B[i] * M[i];
a1 += B[i] * M[i+4];
a2 += B[i] * M[i+8];
a3 += B[i] * M[i+12];
}
A[0] = a0;
A[1] = a1;
A[2] = a2;
A[3] = a3;
}
…
matrixtransform.c
M0M4M8M12
M1M5M9M13
M2M6M10M14
M3M7M11M15
a0a1a2a3
A Mb0b1b2b3
B
Integrated Systems Laboratory
Exercise 3 – Matrix Transform 2/7
Check the additional area of the TIE instruction:
17.05.2016 18
4 mac units
4 acc states
Get area estimates of a compiled TIE source
The four units are never used concurrently => This solution is not optimal!
Integrated Systems Laboratory
Exercise 3 – Matrix Transform 3/7
Can we share the MAC units?
Yes with TIE - functions!
TIE function mac16:
• 32 bit return value
• 32 bit accumulator
• 16 bit inputs
• shared keyword
• Task 2: Use a TIE function, and call it in each MAC unit.
Add the matrixtransform_basicfunc.tie to the Lacerta core and compile it.
Initialize the accumulators in the beginning.
Read the accumulators in the end and assign them to A.
Set #define BASICFUNC to run this test.
Run the Application.
• Check the size of the new TIE source!
17.05.2016 19
…
function [31:0] mac16 ([31:0] accumulator, [15:0] multiplier, [15:0] multiplicand) shared
{
assign mac16 = TIEmac(multiplier, multiplicand, accumulator, 1’b1, 1’b0);
}
matrixtransform_basic.tie
Integrated Systems Laboratory
Exercise 3 – Matrix Transform 4/7
17.05.2016 20
• The compiler generates load, store and move instructions to access the register file.
• accum registers can be declared like integers, shorts etc. accum acc0;
• The registers can be accessed with a pointer: int *p_acc0 = (int*)&acc0;
• Task 3: Define a register file in matrixtransform_rf.tie instead of a state
Add the matrixtransform_rf.tie to the Lacerta core and compile it.
Declare each register of the registerfile.
Initialize the register file in the beginning.
Read the register file in the end and assign them to A.
Set #define RF to run this test.
Run the Application.
• Compare the size and number of instructions
regfile accum 32 4 ac
operation mac.accum {in AR oper10, in AR oper1, inout accum accumulator} {}
{
assign accumulator = TIEmac(oper0[15:0], oper1[15:0], accumulator, 1’b1,
1’b0);
}
matrixtransform_rf.tie
A separate register file can be used for the accumulator register!TIE Syntax: regfile <name> <width> <depth> <short_name>
Integrated Systems Laboratory
Exercise 3 – Matrix Transform 5/7
17.05.2016 21
Single Instruction, Multiple Data (SIMD) of the Matrix Transformation
M0
M4
M8
M12
M1
M5
M9
M13
M2
M6
M10
M14
M3
M7
M11
M15
a0
a1
a2
a3
A Mb0
b1
b2
b3
B
…
for (i=0; i<4; i++)
{
a0 += B[i] * M[i];
a1 += B[i] * M[i+4];
a2 += B[i] * M[i+8];
a3 += B[i] * M[i+12];
}
…
matrixtransform.c
…
A[0] = B[0]*M[0] + B[1]*M[1] + B[2]*M[2] + B[3]*M[3];
A[1] = B[0]*M[4] + B[1]*M[5] + B[2]*M[6] + B[3]*M[7];
A[2] = B[0]*M[8] + B[1]*M[9] + B[2]*M[10] + B[3]*M[11];
A[3] = B[0]*M[12] + B[1]*M[13] + B[2]*M[14] + B[3]*M[15];
…
matrixtransform.c
SISD: 16 instructions SIMD: 4 Instructions
Define a new instruction which computes the dot product!
=> Instruction requires 8 shorts(8*16bit) as input and produces a 32 bit output.
Integrated Systems Laboratory
Exercise 3 – Matrix Transform 6/7
• Task 4a:
Complete the TIE source
Change the C code to use the new dotprod() instruction
Compare the number of cycles with the previous implementations!
17.05.2016 22
regfile vec16x4 WIDTH DEPTH vec
operation dotprod {out AR acc, in vec16x4 vect, in vec16x4 mat} {}
{
// 4-way SIMD multiplication
wire [31:0] prod0 = TIEmul(vect[..],mat[..], 1’b1);
wire ...
// fused acumulation
assign acc = TIEaddn(….);
}
matrixtransform_SIMD.tie
• The SIMD like instruction allows to speed up the matrix transformation by a factor of ~4.
• The above TIE instruction is computing the sum of 4 multiplications in one cycle!
=> We have to expect a negative impact on our timing constraints!
• Solution: multi-cycle instruction!
=> Define a scheduling for this instruction.
schedule <schedule_name> {operation-list} {stage_assignments}
Explicit scheduling: define each assignment in one cycle.
Automatic scheduling: define the number of cycles, the instruction can use.
Integrated Systems Laboratory
Exercise 3 – Matrix Transform 7/7
• Task 4b:
Let the dotprod take 3 cycles.
During synthesis Synopsys Design
Compiler will automatically retime
the dotprod function
17.05.2016 23
…
schedule dotprod_sched {dotprod} {}
{
def acc 3;
}
matrixtransform_SIMD.tie (automatic scheduling)
• Task 5:
Use explicit scheduling and share
the multiplication units with a
shared function! (as with the MAC
unit earlier in this exercise)
Make sure that the multiplication
unit is only used once in each
cycle!
…
schedule dotprod_sched {dotprod} {}
{
def prod0 1;
….
def acc ?;
}
matrixtransform_SIMD.tie (explicit scheduling)
• Can you save a lot of resources when we share the multiplication?
• What is the price we are paying?
Integrated Systems Laboratory
Exercise 4 – ECDSA 1/2
• ECDSA signature verification
– Elliptic Curve Digital Signature Algorithm (ECDSA)
– We have seen that multiplications and squaring take a very long time in this field.
We want to speed up those operations!
17.05.2016 24
• Task 1:
Attach the tie file “BinaryFieldMultiplier.tie” to the Corvus core
Change the defines in multi_precision.h to include the multiplier
What speed up do you measure with the profiler?
Integrated Systems Laboratory
Exercise 4 – ECDSA 2/2
• Binary Squaring:– Look at the square function
“mp_bin_square_only” in multi_precision.c
• What is this operation doing?
17.05.2016 25
operation BinSqLower16 {out AR result, in AR input} {}
{
…
}
operation BinSqUpper16 {out AR result, in AR input} {}
{
…
}
BinarySquare.tie
• Task 2:
Add two tie operations “BinSqLower16” and “BinSqUpper16” to speed up the
squaring
Attach TIE files to the Corvus core
Since you have multiple TIE files, make sure the output name matches, the include
name(“ecdsa_tie”) in multi_precision.c
• Again, what is the speed up?
Integrated Systems Laboratory
Exercise 5 – Bit Stream Processor: CRC 1/2
• In this exercise we will experience the advantage of the Bit Stream Processor using CRC computation as an example.– 3 Coprocessors
– FLIX(Flexible Instruction eXtension) instruction format (3 slots)
– (2*32) bit memory/ 64 bit instruction interface
– Read the introduction of BSP User’s Guide (Chapter 1)
17.05.2016 26
Integrated Systems Laboratory
Exercise 5 – Bit Stream Processor: CRC 2/2
• Task 1: Run the CRC example code with the Corvus core (without the BSP
extension)
• Task 2: Run the CRC example with the Cygnus core (which has a BSP
coprocessor)
Check in the code what kind of BSP-instructions have been used
More details about the instruction format can be found under:
file:///scratch/soc_XX/xtensa/install/builds/RF-2015.2-linux/Cygnus/html/ISA/ISAhtml/index.html
• Task 3: Compare the runtime of the two versions.
Also have a look at the power/area/timing estimates of the two configuration options!
17.05.2016 27
Integrated Systems Laboratory
Exercise 6 – Soft Stream Processor: LTE Viterbi Dec.
17.05.2016 28
Soft Stream Processor
• 64 bit instruction interface
• (2*128) bit memory interface
16 way SIMD support!
16*160 bit vector register file
• Optional Viterbi Decoder
• Task:
Read the introduction of SSP User’s Guide (Chapter 1)
Run the Viterbi decoder main function with the Pavo core
Compare the reference design with the SSP assisted
implementation.
Integrated Systems Laboratory
Summary
• Xtensa CoreGen is an easy tool which allows to build SoCs in little time!
– Useful when time matters more than performance
• Sample solutions under:
– /home/soc_master/5_asip/asip_ex_2016_solution.xws
• RTL access currently not available
– (In discussion with partners)
• Possible Mini-project:
– Further optimize the ECDSA algorithm by replacing other functions
• Possible Semester project:
– Design your own processor core with Xtensa Software and do a first tape out
17.05.2016 29