Micro Genetic Algorithm (mGA) Group Optimization Methods for … · Zero riscy 32b Ariane 64b DMA...

2Integrated Systems Laboratory

1Department of Electrical, Electronic

and Information Engineering

PULP Overview

2nd International Workshop on RISC-V Research 28.02.2019

Frank K. Gürkaynak And the PULP development group

http://pulp-platform.org

▪ Project started in 2013 by Luca Benini

▪ A collaboration between University of Bologna and ETH Zürich

▪ Large team. In total we are about 60 people, not all are working on PULP

▪ Key goal is

▪ We were able to start with a clean slate, no need to remain compatible

to legacy systems.

Parallel Ultra Low Power (PULP)

How to get the most BANG

for the ENERGY consumed

in a computing system

▪ Our research was not developing processors…

▪ … but we needed good processors for systems we build for research

▪ Initially (2013) our options were

▪ Build our own (support for SW and tools)

▪ Use a commercial processor (licensing, collaboration issues)

▪ Use what is openly available (OpenRISC,.. )

▪ We started with OpenRISC

▪ First chips until mid-2016 were all using OpenRISC cores

▪ We spent time improving the microarchitecture

▪ Moved to RISC-V later

▪ Larger community, more momentum

▪ Transition was relatively simple (new decoder)

How we started with open source processors

RISC-V Cores

We have developed several optimized RISC-V cores

RI5CY

32b

Micro

riscy

32b

Zero

riscy

32b

Ariane

64b

Accelerators

InterconnectPeripheralsRISC-V Cores

Only processing cores are not enough, we need more

RI5CY

32b

Micro

riscy

32b

Zero

riscy

32b

Ariane

64b

AXI4 – InterconnectDMA GPIO

APB – Peripheral BusI2SUART

Logarithmic interconnectSPIJTAG

Neurostream

(ML)

HWCrypt

(crypto)

PULPO

(1st order opt)

HWCE

(convolution)

Platforms

Accelerators


All these components are combined into platforms

RI5CY

32b

Micro

riscy

32b

Zero

riscy

32b

Ariane

64b




M

I

Ocluster

interconnect

A R5R5R5

M MMM

inte

rconnect

cluster

interconnect

R5 R5R5R5

M MMM

cluster

interconnect

R5 R5R5R5

M MMM

cluster

interconnect

A R5R5R5

M MMMM

I

O inte

rconnect

Neurostream

(ML)

HWCrypt

(crypto)

PULPO

(1st order opt)

HWCE

(convolution)

R5

MI

O

inte

rconnect

A

Single Core

• PULPino

• PULPissimo

Multi-core

• Fulmine

• Mr. Wolf

Multi-cluster

• Hero

IOT HPC

R5R5

▪ Started by UC-Berkeley in 2010

▪ RISC-V is an open standard

governed by RISC-V foundation

▪ ETHZ is a founding member of the

foundation

▪ Necessary for the continuity

▪ Extensions are still being developed

▪ Defines 32, 64 and 128 bit ISA

▪ No implementation, just the ISA

▪ Different RISC-V implementations (both

open and close source) are available

▪ The PULP project specializes in

efficient implementations of RISC-V

cores and peripherals

Spec separated into “extensions”

RISC-V Instruction Set Architecture

I Integer instructions

E Reduced number of registers

M Multiplication and Division

A Atomic instructions

F Single-Precision Floating-Point

D Double-Precision Floating-Point

C Compressed Instructions

X Non Standard Extensions

▪ No or partial work done yet on those

extensions

▪ Possible to contribute as a foundation

member in task-groups

▪ Dedicated task-groups

▪ Formal specification

▪ Memory Model

▪ Marketing

▪ External Debug Specification

▪ For Bit-manipulation we provide our

own solution → part of the task group

Extensions still being worked on by RISC-V foundation

Q Quad-Precision Floating-Point

LDecimal Floating-Point

(IEEE 754-2008)

B Bit-Manipulation

T Transactional Memory

P Packed-SIMD

J Dynamically Translated Languages

V Vector Operations

N User-Level Interrupts

RISC-V Cores

We have developed several optimized RISC-V cores

RI5CY

32b

Micro

riscy

32b

Zero

riscy

32b

Ariane

64b

Our RISC-V family explained

▪ Zero-riscy

▪ RV32-ICM

▪ Micro-riscy

▪ RV32-CE

▪ Ariane

▪ RV64-

IMAFDCX

▪ Full privilege

specification

▪ RI5CY

▪ RV32-ICMX

▪ SIMD

▪ HW loops

▪ Bit

manipulation

▪ Fixed point

▪ RI5CY+FPU

▪ RV32-ICMFX

Low Cost

Core

Linux capable

Core

Core with DSP

enhancements

Floating-point

capable Core

32 bit 64 bit

ARM Cortex-M0+ ARM Cortex-M4 ARM Cortex-A55ARM Cortex-M4F

▪ 4-stage pipeline, optimized for energy efficiency

▪ 40 kGE, 30 logic levels, Coremark/MHZ 3.19

▪ Includes various extensions (Xpulp) to RISC-V for DSP applications

RI5CY – Our workhorse 32-bit core

▪ The ‘X’ extension can be used by everyone freely

▪ Offers great flexibility

▪ Of course these custom extensions are not automatically supported by tools

▪ You have to add patches / new tools so that these can be utilized

▪ Even if your tools do not support extensions, the cores will work

▪ The tools will just not generate code that takes advantage of the extensions

▪ But the cores with extensions will remain compatible to standard RISC-V

▪ The goal is to work so that ‘good’ extensions become standard

▪ Requires being active in the RISC-V foundation task groups

▪ ETH Zürich is actively involved in V, P and B at the moment

It is possible to enhance RISC-V with custom extensions

RISC-V: “To support development of proprietary custom extensions, portions of the encoding

space are guaranteed to never be used by standard extensions.”

For 8-bit values the following can be executed in a single cycle (pv.dotup.b)

Z = D1 × K1 + D2 × K2 + D3 × K3 + D4 × K4

Our extensions to RI5CY (with additions to GCC)

▪ Post–incrementing load/store instructions

▪ Hardware Loops (lp.start, lp.end, lp.count)

▪ ALU instructions

▪ Bit manipulation (count, set, clear, leading bit detection)

▪ Fused operations: (add/sub-shift)

▪ Immediate branch instructions

▪ Multiply Accumulate (32x32 bit and 16x16 bit)

▪ SIMD instructions (2x16 bit or 4x8 bit) with scalar replication option

▪ add, min/max, dotproduct, shuffle, pack (copy), vector comparison

RI5CY – ISA Extensions improve performance

for (i = 0; i < 100; i++)

d[i] = a[i] + b[i];

mv x5, 0

mv x4, 100

Lstart:

lb x2, 0(x10)

lb x3, 0(x11)

addi x10,x10, 1

addi x11,x11, 1

add x2, x3, x2

sb x2, 0(x12)

addi x4, x4, -1

addi x12,x12, 1

bne x4, x5, Lstart

Baseline

11 cycles/output

mv x5, 0

mv x4, 100

Lstart:

lb x2, 0(x10!)

lb x3, 0(x11!)

addi x4, x4, -1

add x2, x3, x2

sb x2, 0(x12!)

bne x4, x5, Lstart

8 cycles/output

Auto-incr load/store

lp.setupi 100, Lend

lb x2, 0(x10!)

lb x3, 0(x11!)

add x2, x3, x2

Lend: sb x2, 0(x12!)

HW Loop

5 cycles/output

lp.setupi 25, Lend

lw x2, 0(x10!)

lw x3, 0(x11!)

pv.add.b x2, x3, x2

Lend: sw x2, 0(x12!)

Packed-SIMD

1,25 cycles/output

▪ RI5CY was built for energy efficiency for DSP applications

▪ Ideally all parts of the core are running all the time doing something useful

▪ This does not always mean it is low-power

▪ The core is rather large (> 40 kGE without FPU)

▪ People asked us about a simple and small core

▪ Not all processor cores are used for DSP applications

▪ The DSP extensions are mostly idle for control applications

▪ Zero-Riscy was designed to as a simple and efficient core.

▪ Some people wanted the smallest possible RISC-V core

▪ It is possible to further reduce area by using 16 registers instead of 32 (E)

▪ Also the multiplier can be removed saving a bit more

▪ Micro-Riscy is a parametrized variation of Zero-Riscy with minimal area

Why we designed other 32-bit cores after RI5CY?

▪ Only 2-stage pipeline, simplified register file

▪ Zero-Riscy (RV32-ICM), 19kGE, 2.44 Coremark/MHz

▪ Micro-Riscy (RV32-EC), 12kGE, 0.91 Coremark/MHz

▪ Used as SoC level controller in newer PULP systems

Zero/Micro-riscy, small area core for control applications

Different 32-bit cores with different area requirements

RI5CY Zero-riscy Micro-riscy

▪ For the first 4 years of the PULP project we used only 32bit cores

▪ Luca once famously said “We will never build a 64bit core”.

▪ Most IoT applications work well with 32bit cores.

▪ A typical 64bit core is much more than 2x the size of a 32bit core.

▪ But times change:

▪ Using a 64bit Linux capable core allows you to share the same address space as

main stream processors.

▪ We are involved in several projects where we (are planning to) use this capability

▪ There is a lot of interest in the security community for working on a contemporary

open source 64bit core.

▪ Open research questions on how to build systems with multiple cores.

Finally the step into 64-bit cores

ARIANE: Our Linux Capable 64-bit core

▪ Tuned for high frequency, 6 stage pipeline, integrated cache

▪ In order issue, out-of-order write-back, in-order-commit

▪ Supports privilege spec 1.11, M, S and U modes

▪ Hardware Page Table Walker

▪ Implemented in GF22nm (Poseidon, Kosmodrom),

and UMC65 (Scarabaeus)▪ In 22nm: ~1 GHz worst case conditions

(SSG, 125/-40C, 0.72V)

▪ 8-way 32kByte Data cache and

4-way 32kByte Instruction Cache

▪ Core area: 175 kGE

Main properties of Ariane

7%

8%3%

21%

44%

9%

8% Area

PC Gen

IF

ID

Issue

Ex

Reg File

CSR

Ariane booting Linux on a Digilent Genesys 2 board

Packed-SIMD support for all formats

Unified FP/Integer register file

▪ Not standard

▪ up to 15 % better performance

▪ Re-use integer load/stores (post

incrementing ld/st)

▪ Less area overhead

▪ Useful if pressure on register file is not

very high (true for a lot of applications)

23

What About Floating Point Support?

▪ F (single precision) and

D (double precision) extension in RISC-V

▪ Uses separate floating point register file

▪ specialized float loads (also compressed)

▪ float moves from/to integer register file

▪ Fully IEEE compliant

▪ RI5CY support for F

▪ Ariane for F and D

▪ Alternative FP Format support (<32 bit)

FP64

FP32 FP32

FP16 FP16 FP16 FP16

FP8 FP8 FP8 FP8 FP8 FP8 FP8 FP8

▪ Main FP operation groups

▪ MUL/ADD: Add/Subtract, Multiply, FMA

▪ CMP/SGNJ: Comparisons, Min/Max etc.

▪ CAST: FP-FP casts, Int-FP / FP-Int casts

▪ Parametrizable

▪ Number & Encoding of Formats

▪ Packed-SIMD Vectors

▪ # Pipeline Stages (per Op and Format)

▪ Implementation (per Op and Format)

▪ PARALLEL for best Speed

▪ MERGED (or Iterative) for best Area

▪ Special Functions for Transprecision

▪ Cast-and-Pack 2 FP Values to Vector

▪ Casts amongst FP Vectors + Repacking

▪ Expanding FMA (e.g. FP32 += FP16*FP16)

Parametric Floating-Point Unit for Transprecision

FPU

Distribution

Arbitration

MUL/

ADDDIV/

SQRT

CMP/

SGNJCAST

…

Distribution

Arbitration

FP

64

FP

32

FP

16

FP64/

FP32/

FP16/

...

or

PARALLEL MERGED

cast between

any 2 formats

▪ Example Area and Timing*

▪ FP64 + 2xFP32 (SIMD)

▪ ~80 kGE, ~750 MHz, 1 Pipeline Stage

* GF22FDX LVT, 0.72V SSG, post-synthesis results

Physical Memory Protection (PMP)

▪ Protect the physical memory when the

core runs in U or S privilege level

▪ Up to 16 entries for address filtering

▪ Configuration held in 4 CSRspmpcfg[0-3]

▪ Whether Store (W), Load (R) and Fetch

(X) is allowed

▪ Address matching modes:

▪ Naturally aligned power-of-2 regions

(NAPOT) or aligned 4 Byte (NA4)

▪ Boundaries >, < (TOR)

▪ Implemented in RI5CY

Supervisor Memory Translation and

Protection (for Linux-like systems)

▪ Effectively needs TLBs

▪ Register to configure base page number (satp)

▪ Translation Mode

(32, 39, 48 virtual addressing)

▪ Address Space Identifier (ASID)

▪ Implemented in Ariane

Memory Protection (PMU and MMU)

MMUTLB HIT

TLB MISS PTW

Platforms

Accelerators


The pulp-platforms put everything together

RI5CY

32b

Micro

riscy

32b

Zero

riscy

32b

Ariane

64b




Neurostream

(ML)

HWCrypt

(crypto)

PULPO

(1st order opt)

HWCE

(convolution)

R5

MI

O

inte

rconnect

A

Single Core

• PULPino

• PULPissimo

▪ Simple design

▪ Meant as a quick release

▪ Separate Data and

Instruction memory

▪ Makes it easy in HW

▪ Not meant as a Harvard arch.

▪ Can be configured to work

with all our 32bit cores

▪ RI5CY, Zero/Micro-Riscy

▪ Peripherals copied from

its larger brothers

▪ Any AXI and APB peripherals

could be used

PULPino our first single core platform

PULPino

Data

Mem

RISC-V

core

Inst

Mem

I$

AP

B-i

nte

rco

nn

ec

t

GPIO

AX

I -

inte

rco

nn

ect

Bus

Adapt

SPI M

UART

I2C

UART

SPI S

Boot

ROM

▪ Shared memory

▪ Unified Data/Instruction Memory

▪ Uses the multi-core infrastructure

▪ Support for Accelerators

▪ Direct shared memory access

▪ Programmed through APB bus

▪ Number of TCDM access ports

determines max. throughput

▪ uDMA for I/O subsystem

▪ Can copy data directly from I/O to

memory without involving the core

▪ Used as a fabric controller

in larger PULP systems

PULPissimo the improved single core platform

RI5CY

Ibuf

/ I$

instr data

Event Unit

Tightly Coupled Data Memory Interconnect

Mem

Bank

Mem

Bank

Mem

Bank

Mem

Bank

Mem

Bank

Mem

Bank

uDMA

APB / Peripheral Interconnect

Clock / Reset

Generator

Debug

Unit

FLLs

I/O

intfs

UART

SPI

I2S

I2C

SDIO

CPI

JTAG

Hardware

Accelerator Ext

Coreplex

▪ Still work in progress

▪ Current version is very

simple

▪ Useful for in-house testing

▪ A more advanced version

will likely be developed

soon

Kerbin the single core support structure for Ariane

KerbinA

PB

-in

terc

on

nec

t

GPIO

AX

I -

inte

rco

nn

ect

Bus

Adapt

SPI M

UART

I2C

UART

SPI S

Boot

ROM

Ari

an

eI$

D$

Timer

AXI2Per

Debug

FLL

▪ OpenPiton

▪ Developed by Princeton

▪ Originally OpenSPARC T1

▪ Scalable NoC with

coherent LLC

▪ Tiled Architecture

▪ Still work in progress

▪ Bare-metal released in

Dec ’18

▪ Update with support for

SMP Linux will be

released soon

OpenPiton and Ariane together, the many-core system

Tile 00

Chipset

UART

AX

I-L

ite

NoC

SD

Debug

Timer

IRQ

DRAM

Bridge

Ariane

I$ D$L1.5

Shared

L2

Tile 01

NoC

L1.5

Shared

L2

Tile 10

NoC

L1.5

Shared

L2

Tile 11

NoC

L1.5

Shared

L2

…

…

Ariane

I$ D$

Ariane

I$ D$

Ariane

I$ D$

OpenPiton+Ariane mapped to FPGA

31

Xilinx VCU 118

▪ Core: 100 MHz

▪ Up to 16 cores

▪ 32 GiB DDR4

▪ (Available soon)

Xilinx VC707

▪ Core: 60 MHz

▪ Up to 4 cores

▪ 8 GiB DDR3

▪ (Available soon)

Digilent Genesys2

▪ Core: 66 MHz

▪ Up to 2 cores

▪ 8 GiB DDR3

▪ 1 core config:

▪ 85k LUT (42%)

▪ 67 BRAM (15%)

And Luca said “We will never do a vector processor”

Interconnect

ARA

1GHz

8 DP GFLOPS

8 GB/s

Data

Instruction

Queue

ACK/TRAP

MMU

64b 64b

64b

Ariane1GHz

2 DP GFLOPS

8 GB/s

Instruction Data

I$, D$

64b

ARA

Vector Unit

Vector Register File

And Luca said “We will never do a vector processor”

Interconnect

64b

Instruction

Queue

ACK/TRAP

MMU

64b 64b

64b

Ariane1GHz

2 DP GFLOPS

8 GB/s

Instruction Data

I$, D$

256b

Wide

Bank

256b

Wide

Bank

256b

Wide

Bank

256b

Wide

Bank

256b

Wide

Bank

256b

Wide

Bank

256b

Wide

Bank

256b

Wide

Bank

VRF arbitration unit

64bit

FP

FMA

64bit

FP

FMA

64bit

FP

FMA

64bit

FP

FMA

Load

Store

Unit

Writeback

Platforms

Accelerators


The main PULP systems we develop are cluster based

RI5CY

32b

Micro

riscy

32b

Zero

riscy

32b

Ariane

64b




M

I

Ocluster

interconnect

A R5R5R5

M MMM

inte

rconnect

Neurostream

(ML)

HWCrypt

(crypto)

PULPO

(1st order opt)

HWCE

(convolution)

R5

MI

O

inte

rconnect

A

Single Core

• PULPino

• PULPissimo

Multi-core

• Fulmine

• Mr. Wolf

R5

▪ Multiple RISC-V cores

▪ Individual cores can be started/stopped with little overhead

▪ DSP extensions in cores

▪ Multi-banked scratchpad memory (TCDM)

▪ Not a cache, there is no L1 data cache in our systems

▪ Logarithmic Interconnect allowing all cores to access all banks

▪ Cores will be stalled during contention, includes arbitration

▪ DMA engine to copy data to and from TCDM

▪ Data in TCDM managed by software

▪ Multiple channels, allows pipelined operation

▪ Hardware accelerators with direct access to TCDM

▪ No data copies necessary between cores and accelerators.

The main components of a PULP cluster

CLUSTER

PULP cluster contains multiple RISC-V cores

RISC-V

core

RISC-V

core

RISC-V

core

RISC-V

core

CLUSTER

Tightly Coupled Data Memory

All cores can access all memory banks in the cluster

interconnect

RISC-V

core

Mem Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

Mem

Mem

CLUSTER


Data is copied from a higher level through DMA

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

Mem

Memin

terc

on

nect

L2

Mem

CLUSTER


There is a (shared) instruction cache that fetches from L2

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

Mem

Memin

terc

on

nect

L2

Mem

I$ I$ I$

CLUSTER


Hardware Accelerators can be added to the cluster

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

HW

ACCEL

Mem

Memin

terc

on

nect

L2

Mem

I$ I$ I$

CLUSTER


Event unit to manage resources (fast sleep/wakeup)

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

HW

ACCEL

Mem

Memin

terc

on

nect

L2

Mem

I$ I$ I$

Event

Unit

PULPissimo CLUSTER


An additional microcontroller system (PULPissimo) for I/O

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

HW

ACCEL

Mem

Memin

terc

on

nect

L2

Mem

Mem

Cont

I/O

RISC-V

core

I$ I$ I$

Ext.

Mem

Event

Unit

Platforms

Accelerators


Finally multi-cluster PULP systems for HPC applications

RI5CY

32b

Micro

riscy

32b

Zero

riscy

32b

Ariane

64b




M

I

Ocluster

interconnect

A R5R5R5

M MMM

inte

rconnect

cluster

interconnect

R5 R5R5R5

M MMM

cluster

interconnect

R5 R5R5R5

M MMM

cluster

interconnect

A R5R5R5

M MMMM

I

O inte

rconnect

Neurostream

(ML)

HWCrypt

(crypto)

PULPO

(1st order opt)

HWCE

(convolution)

R5

MI

O

inte

rconnect

A

Single Core

• PULPino

• PULPissimo

Multi-core

• Fulmine

• Mr. Wolf

Multi-cluster

• Hero

IOT HPC

R5R5

▪ These systems are meant to work as accelerators for larger systems

▪ Optimized for data processing

▪ Three projects in this field:

▪ HERO : Heterogenous Acceleration with ARM (Ariane) + PULP clusters

▪ PowerPULP: CAPI Interface between IBM Power8/9 and PULP clusters

▪ Ariane + Open Piton: Princeton group OpenPiton running with Ariane

▪ Both ETH Zürich and University of Bologna involved in EPI

▪ Main contributions in Accelerator stream

▪ Close collaboration with several groups

▪ The idea is not necessary to use PULP

▪ Leverage what we have developed so far

Multi-cluster systems for HPC applications

▪ First released in 2018

▪ Allows a PULP cluster to be connected to a host system

Heterogenous Research Platform

HERO: FPGA Platforms

05.03.2019Andreas Kurth

HERO: Roadmap

▪ Standard peripheral talking over AXI/APB

▪ Common in practice, nothing new or special

▪ Instruction set extensions

▪ Possible in RISC-V, RI5CY has many new DSP instructions

▪ Goal is to get ‘good instructions’ into standard extensions at some point

▪ Shared functional units

▪ Amortizes expensive extensions (FPU/DIV) between multiple units

▪ Additional pipeline stages

▪ Work in Patronus for Control flow Integrity

▪ Shared memory accelerators

▪ Our bread and butter, PULPopen, NTX

▪ Cluster as an accelerator

▪ HERO, BigPULP, etc

How to accelerate processing in PULP systems

What Kind of Acceleration: ISA Extensions

▪ High Flexibility, relatively small performance boost

▪ Integrated in the Pipeline of Processors (ID Stage, EX Stage, WB Stage)

▪ Suffer from Register File Bandwidth Bottleneck (only 2 operands…)

▪ Require Adapting Compiler and Binutils

▪ Auxiliary Processing Units (APU), interface available in the RI5CY

▪ Examples: Dot product (already implemented), bit-reverse, butterfly…

Programming model

#define SumDotp(a, b, c) \__builtin_pulp_sdotsp2(a, b, c)

for (int k = 0; k < (N>>1); k++) {VA = VectInA[k];VB = VectInB[k];S = SumDotp(VA, VB, S);

}

▪ The same as previous one, but one unit can be shared among multiple

cores

▪ Useful to save area for low-utilization instructions (i.e. < 1/#cores%)

▪ Examples: Shared FPU (SQRT, DIV…)

What Kind of Acceleration: Shared functional units

What Kind of Acceleration: Additional pipeline stages

▪ Sponge-based Control Flow Protection (joint work with TU-Graz)

▪ Encrypted code is decrypted on-the-fly on an additional pipeline stage (CFI)

▪ Supports both standard RISC-V code and encrypted code

▪ 35% area overhead, 25% power overhead (core level)

▪ 5% area overhead, 13% power overhead (system level) →PULPino

25.7.2018 53

What Kind of Acceleration: Shared memory accelerators

Coarse-Grained Shared-Memory Accelerators

▪ DFGs mapped In Hardware (ILP + DLP) →Highest Efficiency, Low Flexibility

▪ Sharing data memory with processor for fast communication → low overhead

▪ Controlled through a memory-mapped interface

▪ Typically one/two accelerators shared by multiple cores

25.7.2018 55

Fulmine: a HW-Accelerated Secure IoT System-on-Chip

DMA

HWCE

Per

iph

eral

Inte

rco

nn

ect

F. Conti et al., An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics, IEEE TCAS-I 2017

▪ UMC 65nm technology

▪ 6.86 mm2

▪ 4 cores, 2 accelerators

▪ HWCE for 3D conv layers

▪ DSP-optimized cores

▪ HWCRYPT for AES

▪ 64 kB of L1, 192 kB of L2

▪ First version of uDMA for

I/O with no SW intervention

▪ QSPI master/slave

▪ I2C

▪ I2S

▪ UART

The Quicklogic eFPGA integration in PULPissimo

RI5CY

Ibuf

/ I$

instr data

Event Unit

Tightly Coupled Data Memory Interconnect

Mem

Bank

Mem

Bank

Mem

Bank

Mem

Bank

Mem

Bank

Mem

Bank

uDMA

APB / Peripheral Interconnect

Clock / Reset

GeneratorTimer

Power

Controller

Debug

Unit

FLLs Always-On

I/O

intfs

UART

SPI

I2S

I2C

GPIO

CPI

Config Control

TCDM

adapter

(buffered)

JTAG

eFPGA

GPIO

Blocks not drawn to scale

MAC

Mem

Bank

Mem

Bank

DP

Mem

DP

Mem

▪ APB port for

configuration,

programming and

control

▪ Direct 4x TCDM

access for eFPGA

▪ 128 bits/cycle

▪ 4 independent R/W

▪ Possibility to use

uDMA

25.7.2018 57

What Kind of Acceleration: Shared memory accelerators 2

▪ Fine-Grained Shared-Memory Accelerators

▪ Sharing Data Memory with processors for fast communication

▪ Typically one processor controls more than one accelerator

▪ More flexible than coarse-grain accelerators

▪ The functionalities integrated are typically simpler and more “general purpose”

▪ Controlled through a Memory Mapped Interface

▪ but these are also suitable for ISA Extensions → low overhead)

25.7.2018 58

NTX: Boosting HWPEs for Deep Learning

The Neural Training Accelerator (NTX) [4] is built around a float32 fused multiply-

accumulate core specialized for deep learning applications (ReLU, masking…)

[4] F. Schuiki et al., A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets, under revision

NTX Architecture

Programming

Model

What Kind of Acceleration: Cluster as an accelerator

SoC CLUSTER TIGHTLY COUPLED DATA MEMORY

Data

Mem

Data

Mem

Data

Mem

Data

Mem

Data

Mem

Data

Mem

Data

Mem

DMA

Event

Data

Mem

Timer

MMU Pe

rip

he

ral In

t.

Clu

ste

r B

us

Logarithmic Interconnect

Shared Instruction Cache

RI5CY

CORE

RI5CY

CORE

RI5CY

CORE

RI5CY

CORE

Shared FPU

RI5CY

CORE

RI5CY

CORE

RI5CY

CORE

RI5CY

CORE

Shared FPU

Peripheral Int.

CL

K

TIM

ER

PO

WE

R

DE

BU

GSPI M

CAMIF

I2C

UART

HYPER

µD

MA

DEBUG

ROML2

Bank

Logarithmic Interconnect

L2

Bank

Zero-

riscyEvent

AP

B B

us

PWM

GPIO

Always OnCluster Power RingSoC Power

Power

ControlROM

RTC

Power Control

ClusterPU

LP

issim

o

We have designed more than 25 ASICs based on PULP

ASICs meant to go on IC Tester

▪ Mainly characterization

▪ Not so many peripherals

ASICs meant for applications

▪ More peripherals (SPI, Camera)

▪ More on-chip memory

You can buy development boards with PULP technology

GAPUINO from Greenwaves

▪ PULP cluster system with

Nine RI5CY cores

VEGA board from open-isa.org

▪ Micro-controller board with

RI5CY and zero-riscy

▪ All are 28 FDSOI technology, RVT, LVT and RVT flavor

▪ Uses OpenRISC cores

▪ Chips designed in collaboration with STM, EPFL, CEA/LETI

▪ PULPv3 has ABB control

Brief illustrated history of selected ASICs

PULP PULPv2 PULPv3

▪ First multi-core systems that were designed to work on development

boards. Each have several peripherals (SPI, I2C, GPIO)

▪ Mia Wallace and Fulmine (UMC65) use OpenRISC cores

▪ Honey Bunny (GF28 SLP) uses RISC-V cores

▪ All chips also have our own FLL designs.

The first system chips, meant for applications

Mia

Wallace

Honey

BunnyFulmine

▪ Designed in collaboration with the Analog group of Prof. Huang

▪ All chips with SMIC130 (because of analog IPs)

▪ First three with OpenRISC, VivoSoC3 with RISC-V

Combining PULP with analog front-end for Biomedical apps

VivoSoC VivoSoC

v2.0

VivoSoC

v2.001

VivoSoC

v3

▪ System chips in TSMC40 (Mr. Wolf) and UMC65

▪ Mr. Wolf: IoT Processor with 9 RISC-V cores (Zero-riscy + 8x RI5CY)

▪ Atomario: Multi cluster PULP (2x clusters with 4x RI5CY cores each)

▪ Scarabaeus: Ariane based microcontroller

The new generation chips from 2018

Mr. Wolf Atomario Scarabaeus

▪ All are 22nm Globalfoundries FDX, around 10 sqmm, 50-100 Mtrans

▪ Poseidon: PULPissimo (RI5CY) + Ariane

▪ Kosmodrom: 2x Ariane + NTX (due this week)

▪ Arnold: PULPissimo (RI5CY) + Quicklogic eFPGA

The large system chips from 2018

Poseidon Kosmodrom Arnold

We firmly believe in Open Source movement

First launched in

February 2016

(Github)

All our

development is

on open

repositories

Contributions

from many

groups

▪ The way we design ICs has changed, big part is now infrastructure

▪ Processors, peripherals, memory subsystems are now considered infrastructure

▪ Very few (if any) groups design complete IC from scratch

▪ High quality building blocks (IP) needed

▪ We need an easy and fast way to collaborate with people

▪ Currently complicated agreements have to be made between all partners

▪ In many cases, too difficult for academia and SMEs

▪ Hardware is a critical for security, we need to ensure it is secure

▪ Being able to see what is really inside will improve security

▪ Having a way to design open HW, will not prevent people from keeping secrets.

Open Hardware is a necessity, not an ideological crusade

▪ Similar to Apache/BSD, adapted specifically for Hardware

▪ Allows you to:

▪ Use

▪ Modify

▪ Make products and sell them

without restrictions.

▪ Note the difference to GPL

▪ Systems that include PULP do not have to be open source (Copyright not Copyleft)

▪ They can be released commercially

▪ LGPL may not work as you think for HW

We provide PULP with SOLDER Pad License

http://www.solderpad.org/licenses/

▪ The following are ok:

▪ RTL code written in HDL, or a high-level language for HLS flow

▪ Testbenches in HDL and associated makefiles, golden models

▪ How about support scripts for different tools?

▪ Synthesis scripts, tool startup files, configurations

▪ And these are currently no go :

▪ Netlists mapped to standard cell libraries

▪ Placement information (DEF)

▪ Actual Physical Layout (GDSII)

At the moment, open HW can (mostly/only) be HDL code

▪ The PDK of manufacturers is under NDA

▪ Can not release spice netlists, GDS layout, timing extraction data

▪ No real reason why PDK is under NDA, expect it to go away at some point

▪ Reliance on libraries (std. cells, memories, I/O cells, analog IP)

▪ Once the PDK is open this can easily also be open source hardware

▪ EDA tool manufacturers

▪ If you have a hard macro, no need to buy licenses for tools to make them

▪ Arguable: more open source hardware, more companies that do ASICs, more need for

EDA

▪ If open source becomes widespread, also open source tools will increase

▪ Not an issue for companies with commercial licenses of EDA tools

▪ More problematic for universities, smaller companies

▪ FPGA vendors just want to sell their FPGAs

What is stopping the release of GDS and Netlists?

▪ Many companies (we know of) are actively using PULP▪ They value that it is silicon proven

▪ They like that it uses a permissive open source license

Silicon and Open Hardware fuel PULP success

June 11-14 Zürich, SWITZERLAND

▪ Official RISC-V Workshop (June 11-12)

▪ RISC-V foundation member meetings (June 13)

▪ Eurolab4HPC, Open Source Innovation Camp (June 13)

▪ Licensing and IP rights for Open source HW (June 13)

▪ FOSSI: Path to high quality IP, Open source EDA tools (June 14)

▪ Tutorials, demos, hackathons

http://pulp-platform/wosh

RISC-V

FOUNDATION

WOSH

SESSIONS

WOSH

TUTORIALS

Tuesday

11th of JuneRISC-V Day1

Wednesday

12th of JuneRISC-V Day2

Thursday

13th of June

Member

meetings

Licensing

Innovation with OSH

PULP

Open Piton

Friday

14th of June

High Quality IP in OSH

Open Source EDA tools

MyriadRF

OpenISA.org

Greenwaves

Hero

Week of Open Source Hardware, June 11-14, Zurich

Paid

Registration

OPEN

Now

FREE

Registration

(expect early March)

Working on Program

Now

If you have ideas

Contact me

▪ List your open source HW/SW on the

Eurolab4HPC www site and get

3000€▪ For winner & 1000€ for two runner ups

▪ Register by 1st of March

▪ https://www.eurolab4hpc.eu/open-

source/call/

▪ Summer of Code activity on

Transprecision Computing

▪ Up to 10x projects will be supported

6000€▪ Register by 1st of March

▪ Has to be open source

▪ http://oprecomp.eu/open-source

Who says Open Source does not pay?

https://www.eurolab4hpc.eu/open-source/call/

http://oprecomp.eu/open-source

▪ Working on a non-profit organization to:

▪ Manage the distribution of PULP

▪ Governance

▪ Technical Support

▪ Continuity

▪ ETH Zürich and University of Bologna will be contributors

▪ We can concentrate on our research, energy efficient processor systems

▪ Continue using PULP in our work

▪ Become a regular contributor

▪ Hope to announce during the WOSH in Zurich

▪ There is some work towards this goal, nothing official yet

Future of PULP, what we expect

PULP @ ETH Zürich

QUESTIONS?

@pulp_platformhttp://pulp-platform.org

PULPissimo CLUSTER


How do we work: Initiate a DMA transfer

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

HW

ACCEL

Mem

Memin

terc

on

nect

L2

Mem

Mem

Cont

I/O

RISC-V

core

I$ I$ I$

Ext.

Mem

Event

Unit

PULPissimo CLUSTER


Data copied from L2 into TCDM

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

HW

ACCEL

Mem

Memin

terc

on

nect

L2

Mem

Mem

Cont

I/O

RISC-V

core

I$ I$ I$

Ext.

Mem

Event

Unit

PULPissimo CLUSTER


Once data is transferred, event unit notifies cores/accel

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

HW

ACCEL

Mem

Memin

terc

on

nect

L2

Mem

Mem

Cont

I/O

RISC-V

core

I$ I$ I$

Ext.

Mem

Event

Unit

PULPissimo CLUSTER


Cores can work on the data transferred

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

HW

ACCEL

Mem

Memin

terc

on

nect

L2

Mem

Mem

Cont

I/O

RISC-V

core

I$ I$ I$

Ext.

Mem

Event

Unit

PULPissimo CLUSTER


Accelerators can work on the same data

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

HW

ACCEL

Mem

Memin

terc

on

nect

L2

Mem

Mem

Cont

I/O

RISC-V

core

I$ I$ I$

Ext.

Mem

Event

Unit

PULPissimo CLUSTER


Once our work is done, DMA copies data back

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

HW

ACCEL

Mem

Memin

terc

on

nect

L2

Mem

Mem

Cont

I/O

RISC-V

core

I$ I$ I$

Ext.

Mem

Event

Unit

PULPissimo CLUSTER


During normal operation all of these occur concurrently

interconnect

RISC-V

core

MemDMA Mem MemMem

RISC-V

core

RISC-V

core

RISC-V

core

Mem Mem MemMem

I$

HW

ACCEL

Mem

Memin

terc

on

nect

L2

Mem

Mem

Cont

I/O

RISC-V

core

I$ I$ I$

Ext.

Mem

Event

Unit

Micro Genetic Algorithm (mGA) Group Optimization Methods for … · Zero riscy 32b Ariane 64b DMA...

Documents

Transcript of Micro Genetic Algorithm (mGA) Group Optimization Methods for … · Zero riscy 32b Ariane 64b DMA...