Nano-Tera Poster 2014 2nd PrizeObeSense

1
Hardware / Software Optimizations for Efficient Embedded Digital Signal Processing in Wireless Body Sensor Nodes ObeSense RTD 2013 Rubén Braojos, Giovanni Ansaloni, David Atienza Embedded Systems Laboratory, EPFL Overhead 3L-MF 3LMMD RPCLASS Code size 2,6 % 0,9 % 0,7 % Run-time 1,7 % 1,0 % 0,6 % Experimental Results 3L-MF SC MC 0 25 50 75 100 3L-MMD RP-CLASS Power consumption (μW) Clock Tree D-Crossbar I-Crossbar Cores & logic Data Mem. Prog. Mem. SC MC SC MC Mul$core WBSN are more energy efficient than singlecore equivalents if they are properly synchronized !Up to 40% less power consump:on (3LMF) !Very low overhead (< 3%) due to synchroniza$on Synchroniza$on + Broadcas:ng = Memory energy efficiency 40% less accesses to IM and 3.7% less accesses to DM SIMD + Workload division Lower clock constraint VFS ! 15% Reduc:on of System V DD Wireless Body Sensor Nodes (WBSNs) in Healthcare Input Bio-signals Output Diagnosis Fusion Analysis Filter Consecutive phases Parallelism? Multiple inputs Parallelism? WBSNs are miniaturized devices able to acquire, process and transmit biosignals (ECG, EMG, blood pressure movements…) Wearable, unobtrusive, inexpensive Limited resources and autonomy Onboard Digital Signal Processing (DSP) is employed to improve energy efficiency. Can algorithmic parallelism be exploited? New Trend: Mul:core WBSN for Efficient DSP Multi-core WBSN Inst. memory (IM) Data memory (DM) Core Core Core Core Core Core Peripherals Application partition Parallelism Parallelism Workload is divided into subtasks by pipelining phases and performing SIMD execu$on over several biosignals Each subtask is executed in a core Voltage frequency scaling (VFS) Challenges Datadependent branches (SIMD) Data and control flow among phases Lack of efficient mechanism for Resynchroniza$on to maximize SIMD Producerconsumer no$fica$on REFERENCES: [1] Dogan, et al., “Low power processor architecture explora$on for online biomedical signal analysis,” IETCDS 2012. [2] Rahimi, et al. “A fullysynthesizable singlecycle interconnec$on network for SharedL1 processor clusters” DATE’11 [3] Y. Sun et al., “ECG signal condi$oning by morphological filtering,” Computers in Biology and Medicine, 2002. [4] F. Rincon et al., “Development and evalua$on of mul$lead waveletbased ECG delinea$on algorithms for embedded wireless sensor nodes,” Informa$on Tech. in Biomedicine, 2011. [5] Braojos et al., “A methodology for embedded classifica$on of heartbeats using random projec$ons,” in DATE‘13 Conclusions Two considered systems: Synchronized mul$core WBSN (MC) Equivalent singlecore WBSN (SC) Evalua$on Framework RTL implementa$on SystemC cycleaccurate simulator Custom compila$on toolchain Power model derived from postlayout simula$ons Proposed Solu:on: Lightweigth HW / SW Mechanism for Code Synchroniza:on Cores INTERCONNECT INTERCONNECT #0 #1 #2 #3 #4 #N IM DM Private #0 Private #1 Private #2 Private #3 Private #4 Shared Synchronizer Steps to adapt exis$ng bio medical applica$ons 1. Par::oning: Iden$fy algorithmic phases (producer consumer) and poten$al parallel computa$ons (SIMD) 2. Instruc:on inser:on: Implement resynchroniza$on and producerconsumer rela$ons with the specific ISE 3. Code mapping: Cores performing SIMD share the shame IM bank Synchronizer unit : Stalls and wakes-up cores to orchestrate execution Guaranties re-synchronization Synchronization points : Store run-time information for each synchronization event Reserved space in shared memory Instruction Set Extension (ISE) : SLEEP, SINC, SDEC and SNOP Mark synchronization events and modify synchronization points Mul$core WBSN (based on [1]) RISC ultralow power processors Mul:banked IM and DM + shared and private DM Minimize memory conflicts Logarithmic combina$onal interconnect [3] with arbitra$on capabili$es Broadcas:ng: Simultaneous request of the same address are merged into a single memory access Memory energy efficiency filtering filtering filtering filt. filt. filt. Combina$on Delinea$on filt. classifica$on Mul:lead ECG filtering (3LMF) [3] Remove unwanted ar$facts (perspira$on, muscular ac$vity,…) from mul$ple ECG signals Mul:lead ECG delinea:on (3LMMD) [4] Combines mul$ple filtered signals and finds onset, peak and end of the main ECG waves Heartbeat classifier (RPCLASS) [5] Performs mul$lead delinea$on only in the presence of abnormali$es Benchmarks: Embedded Electrocardiogram (ECG) Applica:ons for WBSNs Only SIMD execution SIMD execution + Producer-consumer SIMD execution + Producer- consumer + complex control activation

Transcript of Nano-Tera Poster 2014 2nd PrizeObeSense

Page 1: Nano-Tera Poster 2014 2nd PrizeObeSense

Hardware / Software Optimizations for Efficient Embedded Digital Signal Processing

in Wireless Body Sensor Nodes

ObeSense RTD 2013

Rubén Braojos, Giovanni Ansaloni, David Atienza Embedded Systems Laboratory, EPFL

Overhead 3L-MF 3LMMD RPCLASS Code size 2,6 % 0,9 % 0,7 % Run-time 1,7 % 1,0 % 0,6 %

Experimental  Results  3L-MF

SC MC 0

25

50

75

100 3L-MMD RP-CLASS

Pow

er c

onsu

mpt

ion

(µW

)

Clock Tree D-Crossbar I-Crossbar Cores & logic Data Mem. Prog. Mem.

SC MC SC MC

•  Mul$-­‐core  WBSN  are  more  energy  efficient  than  single-­‐core  equivalents  if  they  are  properly  synchronized    !Up  to  40%  less  power  consump:on  (3L-­‐MF)    !Very  low  overhead  (<  3%)  due  to  synchroniza$on  

 

•  Synchroniza$on  +  Broadcas:ng  =  Memory  energy  efficiency    à  40%  less  accesses  to  IM  and  3.7%    less  accesses  to  DM  

 

•  SIMD  +  Workload  division  à  Lower  clock  constraint  à  VFS    !  15%  Reduc:on  of  System  VDD  

Wireless  Body  Sensor  Nodes  (WBSNs)  in  Healthcare  

Input Bio-signals

Output Diagnosis Fusion   Analysis  Filter  

Consecutive phasesà Parallelism?

Multiple inputs à Parallelism?

•  WBSNs  are  miniaturized  devices  able  to  acquire,  process  and  transmit  bio-­‐signals  (ECG,  EMG,  blood  pressure  movements…)  é  Wearable,  unobtrusive,  inexpensive  ê  Limited  resources  and  autonomy  

•  On-­‐board  Digital  Signal  Processing  (DSP)  is  employed  to  improve  energy  efficiency.  

•  Can  algorithmic  parallelism  be  exploited?  

New  Trend:  Mul:-­‐core  WBSN  for  Efficient  DSP  

Multi-core WBSN

Inst.  memory  (IM)  

Data  memory  (DM)  

Core   Core  

Core   Core  

Core   Core  

Peripherals  

Application partition

Parallelism

Parallelism

•  Workload  is  divided  into  subtasks  by  pipelining  phases  and  performing        SIMD  execu$on  over  several  bio-­‐signals    à  Each  subtask  is  executed  in  a  core    

       à  Voltage  frequency  scaling  (VFS)  •  Challenges  

•  Data-­‐dependent  branches  (SIMD)  •  Data  and  control  flow  among  phases  

•  Lack  of  efficient  mechanism  for  •  Re-­‐synchroniza$on  to  maximize    SIMD  •   Producer-­‐consumer  no$fica$on  

REFERENCES:    [1]  Dogan,  et  al.,  “Low-­‐  power  processor  architecture  explora$on  for  online  biomedical  signal  analysis,”  IET-­‐CDS  2012.  [2]  Rahimi,  et  al.  “A  fully-­‐synthesizable  single-­‐cycle  interconnec$on  network  for  Shared-­‐L1  processor  clusters”  DATE’11  [3]  Y.  Sun  et  al.,  “ECG  signal  condi$oning  by  morphological  filtering,”  Computers  in  Biology  and  Medicine,  2002.  [4]  F.  Rincon  et  al.,  “Development  and  evalua$on  of  mul$lead  wavelet-­‐based  ECG  delinea$on  algorithms  for  embedded  wireless  sensor  nodes,”  Informa$on  Tech.  in  Biomedicine,  2011.    [5]  Braojos  et  al.,  “A  methodology  for  embedded  classifica$on  of  heartbeats  using  random  projec$ons,”  in  DATE‘13  

 

Conclusions  Two  considered  systems:  •  Synchronized  mul$-­‐core  

WBSN  (MC)  •  Equivalent  single-­‐core  

WBSN  (SC)  

Evalua$on  Framework    •  RTL  implementa$on  •  SystemC  cycle-­‐accurate  

simulator  •  Custom  compila$on              tool-­‐chain  •  Power  model  derived  from  

post-­‐layout  simula$ons  

Proposed  Solu:on:  Lightweigth  HW  /  SW  Mechanism  for  Code  Synchroniza:on  

Cores

INTERC

ONNEC

T  

INTERC

ONNEC

T  

#0  

#1  

#2  

#3  

#4  

#N  

IM DM

Private  #0  

Private  #1  

Private  #2  

Private  #3  

Private  #4  

Shared  

Synchronizer  Steps  to  adapt  exis$ng  bio-­‐medical  applica$ons    

1.   Par::oning:  Iden$fy  algorithmic  phases  (producer-­‐consumer)    and  poten$al  parallel  computa$ons  (SIMD)  

2.   Instruc:on  inser:on:  Implement  re-­‐synchroniza$on  and  producer-­‐consumer  rela$ons  with  the  specific  ISE  

3.   Code  mapping:  Cores  performing  SIMD  share  the  shame  IM  bank  

Synchronizer unit: •  Stalls and wakes-up cores to

orchestrate execution •  Guaranties re-synchronization

Synchronization points: •  Store run-time information for each

synchronization event •  Reserved space in shared memory

Instruction Set Extension (ISE): •  SLEEP, SINC, SDEC and SNOP •  Mark synchronization events and

modify synchronization points

Mul$-­‐core  WBSN  (based  on  [1])  •  RISC  ultra-­‐low  power  

processors  

•  Mul:-­‐banked  IM  and  DM  +  shared  and  private  DM    

       à  Minimize  memory  conflicts  

•  Logarithmic  combina$onal  interconnect  [3]  with  arbitra$on  capabili$es  

•  Broadcas:ng:  Simultaneous  request  of  the  same  address  are  merged  into  a  single  memory  access    

     à    Memory  energy  efficiency  

filtering  

filtering  

filtering  

filt.  

filt.  

filt.  

Combina$on   Delinea$on   filt.  

filt.  

filt.  

Combina$on   Delinea$on  

classifica$on  

Mul:-­‐lead  ECG  filtering  (3L-­‐MF)    [3]  •  Remove  unwanted  ar$facts  (perspira$on,  

muscular  ac$vity,…)  from  mul$ple  ECG  signals  

Mul:-­‐lead  ECG  delinea:on  (3L-­‐MMD)  [4]  •  Combines  mul$ple  filtered  signals  and  finds  

onset,  peak  and  end  of  the  main  ECG  waves  

Heart-­‐beat  classifier  (RP-­‐CLASS)  [5]  •  Performs  mul$-­‐lead  delinea$on  only  in  

the  presence  of  abnormali$es  

Benchmarks:  Embedded  Electrocardiogram  (ECG)  Applica:ons  for  WBSNs  

Only SIMD execution

SIMD execution + Producer-consumer

SIMD execution + Producer-consumer + complex control

activation