Precision Timed Embedded Systems Using TickPAD Memory Matthew M Y Kuo* Partha S Roop* Sidharta...

81
Precision Timed Embedded Systems Using TickPAD Memory Matthew M Y Kuo* Partha S Roop* Sidharta Andalam Nitish Patel* *University of Auckland, New Zealand TUM CREATE, Singapore

Transcript of Precision Timed Embedded Systems Using TickPAD Memory Matthew M Y Kuo* Partha S Roop* Sidharta...

  • Slide 1
  • Precision Timed Embedded Systems Using TickPAD Memory Matthew M Y Kuo* Partha S Roop* Sidharta Andalam Nitish Patel* *University of Auckland, New Zealand TUM CREATE, Singapore
  • Slide 2
  • Introduction Hard real time systems Need to meet real time deadlines Catastrophic events may occur when missed Synchronous execution approach Good for hard real time systems Deterministic Reactive Aids static timing analysis Well bounded programs No unbounded loops or recursions
  • Slide 3
  • Synchronous Languages Executes in logical time Ticks Sample input computation emit output Synchronous hypothesis Tick are instantaneous Assumes system is executes infinitely fast System is faster than environment response Worst case reaction time Time between two logical ticks Languages Esterel Scade PRET-C Extension to C
  • Slide 4
  • Synchronous Languages Executes in logical time Ticks Sample input computation emit output Synchronous hypothesis Tick are instantaneous Assumes system is executes infinitely fast System is faster than environment response Worst case reaction time Time between two logical ticks Languages Esterel Scade PRET-C Extension to C
  • Slide 5
  • PRET-C Light-weight multithreading in C Provides thread safe memory access C extension implemented as C macros StatementMeaning ReactiveInput IDeclares I as a reactive environment input ReactiveOutput ODeclares O as a reactive environment output PAR(T1, . Tn)Synchronously executes n threads in parallel, where thread t i has a higher priority than t i+1 EOTMarks the end of tick [weak] abort P when C Preempt p when c is true
  • Slide 6
  • Introduction Practical System require larger memory Not all applications fit on on-chip memory Require memory hierarchy Processor memory gap [1] Hennessy, John L., and David A. Patterson. Computer Architecture: A Quantitative Approach. San Francisco, CA: Morgan Kaufmann, 2011.
  • Slide 7
  • Introduction Traditional approaches Caches Scratchpads However, Scant research for memory architectures tailored for synchronous execution and concurrency.
  • Slide 8
  • Caches CPU Main Memory
  • Slide 9
  • Caches Traditionally Caches Small fast piece of memory Temporal locality Spatial locality Hardware Controlled Replacement policy CPU Main Memory Cache
  • Slide 10
  • Caches Hard real time systems Needs to model the architecture Compute the WCRT Caches models Trade off between length of computation time and tightness Very tight worse case estimate is not scalable CPU Main Memory Cache
  • Slide 11
  • Scratchpad Scratchpad Memory (SPM) Software controlled Statically allocated Statically or dynamically loaded Requires an allocation algorithm e.g. ILP, Greedy CPU Main Memory SPM
  • Slide 12
  • Scratchpad Hard real time systems Easy to compute tight the WCRT Reduces the worst case performance Balance between amount of reload points and overheads May perform worst than cache in the worst case performance CPU Main Memory SPM
  • Slide 13
  • TickPAD CPU Main Memory SPMCache Good at overall performance Hardware controlled Good at worst case performance Easy for fast and tight static analysis
  • Slide 14
  • TickPAD CPU Main Memory SPMCache Good at overall performance Hardware controlled Good at worst case performance Easy for fast and tight static analysis TPM
  • Slide 15
  • TickPAD CPU Main Memory TPM TickPAD Memory TickPAD - Tick Precise Allocation Device Memory controller Hybrid between caches and scratchpads Hardware controlled features Static software allocation Tailored for synchronous languages Instruction memory
  • Slide 16
  • TickPAD Design flow
  • Slide 17
  • PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } main t1 t3 t2
  • Slide 18
  • PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } Computation main t1 t3 t2
  • Slide 19
  • PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } Spawn children threads main t1 t3 t2
  • Slide 20
  • PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } End of tick Synchronization boundaries main t1 t3 t2
  • Slide 21
  • PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } Child thread terminate main t1 t3 t2
  • Slide 22
  • PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } Main thread resume main t1 t3 t2
  • Slide 23
  • PRET-C Execution Time main t1 t3 t2 Sample inputs
  • Slide 24
  • PRET-C Execution main t1 t3 t2 main Time
  • Slide 25
  • PRET-C Execution main t1 t3 t2 main Time t1
  • Slide 26
  • PRET-C Execution main t1 t3 t2 main Time t1t2
  • Slide 27
  • PRET-C Execution main t1 t3 t2 main Time t1t2
  • Slide 28
  • PRET-C Execution main t1 t3 t2 main Time t1t2 Emit Outputs
  • Slide 29
  • PRET-C Execution main t1 t3 t2 main Time t1t2 1 tick (reaction time)
  • Slide 30
  • PRET-C Execution main t1 t3 t2 main Time t1t2 local tick
  • Slide 31
  • Assumptions 0x000x040x080x0C 4 Instructions 1 Cache Line Takes 1 burst transfer from main memory Cache miss, takes 38 clock cycles [2] 0x00Each instructions takes 2 cycles to execute buffer Buffers are 1 cache line in size 2. J. Whitham and N. Audsley. The Scratchpad Memory Management Unit for Microblaze: Implmentation, Testing, and Case Study. Technical Report YCS-2009-439, University of York, 2009.
  • Slide 32
  • TickPAD - Overview TickPAD - Overview
  • Slide 33
  • Spatial memory pipeline To accelerate linear code TickPAD - Overview TickPAD - Overview
  • Slide 34
  • Associative loop memory For predictable temporal locality Statically allocated and Dynamically loaded TickPAD - Overview TickPAD - Overview
  • Slide 35
  • Tick address queue Stores the resumptions address of active threads TickPAD - Overview TickPAD - Overview
  • Slide 36
  • Tick instruction buffer Stores the instructions at the resumption of the next active thread To reduce context switching overhead at state/tick boundaries TickPAD - Overview TickPAD - Overview
  • Slide 37
  • Command table Stores a set of commands to be executed by the TickPAD controller. TickPAD - Overview TickPAD - Overview
  • Slide 38
  • Command buffer A buffer to store operands fetched from main memory Command requiring 2+ operands TickPAD - Overview TickPAD - Overview
  • Slide 39
  • Spatial Memory Pipeline Cache on miss Fetches from main memory on to cache First instruction miss, subsequence instructions on that line hits Requires history of cache needed for timing analysis Scratchpad unallocated Executes from main memory Miss cost for all instructions Simple timing analysis
  • Slide 40
  • Spatial Memory Pipeline Memory controller Single line buffer Simple analysis Analyse previous instruction First instruction miss, subsequence instructions on that line hits CPU Main Memory
  • Slide 41
  • Spatial Memory Pipeline Computation required many lines of instructions Exploit spatial locality Predictability prefetch the next line of instructions Add another buffer
  • Slide 42
  • Spatial Memory Pipeline To preserve determinism Prefetch only active if no branch
  • Slide 43
  • Spatial Memory Pipeline
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Timing analysis Simple to analyse Analysis next instruction line If has a branch next target line will miss e.g. 38 clock cycles Else will be prefetched e.g. 38 8 = 30 clock cycles
  • Slide 54
  • Spatial Memory Pipeline Timing analysis Simple to analyse Analysis next instruction line If has a branch next target line will miss e.g. 38 clock cycles Else will be prefetched e.g. 38 8 = 30 clock cycles
  • Slide 55
  • Spatial Memory Pipeline Timing analysis Simple to analyse Analysis next instruction line If has a branch next target line will miss e.g. 38 clock cycles Else will be prefetched e.g. 38 8 = 30 clock cycles
  • Slide 56
  • Tick Address Queue Tick Instruction Buffer Reduce cost of context switching Maintains a priority queue Thread execution order Prefetches instructions from next thread Make context switching points appear as linear code Paired using Spatial Memory Pipeline
  • Slide 57
  • Tick Address Queue Tick Instruction Buffer
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Context switching memory cost same as linear code
  • Slide 63
  • Tick Address Queue Tick Instruction Buffer
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Timing analysis Same prefetch lines for allocated context switching points
  • Slide 68
  • Associative Loop Memory Statically Allocated Greedy Allocates inner most look first Fetches Loop Before Executing Predictable easy and tight to model Exploits temporal locality
  • Slide 69
  • Command Table Statically Allocated A Look Up table to dynamically load Tick Instruction Buffer Tick Queue Associative Loop Memory Command are executed when the PC matches the address stored on the command Allows the TickPAD to function without modification to source code Libraries Propriety programs
  • Slide 70
  • Command Table Three fields Address The PC address to execute the command Command Discard Loop Associative Memory Store Loop Associative Memory Fill Tick Instruction Buffer Load Tick Address Queue Operand Data used by the command
  • Slide 71
  • Command Table Allocation NodeCommandAddress FORKLoad Tick Address Queue x N Fill Tick Instruction Buffer Address of FORK EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop
  • Slide 72
  • Command Table Allocation NodeCommandAddress FORKLoad Tick Address Queue x N Fill Tick Instruction Buffer Address of FORK EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop
  • Slide 73
  • Command Table Allocation NodeCommandAddress FORKLoad Tick Address Queue x N Fill Tick Instruction Buffer Address of FORK EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop
  • Slide 74
  • Command Table Allocation NodeCommandAddress FORKLoad Tick Address Queue x N Fill Tick Instruction Buffer Address of FORK EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop
  • Slide 75
  • Command Table Allocation NodeCommandAddress FORKLoad Tick Address Queue x N Fill Tick Instruction Buffer Address of FORK EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop
  • Slide 76
  • Results
  • Slide 77
  • Results WCRT reduction 8.5% Locked SPMs 12.3% Thread multiplexed SPM 13.4% Direct Mapped Caches
  • Slide 78
  • Results
  • Slide 79
  • Results - Synthesis
  • Slide 80
  • Conclusion Presented a new memory architecture Tailored for synchronous programs Has better worst case performance Analysis time is scalable Between scratchpad and abstract cache analysis The presented architecture is also suitable for other synchronous languages Future work Data TickPAD TickPAD on multicores
  • Slide 81
  • Thank You