EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit...

35
Energy Efficient GPU Transac2onal Memory via SpaceTime Op2miza2ons Wilson W. L. Fung Tor M. Aamodt

Transcript of EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit...

Page 1: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

                                         

Energy  Efficient    GPU  Transac2onal  Memory    via  Space-­‐Time  Op2miza2ons  

Wilson  W.  L.  Fung  Tor  M.  Aamodt    

Page 2: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Why  TM  for  GPU?    •  Simple  Irregular  Parallelism  on  GPU  

2  

nBody  5M  Bodies   1640s   5.2s  

Regular   Irregular  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Other  Applica2ons?  

Page 3: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Why  TM  for  GPU?  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.   3  

TM  on  GPU  

• Predictable  Dev  Time  • No  Deadlock!  • Maintainable  Code  

Wilson  Fung  

Page 4: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

TM  for  GPU:  Energy  Overhead  

•  TM  =  Specula2ve  Execu2on  =    

•  Kilo  TM:  First  Hardware  TM  for  GPU  – Simple  Design  for  Scalability  – 1000s  of  Concurrent  TransacRons  – Scalar  Transac2on  Management  – Value-­‐Based  Conflict  Detec2on  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.   4  Wilson  Fung  

GPU Memory

Page 5: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Warp-­‐Level    Transac2on  Management  

Transaction Transaction Transaction Transaction

Memory

Temporal    Conflict    Detec2on  

Last Written Time

0x00000000

0xFFFFFFFF

2Xà1.3X Energy Usage

65% Speedup

Wilson  Fung   5  Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  

Space   Time  

Page 6: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Background:  Kilo  TM  

•  1000s  of  Concurrent  TransacRons  – Value-­‐based  conflict  detecRon:  Global  Metadata              

•  Special  HW  to  boost  validaRon  and  commit  parallelism  6  

Transac2on  

Write  Log  

Global  Memory  

Commit  Transac2o

n  Write  Log  

Global  Memory  

Valida2on  

= Read  Log  

Read  Log  

Write  Log  

Global  Memory  

Execu2on    Transac2o

n  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 7: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Memory  Par22on  Memory  Par22on  Memory  Par22on  Memory  Par22on  Memory  Par22on  SIMT  Core  SIMT  Core  

Kilo  TM  Implementa2on  

7  

SIMT  Core   Memory  Par22on  TM-­‐Aware  SIMT  Stack  

TX-­‐Log  Unit  

L1  Data  Cache  

Commit  Unit   L2  Cache  

DRAM  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Commit  Protocol  

Page 8: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Efficiency  Concerns  

•  Scalar  TransacRon  Management    – Scalar  TransacRon  fits  SIMT  Model  – Simple  Design  – Poor  Use  of  SIMD  Memory  Subsystem  

•  Rereading  every  memory  locaRon  – Memory  access  takes  energy  

8  

2X    Energy  Usage  

128X    Speedup    over    

CG-­‐Locks  

40%    FG-­‐Locks  Performance  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 9: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Inefficiency  from    Scalar  Transac2on  Management  

9  

•  Kilo  TM  ignores  GPU  thread  hierarchy  – Excessive  Control  Message  Traffic  

– Scalar  ValidaRon  and  Commit    à  Poor  L2  Bandwidth  URlizaRon  

•  Simplify  HW  Design,  but  Cost  Energy  

SIMT  Core  

CU  

CU  

Send-­‐Log  

SIMT  Core  

CU  

CU  

CU-­‐Pass/Fail  

SIMT  Core  

CU  

CU  

TX-­‐Outcome  

SIMT  Core  

CU  

CU  

Commit  Done  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Commit  Unit  

Last  Level    Cache  32

 B  Port  

4B  4B   4B  4B  

Page 10: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Warp  Level  Transac2on  Management  

•  Key  Idea:    – Manage  transacRons  within  a  warp  as  a  whole    

•  Enables  opRmizaRons  that  exploit    spaRal  locality:  – Aggregate  Control  Messages  – ValidaRon  and  Commit  Coalescing  

•  Challenge:  Intra-­‐Warp  Conflicts  10  Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Transaction Transaction Transaction Transaction Memory

Page 11: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Warp  Level  Transac2on  Management:  Aggregate  Control  Messages  

11  

SIMT  Core  Commit  Unit  

Commit  Unit  

Commit  Unit  

TX1  TX2  TX3  TX4  

Scalar  Messages  12  Messages  

SIMT  Core  Commit  Unit  

Commit  Unit  

Commit  Unit  

TX1  TX2  TX3  TX4  

Aggregated  Messages  3  Messages  

Contributes  up  to  40%  of  Interconnec2on  Traffic  Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 12: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Warp  Level  Transac2on  Management:    Valida2on  and  Commit  Coalescing  

12  

TX1  TX2  TX3  TX4  

Global  Memory  (L2  cache/DRAM)  

4B   32B  Port  Read  and  Write  Logs  Without  Coalescing  

TX1  TX2  TX3  TX4  

Global  Memory  (L2  cache/DRAM)  

Coalescing  Logic  

32B  Port  32/64/128B  Read  and  Write  Logs  With  Coalescing  

Reduce  40%  of  Requests  to  L2  Cache  

Max  U2lity  =  4/32  =  12.5%  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 13: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Intra-­‐Warp  Conflict  

•  PotenRal  existence  of  intra-­‐warp  conflict  introduces  complex  corner  cases:    

13  

TX3  TX1  

Z=7  X=9  

W=7  Y=9  

TX2   TX4  

Y=8   W=6  

Z=8   X=6  

Read  Set  

Write  Set  

Global  Memory  X  =  9  Y  =  8  Z  =  7  W  =  6  

@  Valida2on  

Global  Memory  X  =  6  Y  =  9  Z  =  8  W  =  7  

All  Commijed  (Wrong)  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  

Global  Memory  X  =  9  Y  =  9  Z  =  7  W  =  7  

Global  Memory  X  =  6  Y  =  8  Z  =  8  W  =  6  

Correct  Outcomes  

OR  

Wilson  Fung  

Page 14: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Intra-­‐Warp  Conflict  Resolu2on  

•  Kilo  TM  stores  read-­‐set  and  write-­‐set  in  logs  – Compact,  fits  in  caches  –  Inefficient  for  search  

•  Naive,  pair-­‐wise  resoluRon  too  slow    – T  threads/warp,  R+W  words/thread  – O(T2  x  (R+W)2),  T  ≥  32  

14  

TX3  TX1   TX2   TX4  

O((R+W)2)    Comparisons  Each  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  

Execu2on   Valida2on   Commit  Intra-­‐Warp  Conflict  

Resolu2on  

Wilson  Fung  

Page 15: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Intra-­‐Warp  Conflict  Resolu2on:  2-­‐Phase  Parallel  Conflict  Resolu2on  

•  Insight:  Fixed  priority  for  conflict  resoluRon  enables  parallel  resoluRon  

•  O(R+W)  •  Two  Phases  

– Ownership  Table  ConstrucRon  – Parallel  Match  

15  Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 16: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Intra-­‐Warp  Conflict  Resolu2on:  2-­‐Phase  Parallel  Conflict  Resolu2on  

•  Insight:  Fixed  priority  for  conflict  resoluRon  enables  parallel  resoluRon  

16  

Ownership  Table  

TX3  WLog  

Ownership  Table    Construc2on  

TX1   TX2   TX4  WLog   WLog   WLog  

Ownership  Table  

Addr   H ID  of  Highest  Prio.  TX  Wrijen  to  H(Addr)    

Stored  in  Shared  Memory  (On-­‐Chip  Per-­‐Core  Scratchpad)  High   Low  Priority  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 17: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Intra-­‐Warp  Conflict  Resolu2on:  2-­‐Phase  Parallel  Conflict  Resolu2on  

17  O(R+W)  

Parallel  Match  

TX1   TX2   TX4  

WLog  RLog  

WLog  RLog  

WLog  RLog  

Ownership  Table  

TX3  

WLog  RLog  

•  Insight:  Fixed  priority  for  conflict  resoluRon  enables  parallel  resoluRon  

O(W)  

Ownership  Table  

TX3  WLog  

Ownership  Table    Construc2on  

TX1   TX2   TX4  WLog   WLog   WLog  

Read-­‐Log:  Owner  ID  <  My  ID  à  Abort  (E.g.  Owner  ID  =  2  à  Abort)  

Write-­‐Log:  OwnerID  !=  My  ID  à  Abort  (E.g.  Owner  ID  =  3  à  Pass)  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 18: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Warp  Level  Transac2on  Management  Made  Prac2cal  

   

•  Enables  opRmizaRons  that  exploit    spaRal  locality:  – Aggregate  Control  Messages  – ValidaRon  and  Commit  Coalescing  

•  Challenge:  Intra-­‐Warp  Conflicts  

18  Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Transaction Transaction Transaction Transaction Memory

Page 19: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Temporal  Conflict  Detec2on  

•  MoRvaRon:    Skip  value-­‐based  conflict  detecRon  for    conflict-­‐free  read-­‐only  transacRons  

– 40%  and  85%  of  the  transacRons  in  two  of  our  workloads.    

19  

TX1  if  (C  ==  0)      B  =  B  +  1;  

TX2  int  K;    K  =  X  +  Y;  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  

Data-­‐Dependent  Control  Flow  

Consistent  View  of    Memory  

Wilson  Fung  

Page 20: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Temporal  Conflict  Detec2on  

•  If  LastWrihenTime(X)  <  StartTime,  Pass  •  Otherwise,  Conflict  Detected  

20  

Transac2on  Start  Time  

Global  Memory  (L2  cache/DRAM)  

Last  Wrihen  Time  

Load  [X]  

Data  +  LastWrihenTime(X)  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Global  Memory  (L2  cache/DRAM)  

Last  Wrihen  Time  

Store  [X]  

LastWrihenTime(X)  

Globally  Synchronous    On-­‐Chip  Timer  

Page 21: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Life  Time  of  [A]  loaded  by  TX1  

Temporal  Conflict  Detec2on  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.   21  

TX1  LD  [A];  LD  [B];  

ST  [B]   Time  

Life  Time  of  [B]  loaded  by  TX1  

ST  [A]  ST  [A]  TX1  LD  [A]  

TX1  Starts  

TX1  LD  [B]  

Effec2ve  instantaneous  execu2on  2me  for  TX1  w.r.t.  other  threads  

Wilson  Fung  

Page 22: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

[B]  loaded  by  TX2  

Temporal  Conflict  Detec2on  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.   22  

TX1  LD  [A];  LD  [B];  

ST  [B]   Time  Life  Time  of  [A]  loaded  by  TX1  

ST  [A]  ST  [A]  TX1  LD  [A]  

TX1  Starts  

TX1  LD  [B]  

Value  loaded  by  LD  [A]  and  value  loaded  by  LD  [B]  cannot  coexists  at  any  point  of  2me  –  a  detected  conflict.    

Wilson  Fung  

Page 23: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Temporal  Conflict  Detec2on  Implementa2on  

Wilson  Fung   Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.   23  

Memory  Par22on  Memory  Par22on  Memory  Par22on  Memory  Par22on  Memory  Par22on  SIMT  Core  SIMT  Core  SIMT  Core   Memory  Par22on  

Start  Time  Table  

Last  Wrijen  Time  Table  

Time  Addr   H

16kB  Recency  Bloom  Filter  •   Approximate  but  Conserva2ve  •   Aliasing  two  very  old  store  is  OK  

Page 24: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Evalua2on  

24  

•  GPGPU-­‐Sim  3.2.1  – Detailed:  IPC  CorrelaRon  of  0.90  vs.  Fermi  GPU  

•  Model  Energy  Overhead  of  Kilo  TM    –  Extra  Hardware    

•  CACTI  for  access  energy  of  major  SRAM  arrays    –  Extra  AcRvity  via  GPUWahch  –  Increased  ExecuRon  Time  (More  Leakage)  

•  GPU  TM  ApplicaRons  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  

HT-[H/M/L] – Hash Table Construction ATM – Bank Transactions BH-[H/L] – Barnes Huts (N-Body) CL/CLto – Cloth Simulation CC – Maxflow/Mincut Graph AP – Data Mining

Wilson  Fung  

Page 25: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Results  

25  

0   1   2   3  

WarpTM+TCD  WarpTM  

TCD  KiloTM-­‐Base  

Execu2on  Time  Normalized  to  FGLock  

0   1   2   3  

WarpTM+TCD  WarpTM  

TCD  KiloTM-­‐Base  

Energy  Usage  Normalized  to  FGLock  

2Xà1.3X  Energy  Usage  

40%à66%  FG-­‐Lock  

Performance  

Low  Conten2on  Workload:    Kilo  TM  w/  SW  Op2miza2ons  on  par  with  FG  Lock    

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 26: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Summary  

•  Two  Enhancements  for  Kilo  TM  – Warp  Level  TransacRon  Management    

•  Exploit  SpaRal  Locality  in  Thread  Hierarchy  – Temporal  Conflict  DetecRon    

•  Silent  Commit  of  Read-­‐Only  TransacRon      

•  Reduce  Performance  and  Energy  Overhead  of  Kilo  TM    

•  Low  ContenRon  Workload:    Kilo  TM  w/  OpRmizaRons  on  par  with  FG  Lock    

26  Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Ques2ons?  

Page 27: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

BACKUP  SLIDES  

Wilson  Fung   Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.   27  

Page 28: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Normalized  Performance  

28  

0  

1  

2  

3  

4  

5  

6  

HT-­‐H   HT-­‐M   HT-­‐L   ATM   CL   CLto   BH-­‐H   BH-­‐L   CC   AP   AVG  

Exec.  Tim

e  Normalized

 to  FGLock  

KiloTM-­‐Base   TCD   WarpTM   WarpTM+TCD  

40%à66%  FG-­‐Locks  Performance  

Low  Conten2on  Workload:    Kilo  TM  w/  SW  Op2miza2ons  on  par  with  FG  Lock    

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 29: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Normalized  Energy  Usage  

29  

0  

1  

2  

FGLock  

KiloTM

-­‐Base  

WarpT

M+TCD

 

FGLock  

KiloTM

-­‐Base  

WarpT

M+TCD

 

FGLock  

KiloTM

-­‐Base  

WarpT

M+TCD

 

FGLock  

KiloTM

-­‐Base  

WarpT

M+TCD

 

FGLock  

KiloTM

-­‐Base  

WarpT

M+TCD

 

FGLock  

KiloTM

-­‐Base  

WarpT

M+TCD

 

FGLock  

KiloTM

-­‐Base  

WarpT

M+TCD

 

FGLock  

KiloTM

-­‐Base  

WarpT

M+TCD

 

FGLock  

KiloTM

-­‐Base  

WarpT

M+TCD

 

FGLock  

KiloTM

-­‐Base  

WarpT

M+TCD

 

HT-­‐H   HT-­‐M   HT-­‐L   ATM   CL   CLto   BH-­‐H   BH-­‐L   CC   AP  

Energy  Usage  Normalized

 to  FGLock  

Core   L1Cache   SMem   NOC   L2Cache   DRAM   KiloTM   Idle   Leakage  4.3X   2.9X  2.6X  2.5X  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 30: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Intra-­‐Warp  Conflict  Resolu2on:  2-­‐Phase  Parallel  Conflict  Resolu2on  

30  Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 31: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

2PCR  vs.  SCR  

31  

0  

1  

2  

3  

4  

5  

6  

HT-­‐H   HT-­‐M   HT-­‐L   ATM   CL   CLto   BH-­‐H   BH-­‐L   CC   AP  

Exec.  Tim

e  Normalized

 to  FGLock  

KiloTM-­‐Base   WarpTM+2PCR  WarpTM+2PCR(NoOverhead)   WarpTM+SCR  WarpTM+SCR(NoOverhead)  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 32: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Spa2al  Locality  among  Transac2ons  

32  

0%  

10%  

20%  

30%  

40%  

50%  

60%  

70%  

HT-­‐H  HT-­‐M   HT-­‐L   ATM   CL   CLto   BH-­‐H   BH-­‐L   CC   AP   AVG  

ReadAccess   WriteAccess  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung  

Page 33: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.   33  

ABA  Problem?  

•  Classic  Example:  Linked  List  Based  Stack  

•  Thread  0  –  pop():  while (true) { t = top; Next = t->Next;

// thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! }

top   A  Next  

B  Next  

C  Next   Null  

top   A  Next  

C  Next   Null  

top   B  Next  

C  Next   Null  

top   B  Next  

C  Next   Null  

Next   B  t   A  

top   C  Next   Null  

Wilson  Fung  

Page 34: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.   34  

ABA  Problem?  

•  atomicCAS  protects  only  a  single  word  – Only  part  of  the  data  structure  

•  Value-­‐based  conflict  detec2on  protects    all  relevant  parts  of  the  data  structure  

top   A  Next  

B  Next  

C  Next   Null  

while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! }

Wilson  Fung  

Page 35: EnergyEfficient GPU’Transac2onal’Memory ...SIMT’Stack TXLog Unit L1!Data Cache! Commit’ Unit L2Cache’ DRAM Wilson!Fung! Energy!EfficientGPU!TMviaSpaceFTime!Opt.! Commit’

ABA  Problem?  

•  If  every  memory  input  value  is  idenRcal,    the  transacRon  code  should  generate  the  same  output.    –  No  point  to  re-­‐execute  transacRon  for  ABA  event.    

35  

Time  

TX1  if  (C  ==  0)      B  =  B  +  1;  

See  Tech.  Report:  hhp://www.ece.ubc.ca/~aamodt/papers/wwlfung.tr2012.pdf  

TX1  LD  [C]  

TX1  LD  [B]  

B  =  3   B  =  3   B  =  4  

TX2  Commit  

B  =  1  

TX2  ...  B  =  B  -­‐  2;  

TX3  ...    B  =  B  +  2;  

TX3  Commit  

TX1  Validate  Pass  

TX1  Commit  

Energy  Efficient  GPU  TM  via  Space-­‐Time  Opt.  Wilson  Fung