Download - Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g201103010) Coe-502 paper presentation 2.

Transactional Coherence and Consistency

Presenters: Muhammad Mohsin Butt. (g201103010)Presenters: Muhammad Mohsin Butt. (g201103010)

Coe-502 paper presentation 2

OUtline1. Introduction

2. Current Hardware

3. TCC in Hardware

4. TCC in Software

5. Performance evaluation

6. Conclusion.

• Transactional Coherence and Consistency (TCC) provides a lock free transactional model which simplifies parallel hardware and software.

• Transactions are the basic unit of parallel work which are defined by the programmer.

• Memory coherence, communication and memory consistency are implicit in a transaction.

Intoduction

• Provide illusion of a single shared memory to all

processors.

• Problem is divided into various parallel tasks that work on a

shared data present in shared memory.

• Complex cache coherence protocols required.

• Memory consistency models are also required to ensure the

correctness of the program.

• Locks used to prevent data races and provide sequential

access.

• Too many locks overhead can degrade performance.

Current Hardware

TCC in HARDWARE

• Processors execute speculative transactions in a continuous

cycle.

• A transaction is a sequence of instructions marked by

software that are guaranteed to execute and complete

atomically.

• Provides All Transactions All The time model which

simplifies parallel hardware and software.

TCC in HARDWARE

• When a transaction starts, it produces a block of writes in a

local buffer while transaction is executing.

• After completing transaction, hardware arbitrates system

wide for permission to commit the transaction.

• After acquiring permission, the node broadcasts the writes

of the transaction as one single packet.

• Transmission as a single packet reduces number of inter

processor messages and arbitrations.

• Other processors snoop on these write packets for

dependence violation.

TCC in HARDWARE

TCC in HARDWARE

• TCC simplifies cache design

• Processor hold data in unmodified and speculatively modified

form.

• During snooping invalidation is done if commit packet contains

address only.

• Update is done if commit packet contains address and data.

• Protection against data dependencies.

• If a processor has read from any of the commit packet address, the

transaction is re executed.

TCC in HARDWARE

• Current CMP need features that provide speculative

buffering of memory references and commit arbitration

control.

• Mechanism for gathering all modified cache lines from each

transaction into a single packet is required.

• Write Buffer completely separate from cache.

• Address buffer containing list of tags for lines containing data to be

committed.

TCC in HARDWARE

• Read BITs

• Set on a speculative read during a transaction.

• Current transaction is voilated and restarted if the snoop protocal

sees a commit packet having address of a location whose read bit is

set.

• Modified BITs

• During a transaction stores set this bit to 1.

• During violation lines having modified bit set to 1 are invalidated.

TCC in Software

• Programming with TCC is a 3 Step process.

• Divide program into transactions.

• Specify Transactions Order.

• Can be relaxed if not required.

• Tuning Performance

• TCC provide feedback where in program the violations occur

frequently

Loop Based Parallelization

• Consider Histogram Calculation for 1000 integer

percentage

/* input */

int *in = load_data();

int i, buckets[101];

for (i = 0; i < 1000; i++) {

buckets[data[i]]++;

}

/* output */

print_buckets(buckets);

Loop Based Parallelization

• Can be parallelized using.

t_for (i = 0; i < 1000; i++)

• Each loop body becomes a separate transaction.

• When two parallel iterations try to update same histogram

bucket, TCC hardware causes later transaction to violate,

forcing the later transaction to re execute.

• A conventional Shared memory model would require locks

to protect histogram bins.

• Can be further optimized using

• t_for_unordered()

Fork Based Parallelization

• t_fork() forces the parent transaction to commit and

create two completely new transactions.

• One continues execution of remaining code

• Second start executing the function provided in parameters. E.g

/* Initial setup */

int PC = INITIAL_PC;

int opcode = i_fetch(PC);

while (opcode ! = END_CODE){

t_fork(execute, &opcode,

1, 1, 1);

increment_PC(opcode, &PC);

opcode = i_fetch(PC);}

Explicit transaction commit ordering

• Provide partial ordering.

• Done by assigning two parameters to each transaction

• Sequence Number and Phase Number

• Transactions with same sequence number commit in an

ordered way defined by programmer.

• Transactions with different sequence number are

independent.

• Order for transactions having same sequence numbered is

achieved through phase number.

• Transaction having Lowest Phase number is executed first.

Performance Evaluation


• Maximize Parallelization.

• Create as many transactions as possible

• Minimize Violations.

• Keep transactions small to reduce amount of work lost on violation

• Minimize Transaction Overhead

• Not To small size of transaction

• Avoid Buffer Overflow

• Can result in excessive serialization


• Base Case.

• Simple parallelization without any optimization.

• Unordered

• Finding loops that can be un orderd.

• Reduction

• Finding areas that exploit reduction operations

• Privatization

• Privatize the variables to each transaction that cause violations.

• Using t_commit()

• Break large transactions to small ones but execute on same processor.

Reduces loss overhead due to violations and prevents buffer overflow.

• Loop Adjustments

• Using various loop adjustments optimizations provided by the compiler.


Privatization and t_commitImprove performance

Inner Loops had too many violations Using outer loop_adjust improved result


• CMP performance is close to Ideal TCC for small number of

processors.

Conclusions

• Bandwidth limitation is still a problem for scaling TCC to

more processors.

• No support for nested for loops.

• Dynamic optimization techniques still required to automate

performance tuning on TCC