Branchpredictorscomputerarchitecture5thmay2014 140610214235 Phpapp02 (1)

P a g e | 1

The University of Texas at Dallas

Department of Electrical Engineering

EECE/CS 6304: COMPUTER ARCHITECTURE

PROJECT #2

“ ANALYSIS OF DIFFERENT TYPES OF

BRANCH PREDICTORS ”

Submitted by,

Bharat Biyani (2021152193)

Shree Viswa Shamanthan L D (2021180127)

P a g e | 1

INTRODUCTION

In computer architecture, a branch predictor is a digital circuit that tries to speculate which way a branch will go before this is known for sure (i.e., before its execution). The purpose of the branch predictor is to improve the flow in the instruction pipeline. They play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86.

In this project, we analyze the behavior of different branch predictor configurations in three well-recognized benchmarks, especially GCC, ANAGRAM and GO. We used simplescalar sim-outorder, which models all the execution aspects of Alpha 21264. The simulations provide the CPI values, which we use to compare among different benchmarks.

We have used three types of hardware based branch prediction strategies, they are:

1) Bimodal Predictor: It is a simple predictor, which uses 2-bit saturating counters to predict if a given branch is likely to be taken or not.

2) Two Level Predictor: A two-level adaptive predictor with an n-bit history is that it can predict any repetitive sequence with any period if all n-bit sub-sequences are different. The advantage of the two-level adaptive predictor is that it can quickly learn to predict an arbitrary repetitive pattern.

3) Combined Predictor: A hybrid predictor also called combined predictor implements more than one prediction mechanism. The final prediction is based either on a meta-predictor that remembers which of the predictors has made the best predictions in the past or a majority vote function based on an odd number of different predictors.

P a g e | 2

Part 1: Performance analysis of different types of branch predictors

The simulation is done for different configuration of Return Address Space (RAS) and types of branch predictions.

Baseline default RAS: Bimodal predictor with the default value for RAS. -bpred bimod -bpred:bimod 256 -bpred:ras 8 -bpred:btb 64 2

2 Level Predictor: Uses two bit for defining the state for branch predictor.-bpred 2lev -bpred:2lev 1 256 4 0 -bpred:ras 8 -bpred:btb 64 2

Comb: Combines a two levels and bimodal predictor.-bpred comb -bpred:comb 256 -bpred:bimod 256 -bpred:2lev 1 256 4 0 -bpred:ras 8 -bpred:btb 64 2

RAS 4: Change the return address stack (RAS) size to 4.-bpred bimod -bpred:bimod 256 -bpred:ras 4 -bpred:btb 64 2

RAS 16: Change the return address stack (RAS) size to 16.-bpred bimod -bpred:bimod 256 -bpred:ras 16 -bpred:btb 64 2

Performance Analysis based on CPI

Sr. No. ConfigurationBenchmarks

GCC ANAGRAM GO1 Baseline 0.95 0.4674 0.75712 2 Level Predictor 0.9822 0.4605 0.78933 Comb 0.8678 0.4546 0.75164 Bimod: RAS 4 0.9538 0.4678 0.75745 Bimod: RAS 16 0.9498 0.4674 0.7571

Graphical Representation with above CPI

Baseline 2 Level Predictor

Comb RAS 4 RAS 160

0.2

0.4

0.6

0.8

1

1.2

ANAGRAMGOGCC

P a g e | 3

Above graph clearly displays the performance of different configurations of branch predictor.

Analysis: Benchmark – GCC vs BP Configurations

GCC benchmark has more CPI as compared to the other benchmarks. Specifically, CPI improved for combination of two level and bimodal predictor (Comb). It has high CPI for 2 level predictor which uses two bits for defining state of branch predictor.

Analysis: Benchmark – ANAGRAM vs BP Configurations

From the above graph, we can infer that ANAGRAM benchmark has a less CPI than the other two benchmarks. The performance of ANAGRAM benchmark is fairly constant for all the configurations of branch predictor. Specifically, CPI is optimal for combination of two level and bimodal predictor (Comb).

Analysis: Benchmark – GO vs BP Configurations

Above graph shows that GO benchmark performs better than the GCC benchmark. The performance of GO benchmark is almost constant for all the configurations of branch predictor. Specifically, CPI is optimal for combination of two level and bimodal predictor (Comb). With respect to bimod size variation, if we change baseline configuration from the default return address space from size of 4 to size of 16, CPI performance gets better. RAS size does not have much impact on CPI.

P a g e | 4

Performance Analysis based on Address Hit Rates



Graphical Representation with above Address Hit Rates

Baseline 2 Level Predictor Comb Bimod: RAS 4 Bimod: RAS 160

0.2

0.4

0.6

0.8

1

1.2

GCCGOANAGRAM

The above graph clearly shows the performance of different configurations of branch predictor for different benchmarks.

For ANAGRAM benchmark, except for bimod, Return Address Stack (RAS) size 4, the Address Hit Rates are appreciable.

For GO benchmark, except for 2 level predictor configuration, the Address Hit Rates are appreciable.

For GCC benchmark, except for 2 level predictor configuration, the Address Hits Rates are appreciable.

P a g e | 5

Performance Analysis based on Direction Hit Rates



The graph for the Direction Hit Rates with respect to every benchmark will provide us more information on the effect of branch prediction configurations on different benchmarks.

Graphical Representation with above Direction Hit Rates

Baseline 2 Level Predictor Comb Bimod: RAS 4 Bimod: RAS 160

0.2

0.4

0.6

0.8

1

1.2

GCCGOANAGRAM

The Direction Hit Rates of the branch predictors fairly stays constant for each benchmark. Specifically, ANAGRAM benchmark has more direction hit rates than other two benchmarks. In this case, 2 level prediction direction rate gives worst performance but when we change Returns Address Space from 8 to 16 or 8 to 4, it performs better.

P a g e | 6

Part 2: Modification of the code to accommodate address misses

We carried out modifications in the following two files in simplescalar.

1) bpred.h 2) bpred.c

1) Changes in file bpred.h:

----------------/* branch predictor def */struct bpred_t { ------ } dirpred;struct { --------} retstack;

/* stats */counter_t addr_hits; /* num correct addr-predictions */counter_t dir_hits; /* num correct dir-predictions (incl addr) */counter_t addr_misses; /* num addr_misses */counter_t used_ras; /* num RAS predictions used */counter_t used_bimod; /* num bimodal predictions used (BPredComb) */-----------};

2) Changes in file bpred.c:

-----------sprintf(buf, "%s.dir_hits", name);stat_reg_counter(sdb, buf, "total number of direction-predicted hits " "(includes addr-hits)",

&pred->dir_hits, 0, NULL);sprintf(buf, "%s.addr_misses", name);stat_reg_counter(sdb, buf, "total number of addr-misses",

&pred->addr_misses, 0, NULL);-----------if (bpred == NULL) return;

bpred->dir_hits = 0; bpred->addr_misses = 0; -----------/* Have a branch here */ if (correct) pred->addr_hits++;

if (!!pred_taken == !!taken) pred->dir_hits++; else pred->misses++;

pred->addr_misses= (pred->misses + pred->dir_hits - pred->addr_hits); ----------- ----------- }

Part 3: Comparison of BTB Performance

P a g e | 7

The simulation is done for the following configurations of Branch Target Buffer:

Baseline BTB configuration: 64 sets, 2 way associativity –bpred bimod –bpred:bimod 256 -bpred:btb 64 2

Showing the effect of the number of sets in BTB with the following options –bpred bimod –bpred:bimod 256 -bpred:btb 32 2 –bpred bimod –bpred:bimod 256 –bpred:btb 128 2

Showing the effect of associativity when the total size of BTB is fixed with the following options –bpred bimod –bpred:bimod 256 -bpred:btb 32 4 –bpred bimod –bpred:bimod 256 -bpred:btb 128 1

Performance Analysis based on addr_hits


GCC ANAGRAM GO1 64 sets/2 way 2235498 2771048 19347602 32 sets/2 way 2095859 2746365 18323023 128 sets/2 way 2389785 2777415 20085974 32 sets/4 way 2260256 2775372 19367455 128 sets/1 way 2197498 2759944 1893595

Graphical Representation with above addr_hits

64 sets/2 way 32 sets/2 way 128 sets/2 way 32 sets/4 way 128 sets/1 way0

500000

1000000

1500000

2000000

2500000

3000000

GOGCCANAGRAM

The above graph shows the behavior of various configurations of Branch Target Buffer (BTB) for different benchmarks. Among all the three benchmarks, ANAGRAM benchmark has the highest address hits and the performance is relatively minimum for BTB with 32 sets and 4 way

P a g e | 8

set associative. GCC benchmark has moderate address hits and the performance is relatively minimum for BTB with 32 sets and 4 way set associative. GO benchmark has poor address hits when compared to other benchmark. For this benchmark, the address hits is again minimum for the configuration of BTB with 32 sets and 4 way set associative.

Comparison of BTB Performance based on addr_misses


GCC ANAGRAM GO1 64 sets/2 way 1084176 127541 8014642 32 sets/2 way 1223815 152224 9039223 128 sets/2 way 929889 121174 7276274 32 sets/4 way 1059418 123217 7994795 128 sets/1 way 1122176 138645 842629

Graphical Representation with above addr_misses


200000

400000

600000

800000

1000000

1200000

1400000

ANAGRAMGOGCC

From the above graph, as expected, address misses is very optimal for ANAGRAM benchmark. GCC benchmark has maximum address misses among all the three benchmarks. As we can see from the graph, decreasing the sets from 64 to 32 increases the miss rate and increasing the number of set from 64 to 128 decreases the address misses . This is because capacity misses is reduced by increasing the number of sets. In case of 32 sets/4 way configuration, even though set is decreased from 64 to 32 the address miss is decreased because the associativity is increased which reduces the conflict misses. In case of 128 sets/1 way configuration, due to direct mapping, even the increase in number of set increases the addr_misses.

P a g e | 9

Comparison of BTB Performance based on CPI


GCC ANAGRAM GO1 64 sets/2 way 0. 9500 0. 4674 0. 75712 32 sets/2 way 0. 9664 0. 4711 0. 76453 128 sets/2 way 0. 9304 0. 4664 0. 74964 32 sets/4 way 0. 9491 0. 4670 0. 75755 128 sets/1 way 0. 9528 0. 4686 0. 7583

Graphical Representation with above CPI


0.2

0.4

0.6

0.8

1

1.2

GCCANAGRAMGO

From the above graph, CPI remains fairly constant for every benchmark. Among the benchmarks, ANAGRAM benchmark has the most optimal CPI and GCC benchmark holds the maximum CPI for execution with various BTB configurations. The CPI seems to be higher for configuration 32 sets/2 way compared to the 64 sets/2 way which has much higher sets than this configuration. In case of 32 sets/4 way and 128 sets/1 way configurations, associativity and number of sets makes the CPI almost equal to the 64 sets/2 way CPI. For the configuration with set 128 and associativity 2 the CPI remains much lower than all other configurations.

P a g e | 10

Comparison of BTB Performance based on Branch Predictor Hit Rates


GCC ANAGRAM GO1 64 sets/2 way 0.6779 0.9546 0.69262 32 sets/2 way 0.636 0.9476 0.65273 128 sets/2 way 0.7221 0.9557 0.72254 32 sets/4 way 0.6852 0.9573 0.69315 128 sets/1 way 0.665 0.9518 0.6775

Graphical Representation with above Branch Predictor Hit Rates

64 sets/2 way

32 sets/2 way

128 sets/2 way

32 sets/4 way

128 sets/1 way

0

0.2

0.4

0.6

0.8

1

1.2

GCCANAGRAMGO

The above graph clearly shows us that the branch predictor hit rate for all the benchmarks is relatively low when number of set decreases in a BTB. When we closely observe the variation in the branch predictor hit rates of different configurations, it is evident that for BTB configuration, 32 sets and 2 way set associative the branch prediction hit rate is lower for all the benchmarks.

CONCLUSION

For an optimal branch predictor, it is recommended to have higher sets but at the same time tradeoff between cost and performance should be taken into consideration.

To have high address hit rates and direction hit rates, the simulation results suggests that combination of two level and bimodal predictor configuration is better.

Branchpredictorscomputerarchitecture5thmay2014 140610214235 Phpapp02 (1)

Documents

Transcript of Branchpredictorscomputerarchitecture5thmay2014 140610214235 Phpapp02 (1)