REPORT

A Comparative Analysis of Branch Prediction Schemes

Zhendong Su and Min Zhou

Computer Science Division
University of California at Berkeley
Berkeley, CA 94720

Abstract

Conditional branches are major obstacles to achieve higher performance for a high performance CPU. Accurate branch prediction is required to overcome this performance limitation imposed on high performance architectures and is the key to many techniques for enhancing and exploiting Instruction Level parallelism (ILP). Many different branch prediction schemes have been proposed. Most of these work has been based on benchmark programs including SPEC89 and SPEC92. In this report, we present a comparative analysis for a few well known branch prediction schemes on SPARC architecture based on a partially new collection of benchmark programs including SPECint95-beta and SPECfp92. Comparing to previous work, we have several interesting findings. In this paper, we first show the performance of several well-known dynamic branch prediction schemes. From the results obtained, we conclude that selective predictor achieves the least miss predict rate with the same size of branch prediction buffer. We observe that static predictors without code expansion cannot compete with dynamic predictors. One SPECint95 program experiences very low prediction accuracy using those common schemes. We also observe that misleading data and conclusions may result from either tracing only a few testing programs or tracing just small portion of a program. Finally, we notice context switching has little impact on branch prediction with today's fast CPU, and complex schemes need longer time to warm up than simple schemes.

Project Home | Next Section: Introduction

Introduction

Today's fast CPUs allow very deep pipelines and wide issue rates, which are two of the most effective ways of improving performance of processors. Branches impede machine performance in that conditional branch is not resolved until the condition is resolved and the target address is calculated, and unconditional branch is not resolved until the target address is calculated. As pipelines get deeper or issuing rate gets higher, the penalty imposed by branches gets larger. One way to reduce this penalty is predicting the direction of a conditional branch, pre-fetching, decoding, and executing the instruction at the branch target. A large amount of speculative work has to be thrown away after a branch miss predication. This results in higher misprediction penalty as the memory hierarchy getting more complex. Extremely accurate branch prediction is thus the key to reduce this penalty. Many schemes have been proposed to reduce prediction miss rate.

Branch prediction schemes can be classified into static schemes and dynamic schemes by the way the prediction is made. Static prediction schemes can be simple. The most straight forward one is predicting branches to be always taken by observing that majority of branches are taken. As reported by Lee ad Smith [LS92], this simple strategy can predict correctly 68% of the time. In our study, out of the dynamic instructions traced, 65% of the conditional branches are taken. Our traces also indicate that this simple approach may result in less than 50% correct prediction for some integer programs. Static schemes can also be based on branches' opcodes. Another simple method is using the direction of the branches to make a prediction. If the branch is backward, i.e., the target address is smaller than the PC of the branch instruction, it is predicted to be taken. Otherwise, if the branch is forward, it is predicted to be not taken. This strategy tries to take advantage of loops in the program. It works well for programs with many looping structures. However, it does not work well in the case there are many irregular branches. Profiling is another static strategy which uses previous runs of a program to collect information on the tendencies of a given branch to be taken or not taken and preset a static prediction bit in the opcode of the given branch. Later runs of the program can use this information to make predictions. This strategy suffers from the fact that runs of a program with different input data sets usually result in different branch behaviors. Recently, C. Young and M. Smith proposed static correlated branch prediction(SCBP) trading off increased code size with increased prediction accuracy. At this time, we do not know whether this approach will yield any performance improvement. For more information, refer to [YS94]. In this project, we studied the limit of static approach without code expansion. Our results indicate that the static schemes without code expansion are not comparable to dynamic approaches.

Dynamic schemes are different from static ones in the sense that they use the run-time behavior of branches to make predictions. J. Smith [S81] gave a survey of early simple static and dynamic schemes. The best scheme in his paper is the one which uses 2-bit saturating up-down counters to collect history information which is then used to make predictions. This is perhaps the most well-known technique. McFarling [M93] referred to it as the bimodal branch prediction. There are several variations in the design of the 2-bit counter. Yeh and Patt [YP91] discussed these variations. In many programs with intensive control flow, very often the direction of a branch is affected by the behavior of other branches. By observing this fact, Pan, So, & Rahmeh [PSR92] and Yeh & Patt [YP91] independently proposed correlated branch prediction schemes or two-level adaptive branch prediction schemes. This new approach improved the prediction accuracy by a large factor. Yeh and Patt [YP93] classified the variations of dynamic schemes that using two levels of branch history. McFarling [M93] exploited the possibility of combining branch predictors to achieve even higher prediction accuracy.

Computer technology is advancing at a rapid speed. Advanced VLSI technology makes it possible to have larger branch prediction table and more complicated schemes. The advancement in programming languages also makes it possible to have larger and more complicated programs, and allows more cross-references between branches because of more complicated procedure calls. Multiprocessing and threading become important because of the rise of Multiple Instruction streams, Multiple data streams(MIMD) machines. Therefore, it is important to look at the effect of these advancements on branch prediction.

In this project, we look at some of the following issues. In the literature, there are many MIPS, Alpha, HP-PA, and Power architecture based branch prediction research. We are interested in knowing the impact of a different architecture, SPARC architecture, on branch prediction. Our results do not show any significant, if at all any, impact of architecture on branch prediction. We get similar results as compared to results from previous research. Taking advantage of the fast simulation speed of Shade, we are able to trace much larger programs, try out many different schemes, and experiment with different parameters of the schemes in a reasonable amount of time. We notice that the number of instructions traced clearly affects the resulted branch behavior and prediction accuracy, and that the selection and the size of the testing program set also affects the comparison over different schemes. Since in different applications and programming languages conditional branches behave differently, it is important to have a set of benchmark programs that can truthfully represents the average workload and complexity of the programs people run. We use a partially new collection of programs, which includes 8 SPECint95 beta benchmarks and 13 SPECfp92 benchmarks, to see how well the well-known schemes work on these new programs. We do observe that one new SPEC95 program go has much different branch behavior from previous SPEC programs. In this paper, we first show the performance of several well-known dynamic branch prediction schemes. From the results, we conclude that selective predictor achieves the highest prediction accuracy with the same size of branch prediction buffer. We also observe that conventional static predictors cannot compete with dynamic predictors, and context switching has little impact on branch prediction with today's fast CPU. In addition, complex schemes require longer time to warm up than simple schemes do.

The rest of the report is organized as follows: the related work section gives references to previous work on branch prediction. The design methodology section discusses the methodology used in this study: how the simulated prediction models and testing programs are selected, and how the simulated prediction schemes are designed and implemented. The result analysis section discusses our findings from traces of the benchmark programs. The future work section presents some of the work that may be interesting to explore. The last section summaries the report.

Project Home | Previous Section: Abstract | Next Section: Related Work

Related Work

Branch prediction performance issues have been studied extensively. J. Smith [S81] gave a survey of early simple static and dynamic schemes. The best scheme in his paper is the one which uses 2-bit saturating up/down counters to collect history information which is then used to make predictions. This is perhaps the most well-known technique. McFarling [M93] referred to it as bimodal branch prediction. It was also referred to as one-level branch prediction in Yeh and Yatt 's paper [YP91]. We will discuss this scheme in more detail in later sections. Lee and Smith [LS92] evaluated several branch prediction schemes. In addition, they addressed how to use branch target buffers to reduce the delay due to target address calculation. McFarling and Hennessy [MH86] compared various hardware and software approaches to reducing branch cost including using profiling information. Fisher and Freudenberger [FF92] studied the stability of profile information across separate runs of a program. In many programs with intensive control flow, very often the direction of a branch is affected by behavior of other branches. By observing this fact, Pan, So, & Rahmeh [PSR92] and Yeh & Patt [YP91] independently proposed correlated branch prediction schemes, also called two-level adaptive branch prediction schemes in Yeh and Patt's paper. Correlation schemes use both single conditional branch branch history and global branch history. Pan, So, Rahmeh [PSR92] described how both global history and branch address information can be used in one predictor. This new approach improved the prediction accuracy by a large factor. There are several variations of this kind of dynamic schemes by using different indexing method and buffer organizations. Yeh and Patt gave a comparison of these approaches. [YP91] In designing the 2-bit counter used in many of the dynamic schemes, several variations exist. Yeh and Patt [YP93] discussed these variations. McFarling [M93] exploited the possibility of combining branch predictors to achieve even higher prediction accuracy. He also presented a sharing index scheme, referred to as gshare, and a new scheme using combined predictors. Ball and Larus [BL93] described several techniques for guessing the most common branches directions at compile time using static information. Young and M. Smith [YM94] [YM95] introduced the notion of static correlation branch prediction (SCBP). In a recent paper [GSY], Gloy, M. Smith, and Young addressed performance issues of this approach. They claimed a better performance in comparison to some dynamic approaches. Several studies [JW89] [W91] have looked at the implications of branches on available instruction level parallelism (ILP). These studies show that branch prediction miss is a crucial parameter in determining the amount of parallelism that can be exploited.

Project Home | Previous Section: Introduction | Next Section: Design Methodology

Design Methodology

First, we briefly describe the experimental system and test benchmark programs we use. Second, we describe the schemes that we implemented and tested. Third, we discuss how we chose the buffer organization and related parameters. Finally, we discuss the issues regarding the number of instructions being traced.

1. Experimental System

Our experiment is conducted on SPARC system 10 with two SuperSparc/60 V8 micro-processors. We compiled c benchmark programs using SUNSoft "cc" version 3.0.1, and compiled fortran programs using SUNSoft "f77" version 3.0.1.

Our data is obtained using "Shade"[SHADE] version 5.15 analyzing programs. Shade is a dynamic code tracer. It links instruction set simulation, trace generation with custom trace analysis. The first advantage of a shade-based simulator over other static trace-based simulators such as pixie based approach is its fast running speed. Shade tends to run fast mainly because the executable being traced, the shade trace generator, and the shade analyzer are all in one single process. The second advantage is that it combines trace generation and trace based analyzer/simulator, thereby avoiding awkward trace file manipulations.

The following figure illustrates the code structure of our shade-based branch prediction simulators.

Figure 1. Shade-Based Branch Prediction Scheme Simulator

The main function is analyze() which is invoked for each traced instruction specified in shade_main(). Provided with the pc - program counter, ea - effective address, and taken/untaken information, we implemented different branch prediction scheme simulators. In analyze(), we also implemented a simple profiler and an execution controller. The profiler gathers the program branch behavior information. The execution controller controls the dynamic execution size. In addition, it controls the time to flush branch history tables when simulating context switching.

2. Benchmark Programs

In different applications and programming languages, conditional branches behave differently. It is important to have a set of benchmark programs that can give a good approximation of the average workload and complexity of the programs that users run. Previous work in this area has been done by tracing the execution of some benchmark programs. In this project, we also use instruction tracing data to measure the performance of different branch prediction schemes. Eight benchmark programs from the beta version of SPEC95 integer program suite and thirteen benchmarks from the SPEC92 floating point program suite are used in this study. Table 1 and 2 list the benchmark programs, their abbreviations that we use, and the testing input data sets used in our experiment.

SPEC95 Integer Program Beta
Benchmark / Input	Dynamic Inst.	Dynamic Cond. Branch	Program / Input size	Static Cond. Branch
gcc / 1amptjp.i	1297M	221M	1697K / 222K	19598
gcc / 1c-decl-s.i	1297M	221M	1697K / 222K	19603
gcc / 1dbxout.i	1664M	28M	1697K / 42K	15455
gcc / 1reload1.i	992M	173M	1697K / 148K	19673
gcc / cccp.i	1298M	223M	1697K / 162K	19514
gcc / insn-emit.i	147M	23M	1697K / 48K	10815
gcc / stmt-protoize.i	986M	165M	1697K / 185K	19746
ghost / convolution.ps-color	1400M	238M	584K / 218K	4262
ghost / convolution.ps-mono	1342M	229M	584K / 218K	4312
ghost / convolution.ps-tiff	1315M	222M	584K / 218K	4330
go / restart.in*	4535M	500M	390K /	5761
go / neardone.in	733M	78M	390K /	4874
go / null.in*	4531M	500M	390K /	5742
m88ksim / dcrand.in*	7007M	1000M	389K / 66K	824
numi / numi.in*	7507M	1000M	31K /	1064
li/li-input.lsp*	12355M	1000M	299K /	1412
perl / jumble.perl*	3471M	500M	400K /	2523
perl / primes.perl*	4762M	500M	400K /	2218
vortex / vortex.in*	7195M	1000M	867K /	7602

Table 1. SPEC95 Integer Program Beta and Input Data Description.

( All programs are listed in alphabetical order. Entries with * denote programs that were interrupted in the middle of tracing. The number of static conditional branches is the number of different static conditional branches traced. Only numi is a Fortran program. All the other programs are written in C. The input data sizes of gcc, ghost and m88ksim are related to the tracing sizes.)

SPEC92 Floating Point Program
Benchmark	Dynamic Inst.	Dynamic Cond. Branch	Program size	Static Cond. Branch
alvinn	6792M	480M	9612	1032
doduc	1644M	87M	247K	2330
ear	14506M	705M	59K	1238
fpppp	8463M	106M	138K	1332
hydro2d	6627M	680M	111K	2356
mdljdp2	4206M	309M	79K	1458
mdljsp2	3011M	338M	98K	1520
nasa7	11104M	217M	91K	1889
ora	2029M	158M	24K	1153
su2cor	8055M	165M	150K	1863
swm256	9862M	66M	62K	1335
tomcatv	1261M	31M	21K	1036
wave5	4331M	286M	401K	1956

Table 2. SPEC92 Floating Point Program Description.

( All programs are listed in alphabetical order. The number of static conditional branches is the number of different static conditional branches traced. Alvinn and ear are C programs. All the other program are written in Fortran. )

3. Branch Prediction Scheme Design

Branch History Based Prediction -- Dynamic Branch Prediction
Branch prediction schemes using small buffers of branch history take advantage of the repetitive branch taken/untaken execution behavior, thereby achieving better prediction accuracy than the simple static prediction schemes. For each conditional branch, an appropriate counter is incremented or decremented. The most significant bit of the counter determines the prediction decision. J. Smith [S81] observed that a 2-bit counter empirically provides an appropriate amount of damping to changes in branch direction. A 1-bit counter simply records the last executed branches direction. In addition, 3-bit or higher counters do not appear to offer large cost/benefit advantages over 2-bit counters. We will further discuss the design of 2-bit counter in a later subsection on predictor tuning.
Bimodal Branch Prediction
Bimodal branch prediction is the simplest 2-bit counter based dynamic prediction scheme. The branch history table is indexed by the low order address bits in the program counter. The following table illustrates the design of the bimodal prediction scheme.

Figure 2, 3. Bimodal Predictor | Correlation Based Predictor

Correlated Branch Prediction Schemes
Correlated branch prediction schemes include common-correlation, gselect, global and local. Since the bimodal scheme takes advantage of the bimodal distribution of branch behavior, it does not perform well when branches have strong dynamic behavior. Correlated prediction schemes are designed to take advantage of relationship between different branch instructions -- certain repetitive branch pattern of several consecutive branches. One correlation based predictor uses two branch history tables. The first table records the history of recent branches -- global history. Each entry is implemented using a shift register. The second table records the branch history for each branch. It is organized as a matrix with rows and columns. Each entry is a 2-bit counter. The pc determines which shift register in the first table and which row of the 2-bit counters of the second table should be used. The chosen global shift register indexes the appropriate counter from the selected row of counters. Prediction is made based the selected counter. The selected shift register and the 2-bit counter will be updated afterwards accordingly. Figure 3 above illustrates the design of correlated schemes.
There are many ways of using pc to index the first and the second tables. Yeh and Patt [YP93] classified these methods into per_address which uses the low order bits of pc and per_set which uses high or middle range bits of pc. They claimed that per_address method and per_set method have similar performance, and the latter has higher implementation cost. We use per_address in our study.
The well-known common-correlation scheme is a correlated scheme that uses a single 2-bit shift register as the global branch history table, and four 2-bit counter for each row of the second table. The 2-bit shift register approach only exploits the correlation between two consecutive branches. Another similar correlated scheme design uses j > 2 bits for the global branch history register and 2^j 2-bit counters for each row of the second table. We adopt the name used by McFarling and refer to it as the gselect scheme. If i equals to 1, i.e. there is just single row in the second table, the scheme is also referred to as the global scheme. Global scheme applies all its buffer for recording the correlation information while ignoring the different branch behavior of a single branch. In most cases, it does not perform as well as other correlated schemes.
More complicated design uses multiple shift registers in the first table. Each register records the branch history of different branches. McFarling referred to it as the local scheme.
Sharing Index Branch Prediction Scheme
Sharing index scheme is referred to as gshare. It was proposed by McFarling. This scheme is similar to the bimodal scheme. It xors a j bits global history shift register with the i bits of the pc before indexing the counter table. Figure 4 illustrates the design of the gshare scheme.

Figure 4, 5. Index Sharing Predictor | Selective Predictor

Selective Branch Prediction Scheme
Different dynamic schemes use different branch history information. Many schemes work well on one type of programs and do not work well on another type of programs. The selective scheme uses two different predictors. Each of two predictors makes prediction independently. A third table is used to track the performance of the two subpredictors and arbitrates which prediction should be used as the final prediction. Selective scheme can perform well on different types of programs. Figure 5 illustrates the design of the selective prediction scheme.
The implementation cost of the selective prediction scheme is three times of the implementation cost of other prediction schemes because two predictors and one selector are used.

Implementation Cost of Different Dynamic Branch Prediction Scheme

The implementation costs of different schemes are shown in the following table. In the table, i is the number of pc bits for indexing the counter table row, j is the number of pc bits for indexing the shift register table, and k is the number of bits of the shift register

scheme name	i	j	k	buffer size
bimodal	variable	n/a	n/a	2*2^i
correlation	variable	1	2	1 + 242^i
gselect	variable	1	variable	k + 2*2^(i+k)
global	1	1	variable	k + 2*2^k
local	variable	variable	variable	k2^j + 22^(i+k)
gshare	variable	n/a	variable	k + 2*2^i

Table 3. Dynamic Predictor Implementation Cost

4. Branch Prediction Scheme Tuning

Before gathering data for all the benchmark programs, we applied a small set of programs to come up with the best parameters for each scheme. We tested different 2-bit count designs, different correlation depth for the gselect and local, and different global bits for gshare.

Prediction History Buffer Size
To have a fair comparison, we choose 8K bits branch prediction buffer, i.e. 4K 2-bit counter entries, for all of the prediction schemes. We will discuss the effect of buffer size on prediction accuracy further in the result analysis section.
2-bit Counter Design
There are many variations in the design of the 2-bit counter state transition automaton. Figure 6 shows four common automaton designs. The two most well-known are automaton1 and automaton2. Assuming that automaton1, discussed in Patterson and Hennessy's book , has better performance, we used automaton1 first. However, according to our experimental results, automaton2 based schemes produced about 0.5% better prediction accuracy then those based on automaton1 when 8kb buffer was used. Used by many prediction schemes, automaton2 is also referred to as a saturating up-down counter. We choose automaton2 based schemes in our final comparison analysis. Automaton3 and automaton4 are similar to automaton2. Their state transition is in more favor of branch taken tendency.

Figure 6, 7. 2bit Counter State Diagram Design | Correlation Depth vs. Prediction Accuracy
(click each automaton for a full-sized figure)

Correlation Depth Selection for Gselect Scheme
From Figure 7 above, 5~6 global bits is the best choice when the branch prediction buffer is 8k bits. We use 5 bits in our comparison analysis since we used gcc here, and we expect fewer global bits should be used for floating point programs.
Local Scheme Correlation Depth Selection
From Figure 8, we observe that the best choice is to use 3 global bits when the buffer is 8K bits.

Figure 8-9. Local Scheme Correlation Depth vs. Prediction Accuracy.
| Gshare Scheme Global Branch History Bit Adoption vs Prediction Accuracy

Sharing Index Scheme Global Bit Selection
From Figure 9 above, 2 global bits is best choice for 8kbit buffer case.
Selective Prediction Sub-Predictor Selection and Buffer Design
There are many variations in the design of selective prediction schemes. Two sub-predictors are chosen from different schemes. Given certain amount of prediction history buffer size, the three prediction history buffer may take different buffer spaces. McFarling used a gshare predictor and a bimodal predictor as the two sub-predictors. We use gselect and bimodal. Considering that it is more beneficial to have larger buffer size for gselect predictor, we use 2Kb buffer for the bimodal predictor, 4Kb buffer for the gselect predictor, and 2kb buffer for the selector. We use 3 global history bits for the gselect sub-predictor.

5. The Number of Branch Instruction Being Traced

Providing the fast running speed of Shade, we can explore more schemes and run more programs. More importantly, we can trace much more instructions for each program. In our study, we traced most benchmark programs to the end. The regular size of a benchmark program is more than several hundred million and is getting larger. It may beyond the tracing and simulation capability of trace file based approaches to trace so many instructions. It is feasible for most trace file based approaches of trace and simulate at most several tens of million instructions. Since the branch behaviors of some programs will not show until program runs to the middle, our several hundred millions to several billions of instructions' tracing for each program is expected to provide more reliable result and more complete information about those benchmark programs.

Figure 10, 11. Prediction Accuracies vs. Branch Instructions Traced
| Branch Taken Percentage vs. Branch Instruction Traced

The above two figures show variance of the branch behavior and the performance of different schemes through the tracing procedure of the SPEC program gcc/cccp.i The difference between the lowest point in range of 1M~10M branches and the highest point in range of 10M~100M branches is 10% for the taken percentage and 5% for the accuracy.

Project Home | Previous Section: Related Work | Next Section: Results Analysis

Result Analysis and Discussion

The simulation results presented in this section were run with

the bimodal scheme,
the gshare scheme,
the common correlation scheme,
the gselect scheme,
the local scheme,
the selective scheme (with bimodal and gselect as the two sub-predictors), and
the static scheme based on profiling without code expansion.

All the schemes except the static predictor use 8K bits, i.e. 4K entries of 2-bit counters. The static scheme that we studied here is an assumed perfect profiling based static scheme without code expansion. For the gshare, gselect, local,and selective predictors, by varying the number of address bits and history depth, we empirically selected the one with the best performance from each group.

To evaluate the performance of the seven different schemes, we first studied the branch behavior of the 21 benchmark programs that we use for this project. From this information, we are able to calculate the accuracy limit on the performance of a static predictor. Second, we give a comprehensive comparison of the seven schemes. Third we look at the effect of varying the buffer size on prediction accuracy for some dynamic schemes. Fourth, the effect of context switching on branch prediction is discussed. Finally, we look at an interesting benchmark program from SPECint95-beta.

1. Benchmark Program Branch Behavior

We first looked at the branch behavior of the 21 benchmark programs used in this study. Table 4 summarizes the information for the 8 integer programs from SPECint95-beta. Table 5 is for the 13 floating point (FP) programs from SPECfp92. The second column and third column of these two table list the number of dynamic instruction traced and dynamic conditional branches traced respectively. The last two columns list the number of taken conditional branches and their percentage with respect to the total number of conditional branches traced from the third column. We traced 16 of the total 21 benchmark programs to the end.

Program	Traced_Inst#	Dynamic_Branch#	Taken_Branch#	Taken_Branch%
gcc	6186539354	1057556947	566093312	53.53
ghost	4057743776	690986805	339710533	49.16
go	9800796393	1078696639	504611777	46.78
li+	7007645073	1000000000	551847194	55.18
m88ksim+	7507989139	1000000000	479341651	47.93
numi+	12355029197	1000000000	639221685	63.92
perl+	8234202623	1000000000	474993437	47.50
vortex+	7195515374	1000000000	425973672	42.60

( + stands for the analyzer was interrupted in the middle of tracing.)

Table 4. Branch behavior of benchmark programs from SPEC95int-beta

Program Traced_Inst# Dynamic_Branch# Taken_Branch# Taken_Branch%

alivinn 6792027933 480143481 469154917 97.71
doduc 1644410091 87092576 48388199 55.56
ear 14506557248 705032704 466983346 66.24
fpppp 8463667939 106307360 67686215 63.67
hydro2d 6627612755 680214394 515386882 75.77
mdljdp2 4206065214 309377515 215543933 69.67
mdljsp2 3011635408 338499291 195972810 57.89
nasa7 11104431137 217326271 186014771 85.59
ora 2029511987 158386472 82174861 51.88
su2cor 8055850151 165611544 140751098 84.99
swm256 9862718926 66039940 61056566 92.45
tomcatv 1261279753 31605162 30958962 97.96
wave5 4331716191 286632343 194798779 67.96

Table 5. Branch behavior of benchmark programs from SPEC92fp

Percentage of conditional branches
Conditional branch frequencies dramatically affect machine pipeline performance. Therefore, it is important to look at the percentage of conditional branches. Figure 12 lists the conditional branch frequencies for the 21 benchmark programs used in this study. The x-axis lists the 21 benchmark programs starting with the 8 integer programs. The y-axis shows the percentage of conditional branches. The integer programs show conditional branch frequencies of 8% to 17%, with numi having the lowest. As mentioned in the previous section, among all integer programs, numi is the only one written in Fortran and has the highest FP instruction percentage -- 10.2%. All the other integer programs are written in C. The FP programs have lower percentage of conditional branches then integer programs. They show frequencies between 1% to 11%. The average frequency for the 8 integer programs is about 14% and 5% for the 13 FP programs.

Figure 12 The Frequencies of Conditional Branches

Percentage of taken branches
In Figure 13, the frequencies of taken conditional branches are given. As in Figure 12, the x-axis lists the 21 benchmark programs, and the y-axis gives the percentage of taken conditional branches. The first 8 programs are integer programs. The horizontal dash line represents the average frequency for all the 21 programs. All of the 8 integer programs are below this average. numi is close to the average for the same reason in the earlier paragraph. Notice some of the FP programs have almost 100% taken frequency, such as alvinn and tomcatv. The average for the integer programs is about 50% and 75% for the FP programs. As you will see later, this frequency is directly related to prediction performance of all the predictors. Generally speaking, the higher the taken frequency, the higher the prediction accuracy. This is also shown in the earlier discussion of tracing instruction number. The prediction accuracy curves in Figure 10 have a similar shape as the branch taken percentage curve in Figure 11. FP programs are easier to predict than integer programs with low frequency of taken branches.

Figure 13 Percentage of conditional branches that are taken

Branch profiling information
The bias of a branch describes how strongly this branch tends to be taken or to be not taken. Profiling based Static predictors work well because that most of the dynamic branches executed are strongly biased. Figure 14 shows the bias over all integer benchmarks, over all FP benchmarks, and over all the 21 benchmarks. The average of each group is used. The x-axis shows for a particular branch the percentage of executed branches that are taken. Branches within 5% are grouped together as one group. The y-axis shows the percentage of branches that have a certain bias. The less frequent the number of branches in the middle, the better the performance of the perfect static predictor. We notice that the FP programs are more biased than the integer programs.

Figure 14 Branch bias, weighted by execution frequency.

2. Scheme Comparison

Figure 15 shows the prediction accuracies of the seven prediction schemes for each of the 21 benchmarks. All the schemes use 8K bits branch prediction buffer size except the static scheme. For most of the 21 benchmarks, the static predictor's prediction accuracy is the lowest of all the seven schemes measured. In contrary, the selective predictor achieves the highest prediction accuracies on most of the benchmarks. Of all the six dynamic predictors, the bimodal scheme, which is the simplest dynamic scheme, performs the worst. gshare and common correlation are slightly better. The prediction accuracies of local and gselect are slightly lower than that of selective.



Figure 15 Seven schemes performance on the 21 benchmark programs
(Click the following hyper texts to have detailed views of the comparison.)
comparison of static, bimodal and common-correlation schemes
comparison of gshare, common-correlation, local and gselect schemes
comparison of gselect, local and selective schemes
comparison of gselect, local and common-correlation schemes
comparison of static, bimodal and selective schemes

Figure 16 Seven schemes average prediction accuracy over the 21 benchmark programs

Benchmarks	Schemes ordered by performance (from worst to best)
INT	static	bimodal	gshare	correlation	local	gselect	selective
INT	89.8%	89.8%	90.3%	90.8%	91.3%	91.8%	92.7%
FP	static	bimodal	correlation	gshare	gselect	selective	local
FP	93.3%	94.4%	94.7%	94.7%	95.3%	95.5%	95.6%
ALL	static	bimodal	gshare	correlation	local	gselect	selective
ALL	92.0%	92.6%	93.0%	93.2%	93.9%	93.9%	94.4%

Table 6. Performance summary of the seven schemes

Figure 16 shows the seven prediction schemes' average prediction accuracies for the integrer programs, the FP programs, and for all the 21 programs. Table 6 summaries the relative performance of the seven schemes.

We observe the following:

Replacing single global history register with multiple registers improves the prediction accuracy of FP programs but worsens that of integer programs.
Using 1k more bits for the global history shift registers, on average, the local scheme does not offer any performance win over the cheaper schemes such as gselect which uses only one global history register. Different from FP programs, which have many long looping structures, branches in integer programs are more correlated. Using one single history register exploits the correlation between different branches better.
It is a good idea to combine predictors.
From Figure 15 & 16, and Table 6, we notice that the selective predictor has the highest prediction accuracy. Different programs and different parts from a same program have different branch behaviors. The selective predictor can dynamically adapt to different branch behaviors to achieve the best prediction. As for choosing the two subpredictors to be combined, following from the discussion about the correlation depth, it should be a good idea to combine a predictor that works well for highly correlated programs and a predictor that works well for less correlated programs.
Sharing index can reduce the effect of aliasing.
Many of the dynamic schemes suffer from aliasing, which potentially makes the branch prediction table sparse. gshare is a scheme that tries to reduce aliasing. It xors branch address with global history register to distribute the 2-bit counters more evenly. Although this can reduce aliasing, the net effect on branch prediction accuracy is not clear.
The seven schemes have curves of similar shape. The differences between their prediction accuracies are somewhat predictable.
From figure 15, we can observe that if one scheme performs well on a program, then we can expect that the other ones perform well also. Since the performance of static predictor on a program is solely determined by the program's branch behavior, we can conclude that the branch behavior of a program is the most important parameter in determining the prediction accuracy of each scheme.

3. Effects of Changing Buffer Size

Advancing VLSI technology makes it possible to have larger branch perdition table in the near future. We look at the effect of varying the buffer size over three dynamic schemes: bimodal, common-correlation, and gselect. The benchmark program that we tested on is gcc with cccp.i as the input file.

The results are shown in Figure 17, with curves for the three dynamic schemes. The curve for the bimodal scheme goes flat when the number of 2-bit counters gets above about 5000. This agrees with results from previous research. Therefore, for the bimodal scheme, buffer size is not a limiting factor when it is more than 8K bits (4K 2-bit counters). The bimodal scheme behaves this way because in a typical run of program, 5000 different branches is considered to be large. Shown in Table 1 and Table 2, excluding gcc and vortex, all benchmarks studied have only between 1000 and 5000 static conditional branches. The common correlation scheme still has noticeable improvement with increased buffer size till 20K bits or 10K 2-bit counters. Even when the buffer size reaches 200K bits, the gselect scheme still shows significant improvement if the correlation depth is also increased accordingly.

Figure 17 Effects of varying branch prediction buffer size

4. Effects of Context Switching

Context switches are frequently experienced in multitasking computer systems. After a context switch, the branch prediction tables are normally invalided or flushed. Thus context switches have detrimental effects on branch prediction rate. To observe this effect, we flushed the prediction table after a certain number of instructions. The number of instructions between context switches are 5K, 10K, ... , and 2.6M. We used three dynamic predictors each using 8K bits: bimodal, common-correlation and gselect. The benchmark program that we tested on is gcc from SPECint95-beta with cccp.i as the input file.

The results are shown in Figure 18, with curves for bimodal, correlation, and gselect. The three horizontal lines in the figure show the prediction accuracies for each scheme in the case of no context switch. For all three schemes, we observe that the effect of context switches on prediction rate decreases as the number of instructions between context switches gets larger. We see very little effect after the number of instructions is over 1 million.

Since the increased speed in CPUs, the number of instructions between context switches is getting larger. For a 50-MIPs machine, it is about 3 million instructions assuming the commonly used 16 context switches per second on a UNIX system. Therefore, context switching has little effect on branch prediction accuracy. However, threads and multiprocessing are increasing in importance. The number of instructions between context switches for these lightweight processes sometimes is much shorter than conventional context switches. For this reason, the effect of context switches should not be overlooked.

We also notice that for less complicated schemes such as bimodal, the effect of context switches is not as evident as for some more complicated schemes such as gselect. This indicates that complicated schemes need longer time to warm up than simple schemes.

Figure 18 Effects of context switching on prediction accuracy

5. A Special Case Study: go

In analyzing the simulation results, we observed that every scheme performed poorly on go, a new benchmark program from SPEC95int-beta. To find out why, we did some extra study on this benchmark.

What is peculiar about go?
go is a special version of the "Go" program, "The Many Faces of Go", developed for use as a part of the SPEC benchmark suite. As described in the description file with the benchmark, go is an artificial intelligent game program. It is a computation bound integer benchmark and uses a very small amount of FP only during initialization to set up an array of scaled integers. It uses almost no divides, and few multiplies. Most data is stored in single dimensioned arrays specifically to avoid multiplies. This program has been extensively optimized using gprof to tune for maximum performance. Some inner loops have been unrolled in C. It features many small loops and lots of control flow -- if-then-else. This feature of go causes all the branch prediction schemes which take advantage of long looping structures to perform poorly. Does this say that complicated compiler optimization will change the program branch behavior and the prediction accuracy?
Figure 19 shows the bias of go with comparison to gcc and the average for the 8 integer programs. The curves for both gcc and the integer average show clear U-shape. However, the curve for go is quite flat. There are about same number of branches for each 5% interval of un-takenness. This distribution of branches causes various predictors to perform poorly. This result also indicates that the branch behavior is the most important parameter in determining the prediction accuracy of each scheme.

Figure 19 Profiling information for go
Can we get better results?
We look at the effect of increasing buffer size on prediction accuracy using go as the testing benchmark. For this part of the study, we used the gselect scheme. We varied the branch prediction buffer from 256 bytes to 64K bytes. For each buffer size, the history depth also need to be varied. Without increasing the correlation depth, increasing buffer size does not gain much improvement.
Figure 20 shows the results from this study. The x-axis shows the number of global history bits used (history depth). The y-axis shows the prediction accuracies. There is a curve associated with each buffer size. For each buffer size, there is one with the best prediction rate. We use another curve to connect all these points. For the 256 bytes buffer, the best prediction accuracy is about 76%. If 64K bytes buffer is in use, the best prediction accuracy can go up to about 88%, which has a 50% improvement in miss predict rate.
From the figure, we notice that for go, a highly correlated program, increasing the number of global history bits can help to improve the prediction accuracy. Since there are lots of if-then-else control structures in the benchmark, the direction of a branch usually depends on the outcome of others. Therefore, the predictor needs high correlation depth, which is larger than 15 indicated by the 64KB curve.

Figure 20 Effects of increasing branch prediction buffer size on go using gselect

Project Home | Previous Section: Design Methodology | Next Section: Future Work

Future Work

There are a number of ways that this study can be extended.

First, a large number of parameters were not fully explored here. These parameters include the sizes of the branch prediction tables, number of address bits or history bits used, the organization of the branch history tables, and in the case of selective prediction scheme, the choice of the two predictors. For example, we have mentioned that for some programs the correlation between branches is very high such as the integer programs, especially go from SPEC95int-beta, and for some other programs it is not as high such as the floating point programs. To work well in both cases, the predictor should be able to adapt to a specific program's branch behavior. It is a good idea to use a predictor that works well for both highly correlated programs and less correlated programs since for general-purpose applications the branch behavior of different programs may vary dramatically.
Second, other sources of information such as the target address and the opcode of the branch instruction might be usefully added to increase prediction accuracy. Some type of conditional branch instructions are likely to have higher taken tendencies than others.
Third, we observe that with increased branch buffer size, it is beneficial to increase the history depth, i.e. the number of bits used for global branch history. However, branch history tables are often sparse when the number of rows in a branch prediction table is large. It will be helpful to have a way either to compress the branch history tables or to map the branches more uniformly into the branch tables, such as using a good hashing function, thus can potentially increase the size of the branch buffer.
Fourth, branch prediction accuracy is not an accurate metric. The time to make a prediction should also be considered. Thus it would be interesting do a study on the timing cost of each prediction scheme.
Fifth, information from a compiler with profile support might be used in combination with some of the dynamic predictors to yield higher prediction accuracy.
Finally, one of the design philosophy is that whether we should tune the system performance to one category of programs or balance it for more general cases. Current branch prediction research is closely tied to popular benchmark programs and simplified environment assumptions such as single process assumption. We have shown the impacts of different application programs and operating system. It would be interesting to further explore the branch prediction accuracy impacts due to the interaction between architecture and higher level computer system including the operation system and application programs.

Project Home | Previous Section: Results Analysis | Next Section: Conclusion

Conclusion

In this report, we have analyzed the performance of several most well-known and effective branch prediction schemes with respect to their branch prediction performance and cost effectiveness. The schemes that we have looked at are static, bimodal, common correlation, gshare, local, gselect, and selective. Thanks for the fast speed of the Shade analyzers, we were able to trace hundreds times more instructions than previous research, and more accurate results could be obtained.

The following findings may be of interest for future research in branch prediction and new architecture design.

Selective predictors perform better than other schemes using same size of branch prediction tables. This kind of predictors can adapt to the branch behavior of the running program to achieve high prediction accuracy. Since we used a large set of benchmark programs and traced most of them to the end, we believe this conclusion, which contradicts to some other research based on either fewer testing programs or much smaller tracing portion of each testing program, is more convincing. We anticipate this kind of predictors will be popular in future's branch predictor designs.
Gselect predictors perform especially well on highly correlated programs such as integer programs, which contain many if-then-else statements. Local predictors perform especially well on floating point programs. This good performance is due to the fact that floating point programs have many looping structures. Local predictors keep a history register for each branch address, and thus reduce the interference between different branches. However, for the same reason, they do not perform as well as gselect predictors on integer programs.
Gshare is a set of predictors which have the effect of reducing aliasing. Many of the dynamic schemes suffers from aliasing, which potentially makes the branch prediction table sparse. Gshare tries to reduce aliasing. It xors branch address with global history register to distribute the 2-bit counters more evenly. Although reduced aliasing should improve prediction accuracy, we did not observe any net performance win from this approach.
With respect to cost-effectiveness of different approaches, gselect and selective seem to be the clear winners. Gselect, local, and selective have about the same performance. However, local takes about 20% more space than the other two schemes and has higher implementation complexity than gselect.
We also looked at the effects of changing buffer size and context switches on branch prediction. For simple schemes, increasing buffer size does not have as evident an effect on prediction compared to more complex schemes. For example, while the bimodal scheme does not experience any performance improvement as long as the buffer size is over 8K bits, the gselect scheme still shows clear improvement even if the buffer is over 2M bits.
As to the effect of context switches, we observe that the effect of context switches on prediction rate decreases as the number of instructions between context switches gets larger. We see very little effect after the number of instructions is over 1 million. This is exactly what we expected since the larger the number of instructions between context switches, the less frequently the prediction tables need to be flushed. We have also noticed that for less complicated schemes such as bimodal, the effect of context switches is not as evident as for some more complicated schemes such as gselect. This also indicates complicated schemes need longer warm-up time than simple schemes.
Branch behavior of a program is a very important factor, if not the most important, to determine the performance of any branch prediction schemes. The behavior of an integer program go from SPECint95-beta strongly supports this. This program is not biased at all, with about same number of branches for each 5% interval of takenness. All of the seven schemes perform poorly on this program if only 8K bits branch buffer is used. With increased buffer size, we show some of the schemes have big improvement in prediction accuracy.

Project Home | Previous Section: Future Work | Next Section: References

References

[BL93] T.Ball and J.Larus, "Branch Prediction for Free", Proceedings of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, 1993.

[CG94] B. Calder and D. Grunwald, " Fast & Accurate Instruction Fetch and Branch Prediction," Intl. Symp. on Computer Architecture, Apr. 1994.

[FF92] J. Fisher and S. Freudenberger, "Predicting Conditional Branch Directions From Previous Runs of a Program", Proc. 5th Annual Intl. Conf. on Architectural Support for Prog. Lang. and Operating Systems, Oct. 1992.

[GSM95] N. Gloy, M. Smith, and C. Young, " Performance Issues in Correlated Branch Prediction Schemes," to appear in the Proc. 28th Annual IEEE/ACM Intl. Symp. on Microarchitecture, Nov. 1995.

[JW89] N. Jouppi and D. Wall, "Available Instruction-level Parallelism for superscalar and Superpipelined Machines", Proceedings of ASPLOS III, April 1989.

[LS92] J. Lee and A. Smith, "Branch Prediction Strategies and Branch Target Buffer Design", Computer 17:1 Jan. 1984.

[M93] S. McFarling, " Combining Branch Predictors," TR, Digital Western Research Laboratory,Jun. 1993

[MH86] S. MaFarling and J. Hennessy "Reducing the Cost of Branches", Proc. of 13th Annual Intl. Symp. on Computer Architecture, Jun. 1986.

[PH95] D. Patterson and J. Hennessy, "Computer Architecture: A Quantitative Approach, 2nd Edition," Morgan Kaufmann Publishers, Inc., 1995.

[PSR] S. Pan, K. So, and J. Rahmeh, "Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation," Proc. 5th Annual Intl. Conf. on Architectural Support for Prog. Lang. and Operating Systems, Oct. 1992.

[S81] J. Smith, "A Study of Branch Prediction Strategies," Proc. 8th Annual Intl. Symp. on Computer Architecture, May 1981.

[SHADE] Sun Microsystems, " Shade Manual."

[YP93] T. Yeh and Y. Patt, "A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History," Proc. 20th Annual Intl. Symp. on Computer Architecture, May 1993.

[YP91] T. Yeh and Y. Patt, "Two-Level Adaptive Training Branch Prediction," Proc. 24th Annual ACM/IEEE Intl. Symp. and Workshop on Microarchitecture, Nov. 1991.

[YS95] C. Young and M. Smith, " A Comparative Analysis of Schemes for Correlated Branch Prediction", Proc. 22nd Annual Intl. Symp. on Computer Architecture, June 1995.

[YS94] C. Young and M. Smith, " Improving the Accuracy of Static Branch Prediction Using Branch Correlation", Proc. 6th Intl. Conf. on Architectural Support for Prog. Lang. and Operating Systems, October 1994.

[W91] D. Wall, "Limits of Instruction-level Parallelism", Proceedings of ASPLOS IV, April 1991.

Project Home | Previous Section: Conclusion