文档库

最新最全的文档下载
当前位置:文档库 > Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Control Flow Prediction Schemes for

Wide-Issue Superscalar Processors

Simonjit Dutta and Manoj Franklin

AbstractDIn order to achieve high performance,wide-issue superscalar processors have to fetch a large number of instructions per cycle.Conditional branches are the primary impediment to increasing the fetch bandwidth because they can potentially alter the flow of control and are very frequent.To overcome this problem,these processors need to predict the outcome of multiple branches in a cycle.

This paper investigates two control flow prediction schemes that predict the effective outcome of multiple branches with the help of a single prediction.Instead of considering branches as the basic units of prediction,these schemes consider subgraphs of the control flow graph of the executed program as the basic units of prediction and predict the target of an entire subgraph at a time,thereby allowing the superscalar fetch mechanism to go past multiple branches in a cycle.The first control flow prediction scheme investigated considers sequential block-like subgraphs and the second scheme considers tree-like subgraphs to make the control flow predictions.

Both schemes do a1-out-of-4prediction as opposed to the1-out-of-2prediction done by branch-level prediction schemes.These two schemes are evaluated using a MIPS ISA-based12-way superscalar microarchitecture.An improvement in effective fetch size of approximately25percent and50percent,respectively,is observed over identical microprocessors that use branch-level prediction.No appreciable difference in the prediction accuracy was observed,although the control flow prediction schemes predicted1-out-of-4 outcomes.

Index TermsDBlock-level prediction,instruction-level parallelism(ILP),multiple-issue processors,multiple-branch prediction, speculative execution,trace cache,tree-level prediction.

?

1I NTRODUCTION

S UPERSCALAR processors have become very popular now.

Almost all of the recent microprocessor releases are superscalar processors[15].As more and more transistors are integrated on a single chip,the issue width of these processors also keeps increasing steadily.To sustain a high issue rate,a superscalar processor fetches and decodes multiple instructions per cycle from a sequential instruction stream.Needless to say,the sustained instruction issue rate of a superscalar processor will never exceed the average number of useful instructions it is able to bring into the execution unit in a cycle.

Conditional branches are the primary impediment to feeding the execution unit with a large number of instructions per cycle because they are frequent(averaging about one in every five-six instructions for nonnumeric code)and conditionally alter the flow of http://www.wendangku.net/doc/253792c4da38376baf1fae5a.htmlputer architects realized this problem fairly early on and adopted branch prediction techniques to overcome this[14].That is, when a conditional branch is encountered,the processor predicts the outcome of the branch and executes instruc-tions from the predicted path in a speculative manner.If a prediction is later found to be incorrect,the hardware discards the instructions executed after the mispredicted branch and continues execution along the correct path.Branch prediction techniques have constantly improved over the years and current predictors provide prediction accuracies averaging92-96percent for nonnumeric pro-grams[9],[10],[18].

Although high-accuracy predictors permit the processor to get past unresolved branches,the high frequency of branches necessitates dealing with multiple branches per cycle.If the processor predicts the outcome of at most a single branch per cycle,then it can fetch only about five-six instructions in a cycle,on the average.This severely limits the amount of instruction-level parallelism(ILP)that can be exploited by wide-issue superscalar processors.To over-come this bottleneck,many schemes have been proposed, which can be classified into two broad categories:

.ISA-visible schemesDthose that involve modifica-tions at the ISA(Instruction Set Architecture)level,

along with additional support from the compiler,

and

.ISA-invisible schemesDthose that involve no mod-ifications at the ISA level and rely primarily on the

hardware.

1.1ISA-Visible Schemes

The first category includes techniques such as predicated instructions[7],VLIW tree instructions[2],and block-structured ISAs[8].Predicated instructions allow the compiler to eliminate short branches by converting their control dependences into data dependences.The branch instruction is converted to a predicate-setting instruction and the branch's control-dependent instructions are con-verted to predicated instructions.Predication reduces the number of branches encountered by the hardware and

.S.Dutta is with Intel Corporation,SC9-15,2200Mission College Blvd.,

Santa Clara,CA95052.E-mail:simonjit.dutta@http://www.wendangku.net/doc/253792c4da38376baf1fae5a.html.

.M.Franklin is with the Electrical and Computer Engineering Department,

University of Maryland,1417A.V.Williams Bldg.,College Park,MD

20742-3285.E-mail:manoj@http://www.wendangku.net/doc/253792c4da38376baf1fae5a.html.

Manuscript received23Oct.1995;revised3May1997.

For information on obtaining reprints of this article,please send e-mail to:

tpds@http://www.wendangku.net/doc/253792c4da38376baf1fae5a.html,and reference IEEECS Log Number100019.

1045-9219/99/$10.00?1999IEEE

enlarges the size of basic http://www.wendangku.net/doc/253792c4da38376baf1fae5a.htmlpiler techniques such as superblock scheduling[5]and hyperblock scheduling[6]take this approach one step further by converting an entire subgraph of the control flow graph to predicated instruc-tions and doing appropriate scheduling within the sub-graph.

VLIW tree-like instructions group operations from multi-ple paths into a tree-like structure.Each of these tree-like instructions has the potential for a multiway branch with multiple branch targets.All operations and branches in a tree-like instruction are data-independent of each other and can execute in parallel as a single VLIW instruction.On the basis of the evaluation of the multiway branch,a single path within a tree instruction is selected at execution time as the taken path.Notice that no control speculation is involved here.Because the operations within a VLIW instruction must be data-independent,it is critical that the compiler be able to find enough data-independent operations to fill each VLIW instruction.

Block-structured ISAs take a group of adjacent basic blocks of the program's control flow graph,and define it as an architectural atomic unit.These atomic units,called atomic blocks,are specified by the compiler.Unlike a VLIW instruction,the operations within an atomic block need not be data-independent.An atomic block can encompass multiple branches.All of these branches,except the last one in an atomic block(if it is the last instruction in the atomic block),have their outcomes(indirectly)predicted by the compiler.If the processor finds any of these(static) predictions to be incorrect,it discards the entire atomic block and continues by executing the correct atomic block. Each atomic block can have more than two successor atomic blocks,from which a successor is selected dynamically by means of a multiway prediction.

1.2ISA-Invisible Schemes

The schemes in the second category involve no modifica-tions at the ISA level and consist primarily of hardware techniques to predict the outcome of multiple branches in a cycle.Despite the fact that this paper focuses on this category,the two approaches are not mutually exclusive. The approach considered in[11],called control flow prediction,is to consider subgraphs of the executed program's control flow graph as the basic units of prediction.The targets of a subgraph are recorded in a buffer and the hardware predicts one of the targets of the current subgraph.By predicting a target,the control flow predictor specifies the next subgraph to be speculatively executed, thereby going past one subgraph a cycle.In the type of control flow prediction considered in[11],a subgraph can encompass an arbitrary number of branches and can, therefore,have arbitrary structure,length,and number of targets.Such a control flow prediction scheme,with unrestricted subgraph structures,is more apt for execution models that pursue multiple flows of control such as the multiscalar processor[3],[16].

For processors that follow the superscalar model of execution(single flow of control),the subgraph structure needs to be restricted so as to facilitate quick identification of a path through the subgraph along which instruction execution can proceed.This paper explores two restricted subgraph structures that are suitable for control flow prediction in superscalar processors:1)sequential block-like subgraph and2)tree-like subgraph.A sequential block-like subgraph includes instructions only from a contiguous portion of memory and offers the advantage that the instruction fetch mechanism need fetch only a contiguous block of instructions from the instruction cache every cycle.

A tree-like subgraph,although more complex than a sequential block-like subgraph,still has a well-defined structure which permits the easy selection of a dynamic path through it.In addition,it allows the superscalar fetch mechanism to fetch a large number of useful instructions in a cycle,despite the presence of taken branches in the instruction stream.

One of the prediction techniques discussed in[7],called Btc,performs subgraph-level prediction using sequential blocks.A cache block is used as the basis for making control flow predictions by recording the address to which control flowed during the last time the block was executed and by predicting the same address as the future target.No other history of the block's behavior is recorded for future predictions.

The technique discussed in[19]uses the tree-like subgraph structure,but selects a path through the tree by performing multiple branch predictions in a cycle,using a single,high accuracy,two-level branch predictor[18]. Given a subgraph address,the prediction mechanism performs these predictions before even determining the addresses of the intermediate conditional branches con-tained within the subgraph.Based on the prediction values, a Branch Address Cache provides the selected path's attributes,such as the intermediate fetch addresses and the node lengths,which are used by the fetch unit to fetch instructions from the instruction cache.Because the inter-mediate branch addresses are not known when the predictions are made,this scheme uses the condensed history of all branches to make its decisions,which results in reduced prediction accuracy.Furthermore,the need to predict multiple branches per cycle makes the prediction mechanism multiported,thereby increasing its complexity. The technique proposed in[13]attempts to overcome the multiportedness problem by using two separate predictors, which are accessed in parallel.The individual predictions given by the two predictors are used to decide the next two blocks to be speculatively executed by the processor. This paper investigates prediction schemes that select a dynamic path through a(block-like or tree-like)subgraph by means of a single prediction.That is,instead of predicting the outcome of each conditional branch in a subgraph,a single prediction is made to select a path through the subgraph and instructions are fetched and executed from the selected path.The advantage of this approach is that the prediction mechanism need not be much more complex than an ordinary branch predictor. Furthermore,the prediction mechanism is not restricted to storing the condensed history of all branches,and this gives better prediction accuracies.

1.3Paper Organization

This paper is organized as six sections.The introduction has highlighted the need to go beyond predictions at the branch

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

level and to predict the outcome of multiple branches per cycle.Section 2details the operation of block-level predic-tion and Section 3details the operation of tree-level prediction.Section 4estimates the hardware cost of both schemes and discusses ways to reduce the hardware cost.It also compares the hardware cost to that of contemporary schemes.Section 5presents a detailed experimental evaluation of the block-level prediction and tree-level prediction using the MIPS architecture and a collection of SPEC92and SPEC95integer programs.It also presents detailed sensitivity studies of the two schemes.Section 6presents a summary and the conclusions of this paper.

2B LOCK -L EVEL P REDICTION

The first control flow prediction scheme considered in this paper uses a sequential block-like subgraph as the basic unit of prediction and is called block-level prediction .The main idea is to predict the control flow outcome of a sequential block of instructions using a single prediction,instead of separately predicting the outcome of each branch in the block.

2.1General Operation

Fig.1shows the structure of a sequential block-like subgraph of depth 4that encompasses three conditional branches.The block is addressed by the address of the first instruction in the block,A0.Each node of the sequential block is a straightline piece of code.The nodes at depth 0,1,and 2are terminated by control changing instructions (CCIs).The node at depth 3is terminated by a single-target

instruction (STI),such as an unconditional branch,proce-dure call,or non-control-changing instruction.Such a definition allows up to 3conditional branches (or 4nodes)

Fig.1.Generic structure of a block-like

subgraph.

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Fig.2.Block diagram of a superscalar processor with block-level prediction.

to be encompassed in a block.The number of control flow outcomes of a block is one more than the number of conditional branches encompassed in the block.Associated with each possible outcome,there is a target.Multiple outcomes can have the same target.Notice that it is possible to have control flows into the middle of a block from outside the block.

Fig.2gives a block diagram of a superscalar processor that incorporates subgraph-level prediction.The fetch unit sends the current subgraph address(CSA)to the subgraph predictor.The subgraph predictor predicts a control flow outcome for this subgraph,retrieves the attributes(in this case,the path length and the target)for the predicted outcome from its storage,and sends them to the fetch unit. The fetch unit fetches this sequential path from the i-cache and forwards the fetched instructions to the decoder.Notice that the fetched path is a straight-line piece of code;thus, even if the predicted target falls within the block,instruc-tions from the target are not utilized in the same cycle. The decoder decodes the instructions fetched in the previous cycle and sends them to the hardware window structure,which serves as a platform for performing dynamic scheduling of instructions.Multiple functional units are provided to execute multiple instructions per cycle.Among these,the branch units carry out the execution of CCIs.When a CCI is executed,a check is made to see if the actual outcome is in line with the earlier subgraph prediction;if there is a discrepancy,the hardware window discards the instructions following the point of discrepancy.It also sends the misprediction information to the fetch unit,which then starts fetching from the correct target.

2.2Size of a Block

Before proceeding to study exactly how subgraph predic-tions are made,let us pause briefly to examine some restrictions that need to be imposed on the size of the blocks used for making control flow predictions.Two parameters are important when discussing the size of a sequential block:1)the number of control flow outcomes a block can have and2)the number of instructions in a block.

The first parameter is decided by the number of branches encompassed in a block.If a block has only two outcomes, then only one branch is contained in a block and only one branch is effectively predicted per cycle.If a single branch is effectively predicted per cycle,the average number of useful instructions fetched per cycle can never exceed the average number of instructions between consecutive branches,which is only about five for typical nonnumeric programs.On the other hand,if a block has a large number of possible outcomes,then the prediction mechanism has to choose one outcome from many,potentially resulting in lower prediction accuracies.Each time a misprediction is detected,the instructions fetched after the mispredicted branch have to be discarded.Thus,although a larger block is fetched in a cycle,the poor prediction accuracy could potentially result in reduced performance compared to the case when a block has fewer outcomes.As a trade-off,the studies in this paper allow up to four outcomes per block, which allows up to three conditional branches to be encompassed within a block.

Now,consider the second parameter,namely the number of instructions in a block.Ideally,the fetch mechanism should be able to fetch all of the instructions in a block in one cycle,enabling a new block-level prediction to be carried out every cycle.For this reason,it is better to restrict the length of a block to f instructions(by trimming the last node in the block),where f is the maximum fetch size of the superscalar fetch mechanism.

2.3Prediction Mechanism

Let us now return to the discussion on the methodology of carrying out block-level predictions.The predictions are made with the help of a two-level subgraph predictor,1similar in spirit to the two-level branch predictor described in[18]. Fig.3shows the block diagram of a two-level subgraph predictor.The two main structures in this predictor are the Subgraph History Table(SHT)and the Pattern History Table (PHT).The SHT is the first-level table and the PHT is the second-level table.The SHT stores information relevant to the subgraphs that have been encountered in the recent past.Each SHT entry has three fieldsDTag,Path Attri-butes,and Subgraph History Pattern.The Tag field is for uniquely identifying the subgraph currently mapped to that entry.2The Path Attributes field stores the node lengths and targets of the block and the Subgraph History Pattern field stores the last p control flow outcomes(p=6in the figure)of the subgraph as a P p-bit pattern.Because there are four possible outcomes for a subgraph,2bits are required to store each outcome.For each possible P p-bit pattern,a condensed history of the previous outcomes corresponding to the pattern is recorded in the PHT by means of four independent up/down counters f g H Y g I Y g P Y g Q g.The PHT has P P p entries.

When the instruction fetch unit requests a prediction for subgraph S,the appropriate SHT entry is selected,and its Tag field is checked to see if S is currently mapped to it.If so,the entry's Subgraph History Pattern is used to select the appropriate PHT entry.The four count values in the PHT entry are supplied to a Path Selector,which selects the path corresponding to the maximum count value.(If there is a tie between two or more counter values for the maximum position,the selector could choose the last outcome of that pattern as the next prediction,provided the last outcome is also stored in each PHT entry.)The Path Attributes field of the selected SHT entry contains the path lengths and targets of S.The path length and target corresponding to the predicted path are selected using a4:1 MUX,and sent to the fetch http://www.wendangku.net/doc/253792c4da38376baf1fae5a.htmlpared to a two-level branch-level predictor that also stores the branch targets, the additional steps in the time-critical path are the following:1)the time required to select one of the four

1.It is not mandatory to use a two-level prediction scheme for implementing subgraph prediction.An implementation using the two-level approach is described in this paper because it tends to give high prediction accuracies.

2.Because the number of SHT entries is smaller than the instruction address space,multiple subgraphs can map to the same SHT entry.For making accurate predictions,it is not very important to have a Tag field to identify the subgraph currently mapped to an SHT entry.However,the information in the Path Attributes field relates only to the subgraph currently mapped to the SHT entry and the Tag field helps in identifying this.Notice that a branch-level predictor that stores branch targets(Branch Target Buffer)also needs to have a similar Tag field.

counter values and2)the time delay involved in the4:1 MUX(as opposed to a2:1MUX for the branch-level predictor to select the next instruction address).

After making a prediction for S,the predictor performs a speculative update of S's history.The selected PHT entry's counter values are updated in the following manner.The count value corresponding to the predicted outcome is incremented by3(or less if the counter saturates)and all other count values are decremented by1(if they are not already0).3The Subgraph History Pattern field of the selected SHT entry is updated by shifting it left by2bit positions and,then,entering the latest prediction in the two rightmost bit positions(i.e.,the bit positions left vacant by the left shift).Notice that these updates to the SHT and PHT entries,done immediately after making a prediction,are done in a speculative manner.By doing so,up-to-date history is made available for making future predictions. (Because the accuracy of two-level prediction schemes is very high,using the speculatively updated history will result in better prediction accuracy than using obsolete history.)If the processor detects a misprediction,it discards the subgraph paths fetched after the point of misprediction. When such a situation arises,the speculative updates done to the SHT and PHT entries corresponding to the discarded subgraphs are undone.

2.3.1Handling New Blocks

When a block f is encountered by the fetch unit for the first time,or if its entry in the SHT was replaced by another block,then the SHT will not have an entry for f.If the fetch unit requests a prediction for f in such a situation,the predictor deactivates the Prediction Valid signal.The fetch unit then requests the i-cache for a sequential block of f instructions,where f is the maximum fetch size.When the path lengths and targets of this block become known, they are sent to the SHT for storage.

2.3.2Determination of New Path at Times of Recovery When a control flow outcome(path)is predicted for a block, the outcomes of the conditional branches contained in that path also get predicted indirectly.When the predicted outcome of a conditional branch is later found to be incorrect,the processor initiates recovery actions.These actions include discarding from the hardware window all of the instructions that have been fetched beyond the mispredicted branch.The recovery process also involves predicting a new control flow outcome(path)for the block that contains the mispredicted branch.If the mispredicted branch was predicted to be not taken(but ends up being taken),then the new prediction value for the block is automatically determined.On the other hand,if the mispredicted branch was predicted to be taken(but ends up being not taken),then there might be a choice involved in selecting the new prediction value because of the presence of conditional branches in the fall-through path of the branch.In such a situation,the recovery mechanism conveys the misprediction information to the predictor, which then selects a new prediction value from the available choices.

3T REE-L EVEL P REDICTION

A potential limitation of block-level prediction is that the effective fetch size(i.e.,the number of useful instructions entered into the hardware window from a block)can be somewhat lower than the block size because of branches

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Fig.3.Block diagram of two-level subgraph predictor.

3.The decision to update the counters in this manner was based on experimental results.When the count value corresponding to the predicted outcome is incremented by1and the others decremented by1,most of the count values tend to be zero or close to zero most of the time because,in every update,the overall decrement amount is more than the increment amount.

that are predicted to be taken.4The second scheme that we describe,called tree-level prediction,attempts to overcome this problem by considering a tree-like subgraph as the unit of prediction.A tree encompasses the fall-through paths as well as the taken paths of conditional branches.

3.1General Operation

Fig.4shows the generic structure of a tree-like subgraph of depth3.Each node of the tree is a straightline piece of code. The nodes at depth0and depth1can be terminated by any control-changing instruction(CCI),such as conditional branches,unconditional branches,and procedure calls. The nodes at depth2can be terminated only by single-target instructions(STI),such as unconditional branches, procedure calls,and non-control-changing instructions. Such a definition allows up to three conditional branches, seven nodes,and four tree-paths to be encompassed in a tree.5The tree nodes start at addresses A0...A6,and have lengths L0...L6.The tree has four targets{T0,T1,T2,T3}, one corresponding to each path.The four tree-paths are given the binary encoding{00,01,10,11}.Each path has three attributes:{node addresses,node lengths,target}.For instance,the attributes of path0are{{A0,A1,A2},{L0,L1, L2},T0}.The first two attributes of a tree-path aid in fetching the path from the i-cache and the last attribute gives the address of the next tree.The methodology of integrating the tree-level subgraph predictor into the CPU is same as that of integrating the block-level predictor.

3.2Size of a Tree

As with block-like subgraphs,some restrictions need to be imposed on the size of trees,again on the same two parameters,namely1)the number of control flow outcomes (or paths)a tree can have,and2)the number of instructions in each path of the tree.

The first parameter is decided by the number of branches encompassed in a tree path.If a tree has only two outcomes, then only one branch is effectively predicted per cycle and this severely limits the average number of instructions fetched per cycle.On the other hand,if a tree has a large number of possible outcomes,then the prediction mechan-ism has to choose one outcome from many possibilities, potentially resulting in lower prediction accuracies.Again, as a trade-off,this paper allows up to four outcomes per tree so as to encompass up to two conditional branches in any single path through a tree;eight-outcome trees are investigated in[1],and unbalanced trees are investigated in [17].The tree is extended beyond the second conditional branch by one more level to further increase the tree size without increasing the number of paths in the tree. Although the nodes at this last level cannot have any conditional branches,they can have up to one uncondi-tional control-changing instruction each.

Now,consider the second parameter,namely the number of instructions in each tree-path.Ideally,the fetch mechanism should be able to fetch all instructions from a selected path in one cycle.This will allow one subgraph prediction to be carried out every cycle.Thus,the length of any path through a subgraph is limited to f(by trimming the last node in that path),where f is the maximum fetch size of the superscalar fetch mechanism.

3.3Prediction Mechanism

The structure of the subgraph predictor for tree-like subgraphs is the same as that for block-like subgraphs, except that the Path Attributes field of the SHT stores the intermediate node addresses and node lengths as well.The working of the predictor is also similar to that of block-like subgraphs.

3.3.1Handling New Trees

When a tree is encountered by the fetch unit for the first time,or if its entry in the SHT was replaced by another tree, then the SHT will not have an entry for .If the fetch unit requests a prediction for in such a situation,the predictor deactivates the Prediction Valid signal.The fetch unit then requests the i-cache for a sequential block of f instructions,where f is the maximum fetch size.When these instructions are decoded,the decoder sends the correct attributes of tree-path0to the SHT for storage. The attributes of the remaining paths of are collected by the decoder when control flows through them for the first time.Thus,the trees are built up dynamically during program execution.

3.3.2Determination of New Path at Times of Recovery When a path prediction is made for a tree,the outcomes of the conditional branches contained in the selected tree-path also get predicted indirectly.When the predicted outcome of a conditional branch is later found to be incorrect,the processor initiates recovery actions as before by discarding the instructions fetched after the misprediction point.Concomi-tant to this is the undoing of the speculative updates done to the SHT and PHT entries corresponding to the affected subgraphs.The prediction mechanism then predicts a new path for the tree that contains the mispredicted branch.If the mispredicted branch is the second branch in its tree-path, then there is no choice involved in determining the new prediction value(and,hence,no need to access the PHT).If the previous prediction is the2-bit binary value I H,the new prediction value is given by I" H.On the other hand,if the mispredicted branch is the first branch in its tree-path,then there is a choice involved in selecting the new prediction value and the appropriate PHT entry is consulted to select a new prediction value.If the branch is now taken,then the predictor selects the tree-path corresponding to the max-imum count value among f g P Y g Q g,and if the branch is now not taken,then the predictor selects the tree-path corre-sponding to the maximum count value among f g H Y g I g.

4.Part of this problem can be overcome by letting the compiler modify the branch opcodes or perform instruction scheduling in such a manner that the conditional branches at the beginning of frequently executed blocks most likely take the fall-through path(as described in[4]for IBM RS6000). This will cause most of the instructions of those blocks to be executed, thereby resulting in larger effective fetch sizes.Techniques for doing this are beyond the scope of this paper.

5.It is important to point out that if a node at depth0or1encounters an instruction that has more than two possible targets,such as a subroutine return or an indirect branch,then that path of the tree gets pruned at the instruction following the multitarget instruction.The target of such a pruned path is given a special value so that the instruction fetch mechanism can use a more suitable predictor(such as a return address predictor or an indirect branch target predictor)to predict the actual value of the target.

3.4Fetching Tree-Paths

Tree-like subgraphs differ from block-like subgraphs in that,except for path0,the instructions contained in the remaining tree-paths are not from contiguous portions of memory.Thus,if the predictor predicts a tree-path other than path0,then the nodes within that path may reside in different i-cache blocks.One way of fetching nodes from multiple i-cache blocks is to make the i-cache multiported, possibly by interleaving.A better alternative,however,is to use the trace cache approach described in[12].The trace cache is a special instruction cache for capturing dynamic instruction sequences.Each line in the trace cache of[12] stores a dynamic code sequence,which may contain one or more taken branches.For application in a processor that performs tree-level prediction,each line in the trace cache can store the instructions of a tree-path in a contiguous manner.Most of the time,an instruction fetch request hits in the trace cache and the trace cache supplies an entire tree-path in a cycle.If the fetch request misses in the trace cache, then the i-cache or the next level of the memory hierarchy supplies the instructions of the tree-path,which are then stored in the trace cache for future accesses.Notice that, when a trace cache is used,then the Path Attributes field of the SHT need not store the internal node addresses of the tree-paths.

3.5Comparison of Block-Level Prediction and

Tree-Level Prediction

Both block-level prediction and tree-level prediction con-sider subgraphs of the executed program's control flow graph as the basic units of control flow prediction.Whereas block-level prediction uses sequential block-like subgraphs, tree-level prediction uses tree-like subgraphs.By consider-ing tree-like subgraphs,the latter allows the fetch unit to fetch multiple instructions in a cycle despite the presence of taken branches in the instruction stream.Both allow up to four control flow outcomes per subgraph.Block-level prediction allows up to three conditional branches in a (straightline)path,whereas tree-level prediction allows up to two in a(possibly nonstraightline)path.By restricting the parallelly fetched instructions to be from a contiguous portion of memory,block-level prediction allows for a simpler fetch unit and i-cache structure.With the emer-gence of trace caches that store a noncontiguous dynamic instruction sequence in a single cache block,tree-level prediction can be expected to become popular for the wide-issue superscalar processors of the future.

4H ARDWARE C OST FOR T WO-L EVEL S UBGRAPH P REDICTORS

This section calculates the hardware cost for implementing the two-level subgraph predictor for block-level prediction and tree-level prediction and compares the cost against the costs of contemporary prediction schemes.

4.1Hardware Cost of Block-Level Predictor

The two hardware-intensive components of the block-level predictor are the SHT and the PHT and their costs are calculated below.

The Tag field of an SHT entry has as many bits as is needed to uniquely identify a block address,and is w, where w is the number of bits required to specify an address.The Path Attributes field of an SHT entry stores up to four node lengths(each requiring4bits)and up to four target addresses(each requiring w bits).Lastly,the Subgraph History Pattern field of an SHT entry requires P p bits to store the last p outcomes of the block. Thus,the maximum number of bits required to implement an n-entry SHT is n? S w P p IT .

Next,consider the number of bits required for the PHT. Because each Subgraph History Pattern has P p bits, the PHT has P P p entries.Each PHT entry stores4count values.The total number of bits for implementing the PHT is,therefore,R ?P P p,where is the number of bits in a PHT counter.In addition,the block-level predictor requires an n-input decoder,a P P p-input decoder,and a4-input multiplexer.

4.2Hardware Cost of Tree-Level Predictor

For the tree-level predictor,the hardware cost of the PHT remains the same.The hardware cost of the SHT increases because the Path Attributes field has to store more information.The Path Attributes field,in this case, stores four target addresses(each requiring w bits),six intermediate fetch addresses(each requiring w bits),and seven subgraph node lengths(each requiring4bits, allowing each node to have lengths up to15).Then,the maximum number of bits required to implement an n-entry SHT is n? II w PV P p .In addition,the tree-level predictor requires an n-input decoder,a P P p-input decoder, and a4-input multiplexer.

4.2.1Reducing the Hardware Cost

The cost analysis given above shows that two-level subgraph prediction,especially the one used for tree-level prediction,has high hardware cost.Fortunately,there are ways to reduce the hardware cost.

Reducing the Hardware Cost of SHTDEliminate Redundant Information.The reason for the significant hardware cost for the SHT is because of the storage required

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Fig.4.Generic structure of a tree-like subgraph.

for the Path Attributes field,which contains redundant information.Referring to Fig.4,for tree-path0,it is sufficient to store the subgraph address A0,the combined length L0+L1+L2,and the target T0.Similarly,for tree-path1,it is sufficient to store in addition,the intermediate address A3,the lengths L0+L1and L3,and the target T1. By storing information in this manner,the SHT cost can be reduced to n? V w PV P p bits.Notice that most pro-cessors have restrictions on the instruction address space, effectively reducing the value of w by even25percent or more,resulting in an additional25percent savings in the SHT hardware cost.Notice also that n can be reduced with negligible impact on the SHT hit rate by using set-associative mapping for the SHT.

Reducing the Hardware Cost of PHTDSplit Pattern Predictor.The reason for the significant hardware cost for the PHT is because of the factor P P p in its cost formulae. Notice that,of the two bits used to store a previous outcome in the SHT,the first bit encodes the outcome of the first branch in the tree and the second bit encodes the outcome of the second branch in the taken path in the tree.By treating these two branch outcomes independently,the hardware cost of the PHT can be significantly reduced,as described below.

When a prediction is requested for tree T,the SHT supplies a P p-bit history pattern of previous outcomes of T. Split this P p-bit pattern into two p-bit patterns,as shown in http://www.wendangku.net/doc/253792c4da38376baf1fae5a.htmle one of the patterns to consult a PHT to predict the next outcome of the first branch in T and use the other pattern to consult a different PHT to predict the next outcome of the second branch(in the predicted path)in T. Combining the two single-bit prediction values gives the next2-bit prediction for T.With this scheme,the number of bits required for the PHTs has reduced dramatically to P? P p.

4.2.2Comparison to Other Schemes

Let us compare the hardware cost of the split-pattern two-level subgraph predictor to that of1)two-level branch predictor(PAg scheme)[18],and2)two-level multiple branch predictor(GAg scheme)[19].

Two-Level Branch Predictor.The two-level branch predictor consists of a branch history table(BHT)and a PHT.The BHT contains n entries,each having a Tag field (requiring at most w bits),a w-bit Target field to avoid pipeline bubbles due to branches that are predicted to be taken,and a p-bit Branch History Pattern field.The number of bits required for the BHT is n? P w p .This is approximately25percent of that required for the SHT in tree-level prediction.

The number of bits required for the PHT is ?P p bits, which is half of that required for the PHTs in the split-pattern subgraph predictor.Overall,the hardware cost for the split-pattern subgraph predictor is about four times that for two-level branch predictor,if both use the same number of entries at the first level table.It is important to note,however,that the number of entries in the SHT can be less than that in the BHT because the number of trees in the dynamic instruction stream is potentially smaller than the number of branches in the dynamic instruction stream.

Multiple Branch Predictor.The multiple branch predictor uses a two-level branch predictor for carrying out multiple branch predictions per cycle.It uses an additional Branch Address Cache(BAC)for storing the intermediate fetch addresses,basic block lengths,and the targets of the subgraphs.The number of bits required for an n-entry BAC is n? II w PV ,which can be reduced to n? V w PV using the technique given in Section4.2.1.This hardware cost is very close to that of the SHT used in tree-level prediction. Because of the need to carry out predictions without knowing the intermediate addresses,for practical purposes, the multiple branch predictor is restricted to using a p-bit global history register in place of the BHT[19].A major implication of using a global history register is that a much larger pattern size(typically14or more)has to be used to get reasonable prediction accuracies,as depicted in Table2 and in[18],[19].If p=14,then the number of bits required for the PHT works out to be6Kbytes(for =3),which is significantly higher than that for the PHTs in the split-pattern subgraph predictor.Thus,the hardware cost for the split-pattern subgraph predictor is lower than that for the two-level multiple branch predictor.

5E XPERIMENTAL E VALUATION

The previous sections described block-level prediction and tree-level prediction,their implementation,and hardware costs.This section presents the results of an empirical study of their performance.For comparison purposes,the section also presents performance results of the two-level branch predictor[18](augmented with a branch target buffer to avoid the one-cycle penalty due to branches that are predicted to be taken)and the two-level multiple branch predictor[19].

5.1Experimental Setup

5.1.1Simulation Tool and Benchmarks

The performance studies are conducted using the MIPS-I instruction set architecture(ISA),a representative of the class of streamlined ISAs that have emerged recently.All data are gathered with a simulator that accepts programs compiled for a MIPS-based DECstation and simulates their execution,keeping track of relevant information on a cycle-by-cycle basis.The simulator models speculative execution and is not trace-driven.System calls made by the simulated benchmark programs are handled with the help of traps to the operating system.The collected results,therefore,

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Fig.5.Splitting a P p-bit pattern into two.

exclude the code executed during system calls,but include all other code portions,including the library routines.

For benchmarks,five of the SPEC92integer programs and three of the SPEC95integer programs are used.The benchmark programs are compiled using the MIPS C compiler with the optimization flags distributed in the SPEC benchmark makefiles.The benchmarks are simulated to completion or up to 100million instructions,depending on whichever occurred first.

5.1.2Performance Metrics

For measuring performance,execution time is the sole metric that can accurately measure the performance of an integrated software-hardware system.However,this metric requires many implementation assumptions,including the exact hardware configuration,functional unit latencies,etc.

Other metrics,such as instruction issue rate and number of instructions in hardware window,also require implementa-tion assumptions that limit the utility of the results.Further,these metrics may not give much insight on the perfor-mance of the fetch mechanism,which is the focus of this study.Therefore,for this study,we use the following indirect metrics,which provide good insight on the fetch mechanism's performance:1)effective fetch size (EFS),2)subgraph prediction accuracy (SPA),and 3)branch prediction accuracy (BPA).

The first metric indicates the average number of useful instructions fetched in each fetch attempt of the fetch unit.In the case of block-level prediction,instructions that are fetched,but discarded because of a conditional branch being predicted to be taken,are not counted in this metric.Furthermore,useless instructions that are discarded due to

TABLE 1

Effective Fetch Size for f

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

IP

TABLE 2

Subgraph Prediction Accuracy (SPA)and Branch Prediction Accuracy (BPA)for f

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

IP

incorrect speculative execution are not counted while calculating this metric.Notice also that,if a block or tree had to be fetched in two (or three)attempts because of mispredictions,then it is counted as two (or three)fetch attempts.Thus,EFS takes into account the effect of the inaccuracies in subgraph prediction.This is an important metric because increasing its value is the primary goal of subgraph prediction.Fetching more instructions per cycle gives the superscalar hardware window more opportunities to look for and exploit instruction-level parallelism.

The second metric (SPA)indicates the fraction of times a subgraph prediction gives the correct path through a subgraph.Although EFS captures the effect of subgraph prediction inaccuracies,SPA throws more light on why the performance of the predictor is better or worse.We are also interested in knowing if making a four-way subgraph prediction instead of a two-way branch prediction results in a poor prediction accuracy.The third metric (BPA)indicates the fraction of times a branch (indirectly)gets correctly predicted when subgraph prediction is performed.BPA is highly dependent on SPA,but is helpful in comparing with the BPAs obtained with other prediction techniques.

5.1.3Caveat

It is important to make note of a caveat in this evaluation (which is true for any evaluation of a similar nature).The performance of any computer system is highly dependent on the compiler and the type of static scheduling done.All these studies are carried out with code compiled for a single-issue processor.The major implication of this is that no subgraph prediction-specific optimizations were per-formed on the code.In that sense,the performance results presented in this paper for block-level prediction should be viewed as pessimistic.Further doctoring of the code,especially to make the branches take the fall-through path more often than the taken path,as in [4],will certainly improve the performance of block-level prediction.5.1.4Default Parameters for the Study

We have attempted to do a reasonable study by varying the important parameters independently.When a parameter is varied,the rest of the parameters are kept fixed at the default values given below:.

Maximum Fetch Size (f):The default value is 12instructions.

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Fig.6.Effect of f on EFS for block-level prediction scheme.

Fig.7.Effect of p on EFS for block-level prediction scheme.

.

Branch Predictor:This scheme is simulated with a PAg two-level branch predictor [18],with a pattern size (p )of 6.The BHT has 1,024entries and is direct mapped.The PHT entries consist of 3-bit up/down saturating counters.The predictions are done based on the start address of each fetch block.The fetch unit fetches a sequential block of f instructions in each attempt;if the first conditional branch within this block is predicted to be fall-through,the subsequent instructions up to the next conditional branch in the block are also entered into the hardware window.Therefore,strictly speaking,this scheme performs block-level prediction,with the number of control flow outcomes per block restricted to 2..

Multiple Branch Predictor:This scheme is simulated with a GAg two-level branch predictor.The Branch Address Cache has 1,024entries and is direct mapped.The PHT entries consist of 3-bit up/down saturating counters.Two pattern sizes (p =7and p =14)are used.Notice that the scheme with pattern

size 7has roughly the same hardware cost as the default tree-level predictor.

.

Subgraph Predictor:The block-level prediction scheme and the tree-level prediction scheme are simulated with PAg two-level subgraph predictors.The SHT is also direct mapped and the default number of SHT entries is 1,024.The default pattern size is 6.Each PHT entry has four 4-bit up/down saturating counters.In the split pattern scheme,each PHT entry has a single 3-bit saturating counter.

5.2Experimental Results 5.2.1Effective Fetch Size

Table 1shows the EFS obtained with the default branch prediction scheme,the multiple branch prediction scheme [19],the default block-level prediction scheme,and the default split-pattern subgraph prediction scheme.The first column of numbers presents the EFS obtained with the branch prediction scheme.The next two columns present the EFS obtained with multiple branch predictions per cycle

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Fig.8.Effect of n on SHT hit rate for block-level prediction

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

scheme.

Fig.9.Effect of f on EFS for split-pattern tree-level prediction scheme.

for p=7and p=14,respectively.The fourth column of numbers presents the EFS obtained with the block-level predictor,and the subsequent column presents the percen-tage improvement in EFS for block-level predictor over branch-level predictor.The last two columns present the EFS obtained with split-pattern tree-level predictor and the percentage improvement in EFS for tree-level predictor over branch-level predictor.

Let us consider the results of this table in some detail. The percentage improvement in EFS for block-level predic-tion varies from0percent(for ijpeg)to40.7percent(for eqntott).This increase is because the branch-level prediction is fundamentally limited by the basic block size and,therefore,does not show marked improvement when f is larger than the average basic block size.

Next,consider the results for the tree-level predictor.It can be seen that the EFS is significantly higher for the tree-level predictor compared to that for the branch-level predictor(varying from41.8percent for gcc to130.8 percent for eqntott).Furthermore,the EFS for multiple branch predictor with p=7(which has roughly the same hardware cost as the default split-pattern subgraph pre-dictor)is low.This is because the multiple branch predictor uses a global history register,which results in poor prediction accuracy.When p is increased to14,its performance improves and it performs only slightly worse than the tree-level predictor(with p=6).

5.2.2Prediction Accuracy

Next,let us consider the prediction accuracies obtained. Table2presents the prediction accuracies obtained for f= 12.The first column of numbers gives the BPA obtained with the branch-level predictor.The next group of four columns give the SPA and BPA results for multiple branch predictor for pattern sizes(p)of7and14.The first two columns in this group give the SPA and BPA for p=7,and the next two columns give the SPA and BPA for p=14.The sixth and seventh columns of numbers give the SPA and BPA obtained with the default block-level predictor.The last group of four columns give the SPA and BPA results obtained with the two tree-level predictors.The first two columns in this group give the SPA and BPA obtained with the non-split-pattern predictor,and the last two columns give the results for the split-pattern predictor.

First,let us consider the results for the multiple branch predictor.The prediction accuracy results ob-tained with p=7are somewhat low.For the multiple branch predictor to perform reasonably well,the pattern size has to be much bigger(14or more),resulting in high hardware overheads.On the other hand,the BPA results for block-level predictor and tree-level predictors are very close to the BPA results obtained for the branch-level predictor.This result is very encouraging because it shows that predicting1-out-of-4outcomes has provided more or less the same accuracy as predicting1-out-of-2outcomes. An important observation to make is that the prediction accuracies with the split-pattern scheme are close to that without the split-pattern scheme.This is also very encouraging because the split-pattern scheme has low hardware overhead.

5.2.3Sensitivity Studies of Block-Level Prediction Next,let us vary some of the parameters that were fixed so far in order to study their effect on the performance of block-level predictors.The parameters that are varied for these sensitivity studies are:1)the maximum fetch size(f), 2)the pattern size(p),and3)the number of entries in the SHT(x).Each of these parameters is varied while keeping the rest of the parameters at their default values specified in Section5.1.4.

Fig.6plots the EFS obtained for three different values of f,namely{8,12,16}.It can be seen that,as f is increased,the EFS also increases.However,the increase is sublinear and tends to taper off because taken branches cause control to flow out of the middle of sequential blocks.

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Fig.10.Effect of p on EFS for split-pattern tree-level prediction schemes.

Fig.7plots the EFS obtained for the benchmarks for p= {4,6,8}.There is an increase in EFS as p is increased from4 to6,but no significant change as p is increased further.The se results suggest that a pattern size of6is quite adequate for block-level predictors.

Fig.8plots the SHT hit rates obtained for the bench-marks for three different values of x.Except for gcc and go, all other benchmarks stand little to benefit from a value of x greater than1,024.For programs that have poor SHT hit rates,such as gcc and go,better hit rates can be achieved by using set-associative mapping to select the SHT entries.

5.2.4Sensitivity Studies of Tree-Level Prediction

This section presents the results of sensitivity studies of tree-level prediction.As in the case of block-level predic-tion,the parameters that are varied are:1)the maximum fetch size(f),2)the pattern size(p),and3)the number of entries in the SHT(x).Again,when each of these parameters is varied,the rest of the parameters are maintained at their default values specified in Section5.1.4.The sensitivity studies are conducted on the low hardware cost,split-pattern tree-level predictor.

Fig.9plots the EFS obtained for three different values of f,namely8,12,and16.As with block-level prediction,the effective fetch size begins to taper off as f is increased to16. To get substantial increase for higher values of f,the tree structure needs to be extended to include additional nodes in each tree-path.

Fig.10plots the EFS obtained for p=4,6,and8.The figure shows that,as with the block-level predictor,a pattern size of6is quite adequate for the split-pattern tree-level predictor.

Last,Fig.11plots the SHT hit rates obtained with different number of entries in the SHT.It can be seen that many of the benchmarks have good hit rates even with a 256-entry SHT.Again,gcc and go have low hit rates even with1,024-entry SHTs.To improve the hit rates for these benchmarks,set associativity needs to be employed.

6S UMMARY AND C ONCLUSIONS

This paper investigated control flow prediction techniques for wide-issue superscalar processors.The essence of control flow prediction is to do predictions at a level higher than that of branches(in particular,at the level of subgraphs of the executed program's control flow graph) so as to predict the outcome of multiple branches quickly. That is,the predictor determines where control is most likely to go after the execution of the subgraph.This effectively predicts the outcome of multiple branches per cycle,thereby enabling the fetch unit and the decode unit to bring in more instructions per cycle into the superscalar execution unit.

Two control flow prediction schemes were investigated. The first scheme considered sequential block-like sub-graphs and the second scheme considered tree-like sub-graphs.The paper described the hardware cost of carrying out subgraph-level prediction and also described ways to reduce the hardware cost.The cost was found to be less than alternate schemes to predict the outcome of multiple branches per cycle.

The paper also presented results from a thorough simulation study conducted to verify the potential of this prediction strategy.The results of these experiments with the MIPS architecture show block-level prediction and tree-level prediction to increase the effective fetch size by about 25percent and50percent,respectively,over that of branch-level prediction.The effective branch prediction accuracies

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Fig.11.Effect of n on SHT hit rate for split-pattern tree-level prediction scheme.

for the subgraph-level prediction schemes were close to that obtained for branch-level prediction.

Subgraph-level prediction has several advantages.First, it predicts the outcome of multiple branches per cycle.This is essential to improve the performance of wide-issue superscalar processors for nonnumerical programs.Second, only a single prediction is made per cycle.Thus,the prediction mechanism need not be much more complex than an ordinary branch prediction http://www.wendangku.net/doc/253792c4da38376baf1fae5a.htmlst,tree-level prediction allows the instruction fetch mechanism to fetch a large number of instructions per cycle despite taken branches in the instruction stream.

A CKNOWLEDGMENTS

This work was financially supported by U.S.National Science Foundation Research Initiation Award CCR9410706 and by the Department of Electrical and Computer Engineering of Clemson University.We thank the reviewers for their many helpful comments.

R EFERENCES

[1] B.Cyril and M.Franklin,aA Study of Tree-Based Control Flow

Prediction Schemes,oProc.Int'l Conf.High Performance Computing, pp.28-33,1997.

[2]K.Ebcioglu,aSome Design Ideas for a VLIW Architecture for

Sequential Natured Software,oParallel Processing(Proc.IFIP WG

10.3Working Conf.Parallel Processing,Pisa,Italy),pp.3-21,1988.

[3]M.Franklin,aThe Multiscalar Architecture,oPhD thesis,Technical

Report TR1196,Computer Sciences Dept.,Univ.of Wisconsin-Madison,1993.

[4]M.C.Golumbic and V.Rainish,aInstruction Scheduling Beyond

Basic Blocks,oIBM J.Research and Development,vol.34,no.1,pp.93-97,Jan.1990.

[5]W.W.Hwu et al.,aThe Superblock:An Effective Technique for

VLIW and Superscalar Compilation,oJ.Supercomputing,Jan.1993.

[6]W.W.Hwu,R.E.Hank, D.M.Gallagher,S.A.Mahlke, D.M.

Lavery,G.E.Haab,J.C.Gyllenhaal,and D.I.August,aCompiler Technology for Future Microprocessors,oProc.IEEE,vol.83, no.12,Dec.1995.

[7]S.A.Mahlke et al.,aCharacterizing the Impact of Predicated

Execution on Branch Prediction,oProc.27th Int'l Symp.Micro-architecture(MICRO-27),pp.217-227,1994.

[8]S.Melvin and Y.N.Patt,aEnhancing Instruction Scheduling with a

Block-Structured ISA,oInt'l J.Parallel Programming,vol.23,no.3, 1995.

[9]R.Nair,aDynamic Path-Based Branch Correlation,oProc.28th Int'l

Symp.Microarchitecture(MICRO-28),1995.

[10]S-T.Pan,K.So,and J.T.Rahmeh,aImproving the Accuracy of

Dynamic Branch Prediction Using Branch Correlation,oProc.Fifth Int'l Conf.Architectural Support for Programming Languages and Operating Systems(ASPLOS-V),pp.76-84,1992.

[11] D.N.Pnevmatikatos,M.Franklin,and G.S.Sohi,aControl Flow

Prediction for Dynamic ILP Processors,oProc.26th Int'l Symp.

Microarchitecture(MICRO-26),pp.153-163,1993.

[12] E.Rotenberg,S.Bennett,and J.E.Smith,aTrace Cache:A Low

Latency Approach to High Bandwidth Instruction Fetching,oProc.

29th Int'l Symp.Microarchitecture(MICRO-29),1996.

[13] A.Seznec,S.Jourdan,P.Sainrat,and P.Michaud,aMultiple-Block

Ahead Branch Predictors,oProc.Seventh Int'l Conf.Architectural Support for Programming Languages and Operating Systems(ASPLOS VII),1996.

[14]J.E.Smith,aA Study of Branch Prediction Strategies,oProc.Eighth

Ann.Int'l http://www.wendangku.net/doc/253792c4da38376baf1fae5a.htmlputer Architecture,pp.135-148,1981. [15]J.E.Smith and G.S.Sohi,aThe Microarchitecture of Superscalar

Processors,oProc.IEEE,vol.83,no.12,pp.1,609-1,624,Dec.1995.

[16]G.S.Sohi,S.E.Breach,and T.N.Vijaykumar,aMultiscalar

Processors,oProc.22nd Ann.Int'l http://www.wendangku.net/doc/253792c4da38376baf1fae5a.htmlputer Architecture, 1995.[17] B.R.Toone and M.Franklin,aControl Flow Prediction with

Unbalanced Tree-like Subgraphs,oProc.Int'l Conf.High Perfor-mance Computing,pp.221-227,1998.

[18]T.-Y.Yeh and Y.N.Patt,aAlternative Implementations of Two-

Level Adaptive Branch Prediction,oProc.19th Ann.Int'l Symp.

Computer Architecture,pp.124-134,1992.

[19]T.-Y.Yeh,D.T.Marr,and Y.N.Patt,aIncreasing Instruction Fetch

Rate via Multiple Branch Predictions and a Branch Address Cache,oProc.Seventh ACM Int'l Conf.Supercomputing,pp.67-76, July1993.

Simonjit Dutta received the BTech degree in

computer engineering from Regional Engineer-

ing College,Suratkal,in1993,and the MS

degree in computer engineering from Clemson

University,Clemson,South Carolina,in1995.

His MS thesis work involved developing control

flow prediction techniques for wide-issue super-

scalar processors.Prior to joining Intel,he was

with Texas Inc,Dallas from1995to1996.He is a

computer architect at Intel Inc.,Santa Clara, California,where he investigates power and performance trade-offs in the mobile platform.His research interests are in computer architecture, instruction-level parallel processing,branch prediction,and system/ workload profiling for low power design.

Manoj Franklin received the BSc degree in

electronics and communications engineering

from the University of Kerala,India,in1984,

and the MS and PhD degrees in computer

sciences from the University of Wisconsin-

Madison,in1990and1993,respectively.He is

an assistant professor in the Electrical Engineer-

ing Department of University of Maryland,

College Park,where he holds a joint appoint-

ment with the University of Maryland Institute for

Advanced Computer Studies(UMIACS).He worked during the summer of1990at Cray Research Inc.,Chippewa Falls,Wisconsin,and the summer of1991at the IBM T.J.Watson Research Center,New York.Prior to joining the University of Maryland, he was an assistant professor at Clemson University from1993-1998. He received the IBM Graduate Fellowship in the years1990-1993,the U.S.National Science Foundation(NSF)Research Initiation Award in 1994,and the NSF CAREER award in1997.His primary research interests are in computer architecture,instruction-level parallel proces-sing,processor design,ILP compilation,and fault-tolerant computing.

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors