当前位置：文档库 › Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

Control Flow Prediction Schemes for

Wide-Issue Superscalar Processors

Simonjit Dutta and Manoj Franklin

AbstractDIn order to achieve high performance,wide-issue superscalar processors have to fetch a large number of instructions per cycle.Conditional branches are the primary impediment to increasing the fetch bandwidth because they can potentially alter the flow of control and are very frequent.To overcome this problem,these processors need to predict the outcome of multiple branches in a cycle.

This paper investigates two control flow prediction schemes that predict the effective outcome of multiple branches with the help of a single prediction.Instead of considering branches as the basic units of prediction,these schemes consider subgraphs of the control flow graph of the executed program as the basic units of prediction and predict the target of an entire subgraph at a time,thereby allowing the superscalar fetch mechanism to go past multiple branches in a cycle.The first control flow prediction scheme investigated considers sequential block-like subgraphs and the second scheme considers tree-like subgraphs to make the control flow predictions.

Both schemes do a1-out-of-4prediction as opposed to the1-out-of-2prediction done by branch-level prediction schemes.These two schemes are evaluated using a MIPS ISA-based12-way superscalar microarchitecture.An improvement in effective fetch size of approximately25percent and50percent,respectively,is observed over identical microprocessors that use branch-level prediction.No appreciable difference in the prediction accuracy was observed,although the control flow prediction schemes predicted1-out-of-4 outcomes.

Index TermsDBlock-level prediction,instruction-level parallelism(ILP),multiple-issue processors,multiple-branch prediction, speculative execution,trace cache,tree-level prediction.

1I NTRODUCTION

S UPERSCALAR processors have become very popular now.

Almost all of the recent microprocessor releases are superscalar processors[15].As more and more transistors are integrated on a single chip,the issue width of these processors also keeps increasing steadily.To sustain a high issue rate,a superscalar processor fetches and decodes multiple instructions per cycle from a sequential instruction stream.Needless to say,the sustained instruction issue rate of a superscalar processor will never exceed the average number of useful instructions it is able to bring into the execution unit in a cycle.

Conditional branches are the primary impediment to feeding the execution unit with a large number of instructions per cycle because they are frequent(averaging about one in every five-six instructions for nonnumeric code)and conditionally alter the flow of https://www.wendangku.net/doc/2d6311868.html,puter architects realized this problem fairly early on and adopted branch prediction techniques to overcome this[14].That is, when a conditional branch is encountered,the processor predicts the outcome of the branch and executes instruc-tions from the predicted path in a speculative manner.If a prediction is later found to be incorrect,the hardware discards the instructions executed after the mispredicted branch and continues execution along the correct path.Branch prediction techniques have constantly improved over the years and current predictors provide prediction accuracies averaging92-96percent for nonnumeric pro-grams[9],[10],[18].

Although high-accuracy predictors permit the processor to get past unresolved branches,the high frequency of branches necessitates dealing with multiple branches per cycle.If the processor predicts the outcome of at most a single branch per cycle,then it can fetch only about five-six instructions in a cycle,on the average.This severely limits the amount of instruction-level parallelism(ILP)that can be exploited by wide-issue superscalar processors.To over-come this bottleneck,many schemes have been proposed, which can be classified into two broad categories:

.ISA-visible schemesDthose that involve modifica-tions at the ISA(Instruction Set Architecture)level,

along with additional support from the compiler,

and

.ISA-invisible schemesDthose that involve no mod-ifications at the ISA level and rely primarily on the

hardware.

1.1ISA-Visible Schemes

The first category includes techniques such as predicated instructions[7],VLIW tree instructions[2],and block-structured ISAs[8].Predicated instructions allow the compiler to eliminate short branches by converting their control dependences into data dependences.The branch instruction is converted to a predicate-setting instruction and the branch's control-dependent instructions are con-verted to predicated instructions.Predication reduces the number of branches encountered by the hardware and

.S.Dutta is with Intel Corporation,SC9-15,2200Mission College Blvd.,

Santa Clara,CA95052.E-mail:simonjit.dutta@https://www.wendangku.net/doc/2d6311868.html,.

.M.Franklin is with the Electrical and Computer Engineering Department,

University of Maryland,1417A.V.Williams Bldg.,College Park,MD

20742-3285.E-mail:manoj@https://www.wendangku.net/doc/2d6311868.html,.

Manuscript received23Oct.1995;revised3May1997.

For information on obtaining reprints of this article,please send e-mail to:

tpds@https://www.wendangku.net/doc/2d6311868.html,,and reference IEEECS Log Number100019.

1045-9219/99/$10.00?1999IEEE

enlarges the size of basic https://www.wendangku.net/doc/2d6311868.html,piler techniques such as superblock scheduling[5]and hyperblock scheduling[6]take this approach one step further by converting an entire subgraph of the control flow graph to predicated instruc-tions and doing appropriate scheduling within the sub-graph.

VLIW tree-like instructions group operations from multi-ple paths into a tree-like structure.Each of these tree-like instructions has the potential for a multiway branch with multiple branch targets.All operations and branches in a tree-like instruction are data-independent of each other and can execute in parallel as a single VLIW instruction.On the basis of the evaluation of the multiway branch,a single path within a tree instruction is selected at execution time as the taken path.Notice that no control speculation is involved here.Because the operations within a VLIW instruction must be data-independent,it is critical that the compiler be able to find enough data-independent operations to fill each VLIW instruction.

Block-structured ISAs take a group of adjacent basic blocks of the program's control flow graph,and define it as an architectural atomic unit.These atomic units,called atomic blocks,are specified by the compiler.Unlike a VLIW instruction,the operations within an atomic block need not be data-independent.An atomic block can encompass multiple branches.All of these branches,except the last one in an atomic block(if it is the last instruction in the atomic block),have their outcomes(indirectly)predicted by the compiler.If the processor finds any of these(static) predictions to be incorrect,it discards the entire atomic block and continues by executing the correct atomic block. Each atomic block can have more than two successor atomic blocks,from which a successor is selected dynamically by means of a multiway prediction.

1.2ISA-Invisible Schemes

The schemes in the second category involve no modifica-tions at the ISA level and consist primarily of hardware techniques to predict the outcome of multiple branches in a cycle.Despite the fact that this paper focuses on this category,the two approaches are not mutually exclusive. The approach considered in[11],called control flow prediction,is to consider subgraphs of the executed program's control flow graph as the basic units of prediction.The targets of a subgraph are recorded in a buffer and the hardware predicts one of the targets of the current subgraph.By predicting a target,the control flow predictor specifies the next subgraph to be speculatively executed, thereby going past one subgraph a cycle.In the type of control flow prediction considered in[11],a subgraph can encompass an arbitrary number of branches and can, therefore,have arbitrary structure,length,and number of targets.Such a control flow prediction scheme,with unrestricted subgraph structures,is more apt for execution models that pursue multiple flows of control such as the multiscalar processor[3],[16].

For processors that follow the superscalar model of execution(single flow of control),the subgraph structure needs to be restricted so as to facilitate quick identification of a path through the subgraph along which instruction execution can proceed.This paper explores two restricted subgraph structures that are suitable for control flow prediction in superscalar processors:1)sequential block-like subgraph and2)tree-like subgraph.A sequential block-like subgraph includes instructions only from a contiguous portion of memory and offers the advantage that the instruction fetch mechanism need fetch only a contiguous block of instructions from the instruction cache every cycle.

A tree-like subgraph,although more complex than a sequential block-like subgraph,still has a well-defined structure which permits the easy selection of a dynamic path through it.In addition,it allows the superscalar fetch mechanism to fetch a large number of useful instructions in a cycle,despite the presence of taken branches in the instruction stream.

One of the prediction techniques discussed in[7],called Btc,performs subgraph-level prediction using sequential blocks.A cache block is used as the basis for making control flow predictions by recording the address to which control flowed during the last time the block was executed and by predicting the same address as the future target.No other history of the block's behavior is recorded for future predictions.

The technique discussed in[19]uses the tree-like subgraph structure,but selects a path through the tree by performing multiple branch predictions in a cycle,using a single,high accuracy,two-level branch predictor[18]. Given a subgraph address,the prediction mechanism performs these predictions before even determining the addresses of the intermediate conditional branches con-tained within the subgraph.Based on the prediction values, a Branch Address Cache provides the selected path's attributes,such as the intermediate fetch addresses and the node lengths,which are used by the fetch unit to fetch instructions from the instruction cache.Because the inter-mediate branch addresses are not known when the predictions are made,this scheme uses the condensed history of all branches to make its decisions,which results in reduced prediction accuracy.Furthermore,the need to predict multiple branches per cycle makes the prediction mechanism multiported,thereby increasing its complexity. The technique proposed in[13]attempts to overcome the multiportedness problem by using two separate predictors, which are accessed in parallel.The individual predictions given by the two predictors are used to decide the next two blocks to be speculatively executed by the processor. This paper investigates prediction schemes that select a dynamic path through a(block-like or tree-like)subgraph by means of a single prediction.That is,instead of predicting the outcome of each conditional branch in a subgraph,a single prediction is made to select a path through the subgraph and instructions are fetched and executed from the selected path.The advantage of this approach is that the prediction mechanism need not be much more complex than an ordinary branch predictor. Furthermore,the prediction mechanism is not restricted to storing the condensed history of all branches,and this gives better prediction accuracies.

1.3Paper Organization

This paper is organized as six sections.The introduction has highlighted the need to go beyond predictions at the branch

level and to predict the outcome of multiple branches per cycle.Section 2details the operation of block-level predic-tion and Section 3details the operation of tree-level prediction.Section 4estimates the hardware cost of both schemes and discusses ways to reduce the hardware cost.It also compares the hardware cost to that of contemporary schemes.Section 5presents a detailed experimental evaluation of the block-level prediction and tree-level prediction using the MIPS architecture and a collection of SPEC92and SPEC95integer programs.It also presents detailed sensitivity studies of the two schemes.Section 6presents a summary and the conclusions of this paper.

2B LOCK -L EVEL P REDICTION

The first control flow prediction scheme considered in this paper uses a sequential block-like subgraph as the basic unit of prediction and is called block-level prediction .The main idea is to predict the control flow outcome of a sequential block of instructions using a single prediction,instead of separately predicting the outcome of each branch in the block.

2.1General Operation

Fig.1shows the structure of a sequential block-like subgraph of depth 4that encompasses three conditional branches.The block is addressed by the address of the first instruction in the block,A0.Each node of the sequential block is a straightline piece of code.The nodes at depth 0,1,and 2are terminated by control changing instructions (CCIs).The node at depth 3is terminated by a single-target

instruction (STI),such as an unconditional branch,proce-dure call,or non-control-changing instruction.Such a definition allows up to 3conditional branches (or 4nodes)

Fig.1.Generic structure of a block-like

subgraph.

Fig.2.Block diagram of a superscalar processor with block-level prediction.

to be encompassed in a block.The number of control flow outcomes of a block is one more than the number of conditional branches encompassed in the block.Associated with each possible outcome,there is a target.Multiple outcomes can have the same target.Notice that it is possible to have control flows into the middle of a block from outside the block.

Fig.2gives a block diagram of a superscalar processor that incorporates subgraph-level prediction.The fetch unit sends the current subgraph address(CSA)to the subgraph predictor.The subgraph predictor predicts a control flow outcome for this subgraph,retrieves the attributes(in this case,the path length and the target)for the predicted outcome from its storage,and sends them to the fetch unit. The fetch unit fetches this sequential path from the i-cache and forwards the fetched instructions to the decoder.Notice that the fetched path is a straight-line piece of code;thus, even if the predicted target falls within the block,instruc-tions from the target are not utilized in the same cycle. The decoder decodes the instructions fetched in the previous cycle and sends them to the hardware window structure,which serves as a platform for performing dynamic scheduling of instructions.Multiple functional units are provided to execute multiple instructions per cycle.Among these,the branch units carry out the execution of CCIs.When a CCI is executed,a check is made to see if the actual outcome is in line with the earlier subgraph prediction;if there is a discrepancy,the hardware window discards the instructions following the point of discrepancy.It also sends the misprediction information to the fetch unit,which then starts fetching from the correct target.

2.2Size of a Block

Before proceeding to study exactly how subgraph predic-tions are made,let us pause briefly to examine some restrictions that need to be imposed on the size of the blocks used for making control flow predictions.Two parameters are important when discussing the size of a sequential block:1)the number of control flow outcomes a block can have and2)the number of instructions in a block.

The first parameter is decided by the number of branches encompassed in a block.If a block has only two outcomes, then only one branch is contained in a block and only one branch is effectively predicted per cycle.If a single branch is effectively predicted per cycle,the average number of useful instructions fetched per cycle can never exceed the average number of instructions between consecutive branches,which is only about five for typical nonnumeric programs.On the other hand,if a block has a large number of possible outcomes,then the prediction mechanism has to choose one outcome from many,potentially resulting in lower prediction accuracies.Each time a misprediction is detected,the instructions fetched after the mispredicted branch have to be discarded.Thus,although a larger block is fetched in a cycle,the poor prediction accuracy could potentially result in reduced performance compared to the case when a block has fewer outcomes.As a trade-off,the studies in this paper allow up to four outcomes per block, which allows up to three conditional branches to be encompassed within a block.

Now,consider the second parameter,namely the number of instructions in a block.Ideally,the fetch mechanism should be able to fetch all of the instructions in a block in one cycle,enabling a new block-level prediction to be carried out every cycle.For this reason,it is better to restrict the length of a block to f instructions(by trimming the last node in the block),where f is the maximum fetch size of the superscalar fetch mechanism.

2.3Prediction Mechanism

Let us now return to the discussion on the methodology of carrying out block-level predictions.The predictions are made with the help of a two-level subgraph predictor,1similar in spirit to the two-level branch predictor described in[18]. Fig.3shows the block diagram of a two-level subgraph predictor.The two main structures in this predictor are the Subgraph History Table(SHT)and the Pattern History Table (PHT).The SHT is the first-level table and the PHT is the second-level table.The SHT stores information relevant to the subgraphs that have been encountered in the recent past.Each SHT entry has three fieldsDTag,Path Attri-butes,and Subgraph History Pattern.The Tag field is for uniquely identifying the subgraph currently mapped to that entry.2The Path Attributes field stores the node lengths and targets of the block and the Subgraph History Pattern field stores the last p control flow outcomes(p=6in the figure)of the subgraph as a P p-bit pattern.Because there are four possible outcomes for a subgraph,2bits are required to store each outcome.For each possible P p-bit pattern,a condensed history of the previous outcomes corresponding to the pattern is recorded in the PHT by means of four independent up/down counters f g H Y g I Y g P Y g Q g.The PHT has P P p entries.

When the instruction fetch unit requests a prediction for subgraph S,the appropriate SHT entry is selected,and its Tag field is checked to see if S is currently mapped to it.If so,the entry's Subgraph History Pattern is used to select the appropriate PHT entry.The four count values in the PHT entry are supplied to a Path Selector,which selects the path corresponding to the maximum count value.(If there is a tie between two or more counter values for the maximum position,the selector could choose the last outcome of that pattern as the next prediction,provided the last outcome is also stored in each PHT entry.)The Path Attributes field of the selected SHT entry contains the path lengths and targets of S.The path length and target corresponding to the predicted path are selected using a4:1 MUX,and sent to the fetch https://www.wendangku.net/doc/2d6311868.html,pared to a two-level branch-level predictor that also stores the branch targets, the additional steps in the time-critical path are the following:1)the time required to select one of the four

1.It is not mandatory to use a two-level prediction scheme for implementing subgraph prediction.An implementation using the two-level approach is described in this paper because it tends to give high prediction accuracies.

2.Because the number of SHT entries is smaller than the instruction address space,multiple subgraphs can map to the same SHT entry.For making accurate predictions,it is not very important to have a Tag field to identify the subgraph currently mapped to an SHT entry.However,the information in the Path Attributes field relates only to the subgraph currently mapped to the SHT entry and the Tag field helps in identifying this.Notice that a branch-level predictor that stores branch targets(Branch Target Buffer)also needs to have a similar Tag field.

counter values and2)the time delay involved in the4:1 MUX(as opposed to a2:1MUX for the branch-level predictor to select the next instruction address).

After making a prediction for S,the predictor performs a speculative update of S's history.The selected PHT entry's counter values are updated in the following manner.The count value corresponding to the predicted outcome is incremented by3(or less if the counter saturates)and all other count values are decremented by1(if they are not already0).3The Subgraph History Pattern field of the selected SHT entry is updated by shifting it left by2bit positions and,then,entering the latest prediction in the two rightmost bit positions(i.e.,the bit positions left vacant by the left shift).Notice that these updates to the SHT and PHT entries,done immediately after making a prediction,are done in a speculative manner.By doing so,up-to-date history is made available for making future predictions. (Because the accuracy of two-level prediction schemes is very high,using the speculatively updated history will result in better prediction accuracy than using obsolete history.)If the processor detects a misprediction,it discards the subgraph paths fetched after the point of misprediction. When such a situation arises,the speculative updates done to the SHT and PHT entries corresponding to the discarded subgraphs are undone.

2.3.1Handling New Blocks

When a block f is encountered by the fetch unit for the first time,or if its entry in the SHT was replaced by another block,then the SHT will not have an entry for f.If the fetch unit requests a prediction for f in such a situation,the predictor deactivates the Prediction Valid signal.The fetch unit then requests the i-cache for a sequential block of f instructions,where f is the maximum fetch size.When the path lengths and targets of this block become known, they are sent to the SHT for storage.

2.3.2Determination of New Path at Times of Recovery When a control flow outcome(path)is predicted for a block, the outcomes of the conditional branches contained in that path also get predicted indirectly.When the predicted outcome of a conditional branch is later found to be incorrect,the processor initiates recovery actions.These actions include discarding from the hardware window all of the instructions that have been fetched beyond the mispredicted branch.The recovery process also involves predicting a new control flow outcome(path)for the block that contains the mispredicted branch.If the mispredicted branch was predicted to be not taken(but ends up being taken),then the new prediction value for the block is automatically determined.On the other hand,if the mispredicted branch was predicted to be taken(but ends up being not taken),then there might be a choice involved in selecting the new prediction value because of the presence of conditional branches in the fall-through path of the branch.In such a situation,the recovery mechanism conveys the misprediction information to the predictor, which then selects a new prediction value from the available choices.

3T REE-L EVEL P REDICTION

A potential limitation of block-level prediction is that the effective fetch size(i.e.,the number of useful instructions entered into the hardware window from a block)can be somewhat lower than the block size because of branches

Fig.3.Block diagram of two-level subgraph predictor.

3.The decision to update the counters in this manner was based on experimental results.When the count value corresponding to the predicted outcome is incremented by1and the others decremented by1,most of the count values tend to be zero or close to zero most of the time because,in every update,the overall decrement amount is more than the increment amount.

that are predicted to be taken.4The second scheme that we describe,called tree-level prediction,attempts to overcome this problem by considering a tree-like subgraph as the unit of prediction.A tree encompasses the fall-through paths as well as the taken paths of conditional branches.

3.1General Operation

Fig.4shows the generic structure of a tree-like subgraph of depth3.Each node of the tree is a straightline piece of code. The nodes at depth0and depth1can be terminated by any control-changing instruction(CCI),such as conditional branches,unconditional branches,and procedure calls. The nodes at depth2can be terminated only by single-target instructions(STI),such as unconditional branches, procedure calls,and non-control-changing instructions. Such a definition allows up to three conditional branches, seven nodes,and four tree-paths to be encompassed in a tree.5The tree nodes start at addresses A0...A6,and have lengths L0...L6.The tree has four targets{T0,T1,T2,T3}, one corresponding to each path.The four tree-paths are given the binary encoding{00,01,10,11}.Each path has three attributes:{node addresses,node lengths,target}.For instance,the attributes of path0are{{A0,A1,A2},{L0,L1, L2},T0}.The first two attributes of a tree-path aid in fetching the path from the i-cache and the last attribute gives the address of the next tree.The methodology of integrating the tree-level subgraph predictor into the CPU is same as that of integrating the block-level predictor.

3.2Size of a Tree

As with block-like subgraphs,some restrictions need to be imposed on the size of trees,again on the same two parameters,namely1)the number of control flow outcomes (or paths)a tree can have,and2)the number of instructions in each path of the tree.

The first parameter is decided by the number of branches encompassed in a tree path.If a tree has only two outcomes, then only one branch is effectively predicted per cycle and this severely limits the average number of instructions fetched per cycle.On the other hand,if a tree has a large number of possible outcomes,then the prediction mechan-ism has to choose one outcome from many possibilities, potentially resulting in lower prediction accuracies.Again, as a trade-off,this paper allows up to four outcomes per tree so as to encompass up to two conditional branches in any single path through a tree;eight-outcome trees are investigated in[1],and unbalanced trees are investigated in [17].The tree is extended beyond the second conditional branch by one more level to further increase the tree size without increasing the number of paths in the tree. Although the nodes at this last level cannot have any conditional branches,they can have up to one uncondi-tional control-changing instruction each.

Now,consider the second parameter,namely the number of instructions in each tree-path.Ideally,the fetch mechanism should be able to fetch all instructions from a selected path in one cycle.This will allow one subgraph prediction to be carried out every cycle.Thus,the length of any path through a subgraph is limited to f(by trimming the last node in that path),where f is the maximum fetch size of the superscalar fetch mechanism.

3.3Prediction Mechanism

The structure of the subgraph predictor for tree-like subgraphs is the same as that for block-like subgraphs, except that the Path Attributes field of the SHT stores the intermediate node addresses and node lengths as well.The working of the predictor is also similar to that of block-like subgraphs.

3.3.1Handling New Trees

When a tree is encountered by the fetch unit for the first time,or if its entry in the SHT was replaced by another tree, then the SHT will not have an entry for .If the fetch unit requests a prediction for in such a situation,the predictor deactivates the Prediction Valid signal.The fetch unit then requests the i-cache for a sequential block of f instructions,where f is the maximum fetch size.When these instructions are decoded,the decoder sends the correct attributes of tree-path0to the SHT for storage. The attributes of the remaining paths of are collected by the decoder when control flows through them for the first time.Thus,the trees are built up dynamically during program execution.

3.3.2Determination of New Path at Times of Recovery When a path prediction is made for a tree,the outcomes of the conditional branches contained in the selected tree-path also get predicted indirectly.When the predicted outcome of a conditional branch is later found to be incorrect,the processor initiates recovery actions as before by discarding the instructions fetched after the misprediction point.Concomi-tant to this is the undoing of the speculative updates done to the SHT and PHT entries corresponding to the affected subgraphs.The prediction mechanism then predicts a new path for the tree that contains the mispredicted branch.If the mispredicted branch is the second branch in its tree-path, then there is no choice involved in determining the new prediction value(and,hence,no need to access the PHT).If the previous prediction is the2-bit binary value I H,the new prediction value is given by I" H.On the other hand,if the mispredicted branch is the first branch in its tree-path,then there is a choice involved in selecting the new prediction value and the appropriate PHT entry is consulted to select a new prediction value.If the branch is now taken,then the predictor selects the tree-path corresponding to the max-imum count value among f g P Y g Q g,and if the branch is now not taken,then the predictor selects the tree-path corre-sponding to the maximum count value among f g H Y g I g.

4.Part of this problem can be overcome by letting the compiler modify the branch opcodes or perform instruction scheduling in such a manner that the conditional branches at the beginning of frequently executed blocks most likely take the fall-through path(as described in[4]for IBM RS6000). This will cause most of the instructions of those blocks to be executed, thereby resulting in larger effective fetch sizes.Techniques for doing this are beyond the scope of this paper.

5.It is important to point out that if a node at depth0or1encounters an instruction that has more than two possible targets,such as a subroutine return or an indirect branch,then that path of the tree gets pruned at the instruction following the multitarget instruction.The target of such a pruned path is given a special value so that the instruction fetch mechanism can use a more suitable predictor(such as a return address predictor or an indirect branch target predictor)to predict the actual value of the target.

3.4Fetching Tree-Paths

Tree-like subgraphs differ from block-like subgraphs in that,except for path0,the instructions contained in the remaining tree-paths are not from contiguous portions of memory.Thus,if the predictor predicts a tree-path other than path0,then the nodes within that path may reside in different i-cache blocks.One way of fetching nodes from multiple i-cache blocks is to make the i-cache multiported, possibly by interleaving.A better alternative,however,is to use the trace cache approach described in[12].The trace cache is a special instruction cache for capturing dynamic instruction sequences.Each line in the trace cache of[12] stores a dynamic code sequence,which may contain one or more taken branches.For application in a processor that performs tree-level prediction,each line in the trace cache can store the instructions of a tree-path in a contiguous manner.Most of the time,an instruction fetch request hits in the trace cache and the trace cache supplies an entire tree-path in a cycle.If the fetch request misses in the trace cache, then the i-cache or the next level of the memory hierarchy supplies the instructions of the tree-path,which are then stored in the trace cache for future accesses.Notice that, when a trace cache is used,then the Path Attributes field of the SHT need not store the internal node addresses of the tree-paths.

3.5Comparison of Block-Level Prediction and

Tree-Level Prediction

Both block-level prediction and tree-level prediction con-sider subgraphs of the executed program's control flow graph as the basic units of control flow prediction.Whereas block-level prediction uses sequential block-like subgraphs, tree-level prediction uses tree-like subgraphs.By consider-ing tree-like subgraphs,the latter allows the fetch unit to fetch multiple instructions in a cycle despite the presence of taken branches in the instruction stream.Both allow up to four control flow outcomes per subgraph.Block-level prediction allows up to three conditional branches in a (straightline)path,whereas tree-level prediction allows up to two in a(possibly nonstraightline)path.By restricting the parallelly fetched instructions to be from a contiguous portion of memory,block-level prediction allows for a simpler fetch unit and i-cache structure.With the emer-gence of trace caches that store a noncontiguous dynamic instruction sequence in a single cache block,tree-level prediction can be expected to become popular for the wide-issue superscalar processors of the future.

4H ARDWARE C OST FOR T WO-L EVEL S UBGRAPH P REDICTORS

This section calculates the hardware cost for implementing the two-level subgraph predictor for block-level prediction and tree-level prediction and compares the cost against the costs of contemporary prediction schemes.

4.1Hardware Cost of Block-Level Predictor

The two hardware-intensive components of the block-level predictor are the SHT and the PHT and their costs are calculated below.

The Tag field of an SHT entry has as many bits as is needed to uniquely identify a block address,and is w, where w is the number of bits required to specify an address.The Path Attributes field of an SHT entry stores up to four node lengths(each requiring4bits)and up to four target addresses(each requiring w bits).Lastly,the Subgraph History Pattern field of an SHT entry requires P p bits to store the last p outcomes of the block. Thus,the maximum number of bits required to implement an n-entry SHT is n? S w P p IT .

Next,consider the number of bits required for the PHT. Because each Subgraph History Pattern has P p bits, the PHT has P P p entries.Each PHT entry stores4count values.The total number of bits for implementing the PHT is,therefore,R ?P P p,where is the number of bits in a PHT counter.In addition,the block-level predictor requires an n-input decoder,a P P p-input decoder,and a4-input multiplexer.

4.2Hardware Cost of Tree-Level Predictor

For the tree-level predictor,the hardware cost of the PHT remains the same.The hardware cost of the SHT increases because the Path Attributes field has to store more information.The Path Attributes field,in this case, stores four target addresses(each requiring w bits),six intermediate fetch addresses(each requiring w bits),and seven subgraph node lengths(each requiring4bits, allowing each node to have lengths up to15).Then,the maximum number of bits required to implement an n-entry SHT is n? II w PV P p .In addition,the tree-level predictor requires an n-input decoder,a P P p-input decoder, and a4-input multiplexer.

4.2.1Reducing the Hardware Cost

The cost analysis given above shows that two-level subgraph prediction,especially the one used for tree-level prediction,has high hardware cost.Fortunately,there are ways to reduce the hardware cost.

Reducing the Hardware Cost of SHTDEliminate Redundant Information.The reason for the significant hardware cost for the SHT is because of the storage required

Fig.4.Generic structure of a tree-like subgraph.

for the Path Attributes field,which contains redundant information.Referring to Fig.4,for tree-path0,it is sufficient to store the subgraph address A0,the combined length L0+L1+L2,and the target T0.Similarly,for tree-path1,it is sufficient to store in addition,the intermediate address A3,the lengths L0+L1and L3,and the target T1. By storing information in this manner,the SHT cost can be reduced to n? V w PV P p bits.Notice that most pro-cessors have restrictions on the instruction address space, effectively reducing the value of w by even25percent or more,resulting in an additional25percent savings in the SHT hardware cost.Notice also that n can be reduced with negligible impact on the SHT hit rate by using set-associative mapping for the SHT.

Reducing the Hardware Cost of PHTDSplit Pattern Predictor.The reason for the significant hardware cost for the PHT is because of the factor P P p in its cost formulae. Notice that,of the two bits used to store a previous outcome in the SHT,the first bit encodes the outcome of the first branch in the tree and the second bit encodes the outcome of the second branch in the taken path in the tree.By treating these two branch outcomes independently,the hardware cost of the PHT can be significantly reduced,as described below.

When a prediction is requested for tree T,the SHT supplies a P p-bit history pattern of previous outcomes of T. Split this P p-bit pattern into two p-bit patterns,as shown in https://www.wendangku.net/doc/2d6311868.html,e one of the patterns to consult a PHT to predict the next outcome of the first branch in T and use the other pattern to consult a different PHT to predict the next outcome of the second branch(in the predicted path)in T. Combining the two single-bit prediction values gives the next2-bit prediction for T.With this scheme,the number of bits required for the PHTs has reduced dramatically to P? P p.

4.2.2Comparison to Other Schemes

Let us compare the hardware cost of the split-pattern two-level subgraph predictor to that of1)two-level branch predictor(PAg scheme)[18],and2)two-level multiple branch predictor(GAg scheme)[19].

Two-Level Branch Predictor.The two-level branch predictor consists of a branch history table(BHT)and a PHT.The BHT contains n entries,each having a Tag field (requiring at most w bits),a w-bit Target field to avoid pipeline bubbles due to branches that are predicted to be taken,and a p-bit Branch History Pattern field.The number of bits required for the BHT is n? P w p .This is approximately25percent of that required for the SHT in tree-level prediction.

The number of bits required for the PHT is ?P p bits, which is half of that required for the PHTs in the split-pattern subgraph predictor.Overall,the hardware cost for the split-pattern subgraph predictor is about four times that for two-level branch predictor,if both use the same number of entries at the first level table.It is important to note,however,that the number of entries in the SHT can be less than that in the BHT because the number of trees in the dynamic instruction stream is potentially smaller than the number of branches in the dynamic instruction stream.

Multiple Branch Predictor.The multiple branch predictor uses a two-level branch predictor for carrying out multiple branch predictions per cycle.It uses an additional Branch Address Cache(BAC)for storing the intermediate fetch addresses,basic block lengths,and the targets of the subgraphs.The number of bits required for an n-entry BAC is n? II w PV ,which can be reduced to n? V w PV using the technique given in Section4.2.1.This hardware cost is very close to that of the SHT used in tree-level prediction. Because of the need to carry out predictions without knowing the intermediate addresses,for practical purposes, the multiple branch predictor is restricted to using a p-bit global history register in place of the BHT[19].A major implication of using a global history register is that a much larger pattern size(typically14or more)has to be used to get reasonable prediction accuracies,as depicted in Table2 and in[18],[19].If p=14,then the number of bits required for the PHT works out to be6Kbytes(for =3),which is significantly higher than that for the PHTs in the split-pattern subgraph predictor.Thus,the hardware cost for the split-pattern subgraph predictor is lower than that for the two-level multiple branch predictor.

5E XPERIMENTAL E VALUATION

The previous sections described block-level prediction and tree-level prediction,their implementation,and hardware costs.This section presents the results of an empirical study of their performance.For comparison purposes,the section also presents performance results of the two-level branch predictor[18](augmented with a branch target buffer to avoid the one-cycle penalty due to branches that are predicted to be taken)and the two-level multiple branch predictor[19].

5.1Experimental Setup

5.1.1Simulation Tool and Benchmarks

The performance studies are conducted using the MIPS-I instruction set architecture(ISA),a representative of the class of streamlined ISAs that have emerged recently.All data are gathered with a simulator that accepts programs compiled for a MIPS-based DECstation and simulates their execution,keeping track of relevant information on a cycle-by-cycle basis.The simulator models speculative execution and is not trace-driven.System calls made by the simulated benchmark programs are handled with the help of traps to the operating system.The collected results,therefore,

Fig.5.Splitting a P p-bit pattern into two.

exclude the code executed during system calls,but include all other code portions,including the library routines.

For benchmarks,five of the SPEC92integer programs and three of the SPEC95integer programs are used.The benchmark programs are compiled using the MIPS C compiler with the optimization flags distributed in the SPEC benchmark makefiles.The benchmarks are simulated to completion or up to 100million instructions,depending on whichever occurred first.

5.1.2Performance Metrics

For measuring performance,execution time is the sole metric that can accurately measure the performance of an integrated software-hardware system.However,this metric requires many implementation assumptions,including the exact hardware configuration,functional unit latencies,etc.

Other metrics,such as instruction issue rate and number of instructions in hardware window,also require implementa-tion assumptions that limit the utility of the results.Further,these metrics may not give much insight on the perfor-mance of the fetch mechanism,which is the focus of this study.Therefore,for this study,we use the following indirect metrics,which provide good insight on the fetch mechanism's performance:1)effective fetch size (EFS),2)subgraph prediction accuracy (SPA),and 3)branch prediction accuracy (BPA).

The first metric indicates the average number of useful instructions fetched in each fetch attempt of the fetch unit.In the case of block-level prediction,instructions that are fetched,but discarded because of a conditional branch being predicted to be taken,are not counted in this metric.Furthermore,useless instructions that are discarded due to

TABLE 1

Effective Fetch Size for f

TABLE 2

Subgraph Prediction Accuracy (SPA)and Branch Prediction Accuracy (BPA)for f

incorrect speculative execution are not counted while calculating this metric.Notice also that,if a block or tree had to be fetched in two (or three)attempts because of mispredictions,then it is counted as two (or three)fetch attempts.Thus,EFS takes into account the effect of the inaccuracies in subgraph prediction.This is an important metric because increasing its value is the primary goal of subgraph prediction.Fetching more instructions per cycle gives the superscalar hardware window more opportunities to look for and exploit instruction-level parallelism.

The second metric (SPA)indicates the fraction of times a subgraph prediction gives the correct path through a subgraph.Although EFS captures the effect of subgraph prediction inaccuracies,SPA throws more light on why the performance of the predictor is better or worse.We are also interested in knowing if making a four-way subgraph prediction instead of a two-way branch prediction results in a poor prediction accuracy.The third metric (BPA)indicates the fraction of times a branch (indirectly)gets correctly predicted when subgraph prediction is performed.BPA is highly dependent on SPA,but is helpful in comparing with the BPAs obtained with other prediction techniques.

5.1.3Caveat

It is important to make note of a caveat in this evaluation (which is true for any evaluation of a similar nature).The performance of any computer system is highly dependent on the compiler and the type of static scheduling done.All these studies are carried out with code compiled for a single-issue processor.The major implication of this is that no subgraph prediction-specific optimizations were per-formed on the code.In that sense,the performance results presented in this paper for block-level prediction should be viewed as pessimistic.Further doctoring of the code,especially to make the branches take the fall-through path more often than the taken path,as in [4],will certainly improve the performance of block-level prediction.5.1.4Default Parameters for the Study

We have attempted to do a reasonable study by varying the important parameters independently.When a parameter is varied,the rest of the parameters are kept fixed at the default values given below:.

Maximum Fetch Size (f):The default value is 12instructions.

Fig.6.Effect of f on EFS for block-level prediction scheme.

Fig.7.Effect of p on EFS for block-level prediction scheme.

Branch Predictor:This scheme is simulated with a PAg two-level branch predictor [18],with a pattern size (p )of 6.The BHT has 1,024entries and is direct mapped.The PHT entries consist of 3-bit up/down saturating counters.The predictions are done based on the start address of each fetch block.The fetch unit fetches a sequential block of f instructions in each attempt;if the first conditional branch within this block is predicted to be fall-through,the subsequent instructions up to the next conditional branch in the block are also entered into the hardware window.Therefore,strictly speaking,this scheme performs block-level prediction,with the number of control flow outcomes per block restricted to 2..

Multiple Branch Predictor:This scheme is simulated with a GAg two-level branch predictor.The Branch Address Cache has 1,024entries and is direct mapped.The PHT entries consist of 3-bit up/down saturating counters.Two pattern sizes (p =7and p =14)are used.Notice that the scheme with pattern

size 7has roughly the same hardware cost as the default tree-level predictor.

Subgraph Predictor:The block-level prediction scheme and the tree-level prediction scheme are simulated with PAg two-level subgraph predictors.The SHT is also direct mapped and the default number of SHT entries is 1,024.The default pattern size is 6.Each PHT entry has four 4-bit up/down saturating counters.In the split pattern scheme,each PHT entry has a single 3-bit saturating counter.

5.2Experimental Results 5.2.1Effective Fetch Size

Table 1shows the EFS obtained with the default branch prediction scheme,the multiple branch prediction scheme [19],the default block-level prediction scheme,and the default split-pattern subgraph prediction scheme.The first column of numbers presents the EFS obtained with the branch prediction scheme.The next two columns present the EFS obtained with multiple branch predictions per cycle

Fig.8.Effect of n on SHT hit rate for block-level prediction

scheme.

Fig.9.Effect of f on EFS for split-pattern tree-level prediction scheme.

for p=7and p=14,respectively.The fourth column of numbers presents the EFS obtained with the block-level predictor,and the subsequent column presents the percen-tage improvement in EFS for block-level predictor over branch-level predictor.The last two columns present the EFS obtained with split-pattern tree-level predictor and the percentage improvement in EFS for tree-level predictor over branch-level predictor.

Let us consider the results of this table in some detail. The percentage improvement in EFS for block-level predic-tion varies from0percent(for ijpeg)to40.7percent(for eqntott).This increase is because the branch-level prediction is fundamentally limited by the basic block size and,therefore,does not show marked improvement when f is larger than the average basic block size.

Next,consider the results for the tree-level predictor.It can be seen that the EFS is significantly higher for the tree-level predictor compared to that for the branch-level predictor(varying from41.8percent for gcc to130.8 percent for eqntott).Furthermore,the EFS for multiple branch predictor with p=7(which has roughly the same hardware cost as the default split-pattern subgraph pre-dictor)is low.This is because the multiple branch predictor uses a global history register,which results in poor prediction accuracy.When p is increased to14,its performance improves and it performs only slightly worse than the tree-level predictor(with p=6).

5.2.2Prediction Accuracy

Next,let us consider the prediction accuracies obtained. Table2presents the prediction accuracies obtained for f= 12.The first column of numbers gives the BPA obtained with the branch-level predictor.The next group of four columns give the SPA and BPA results for multiple branch predictor for pattern sizes(p)of7and14.The first two columns in this group give the SPA and BPA for p=7,and the next two columns give the SPA and BPA for p=14.The sixth and seventh columns of numbers give the SPA and BPA obtained with the default block-level predictor.The last group of four columns give the SPA and BPA results obtained with the two tree-level predictors.The first two columns in this group give the SPA and BPA obtained with the non-split-pattern predictor,and the last two columns give the results for the split-pattern predictor.

First,let us consider the results for the multiple branch predictor.The prediction accuracy results ob-tained with p=7are somewhat low.For the multiple branch predictor to perform reasonably well,the pattern size has to be much bigger(14or more),resulting in high hardware overheads.On the other hand,the BPA results for block-level predictor and tree-level predictors are very close to the BPA results obtained for the branch-level predictor.This result is very encouraging because it shows that predicting1-out-of-4outcomes has provided more or less the same accuracy as predicting1-out-of-2outcomes. An important observation to make is that the prediction accuracies with the split-pattern scheme are close to that without the split-pattern scheme.This is also very encouraging because the split-pattern scheme has low hardware overhead.

5.2.3Sensitivity Studies of Block-Level Prediction Next,let us vary some of the parameters that were fixed so far in order to study their effect on the performance of block-level predictors.The parameters that are varied for these sensitivity studies are:1)the maximum fetch size(f), 2)the pattern size(p),and3)the number of entries in the SHT(x).Each of these parameters is varied while keeping the rest of the parameters at their default values specified in Section5.1.4.

Fig.6plots the EFS obtained for three different values of f,namely{8,12,16}.It can be seen that,as f is increased,the EFS also increases.However,the increase is sublinear and tends to taper off because taken branches cause control to flow out of the middle of sequential blocks.

Fig.10.Effect of p on EFS for split-pattern tree-level prediction schemes.

Fig.7plots the EFS obtained for the benchmarks for p= {4,6,8}.There is an increase in EFS as p is increased from4 to6,but no significant change as p is increased further.The se results suggest that a pattern size of6is quite adequate for block-level predictors.

Fig.8plots the SHT hit rates obtained for the bench-marks for three different values of x.Except for gcc and go, all other benchmarks stand little to benefit from a value of x greater than1,024.For programs that have poor SHT hit rates,such as gcc and go,better hit rates can be achieved by using set-associative mapping to select the SHT entries.

5.2.4Sensitivity Studies of Tree-Level Prediction

This section presents the results of sensitivity studies of tree-level prediction.As in the case of block-level predic-tion,the parameters that are varied are:1)the maximum fetch size(f),2)the pattern size(p),and3)the number of entries in the SHT(x).Again,when each of these parameters is varied,the rest of the parameters are maintained at their default values specified in Section5.1.4.The sensitivity studies are conducted on the low hardware cost,split-pattern tree-level predictor.

Fig.9plots the EFS obtained for three different values of f,namely8,12,and16.As with block-level prediction,the effective fetch size begins to taper off as f is increased to16. To get substantial increase for higher values of f,the tree structure needs to be extended to include additional nodes in each tree-path.

Fig.10plots the EFS obtained for p=4,6,and8.The figure shows that,as with the block-level predictor,a pattern size of6is quite adequate for the split-pattern tree-level predictor.

Last,Fig.11plots the SHT hit rates obtained with different number of entries in the SHT.It can be seen that many of the benchmarks have good hit rates even with a 256-entry SHT.Again,gcc and go have low hit rates even with1,024-entry SHTs.To improve the hit rates for these benchmarks,set associativity needs to be employed.

6S UMMARY AND C ONCLUSIONS

This paper investigated control flow prediction techniques for wide-issue superscalar processors.The essence of control flow prediction is to do predictions at a level higher than that of branches(in particular,at the level of subgraphs of the executed program's control flow graph) so as to predict the outcome of multiple branches quickly. That is,the predictor determines where control is most likely to go after the execution of the subgraph.This effectively predicts the outcome of multiple branches per cycle,thereby enabling the fetch unit and the decode unit to bring in more instructions per cycle into the superscalar execution unit.

Two control flow prediction schemes were investigated. The first scheme considered sequential block-like sub-graphs and the second scheme considered tree-like sub-graphs.The paper described the hardware cost of carrying out subgraph-level prediction and also described ways to reduce the hardware cost.The cost was found to be less than alternate schemes to predict the outcome of multiple branches per cycle.

The paper also presented results from a thorough simulation study conducted to verify the potential of this prediction strategy.The results of these experiments with the MIPS architecture show block-level prediction and tree-level prediction to increase the effective fetch size by about 25percent and50percent,respectively,over that of branch-level prediction.The effective branch prediction accuracies

Fig.11.Effect of n on SHT hit rate for split-pattern tree-level prediction scheme.

for the subgraph-level prediction schemes were close to that obtained for branch-level prediction.

Subgraph-level prediction has several advantages.First, it predicts the outcome of multiple branches per cycle.This is essential to improve the performance of wide-issue superscalar processors for nonnumerical programs.Second, only a single prediction is made per cycle.Thus,the prediction mechanism need not be much more complex than an ordinary branch prediction https://www.wendangku.net/doc/2d6311868.html,st,tree-level prediction allows the instruction fetch mechanism to fetch a large number of instructions per cycle despite taken branches in the instruction stream.

A CKNOWLEDGMENTS

This work was financially supported by U.S.National Science Foundation Research Initiation Award CCR9410706 and by the Department of Electrical and Computer Engineering of Clemson University.We thank the reviewers for their many helpful comments.

R EFERENCES

[1] B.Cyril and M.Franklin,aA Study of Tree-Based Control Flow

Prediction Schemes,oProc.Int'l Conf.High Performance Computing, pp.28-33,1997.

[2]K.Ebcioglu,aSome Design Ideas for a VLIW Architecture for

Sequential Natured Software,oParallel Processing(Proc.IFIP WG

10.3Working Conf.Parallel Processing,Pisa,Italy),pp.3-21,1988.

[3]M.Franklin,aThe Multiscalar Architecture,oPhD thesis,Technical

Report TR1196,Computer Sciences Dept.,Univ.of Wisconsin-Madison,1993.

[4]M.C.Golumbic and V.Rainish,aInstruction Scheduling Beyond

Basic Blocks,oIBM J.Research and Development,vol.34,no.1,pp.93-97,Jan.1990.

[5]W.W.Hwu et al.,aThe Superblock:An Effective Technique for

VLIW and Superscalar Compilation,oJ.Supercomputing,Jan.1993.

[6]W.W.Hwu,R.E.Hank, D.M.Gallagher,S.A.Mahlke, D.M.

Lavery,G.E.Haab,J.C.Gyllenhaal,and D.I.August,aCompiler Technology for Future Microprocessors,oProc.IEEE,vol.83, no.12,Dec.1995.

[7]S.A.Mahlke et al.,aCharacterizing the Impact of Predicated

Execution on Branch Prediction,oProc.27th Int'l Symp.Micro-architecture(MICRO-27),pp.217-227,1994.

[8]S.Melvin and Y.N.Patt,aEnhancing Instruction Scheduling with a

Block-Structured ISA,oInt'l J.Parallel Programming,vol.23,no.3, 1995.

[9]R.Nair,aDynamic Path-Based Branch Correlation,oProc.28th Int'l

Symp.Microarchitecture(MICRO-28),1995.

[10]S-T.Pan,K.So,and J.T.Rahmeh,aImproving the Accuracy of

Dynamic Branch Prediction Using Branch Correlation,oProc.Fifth Int'l Conf.Architectural Support for Programming Languages and Operating Systems(ASPLOS-V),pp.76-84,1992.

[11] D.N.Pnevmatikatos,M.Franklin,and G.S.Sohi,aControl Flow

Prediction for Dynamic ILP Processors,oProc.26th Int'l Symp.

Microarchitecture(MICRO-26),pp.153-163,1993.

[12] E.Rotenberg,S.Bennett,and J.E.Smith,aTrace Cache:A Low

Latency Approach to High Bandwidth Instruction Fetching,oProc.

29th Int'l Symp.Microarchitecture(MICRO-29),1996.

[13] A.Seznec,S.Jourdan,P.Sainrat,and P.Michaud,aMultiple-Block

Ahead Branch Predictors,oProc.Seventh Int'l Conf.Architectural Support for Programming Languages and Operating Systems(ASPLOS VII),1996.

[14]J.E.Smith,aA Study of Branch Prediction Strategies,oProc.Eighth

Ann.Int'l https://www.wendangku.net/doc/2d6311868.html,puter Architecture,pp.135-148,1981. [15]J.E.Smith and G.S.Sohi,aThe Microarchitecture of Superscalar

Processors,oProc.IEEE,vol.83,no.12,pp.1,609-1,624,Dec.1995.

[16]G.S.Sohi,S.E.Breach,and T.N.Vijaykumar,aMultiscalar

Processors,oProc.22nd Ann.Int'l https://www.wendangku.net/doc/2d6311868.html,puter Architecture, 1995.[17] B.R.Toone and M.Franklin,aControl Flow Prediction with

Unbalanced Tree-like Subgraphs,oProc.Int'l Conf.High Perfor-mance Computing,pp.221-227,1998.

[18]T.-Y.Yeh and Y.N.Patt,aAlternative Implementations of Two-

Level Adaptive Branch Prediction,oProc.19th Ann.Int'l Symp.

Computer Architecture,pp.124-134,1992.

[19]T.-Y.Yeh,D.T.Marr,and Y.N.Patt,aIncreasing Instruction Fetch

Rate via Multiple Branch Predictions and a Branch Address Cache,oProc.Seventh ACM Int'l Conf.Supercomputing,pp.67-76, July1993.

Simonjit Dutta received the BTech degree in

computer engineering from Regional Engineer-

ing College,Suratkal,in1993,and the MS

degree in computer engineering from Clemson

University,Clemson,South Carolina,in1995.

His MS thesis work involved developing control

flow prediction techniques for wide-issue super-

scalar processors.Prior to joining Intel,he was

with Texas Inc,Dallas from1995to1996.He is a

computer architect at Intel Inc.,Santa Clara, California,where he investigates power and performance trade-offs in the mobile platform.His research interests are in computer architecture, instruction-level parallel processing,branch prediction,and system/ workload profiling for low power design.

Manoj Franklin received the BSc degree in

electronics and communications engineering

from the University of Kerala,India,in1984,

and the MS and PhD degrees in computer

sciences from the University of Wisconsin-

Madison,in1990and1993,respectively.He is

an assistant professor in the Electrical Engineer-

ing Department of University of Maryland,

College Park,where he holds a joint appoint-

ment with the University of Maryland Institute for

Advanced Computer Studies(UMIACS).He worked during the summer of1990at Cray Research Inc.,Chippewa Falls,Wisconsin,and the summer of1991at the IBM T.J.Watson Research Center,New York.Prior to joining the University of Maryland, he was an assistant professor at Clemson University from1993-1998. He received the IBM Graduate Fellowship in the years1990-1993,the U.S.National Science Foundation(NSF)Research Initiation Award in 1994,and the NSF CAREER award in1997.His primary research interests are in computer architecture,instruction-level parallel proces-sing,processor design,ILP compilation,and fault-tolerant computing.

比较PageRank算法和HITS算法的优缺点

题目：请比较PageRank算法和HITS算法的优缺点，除此之外，请再介绍2种用于搜索引擎检索结果的排序算法，并举例说明。答： 1998年，Sergey Brin和Lawrence Page[1]提出了PageRank算法。该算法基于“从许多优质的网页链接过来的网页，必定还是优质网页”的回归关系，来判定网页的重要性。该算法认为从网页A导向网页B的链接可以看作是页面A对页面B的支持投票，根据这个投票数来判断页面的重要性。当然，不仅仅只看投票数，还要对投票的页面进行重要性分析，越是重要的页面所投票的评价也就越高。根据这样的分析，得到了高评价的重要页面会被给予较高的PageRank值，在检索结果内的名次也会提高。PageRank是基于对“使用复杂的算法而得到的链接构造”的分析，从而得出的各网页本身的特性。 HITS 算法是由康奈尔大学( Cornell University ) 的JonKleinberg 博士于1998 年首先提出。Kleinberg认为既然搜索是开始于用户的检索提问，那么每个页面的重要性也就依赖于用户的检索提问。他将用户检索提问分为如下三种：特指主题检索提问（specific queries，也称窄主题检索提问）、泛指主题检索提问（Broad-topic queries，也称宽主题检索提问）和相似网页检索提问（Similar-page queries）。HITS 算法专注于改善泛指主题检索的结果。 Kleinberg将网页（或网站）分为两类，即hubs和authorities，而且每个页面也有两个级别，即hubs（中心级别）和authorities（权威级别）。Authorities 是具有较高价值的网页，依赖于指向它的页面；hubs为指向较多authorities的网页，依赖于它指向的页面。HITS算法的目标就是通过迭代计算得到针对某个检索提问的排名最高的authority的网页。通常HITS算法是作用在一定范围的，例如一个以程序开发为主题的网页，指向另一个以程序开发为主题的网页，则另一个网页的重要性就可能比较高，但是指向另一个购物类的网页则不一定。在限定范围之后根据网页的出度和入度建立一个矩阵，通过矩阵的迭代运算和定义收敛的阈值不断对两个向量authority 和hub值进行更新直至收敛。从上面的分析可见，PageRank算法和HITS算法都是基于链接分析的搜索引擎排序算法，并且在算法中两者都利用了特征向量作为理论基础和收敛性依据。

pagerank算法实验报告

PageRank算法实验报告一、算法介绍 PageRank是Google专有的算法，用于衡量特定网页相对于搜索引擎索引中的其他网页而言的重要程度。它由Larry Page 和Sergey Brin在20世纪90年代后期发明。PageRank实现了将链接价值概念作为排名因素。 PageRank的核心思想有2点： 1.如果一个网页被很多其他网页链接到的话说明这个网页比较重要，也就是pagerank值会相对较高； 2.如果一个pagerank值很高的网页链接到一个其他的网页，那么被链接到的网页的pagerank值会相应地因此而提高。若页面表示有向图的顶点，有向边表示链接，w(i,j)=1表示页面i存在指向页面j的超链接，否则w(i,j)=0。如果页面A存在指向其他页面的超链接，就将A 的PageRank的份额平均地分给其所指向的所有页面，一次类推。虽然PageRank 会一直传递，但总的来说PageRank的计算是收敛的。实际应用中可以采用幂法来计算PageRank，假如总共有m个页面，计算如公式所示： r=A*x 其中A=d*P+（1-d)*(e*e'/m) r表示当前迭代后的PageRank，它是一个m行的列向量,x是所有页面的PageRank初始值。 P由有向图的邻接矩阵变化而来，P'为邻接矩阵的每个元素除以每行元素之和得到。 e是m行的元素都为1的列向量。二、算法代码实现

三、心得体会在完成算法的过程中，我有以下几点体会： 1、在动手实现的过程中，先将算法的思想和思路理解清楚，对于后续动手实现有很大帮助。 2、在实现之前，对于每步要做什么要有概念，然后对于不会实现的部分代码先查找相应的用法，在进行整体编写。 3、在实现算法后，在寻找数据验证算法的过程中比较困难。作为初学者，对于数据量大的数据的处理存在难度，但数据量的数据很难寻找，所以难以进行实例分析。

PageRank算法的核心思想

如何理解网页和网页之间的关系，特别是怎么从这些关系中提取网页中除文字以外的其他特性。这部分的一些核心算法曾是提高搜索引擎质量的重要推进力量。另外，我们这周要分享的算法也适用于其他能够把信息用结点与结点关系来表达的信息网络。今天，我们先看一看用图来表达网页与网页之间的关系，并且计算网页重要性的经典算法：PageRank。 PageRank 的简要历史时至今日，谢尔盖·布林（Sergey Brin）和拉里·佩奇（Larry Page）作为Google 这一雄厚科技帝国的创始人，已经耳熟能详。但在1995 年，他们两人还都是在斯坦福大学计算机系苦读的博士生。那个年代，互联网方兴未艾。雅虎作为信息时代的第一代巨人诞生了，布林和佩奇都希望能够创立属于自己的搜索引擎。1998 年夏天，两个人都暂时离开斯坦福大学的博士生项目，转而全职投入到Google 的研发工作中。他们把整个项目的一个总结发表在了1998 年的万维网国际会议上（WWW7，the seventh international conference on World Wide Web）（见参考文献[1]）。这是PageRank 算法的第一次完整表述。 PageRank 一经提出就在学术界引起了很大反响，各类变形以及对PageRank 的各种解释和分析层出不穷。在这之后很长的一段时间里，PageRank 几乎成了网页链接分析的代名词。给你推荐一篇参考文献[2]，作为进一步深入了解的阅读资料。

PageRank 的基本原理我在这里先介绍一下PageRank 的最基本形式，这也是布林和佩奇最早发表PageRank 时的思路。首先，我们来看一下每一个网页的周边结构。每一个网页都有一个“输出链接”（Outlink）的集合。这里，输出链接指的是从当前网页出发所指向的其他页面。比如，从页面A 有一个链接到页面B。那么B 就是A 的输出链接。根据这个定义，可以同样定义“输入链接”（Inlink），指的就是指向当前页面的其他页面。比如，页面C 指向页面A，那么C 就是A 的输入链接。有了输入链接和输出链接的概念后，下面我们来定义一个页面的PageRank。我们假定每一个页面都有一个值，叫作PageRank，来衡量这个页面的重要程度。这个值是这么定义的，当前页面I 的PageRank 值，是I 的所有输入链接PageRank 值的加权和。那么，权重是多少呢？对于I 的某一个输入链接J，假设其有N 个输出链接，那么这个权重就是N 分之一。也就是说，J 把自己的PageRank 的N 分之一分给I。从这个意义上来看，I 的PageRank，就是其所有输入链接把他们自身的PageRank 按照他们各自输出链接的比例分配给I。谁的输出链接多，谁分配的就少一些；反之，谁的输出链接少，谁分配的就多一些。这是一个非常形象直观的定义。