文档库 最新最全的文档下载
当前位置:文档库 › Priority-Driven Active Data Prefetching

Priority-Driven Active Data Prefetching

Priority-Driven Active Data Prefetching
Priority-Driven Active Data Prefetching

Priority-Driven Active Data Prefetching Ming Zhu Harsha Narravula Constantine Katsinis Diana Hecht

zhuming,ckatsini,hecht@https://www.wendangku.net/doc/db19033145.html,;harsha@https://www.wendangku.net/doc/db19033145.html,

Dept.of Electrical and Computer Engineering,Drexel University,Philadelphia,PA19104

Abstract—Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the pro-cessor.It is especially worse in the DSM environment. Prefetching data for the critical data address misses is one way to tolerate the cache miss latencies.But current ap-plications with irregular access patterns make it dif?cult to prefetch data suf?ciently early to mask large cache miss la-tencies,especially in multithreadal applications.To comple-ment prefetching in a multithreaded environment,this pa-per proposes an approach to prefetch data addresses by a priority-driven method.The method introduced in this pa-per is a novel approach for dynamically identifying and pre-computing the data addresses of the instructions marked as in a higher priority critical path of an application.The crit-ical path can be identi?ed at compile-time or run-time.A separate engine calculates the data addresses of the iden-ti?ed instructions in the critical path and prefetches early enough,the data that will be used in the next critical instruc-tion.

Preliminary results show that a priority-driven prefetch-ing is useful.It reduces the completion time of an application signi?cantly.The approach improved the overall perfor-mance in three experiments conducted with Active Prefetch-ing,over traditional prefetching,especially in the Matrix-Matrix multiplication,in our simulator.

Index Terms—Parallel Computing,Priority-based Data Cache,Data Prefetching,Critical Path.

I.I NTRODUCTION

Processor clock speeds are continuing to outpace mem-ory access speeds,resulting in larger cache miss latencies (measured in processor cycles).Meanwhile,the chief ap-plication domains including games,multimedia,web ser-vices,continue to have large memory footprints and do not use processor caches effectively,resulting in a higher cache misse rate.All these cause signi?cant performance degradation.

To bridge the gap between processor and memory per-formance and reduce the high miss rates of these irregu-lar applications,a variety of data prefetching techniques have been proposed.The general goal of data prefetch-ing is to hide the access latency of data referencing pat-terns that fail caching strategies.Rather than waiting for a cache miss to perform a memory fetch,data prefetching anticipates such misses and issues a fetch to the mem-ory system in advance of the actual memory reference.It is obvious that accurate data prefetching can intro-duce a major improvement in the performance of par-allel applications[1].The techniques can be classi?ed into two groups:hardware-initiated data prefetching[2] and software-initiated data prefetching[3].In hardware-initiated data prefetching,focus is on Sequential prefetch-ing for the spatial locality of data addresses.Examples of such include Prefetch on miss algorithm,Tagged prefetch algorithm and Prefetching with arbitrary strides[4].In software-initiated data prefetching,the compiler is the chief function used in the prefetch scheduling and soft-ware piplining(prolog/epilog)[5].But most existing data prefetch schemes are not effective in prefetching ad-dresses for irregular applications where it is dif?cult to accurately issue the prefetches in a timely fashion.

This paper explores a priority-aware method to prefetch addresses.It introduces a novel approach for dynamically identifying and precomputing the data ad-dresses of the instructions in the critical path marked as in a higher priority.This helps signi?cantly reduce the com-pletion time of an application.The critical path can be identi?ed at compile-time or run-time.A separate engine calculates the data addresses of the identi?ed instructions in the critical path and prefetches early enough,the data that will be used in the next critical instruction.

When an load/store instruction marked in the critical path is fetched from the I-cache into the Instruction Fetch Queue(IFQ)and is likely to cause a data cache miss,a separate Prefetch Address Engine(PAE)will generate a timely data prefetch.The issued prefetch does not affect the architected state of the main processor.Since precom-putation of the address is speculative,it runs ahead of the normal computation by avoiding fetch queue and reorder buffer delays experienced by the normal computation due to the in-order decode and the instruction scheduling pri-orities of the main processor pipeline.Some sound re-search has been done on prioritizing the thread along the critical path of an application[6],and priority scheduling of threads is assumed in this paper.We further extend the priorities down to the instructions in the critical path of an application.

In particular,under a full or heavy load data cache con-

dition,data prefetched in the critical path of an applica-tion may experience a signi?cant delay,since it has to wait and may not have space to be stored in the data cache where it is handled in a First-Come First-Served(FCFS) manner.To address this scenario,a priority-driven data replacement is introduced in the data cache.The replace-ment algorithm in our approach uses a non-preemptive priority policy.Low priority data in the data cache is replaced?rst.Data that is identi?ed as high priority in an application is also prefetched to the data cache with a higher priority than other data,so that their stalling time is minimized.The minimized time for high priority data will result in faster execution time of the critical paths in the application.As a result,the overall execution time of the application will be reduced.

Preliminary results show that a priority-driven prefetching is useful.It reduces the completion time of an application signi?cantly.The approach improved the overall performance in three experiments conducted with Active Prefetching,over traditional prefetching, especially the Matrix-Matrix multiplication,in our simulator.

The rest of this paper is organized as follows.The next section describes in detail the proposed approach and a hardware implementation to precompute and prefetch data addresses.A performance analysis of our approach is given in Section3.It describes the simulation envi-ronment and preliminary analysis.Section4compares it with the most relevant prior work in the area of data prefetching techniques by highlighting the applicability and constraints of the prior schemes.Conclusions and some future directions are presented in Section5.

II.P RIORITY-DRIVEN A CTIVE D ATA P REFETCHING The work?ow overview of the approach is shown in Figure1.First,the critical path of an application is iden-ti?ed by a compiler or a runtime system.The path is as-signed a higher priority than the other paths.Next,us-ing priority-based mechanism[6],OS executes the appli-cation by scheduling the processes or threads with their priority information.Thus,at the instruction level,the in-structions of the processes or threads are entered in the IFQ;and?nally active data prefetching will perform by a priority-driven method,to prefetch the data.

The basic idea of the approach is to prioritize the in-structions prefetching for the critical paths of an appli-cation.Figure2presents the overview of the pipeline of active data prefetch.The pre-decoded instructions in the IFQ are scanned in program order.Whenever it?nds a load/store instruction marked for high priori-tized prefetching,it initiates the process of precomputa-tion.Precomputed instructions never modify the archi-tected state of the main processor.Thus,although an incorrect calculation would result in generating a wrong prefetch address,they do not corrupt the architected state of the system.Figure3highlights the additional hardware added.The IFQ is an FIFO queue of instructions coming from the instruction cache.The priority of each of the in-structions is fed to a Priority Circuit,to pick the highest priority and the oldest instruction in the queue.The Prior-ity Circuit then controls a multiplexer that can remove the instruction from the IFQ and place it into a Data Prefetch Queue.These instructions are then decoded by the PAE to obtain the address referenced,and a prefetch is issued.

Fetch Instruction Catch

IFQ

Data Prefetch Queue

Decode. . .

Prefetch Address Engine Fig.2.Pipeline Active Data Prefetch

S

Engine Fig.3.The Hardware of Prioritized Prefetch Engine

PAE executes instructions speculatively.The results generated by PAE are used for prefetching data(as men-tioned previously,they never update the architected state of the processor).When executing a marked instruction, PAE generates the load/store address and prefetches the data into the L1cache without loading a value into the main processor’s register(for a load)and without stor-ing a value in the prefetched address(for a store).Since PAE is prevented from modifying the architected state of the main processor,it can speculatively run ahead of the actual computation without having to guarantee the cor-rectness of the precomputation.

Software Layers Fig.1.Work?ow Overview

The PAE is a simple single issue,in-order execution engine.As soon as it ?nishes one calculation it proceeds to the next buffered one,if any.After the address is gen-erated,the data is prefetched into the L1cache of the pro-cessor.The prefetch is squashed by L1if the data is al-ready present.

A.Software Support for Active Data Prefetching

Task graphs are used to abstract parallel applications.A task graph generally consists of a set of all paths.Ev-ery path consists of a set of tasks and edges.Task weight is the computation time required by the task and edge weight is the amount of data transmission between tasks.The critical path in a task graph determines the comple-tion time of an application.Naturally,the tasks along the critical path should be executed in a high priority.When a task executes on a node,without loss of generality,we model the case as a task queue (as shown in Figure 4)where the node is a priority queueing system,CPU as the

server.

Task Queue

Critical Task

Fig.4.Task Queue Model

OS executes the application by scheduling the pro-cesses or threads with their priority information.At the instruction level,the related instructions should provide these priority parameters.Thus the underlying hardware will utilize the parameters to reduce the process time of the critical path.

B.Hardware Support for Active Data Prefetching

A processor is usually structured as a pipeline organiza-tion.The outline of the data prefetch hardware proposed

in this paper is shown in Figure 2and 3.In particular,we modify the IFQ by adding a new parameter,the pri-ority level.The proposed engine is modeled as an M/D/1queueing system with priority policy.An instruction of the task in Critical Path will be marked with a higher pri-ority level.Thus the proposed engine will enqueue its data prefetches as high priority ones.Consequently,the data required by the instruction will be prefetched earlier than others.

For non-preemptive priority policies,the average total delays per data are given in Table I,assuming that there are two priority levels [7].is the arrival rate of the level 1(high priority)data and the arrival rate of the level 2(low priroty)https://www.wendangku.net/doc/db19033145.html,ually,when the data prefetching demand is very high (i.e.,),preemptive-resume policy has a better performance for high priority data than non-preemptive policy.

The unit of data being processed,in the technique pro-posed in this paper,is an instruction.Since the instruc-tions of each thread need to be executed in order,and be-cause it usually takes just a few clock cycles (except for read/write misses)for a processor to execute an instruc-tion,preemptive-resume priority policy is not required.This is also true for multi-threaded applications and it ac-tually increases the overhead to implement preemptive-resume priority as a context-switching-equivalent mecha-nism needs to be implemented.Hence,the preemptive-resume priority policy is not adopted in our approach.Furthermore,the replacement policy of the data cache will be enhanced for the prefetched data.This discussion is out of the scope of this paper and will be addressed in another paper.In summary,at the instruction level,the instructions of the threads are entered in the IFQ;and ac-tive data prefectching is performed by a priority-driven method,to prefetch the data for instructions.

III.S IMULATION A NALYSIS

M/D/1queue is used in the analysis of our data prefetching technique.The service times are identical for

all arrivals in the queue modeled with mean service time

(i.e.,service rate=)and data arrivals modeled by Poisson processes.If FCFS queueing policy is used and data arrival rate is,the average total delay per packet is.For further analysis,a simulation study was carried out.

In the simulation,task graphs were used to abstract ap-plications.In a task graph,nodes represent tasks.A di-rected edge from task to task speci?es that task must be completed before task can begin.The weight of the node speci?es the execution cost for the task and the weight of the edge speci?es the inter-task communi-cation cost between the tasks.

It is assumed that tasks in the task graph are numbered from to,where is the total number of tasks in the graph.An edge between task and task is repre-sented by an ordered tuple,and a path in the task graph is represented by a list of edges.For example,

is a path that starts from task and ends at task.

The simulated task graph is stored in a connection ma-trix TG.For each element in the matrix,there are three attributes:DataSize,Weight and Flag.If

is an edge in the task graph,TG[i,j].DataSize stores the number of packets sent from task to task. TG[i,j].Weight is used to hold the communication weight of the edge,and TG[i,j].Flag is used to in-dicate whether or not the communication is prioritized for the edge.If is not an edge in the task graph, TG[i,j].DataSize is set to infinity.Note that the matrix is symmetric and that TG[i,i].Weight holds the node weight of task.

Assuming that the critical path de?ned as the path

in the set of paths PS in a task graph DAG such that:

,

PathCompTime(P)represents the summation of the weights of all the tasks and edges in path P.

The critical path completion time is,

where is the th edge and there are edges in path P.Then,is the task weight of the task,and is the communication weight of the task.

In particular,list scheduling is adopted in the simula-tion.List scheduling is a class of scheduling heuristics in which tasks are assigned priorities and placed in a list ordered in decreasing magnitude of priority.In this sim-ulation,each task in a task graph was assigned a priority. The tasks in critical path of a task graph were assigned the highest level priority.A priority queue was for ready tasks by inserting every task that has no immediate pre-decessors.Tasks are sorted in decreasing order of prior-ities.As lng as the priority queue is not empty,the task with highest priority was assigned to an idle processor for maximizing the average system utilization.The purpose of this type of heuristic is to reduce the completion time of a task graph by reducing the completion time of the critical path.The structure of the scheduling algorithm is shown in Figure5.

At?rst step,the critical path of a randomly generated task graph is identi?ed.Then,the list scheduling policy schedules the task graph.After that,the CPU is simulated and the IFQ is linked with the PAE which explores the data prefetching priority queue.

The priority queue is simulated with arrival rates(, )and service rate().is for the high priority data prefetching;for the low priority.The experimental task graphs of applications were written in UPC[8]and run in a simulated DSM environment[9].The UPC com-piler generates assembly code for parallel execution.The critical path is identi?ed and marked in the instruction level.The instruction code embedded with the priority information can then be executed in the simulated DSM environment.The simulator implements the MIPS IV in-struction set.The simulation results were reported by the underlying MIPS DSM simulator.The table II shows the de?nitions used in the explanation of the results.Results of the experiment are in terms of Completion time(mea-sured in processor clock cycles).The following applica-tions were simulated to verify the performance gain: Matrix-Vector multiplication

Matrix-Matrix multiplication

LU Block Decomposition

A.Matrix-vector multiplication

where the matrix A is distributed over the processors row-wise with each processor holding N/P rows of the array.Vector X is similarly distributed,with N/P elements distributed over each processor.Each pro-cessor needs access to the whole of vector X to calculate its part of the result.It accesses the elements of the vec-tor using a simple for loop,with prefetches inserted for blocks that will be accessed in the future iterations. Table III reports the number of instructions executed, and the completion times(in processor clock cycles)of this experiment under three cases.

non-preemptive

meaning

N16

P4

S

Cache Block Size

Cache Size

Number of Cache Lines

Test Case Completion Time

1011

1171

1171

TABLE III

R ESULTS FOR M ATRIX-VECTOR M ULTIPLICATION

B.Matrix-Matrix multiplication

,where the matrices A and B are dis-tributed block-wise with each processor holding a block of size of the array.is assumed for simplicity.The algorithm is again traditional in that, each processor accesses the whole block of elements at the appropriate remote processor to calculate its part of the result.As before,prefetch instructions are inserted for blocks that will be accessed in the future iterations. Table IV summarizes the results of this experiment.

C.LU Block Decomposition

The matrix to be decomposed is distributed over the the processors in a block-wise fashion,with each processor holding a block of size of the array.The traditional algorithm is broken down and individual steps are optimized for the DSM environment under study[9]. The following summarizes the algorithm,and a more de-tailed analysis/implementation of the algorithm is given in[10].

Local maxima of the elements in the major col-umn are calculated by the processors in the major column,and the pivot processor(processor holding the element)cal-culates the global maxima.The pivot processor ac-cesses the maximas at the major column processors sequentially.Prefetches are issued for all the max-ima before the processor accesses the values.The processor might incur one remote miss for the very ?st access,but after that,the rest of the accesses are converted to hits.

If global maxima is different from the current row, the rows are swapped in parallel by all the proces-sors in the major row,and the row with the global maxima.This step is skipped if the global maxima is the same as the current row.Also,prefetches again are inserted for future accesses while swapping.

The processors holding the elements

and higher,then proceed in parallel to do the transformation.Again,prefetches are inserted for block accesses in future iterations. The algorithm is iterative with the above steps being re-peated for all the rows in the array,and barriers are in-serted where necessary.Table V summarizes the results of the experiment.

D.Summary

Prefetching requires additional instructions to be in-serted into the code,hence the additional instruction over-head for the program with enhancements.The instruc-tion overhead for the Matrix-Matrix multiplication with prefetching(2091)can be noticed to be higher than the Matrix-Vector multiplication(1091).A prefetch intruc-tion is usually associated with a conditional if statement, and the conditional needs to be evaluated at every itera-tion in the loop,irrespective of its outcome.This condi-tional is more complicated in the Matrix-Matrix multipli-cation due to its memory access pattern.Hence the higher instruction overhead of the Matrix-Matrix multiplication program over the Matrix-Vector multiplication.The same is the case for LU Block Decomposition.

The critical path at each of the experiments is run at a higher priority,with a background application keeping the queues and the controllers busy.Normal prefetch-ing indicates traditional prefetching where in,prefetch instructions are in the code and the processor issues the prefetches everytime it encounters a prefetch instruction. The code for Active Prefetching is the same,but the hard-ware is different in that,the prefetches are issued based on the priority technique proposed in this paper.We no-tice a signi?cant performance gain in each of the exper-iments by prioritizing the prefetches of the critical path. In the Matrix-Vector multiplication,the completion time with normal prefetch is2544and the same with Active Prefetching is2036cycles.In the Matrix-Matrix mul-tiplication case,completion time without prefetching is 4166and the same with Active Prefetching is3544.The gain is higher in cases with more regular memory access patterns.(Matrix-Vector and Matrix-Matrix multiplica-tion).A more irregular memory access pattern,and the fact that each of the steps can only be optimized individ-ually and not as a whole,contribute to a less signi?cant performance gain in the LU Block Decomposition exper-iment.This is a limitation of the algorithm because,the processor has to?nish one step before it moves on to an-other.

Also,the cache size is big compared to the size of the array used in the experiments.This might portray a more signi?cant performance gain than what is realistic.

A more detailed analysis of the effect of cache size on these experiments is being considered for future work.

IV.R ELATED W ORK

Rather than waiting for a cache miss to perform a mem-ory fetch,data prefetching anticipates such misses and is-sues a fetch to the memory system in advance of the ac-tual memory reference.This prefetch proceeds in parallel with processor computation,allowing the memory system time to transfer the desired data from main memory to the cache.Ideally,the prefetch will complete just in time for the processor to access the needed data in the cache with-out stalling the processor.

No.of Instructions

Without Prefetching4650

With Normal Prefetching4166

With Active Prefetching3544

Test Case Completion Time

4249

6720

6720

TABLE V

R ESULTS FOR LU B LOCK D ECOMPOSITION

Prefetching is to fetch data/instructions into the cache before they are actually referenced.Due to the strong se-quential and spatial nature of instructions,next sequential block prefetching is found to be quite accurate and use-ful[11].Being driven by hardware,they can be imple-mented in various ways:prefetch-on-miss[17],always prefetching[17],tagged prefetching[6],thread prefetch-ing[11],and multi-paths prefetching[7].While the?rst three techniques focus on when the next block prefetch-ing should be triggered,the last two mechanisms empha-size on the prefetching of instructions in the next dynamic block of instructions.The prefetch accuracy achieved by these mechanisms is high.Although they were success-ful in improving the performance,none of these systems considered prioritizing the critical path with respect to prefetching.

Kale et.al.proposed to use the priority-based process scheduler provided by the operating system to speedup the processing of parallel tasks[6].Although the high-level scheduler adds priority mechanism to the tasks of a parallel application,the underlying data prefetching in these tasks are not prioritized.In particular,most of previ-ous work in task graph scheduling has mainly focused on the algorithmic research for task mapping[12],and little research has been conducted on ef?cient support for data prefetching of executing task schedules.The objective of our optimization is to shorten the execution time of criti-cal paths in a parallel application by prefetching data in a prioritized scheme.

V.C ONCLUSIONS AND F UTURE W ORK

This paper proposes a priority data prefetching method based on the critical path of an application.Application domains exhibit irregular access patterns resulting in a signi?cant number of cache misses.Furthermore,proces-sor clock speeds are continuing to outpace memory access speeds,resulting in longer cache miss latencies.Thus even aggressive out-of-order processors suffer signi?cant performance degradation when executing these applica-tions due to their frequent cache misses and increasingly long cache miss latencies.Existing data prefetch tech-niques do not effectively prefetch addresses for irregular applications in a timely fashion.In this paper we have explored a method to prefetch addresses,namely precom-puting them.

Our approach is a novel one for accurately and dynam-ically identifying and precomputing the critical prefetch addresses determined by load/store instructions that are deemed to be responsible for most data cache misses.A separate PAE generates prefetch addresses and a prior-ity scheme is implemented on these addresses based on whether they come from a critical path.The results gener-ated by PAE are used only for prefetching data;notice that they do not update the architected state of the main pro-cessor,it can speculatively run ahead of the actual compu-tation without having to guarantee the correctness of the precomputation.

The results of the simulation show that the approach is ef?cient to shorten the completion time of an applica-tion.In addition,the approach is a general one for high-performance system because it can be built on many ex-isting systems by providing a prioritizing mechanism in their data prefetching schemes.

In the proposed approach,the identi?cation of prior-itization can be a static or dynamic policy.To reduce the overhead of dynamic policy,the static one will be utilized?rst.Only if some special case happens,such as a heavy unbalanced load,the approach would execute

the dynamic policy.The integration of dynamic load-balancing policies will be studied later.Also,as a general approach,it can apply to neighborhood prefetching and SNARFing.Priority-aware neighborhood prefetching and SNARFing will be studied as another future work.

R EFERENCES

[1]Gutierrez E.,Plata O.,and Zapata E.L.,“On improving the perfor-

mance of data partitioning oriented parallel irregular reductions,”

in10th Euromicro Workshop on Parallel,Distributed and Network based Processing,2002,pp.445–452.

[2]Wei Fen Lin,Reinhardt S.K.,and Burger D.,“Designing a modern

memory hierarchy with hardware prefetching,”IEEE Transactions on Computers,vol.50,no.11,pp.1202–1218,Nov2001.

[3]Solihin Y.,Jaejin Lee,and Torrellas J.,“Using a user level memory

thread for correlation prefetching,”in29th Annual International Symposium on Computer Architecture,2002,pp.171–182. [4]Tuah N.J.,Kumar M.,Venkatesh S.,and Das S.K.,“Performance

optimization problem in speculative prefetching,”IEEE Trans-actions on Parallel and Distributed Systems,vol.13,no.5,pp.

471–484,May2002.

[5]Ortega D.,Ayguade E.,Baer J.L.,and Valero M.,“Cost effec-

tive compiler directed memory prefetching and bypassing,”in In-ternational Conference on Parallel Architectures and Compilation Techniques,2002,pp.189–198.

[6]L.V.Kale,B.Ramkumar,V.Saletore,and A.B.Sinha,“Pri-

oritization in parallel symbolic computing,”in Lecture Notes in Computer Science,T.Ito and R.Halstead,Eds.1993,vol.748,pp.

12–41,Springer-Verlag.

[7]Dimitri Bertsekas and Robert Gallager,Data network,Prentice-

Hall International Editions,second edition edition,1992.

[8]“upc,available from website https://www.wendangku.net/doc/db19033145.html,/upc/,”.

[9]Katsinis Constantine,“Performance analysis of the simultaneous

optical multiprocessor exchange bus,”Parallel Computing,vol.

27,no.8,pp.1079–1115,July2001.

[10]Harsha Narravula,Performance of Parallel Algorithms on a

Broadcast-Based Architecture,Ph.D.thesis,Drexel University, Philadelphia,PA19104,USA,Nov2003.

[11]T.Sherwood,S.Sair,and B.Calder,“Predictor-directed stream

buffers,”in the33rd International Symposium on Microarchitec-ture,Dec.2000.

[12]M.Y.Wu and D.Gajski,“Hypertool:A programming aid for

message-passing systems,”IEEE Transactions on Parallel and Distributed Systems,vol.1,no.3,pp.330–343,1990.

相关文档