文档库 最新最全的文档下载
当前位置:文档库 › Microthreading and its Programming Model

Microthreading and its Programming Model

Microthreading and its Programming Model

T. Bernard, K. Bousias, L. Guang, C.R. Jesshope, P.M.W. Knijnenburg, M.

Lankamp, A. Pimentel, M. van Tol, T.D. Vu, L. Zhang

Computer Systems Architecture, Informatics Institute, University of Amsterdam,

Kruislaan 403, 1098 SJ Amsterdam, The Netherlands

5. The μTC programming language

Microthreaded C (μTC) captures the microthreaded programming model and is intended to be a compiler target language for user languages such as C, C++ etc.

It captures the ISA extensions from the microthreaded microprocessor

thread , family , place (processor sets) are captured as new types

index captures the index of a thread automatically Variables can be designated shared between threads

References

A Bolychevsky, C R Jesshope and V

B Muchnick, (1996) Dynamic scheduling in RIS

C architectures,IEE Trans. E, Computers and Digital Techniques , 143, pp309-317

Jesshope C. R. (2006) Microthreading a model for distributed instruction-level concurrency, Parallel processing Letters , 16(2), pp209-228, ISSN: 0129-6264

T. Bernard, K. Bousias, B. de Geus, M. Lankamp, L. Zhang, A. Pimentel, P.M.W. Knijnenburg, and C.R. Jesshope. (2006) A Microthreaded Architecture and its Compiler, Proc. 12th International

Workshop on Compilers for Parallel Computers (CPC), M. Arenez, R. Doallo, B.B. Fraguela, and J.Tourino (eds.), pp 326-340

1. Motivation and goals

Silicon system trends - continuing increase in device density but locality, power dissipation and reliability will become the major limitations

Chips will becoming asynchronous, distributed and multi-core!

Goals - the goals are exactly the same as those in the parallel and distributed computing but they need to be approached from the microprocessor architecture upwards

Scalable and distributed architecture Program multi-cores from sequential code

Binary code compatibility across arbitrary number of cores Support concurrency and resource management dynamically

Solution - scale-invariant programming model implemented in the level of the ISA

2. Programming model

DRISC ISA implements - blocking-threads based on parameterised families, which capture loops of all kinds and asynchronous task creation - just add these instructions:

create and corresponding barrier sync on named families

break to terminate an infinite family

kill and squeeze (concurrent pre-emption) on named families

Synchronised communication - synchronised scalar variables support dependencies between threads within a single family - implemented in distributed-shared registers

3. Microthreaded or DRISC microprocessor

Synchronising memory is a large register file that controls instruction execution.Processor tolerates large latencies using parallel slackness (threads in active queue).The processor can be powered down when it has no active threads saving power.

6. The Microthreaded Tool Chain

We are developing a set of tools based on simulators, compilers and translators that will allow the investigation of targeting and evaluating a multi-core architecture programmed from both sequential, data-parallel and streaming languages

Figure 3. Relation between C and μTC

Tools being developed within the project

Tasks and collaborations:

1.Core compiler μTC to DRISC binary using gcc.

2.Parallelising C compiler using CoSy framework collaborators ACE & University of Ioannina.

3.Data parallel compiler using SAC compiler collaborators University of Hertfordshire.

4.Coordination language compiler Snet compiler collaborators University of Hertfordshire.

7. Simulation results

float a[100];float b[100];

int i;

float sum = 0;…

for(i = 0; i < 100; i++) {

sum += a[i] * a[i] + b[i] * b[i];}

float a[100], b[100], sum = 0;family fid;place pid;…

create (fid; pid; 0; 99) {

index i;

shared float s = sum;

s = s + a[i] * a[i] + b[i] * b[i];}sync (fid);

/* Here sum is the sum of squares */

?Concurrent composition - build programs concurrently –nodes represent threads - leaf nodes do computation

–branching at a node represent the creation of concurrent subordinate threads

?Threads at different levels run concurrently ?The events are:

–A creates Bi for all i in some set

–dependencies defined between threads –A continues until a barrier sync

May have dependencies but only between nodes at one level

A

B n

create

B 0

Dependency chain

sync

barrier sync

2. Instructions issued from the head of the active queue read

synchronizing memory 3. If data is available it is sent for processing…

otherwise the thread suspends on the empty register

4. Suspended threads are rescheduled when data is written to re-execute the blocked instruction

Thread

instruction buffer Queue Of

Active threads

Synchronising

memory (register file)

Fixed delay operations

Variable delay

Operations (e.g. loads)

instructions data

1. Threads are

created dynamically with a context in synchronizing memory

Cluster 5Cluster 5Cluster 5Cluster 5Cluster 5Cluster 5

Cluster 5Cluster 5Cluster 5Cluster 5Cluster 5Cluster 5

Cluster 4Cluster 4μT proc μT proc μT proc μT proc

Cluster 4Cluster 4Cluster 3Cluster 3Cluster 3μT proc

Cluster 2Cluster 2Cluster 3Cluster 3Cluster 3μT proc

SEP

Cluster 1

μT proc

μT proc

μT proc

μT proc

On-chip COMA memory

4. Microgrid

SEP allocates configured rings of processors to threads which delegate work (sub trees):

dependency rings are circuit switched,delegation grid is packet switched.

A create instruction distributes a family of threads to a ring of processors:

number of processors is set dynamically.On-chip COMA cache organisation provides flexibly in partitioning the on-chip memory.

FPGA Microthreaded

processor

μTC to ISA Alpha + ISA μT

C to μTC

SAC to μTC

Snet to μTC

exist today in development Microthreaded CMP simulator/emulator ISA Alpha + ISA μT assember/loader

hand-assembled

kernels

to be developed

Sequential Data parallel Streaming

Execution of hand-compiled kernels on our simulator show that:

?for kernels with no inter-loop dependencies performance scales almost linearly,?for kernels with inter-loop dependencies speedup can be achieved although limited by the number independent instructions within the thread body.

MicroThreaded CMP Performance

3264

96128160

1922242561

2

4

8

12

16

20

24

32

48

56

64

72

96128160192224256

Number of Processors

S p e e d u p

L.K. 7 (Equation of state fragment)Matrix-Vector Multiplication

相关文档