Microthreading and its Programming Model
T. Bernard, K. Bousias, L. Guang, C.R. Jesshope, P.M.W. Knijnenburg, M.
Lankamp, A. Pimentel, M. van Tol, T.D. Vu, L. Zhang
Computer Systems Architecture, Informatics Institute, University of Amsterdam,
Kruislaan 403, 1098 SJ Amsterdam, The Netherlands
5. The μTC programming language
Microthreaded C (μTC) captures the microthreaded programming model and is intended to be a compiler target language for user languages such as C, C++ etc.
It captures the ISA extensions from the microthreaded microprocessor
thread , family , place (processor sets) are captured as new types
index captures the index of a thread automatically Variables can be designated shared between threads
References
A Bolychevsky, C R Jesshope and V
B Muchnick, (1996) Dynamic scheduling in RIS
C architectures,IEE Trans. E, Computers and Digital Techniques , 143, pp309-317
Jesshope C. R. (2006) Microthreading a model for distributed instruction-level concurrency, Parallel processing Letters , 16(2), pp209-228, ISSN: 0129-6264
T. Bernard, K. Bousias, B. de Geus, M. Lankamp, L. Zhang, A. Pimentel, P.M.W. Knijnenburg, and C.R. Jesshope. (2006) A Microthreaded Architecture and its Compiler, Proc. 12th International
Workshop on Compilers for Parallel Computers (CPC), M. Arenez, R. Doallo, B.B. Fraguela, and J.Tourino (eds.), pp 326-340
1. Motivation and goals
Silicon system trends - continuing increase in device density but locality, power dissipation and reliability will become the major limitations
Chips will becoming asynchronous, distributed and multi-core!
Goals - the goals are exactly the same as those in the parallel and distributed computing but they need to be approached from the microprocessor architecture upwards
Scalable and distributed architecture Program multi-cores from sequential code
Binary code compatibility across arbitrary number of cores Support concurrency and resource management dynamically
Solution - scale-invariant programming model implemented in the level of the ISA
2. Programming model
DRISC ISA implements - blocking-threads based on parameterised families, which capture loops of all kinds and asynchronous task creation - just add these instructions:
create and corresponding barrier sync on named families
break to terminate an infinite family
kill and squeeze (concurrent pre-emption) on named families
Synchronised communication - synchronised scalar variables support dependencies between threads within a single family - implemented in distributed-shared registers
3. Microthreaded or DRISC microprocessor
Synchronising memory is a large register file that controls instruction execution.Processor tolerates large latencies using parallel slackness (threads in active queue).The processor can be powered down when it has no active threads saving power.
6. The Microthreaded Tool Chain
We are developing a set of tools based on simulators, compilers and translators that will allow the investigation of targeting and evaluating a multi-core architecture programmed from both sequential, data-parallel and streaming languages
Figure 3. Relation between C and μTC
Tools being developed within the project
Tasks and collaborations:
1.Core compiler μTC to DRISC binary using gcc.
2.Parallelising C compiler using CoSy framework collaborators ACE & University of Ioannina.
3.Data parallel compiler using SAC compiler collaborators University of Hertfordshire.
4.Coordination language compiler Snet compiler collaborators University of Hertfordshire.
7. Simulation results
float a[100];float b[100];
int i;
float sum = 0;…
for(i = 0; i < 100; i++) {
sum += a[i] * a[i] + b[i] * b[i];}
float a[100], b[100], sum = 0;family fid;place pid;…
create (fid; pid; 0; 99) {
index i;
shared float s = sum;
s = s + a[i] * a[i] + b[i] * b[i];}sync (fid);
/* Here sum is the sum of squares */
?Concurrent composition - build programs concurrently –nodes represent threads - leaf nodes do computation
–branching at a node represent the creation of concurrent subordinate threads
?Threads at different levels run concurrently ?The events are:
–A creates Bi for all i in some set
–dependencies defined between threads –A continues until a barrier sync
May have dependencies but only between nodes at one level
A
B n
create
B 0
Dependency chain
sync
barrier sync
2. Instructions issued from the head of the active queue read
synchronizing memory 3. If data is available it is sent for processing…
otherwise the thread suspends on the empty register
4. Suspended threads are rescheduled when data is written to re-execute the blocked instruction
Thread
instruction buffer Queue Of
Active threads
Synchronising
memory (register file)
Fixed delay operations
Variable delay
Operations (e.g. loads)
instructions data
1. Threads are
created dynamically with a context in synchronizing memory
Cluster 5Cluster 5Cluster 5Cluster 5Cluster 5Cluster 5
Cluster 5Cluster 5Cluster 5Cluster 5Cluster 5Cluster 5
Cluster 4Cluster 4μT proc μT proc μT proc μT proc
Cluster 4Cluster 4Cluster 3Cluster 3Cluster 3μT proc
Cluster 2Cluster 2Cluster 3Cluster 3Cluster 3μT proc
SEP
Cluster 1
μT proc
μT proc
μT proc
μT proc
On-chip COMA memory
4. Microgrid
SEP allocates configured rings of processors to threads which delegate work (sub trees):
dependency rings are circuit switched,delegation grid is packet switched.
A create instruction distributes a family of threads to a ring of processors:
number of processors is set dynamically.On-chip COMA cache organisation provides flexibly in partitioning the on-chip memory.
FPGA Microthreaded
processor
μTC to ISA Alpha + ISA μT
C to μTC
SAC to μTC
Snet to μTC
exist today in development Microthreaded CMP simulator/emulator ISA Alpha + ISA μT assember/loader
hand-assembled
kernels
to be developed
Sequential Data parallel Streaming
Execution of hand-compiled kernels on our simulator show that:
?for kernels with no inter-loop dependencies performance scales almost linearly,?for kernels with inter-loop dependencies speedup can be achieved although limited by the number independent instructions within the thread body.
MicroThreaded CMP Performance
3264
96128160
1922242561
2
4
8
12
16
20
24
32
48
56
64
72
96128160192224256
Number of Processors
S p e e d u p
L.K. 7 (Equation of state fragment)Matrix-Vector Multiplication