HPC
Architectures
Introduction
History of Machines
Parallel Performance
EPCC, University of Edinburgh
Overview
?HPC Architecture course aims to cover:
–History of HPC systems
–Basic components of HPC systems
–Different types of HPC systems
–System software and compilers
–CPU components
–Memory and caches
–Multicore and GPGPU hardware
–Interconnects
–Specific examples of HPC Architectures
?Practicals through-out
?Timetable:
–Monday 14:00-14:50 -lecture
–Thursday 14:00-14:50 -lecture
–Fridays 11:10-12:00 -practical
?Assessed through exam
HPC Architectures2
HPC Architectures 3
Background
?Edinburgh/EPCC has been at the leading edge of parallel
computing since the early 1980s
?This talk is an overview of key HPC systems associated with
Edinburgh during that time
?Introduce parallel performance concepts
HPC Architectures 4
Performance Trend
?Yotta: 1024?Zetta: 1021?Exa: 1018?Peta: 1015?Tera: 1012?Giga: 109?Mega: 106?Kilo: 103FLOPS
?Yotta: 1024?Zetta: 1021?Exa: 1018?Peta: 1015?Tera: 1012?Giga: 109?Mega: 106?Kilo: 103
This graph is borrowed from Wikipedia ?Lucas wilkins
This graph is borrowed from Wikipedia ?Lucas wilkins
Quantifying Performance
?Serial computing concerned with complexity
–how execution time varies with problem size N
–adding two arrays (or vectors) is O(N)
–matrix times vector is O(N2), matrix-matrix is O(N3)
?Look for clever algorithms
–na?ve sort is O(N2)
–divide-and-conquer approaches are O(N log (N))
?Parallel computing also concerned with scaling
–how time varies with number of processors P
–different algorithms can have different scaling behaviour
–but always remember that we are interested in minimum time!
HPC Architectures5
Performance Measures
?T(N,P)is time for size N on P processors
?Speedup
–typically S(N,P)< P
?Parallel Efficiency
–typically E(N,P)< 1
?Serial Efficiency
–typically E(N) <= 1
HPC Architectures6
The Serial Component
?Amdahl’s law
“the performance improvement to be gained by parallelisation is limited by the proportion of the code which is serial”
Gene Amdahl, 1967
HPC Architectures7
Amdahl’s law
?Assume a fraction αis completely serial
–time is sum of serial and potentially parallel
?Parallel time
–parallel part 100% efficient
?Parallel speedup
–for α= 0, S= P as expected (ie E= 100%)
–otherwise, speedup limited by 1/ αfor any P
–impossible to effectively utilise large parallel machines?
HPC Architectures8
Gustafson’s Law
?Need larger problems for larger numbers of CPUs
HPC Architectures9
Utilising Large Parallel Machines
?Assume parallel part is O(N), serial part is O(1)
–time
–speedup
?Scale problem size with CPUs, ie set N= P
–speedup
–efficiency
?Maintain constant efficiency (1-α) for large P
HPC Architectures10
Performance Summary
?Useful definitions
–Speed-up
–Efficiency
?Amdahl’s Law –“the performance improvement to be gained by parallelisation is limited by the proportion of the code
which is serial”
?Gustafson’s Law –to maintain constant efficiency we need to scale the problem size with the number of CPUs.
HPC Architectures11
Parallel Computers at Edinburgh
–1981 ICL DAP (SIMD; 4K processors)
–1986 Meiko T800 CS (MIMD-DM; 400 processors)
–1988 AMT DAP608 (SIMD; 1K processors)
–1990 Meiko i860 CS (MIMD-DM; 64 processors)
–1991 TMC CM-200 (SIMD: 16K processors)
–1992 Meiko i860 CS (MIMD-DM; 16 processors)
–1994 Cray T3D (MIMD-NUMA; 512 processors), Cray Y-MP (Vector)
–1995 Meiko CS-2 (MIMD-DM)
–1996 Cray J90 (Vector)
–1997 Cray T3E (MIMD-NUMA; 344 processors)
–1998 Hitachi SR2201 (MIMD-DM)
–2000 Sun UltraSPARC III Cluster (SMP Cluster; 66 processors)
–2001 Sun Fire 15K (MIMD-SMP; 52 processors)
–2002 IBM p690 cluster (SMP cluster; 1280 processors)
–2004 QCDOC (MIMD-DM; ~14,000 processors)
–2005 IBM BlueGene/L (MIMD-DM; 2048 processors)
–2006 IBM p575 cluster (SMP cluster; 2560 processors)
–2007 Cray XT4 (MIMD-DM; 11,328 processors)
HPC Architectures12
ICL DAPs
HPC Architectures13
ICL DAPs
?Facts and Figures
–Lifetime:1981/1982--1988
–Processors:4096 single bit processors (in each of 2 systems)
–Peak Performance: 0.03 GFlops
–Architecture:SIMD
–Memory:2 MBytes
–Programming:Data Parallel (DAP Fortran)
?Notes
–one of the earliest production parallel computers
–ICL made ~10 before AMT took the DAP technology forward
–EPCC had a significant upgrade in 1988 to AMT DAP (1024 processors with 4 MBytes and a peak of ~60 Mflops)
HPC Architectures14
Meiko CS-1
HPC Architectures15
Meiko CS-1
?Facts and Figures
–Lifetime:1986--1994
–Processors:400 x T800 Transputers
–Peak Performance:0.4 GFlops
–Architecture:MIMD-DM
–Memory:400 MBytes
–Programming:OCCAM special purpose language/OS
?Notes
–T800 was first processor to have a peak of 1 MFlops and also had built-in support for passing messages
–focus for Edinburgh Concurrent Supercomputer Project which was pre-cursor to EPCC
HPC Architectures16
Birth of EPCC
Early 1990s
HPC Architectures17 Meiko i860
HPC Architectures18
Meiko i860
?Facts and Figures
–Lifetime:1990--1995
–Processors:64 x 80 MHz i860 (+ T800s for communication)
–Peak Performance:5.1 GFlops
–Architecture:MIMD-DM
–Memory:1 Gbyte
–Programming:Message Passing (CSTools)
?Notes
–split between QCD and Materials Grand Challenges
–QCD code sustained more than 1 GFlop making it one of the fastest applications codes in the world!
HPC Architectures19 TMC CM-200
HPC Architectures20
TMC CM-200
?Facts and Figures
–Lifetime:1991--1996
–Processors:16,584 single bit processors + 512 FPUs
–Peak Performance:5 GFlops
–Architecture:SIMD
–Memory:512 MBytes
–Programming:Data Parallel (CM Fortran, C*)
?Notes
–largest SIMD machine in Europe
–state-of-the-art Data Vault with 10 GBytes of storage
HPC Architectures21 Cray T3D
HPC Architectures22
Cray T3D
?Facts and Figures
–Lifetime:1994-1999
–Processors:512 x 150 MHz EV5 Alphas
–Peak Performance:76 GFlops
–Architecture:MIMD-NUMA
–Memory:32 GBytes
–Programming:Message Passing (PVM/MPI), Work Sharing, Data Parallel (CRAFT)
?Notes
–first UK national parallel computing service
–at various times, this was the largest T3D in Europe
–choice of programming paradigms
HPC Architectures23
Cray J90
?Facts and Figures
–Lifetime:1996--2002
–Processors:10 x 100 MHz Vector
–Peak Performance:2 GFlops
–Architecture:MIMD-SMP/MISD?
–Memory:2 GBytes
–Programming:serial/vector
?Notes
–EPCC primarily had vector facilities
in support of the Cray HPC systems
HPC Architectures24
Cray T3E
HPC Architectures25
Cray T3E
?Facts and Figures
–Lifetime:1997--2002
–Processors:344 x 450 MHz EV56 Alphas
–Peak Performance:310 GFlops
–Architecture:MIMD-NUMA
–Memory:~40 GBytes
–Programming:Message Passing (MPI), Data Parallel (CRAFT)
?Notes
–supported multiple services for various communities
–different processors had 64, 128 or 256 MBytes of memory
HPC Architectures26
Sun E15K (Lomond)
HPC Architectures27
Sun E15K (Lomond)
?Facts and Figures
–Lifetime:2001--2007
–Processors:52 x 900 MHz UltraSPARC IIIs
–Peak Performance:94 GFlops
–Architecture:MIMD-SMP
–Memory:52 GBytes
–Programming:Shared Variables (OpenMP), Message Passing (MPI)
?Notes
–culmination of a series of Sun SMP clusters, purchased with a single JREI grant, providing HPC service for UoE
HPC Architectures28
IBM p690 Cluster (HPCx Phase 1)
HPC Architectures29
IBM p690 Cluster (HPCx Phase 1)
?Facts and Figures
–Lifetime:12/2002--7/2004
–Processors:1280 x 1.3 GHz Power 4s
–Peak Performance:6.7 TFlops
–Architecture:SMP cluster
–Memory:1280 GBytes
–Programming:Message Passing (MPI), Mixed Mode (OpenMP+MPI)?Notes
–UK national HPC service run by UoE/EPCC, DL and IBM
–EPCC lead partner, although system is located at DL
–focus on capability computing
–upgrades to Power 4+ (2004) and then Power 5 (2005/06)
HPC Architectures30
QCDoC
?Q uantum C hromo D ynamics o n a C hip
HPC Architectures31
QCDoC
?Facts and Figures
–Lifetime:10/2004 --
–Processors:>14,000 x 400 MHz special-purpose chips
–Peak Performance:~11 TFlops
–Architecture:MIMD-DM
–Memory:~1750 GBytes
–Programming: Non-standard Message Passing (nearest-neighbour communications plus collectives)
?Notes
–multiple systems (largest had 12K processors)
–designed by IBM, University of Edinburgh and Columbia
–QCD sustains up to 4 TFlops
HPC Architectures32
IBM BlueGene (Blue Sky)
HPC Architectures33
IBM BlueGene(Blue Sky)
?Facts and Figures
–Lifetime:1/2005--
–Processors:2048 x 700 MHz PowerPCs
–Peak Performance:5.6 TFlops
–Architecture:MIMD-DM
–Memory:512 GBytes
–Programming:Message Passing (MPI)
?Notes
–first BlueGene system in Europe
–low power requirements and high density of processors
–capable of scaling to extremely large systems
–many BlueGene systems of >100 TF
HPC Architectures34
IBM p575 Cluster (HPCx Phase 3)
HPC Architectures35
IBM p575 Cluster (HPCx Phase 3)
?Facts and Figures
–Lifetime:7/2006—2010
–Processors:2560 x 1.5 GHz Power 5s
–Peak Performance:15.3 TFlops
–Architecture:SMP cluster
–Memory:5120 GBytes
–Programming:Message Passing (MPI), Mixed Mode (OpenMP+MPI)?Notes
–double the memory per processor of Phase 1/2
–significantly improved interconnect
–focussed on Complementary Capability Computing
–closes in January 2010
HPC Architectures36
Cray XT4 (HECToR Phase 1)
HPC Architectures37
Cray XT4 (HECToR Phase 1)
?Facts and Figures
–Lifetime:9/2007—6/2009
–Processors:5664 x 2.8 GHz dual-core Opterons
–i.e. 11,328 cores
–Peak Performance:63.4 TFlops
–Architecture:MIMD-DM
–Memory:34 TBytes
–Programming:Message Passing (MPI)
?Notes
–UK’s current national HPC facility
–regular upgrades to 2013+
–initially one of Top 20 systems worldwide
–supplemented by Cray X2 (BlackWidow) vector system
HPC Architectures38
Cray XT4 (HECToR Phase 2a)
?Facts and Figures
–Lifetime:6/2009 —
–Processors:5664 x 2.3 GHz quad-core Opterons
–i.e. 22,656 cores
–Peak Performance:208 TFlops
–Architecture:MIMD-DM
–Memory:45.3 TBytes
–Programming:Message Passing (MPI)
?Notes
–reduced in size from 60 to 33 cabinets in 2Q10
–to allow for Phase 2b
HPC Architectures39
Cray XT6 (HECToR Phase 2b)
?Facts and Figures
–Lifetime:6/2010 —
–Processors:3712 x 2.1 GHz 12-core (Magny-Cours) Opterons –i.e. 44,544 cores
–Peak Performance:374 TFlops
–Architecture:MIMD-DM
–Memory:59.4 TBytes
–Programming:MPI
?Notes
–20 XT6 cabinets
–network upgrade by 1/2011
–Phase 3 upgrade expected
later in 2011
HPC Architectures40