当前位置：文档库 › L01-Introduction to HPC Architecture

L01-Introduction to HPC Architecture

HPC

Architectures

Introduction

History of Machines

Parallel Performance

EPCC, University of Edinburgh

Overview

?HPC Architecture course aims to cover:

–History of HPC systems

–Basic components of HPC systems

–Different types of HPC systems

–System software and compilers

–CPU components

–Memory and caches

–Multicore and GPGPU hardware

–Interconnects

–Specific examples of HPC Architectures

?Practicals through-out

?Timetable:

–Monday 14:00-14:50 -lecture

–Thursday 14:00-14:50 -lecture

–Fridays 11:10-12:00 -practical

?Assessed through exam

HPC Architectures2

HPC Architectures 3

Background

?Edinburgh/EPCC has been at the leading edge of parallel

computing since the early 1980s

?This talk is an overview of key HPC systems associated with

Edinburgh during that time

?Introduce parallel performance concepts

HPC Architectures 4

Performance Trend

?Yotta: 1024?Zetta: 1021?Exa: 1018?Peta: 1015?Tera: 1012?Giga: 109?Mega: 106?Kilo: 103FLOPS

?Yotta: 1024?Zetta: 1021?Exa: 1018?Peta: 1015?Tera: 1012?Giga: 109?Mega: 106?Kilo: 103

This graph is borrowed from Wikipedia ?Lucas wilkins

Quantifying Performance

?Serial computing concerned with complexity

–how execution time varies with problem size N

–adding two arrays (or vectors) is O(N)

–matrix times vector is O(N2), matrix-matrix is O(N3)

?Look for clever algorithms

–na?ve sort is O(N2)

–divide-and-conquer approaches are O(N log (N))

?Parallel computing also concerned with scaling

–how time varies with number of processors P

–different algorithms can have different scaling behaviour

–but always remember that we are interested in minimum time!

HPC Architectures5

Performance Measures

?T(N,P)is time for size N on P processors

?Speedup

–typically S(N,P)< P

?Parallel Efficiency

–typically E(N,P)< 1

?Serial Efficiency

–typically E(N) <= 1

HPC Architectures6

The Serial Component

?Amdahl’s law

“the performance improvement to be gained by parallelisation is limited by the proportion of the code which is serial”

Gene Amdahl, 1967

HPC Architectures7

Amdahl’s law

?Assume a fraction αis completely serial

–time is sum of serial and potentially parallel

?Parallel time

–parallel part 100% efficient

?Parallel speedup

–for α= 0, S= P as expected (ie E= 100%)

–otherwise, speedup limited by 1/ αfor any P

–impossible to effectively utilise large parallel machines?

HPC Architectures8

Gustafson’s Law

?Need larger problems for larger numbers of CPUs

HPC Architectures9

Utilising Large Parallel Machines

?Assume parallel part is O(N), serial part is O(1)

–time

–speedup

?Scale problem size with CPUs, ie set N= P

–speedup

–efficiency

?Maintain constant efficiency (1-α) for large P

HPC Architectures10

Performance Summary

?Useful definitions

–Speed-up

–Efficiency

?Amdahl’s Law –“the performance improvement to be gained by parallelisation is limited by the proportion of the code

which is serial”

?Gustafson’s Law –to maintain constant efficiency we need to scale the problem size with the number of CPUs.

HPC Architectures11

Parallel Computers at Edinburgh

–1981 ICL DAP (SIMD; 4K processors)

–1986 Meiko T800 CS (MIMD-DM; 400 processors)

–1988 AMT DAP608 (SIMD; 1K processors)

–1990 Meiko i860 CS (MIMD-DM; 64 processors)

–1991 TMC CM-200 (SIMD: 16K processors)

–1992 Meiko i860 CS (MIMD-DM; 16 processors)

–1994 Cray T3D (MIMD-NUMA; 512 processors), Cray Y-MP (Vector)

–1995 Meiko CS-2 (MIMD-DM)

–1996 Cray J90 (Vector)

–1997 Cray T3E (MIMD-NUMA; 344 processors)

–1998 Hitachi SR2201 (MIMD-DM)

–2000 Sun UltraSPARC III Cluster (SMP Cluster; 66 processors)

–2001 Sun Fire 15K (MIMD-SMP; 52 processors)

–2002 IBM p690 cluster (SMP cluster; 1280 processors)

–2004 QCDOC (MIMD-DM; ~14,000 processors)

–2005 IBM BlueGene/L (MIMD-DM; 2048 processors)

–2006 IBM p575 cluster (SMP cluster; 2560 processors)

–2007 Cray XT4 (MIMD-DM; 11,328 processors)

HPC Architectures12

ICL DAPs

HPC Architectures13

ICL DAPs

?Facts and Figures

–Lifetime:1981/1982--1988

–Processors:4096 single bit processors (in each of 2 systems)

–Peak Performance: 0.03 GFlops

–Architecture:SIMD

–Memory:2 MBytes

–Programming:Data Parallel (DAP Fortran)

?Notes

–one of the earliest production parallel computers

–ICL made ~10 before AMT took the DAP technology forward

–EPCC had a significant upgrade in 1988 to AMT DAP (1024 processors with 4 MBytes and a peak of ~60 Mflops)

HPC Architectures14

Meiko CS-1

HPC Architectures15

Meiko CS-1

?Facts and Figures

–Lifetime:1986--1994

–Processors:400 x T800 Transputers

–Peak Performance:0.4 GFlops

–Architecture:MIMD-DM

–Memory:400 MBytes

–Programming:OCCAM special purpose language/OS

?Notes

–T800 was first processor to have a peak of 1 MFlops and also had built-in support for passing messages

–focus for Edinburgh Concurrent Supercomputer Project which was pre-cursor to EPCC

HPC Architectures16

Birth of EPCC

Early 1990s

HPC Architectures17 Meiko i860

HPC Architectures18

Meiko i860

?Facts and Figures

–Lifetime:1990--1995

–Processors:64 x 80 MHz i860 (+ T800s for communication)

–Peak Performance:5.1 GFlops

–Architecture:MIMD-DM

–Memory:1 Gbyte

–Programming:Message Passing (CSTools)

?Notes

–split between QCD and Materials Grand Challenges

–QCD code sustained more than 1 GFlop making it one of the fastest applications codes in the world!

HPC Architectures19 TMC CM-200

HPC Architectures20

TMC CM-200

?Facts and Figures

–Lifetime:1991--1996

–Processors:16,584 single bit processors + 512 FPUs

–Peak Performance:5 GFlops

–Architecture:SIMD

–Memory:512 MBytes

–Programming:Data Parallel (CM Fortran, C*)

?Notes

–largest SIMD machine in Europe

–state-of-the-art Data Vault with 10 GBytes of storage

HPC Architectures21 Cray T3D

HPC Architectures22

Cray T3D

?Facts and Figures

–Lifetime:1994-1999

–Processors:512 x 150 MHz EV5 Alphas

–Peak Performance:76 GFlops

–Architecture:MIMD-NUMA

–Memory:32 GBytes

–Programming:Message Passing (PVM/MPI), Work Sharing, Data Parallel (CRAFT)

?Notes

–first UK national parallel computing service

–at various times, this was the largest T3D in Europe

–choice of programming paradigms

HPC Architectures23

Cray J90

?Facts and Figures

–Lifetime:1996--2002

–Processors:10 x 100 MHz Vector

–Peak Performance:2 GFlops

–Architecture:MIMD-SMP/MISD?

–Memory:2 GBytes

–Programming:serial/vector

?Notes

–EPCC primarily had vector facilities

in support of the Cray HPC systems

HPC Architectures24

Cray T3E

HPC Architectures25

Cray T3E

?Facts and Figures

–Lifetime:1997--2002

–Processors:344 x 450 MHz EV56 Alphas

–Peak Performance:310 GFlops

–Architecture:MIMD-NUMA

–Memory:~40 GBytes

–Programming:Message Passing (MPI), Data Parallel (CRAFT)

?Notes

–supported multiple services for various communities

–different processors had 64, 128 or 256 MBytes of memory

HPC Architectures26

Sun E15K (Lomond)

HPC Architectures27

Sun E15K (Lomond)

?Facts and Figures

–Lifetime:2001--2007

–Processors:52 x 900 MHz UltraSPARC IIIs

–Peak Performance:94 GFlops

–Architecture:MIMD-SMP

–Memory:52 GBytes

–Programming:Shared Variables (OpenMP), Message Passing (MPI)

?Notes

–culmination of a series of Sun SMP clusters, purchased with a single JREI grant, providing HPC service for UoE

HPC Architectures28

IBM p690 Cluster (HPCx Phase 1)

HPC Architectures29

IBM p690 Cluster (HPCx Phase 1)

?Facts and Figures

–Lifetime:12/2002--7/2004

–Processors:1280 x 1.3 GHz Power 4s

–Peak Performance:6.7 TFlops

–Architecture:SMP cluster

–Memory:1280 GBytes

–Programming:Message Passing (MPI), Mixed Mode (OpenMP+MPI)?Notes

–UK national HPC service run by UoE/EPCC, DL and IBM

–EPCC lead partner, although system is located at DL

–focus on capability computing

–upgrades to Power 4+ (2004) and then Power 5 (2005/06)

HPC Architectures30

QCDoC

?Q uantum C hromo D ynamics o n a C hip

HPC Architectures31

QCDoC

?Facts and Figures

–Lifetime:10/2004 --

–Processors:>14,000 x 400 MHz special-purpose chips

–Peak Performance:~11 TFlops

–Architecture:MIMD-DM

–Memory:~1750 GBytes

–Programming: Non-standard Message Passing (nearest-neighbour communications plus collectives)

?Notes

–multiple systems (largest had 12K processors)

–designed by IBM, University of Edinburgh and Columbia

–QCD sustains up to 4 TFlops

HPC Architectures32

IBM BlueGene (Blue Sky)

HPC Architectures33

IBM BlueGene(Blue Sky)

?Facts and Figures

–Lifetime:1/2005--

–Processors:2048 x 700 MHz PowerPCs

–Peak Performance:5.6 TFlops

–Architecture:MIMD-DM

–Memory:512 GBytes

–Programming:Message Passing (MPI)

?Notes

–first BlueGene system in Europe

–low power requirements and high density of processors

–capable of scaling to extremely large systems

–many BlueGene systems of >100 TF

HPC Architectures34

IBM p575 Cluster (HPCx Phase 3)

HPC Architectures35

IBM p575 Cluster (HPCx Phase 3)

?Facts and Figures

–Lifetime:7/2006—2010

–Processors:2560 x 1.5 GHz Power 5s

–Peak Performance:15.3 TFlops

–Architecture:SMP cluster

–Memory:5120 GBytes

–Programming:Message Passing (MPI), Mixed Mode (OpenMP+MPI)?Notes

–double the memory per processor of Phase 1/2

–significantly improved interconnect

–focussed on Complementary Capability Computing

–closes in January 2010

HPC Architectures36

Cray XT4 (HECToR Phase 1)

HPC Architectures37

Cray XT4 (HECToR Phase 1)

?Facts and Figures

–Lifetime:9/2007—6/2009

–Processors:5664 x 2.8 GHz dual-core Opterons

–i.e. 11,328 cores

–Peak Performance:63.4 TFlops

–Architecture:MIMD-DM

–Memory:34 TBytes

–Programming:Message Passing (MPI)

?Notes

–UK’s current national HPC facility

–regular upgrades to 2013+

–initially one of Top 20 systems worldwide

–supplemented by Cray X2 (BlackWidow) vector system

HPC Architectures38

Cray XT4 (HECToR Phase 2a)

?Facts and Figures

–Lifetime:6/2009 —

–Processors:5664 x 2.3 GHz quad-core Opterons

–i.e. 22,656 cores

–Peak Performance:208 TFlops

–Architecture:MIMD-DM

–Memory:45.3 TBytes

–Programming:Message Passing (MPI)

?Notes

–reduced in size from 60 to 33 cabinets in 2Q10

–to allow for Phase 2b

HPC Architectures39

Cray XT6 (HECToR Phase 2b)

?Facts and Figures

–Lifetime:6/2010 —

–Processors:3712 x 2.1 GHz 12-core (Magny-Cours) Opterons –i.e. 44,544 cores

–Peak Performance:374 TFlops

–Architecture:MIMD-DM

–Memory:59.4 TBytes

–Programming:MPI

?Notes

–20 XT6 cabinets

–network upgrade by 1/2011

–Phase 3 upgrade expected

later in 2011

HPC Architectures40