

### THE MODULAR SUPERCOMPUTING ARCHITECTURE

Hardware composability for application diversity

ZIH Colloquium I Estela Suarez (JSC)



Mitglied der Helmholtz-Gemeinschaft

### OUTLINE

#### • Evolution of HPC architectures

- Global historical evolution
- Dual architecture at JSC
- Cluster-Booster
- Modular Supercomputing Architecture

#### • Software

- Software stack
- Network bridging
- Programming environment
- Scheduling and resource management
- Application experience
- Conclusions and Next steps





# **Historical evolution in HPC Architectures**

- 1940 1950: first computers are Supercomputers
  - Specialized, very expensive
- 1960 1980: general purpose computers appear
  - Still special machines needed to solve very complex problems
    - → Supercomputing (High Performance Computing HPC)
      - Focus: floating point operations (linear Algebra)
      - Special purpose technologies (fast vector processors, parallel architectures)
      - Only few machines produced
- **1990 2000**: integrate standard processors
  - Many "computers" connected through fast network
    - Distributed memory → MPI
  - Both in proprietary Massively Parallel (MPP) and Cluster Computing
- 2010 today: heterogeneous cluster systems
  - Use accelerator technologies (GPU, many-core)



Cray-1 (Source: Wikipedia)





JURECA (Source: FZJ)





### **Cluster vs. MPP**

**Example-systems at JSC** 

#### General purpose systems

- + Highly flexible
- Relatively large energy consumption
- + Preferred by many applications
  - $_{\circ}$   $\,$  Some code parts that could profit from massively parallel system

#### • Highly scalable systems (MPP)

- + Highly energy efficient
- Few (highly parallelizable) codes can fully exploit them

Can one combine the best of these two worlds into a single system?





6

# **HOMOGENOUS CLUSTER**

General purpose CPUs attached to a high-speed network



- +: Easy to use
- +: Very flexible
- -: Power hungry

**CN**: Cluster Node (general purpose processor)



# **Traditional HETEROGENEOUS CLUSTER**

Attach accelerators (e.g. GPUs) to each CPU



**CN**: Cluster Node (general purpose processor) **GPU**: Graphics Processing Unit (or any other accelerator)

- +: Energy efficient
- +: Easy management
- -: Static assignment of accelerators to CPUs
- -: Expensive scale-up



# **CLUSTER-BOOSTER concept**

#### System-level heterogeneity



- +: Energy efficient
- +: Better scalability
- +: High flexibility
- +: Dynamic resource assignment

#### **CN**: Cluster Node (general purpose processor) **BN**: Booster Node (autonomous accelerator)

N. Eicker, Th. Lippert, Th. Moschny, and **E. Suarez**, "The DEEP Project - An alternative approach to heterogeneous cluster-computing in the many-core era", Concurrency and computation: Practice and Experience, Vol. 28, p. 2394–2411 (2016), doi = 10.1002/cpe.3562.





# **THE DEEP PROJECTS**

- **DEEP** (2011 2015)
  - Introduced the **Cluster-Booster** architecture
- **DEEP-ER** (2013 2017)
  - Added I/O and resiliency functionalities -
- **DEEP-EST** (2017 2021)
  - Extends the concept to a -Modular Supercomputer Architecture

27 partners (>80 people) EU funding: 30 M€ Budget: 45 M€



## THE DEEP PROTOTYPE





|               | Cluster                      | Booster                    |
|---------------|------------------------------|----------------------------|
| Node count    | 128                          | 384                        |
| Processor     | Intel Xeon<br>(Sandy Bridge) | Xeon Phi<br>( <b>KNC</b> ) |
| Cores / node  | 8 (×2 socket)                | 61                         |
| Threads /node | 32                           | 244                        |
| Frequency     | 2,7 GHz                      | 1,2 GHz                    |
| Memory (GB)   | 32 – RAM                     | 16 RAM                     |
| Interconnect  | InfiniBand<br>(QDR)          | EXTOLL<br>(FPGA)           |
| BW            | 32 Gbit/s                    | 20 Gbit/s                  |
| Peak Perf.    | 45 TFlops/s                  | 500 TFlop/s                |

#### **Decommissioned in Summer 2018**



### **CLUSTER-BOOSTER in Production**

#### a) JURECA Cluster



b) JURECA Booster



|                                     | Cluster               | Booster           |
|-------------------------------------|-----------------------|-------------------|
| Processor                           | Intel Xeon (Haswell)  | Xeon Phi (KNL)    |
| Interconnect                        | InfiniBand EDR        | OmniPath          |
| Node count                          | 1,872                 | 1,640             |
| Peak Perf. (PFlops)                 | 1,8 (CPU) + 0.4 (GPU) | 5                 |
| Mitglied der Helmholtz-Gemeinschaft | Suarez – 2019 12      | Forschungszentrum |





# **MODULAR SUPERCOMPUTING**

**Composability of** heterogeneous resources

- Cost-efficient scaling
- Effective resource-sharing
- Fit application diversity
  - Large-scale simulations -
  - Data analytics -
  - Machine- and Deep Learning -
  - Artificial Intelligence -



Forschungszentrum

## **DEEP-EST** prototype



#### Prototype co-designed with Software and Applications





Early-Access program in 2020!



## OUTLINE

- Evolution of HPC architectures
  - Global historical evolution
  - Dual architecture at JSC
  - Cluster-Booster
  - Modular Supercomputing Architecture

#### Software

- Software stack
- Network bridging
- Programming environment
- Scheduling and resource management
- Application experience
- Conclusions and Next steps





## **SOFTWARE ENVIRONMENT**



Slurm ParaStation Scalasca



**SIONIib** 

- Low-level SW: Inter-network bridging
- Scheduler: Torque/Maui → SLURM
- Filesystem: BeeGFS
- **Compilers**: Intel, gcc, PGI
- **Debuggers**: Intel Inspector, TotalView
- **Programming**: ParaStation MPI (mpich), OpenMP, OmpSs
- Performance analysis tools: Scalasca, Extrae/Paraver, Intel Advisor, VTune...
- Benchmarking tools: JUBE
- Libraries: SIONlib, SCR, HDF5...



# **NETWORK BRIDGING**

- Classical Gateway approach
  - Just one additional hop
- Forwarder daemons translate between module-networks





• Eicker et al., Bridging the DEEP Gap - Implementation of an Efficient Forwarding Protocol, Intel European Exascale Labs - Report 2013 34-41, (2014)

20

# **PROGRAMMING ENVIRONMENT**



- One application can run:
  - Using only Cluster nodes
  - Using only Booster nodes
  - Distributed over Cluster and Booster
    - In this case two executables are created
    - <u>Collective offload</u> process
- ParaStation Global MPI
  - Enables distributing code
  - Uses MPI\_Comm\_spawn()
    - Collective spawn groups of processes from Cluster to Booster (or vice-versa)
  - Inter-communicator
    - Connects the 2 MPI\_Comm\_worlds

- One can also start two parts of a code and connect them via MPI\_Connect()

- Or have one single common MPI\_Comm\_World() and split it into subcommunicators via MPI\_Comm\_Split()





 Clauss et al., Dynamic Process Management with Allocation-internal Co-Scheduling towards Interactive Supercomputing, COSH@HiPEAC, (2016)

# **COMPILE AND RUN**

#### Compilation

- Creates two executables
- $\circ$  One for \_\_CLUSTER\_\_ code
- $\circ$  One for <u>BOOSTER</u> code

#### Batch system

- Reserves required resources

#### • Execution

- Script starts Booster code
- This code calls MPI\_Comm\_spawn() with name of Cluster executable

#### • Runtime + Scheduler + FS

- Detect ParaStation MPI calls
- Distribute child binaries



salloc --partition=cluster -N 4
 : --partition=booster -N 12
srun --pack-group=0 -N 4 -n 8
 ./hi\_booster

MPI Comm spawn("./xPic.Cluster", &argv[1],

INTERCOMM, MPI ERRCODES IGNORE);

Compiler

ParaStation Global MPI

**DEEP** Runtime

R

nproc, MPI\_INFO\_NULL, 0, GRID\_COMM\_WORLD,

Booster

Executable

**Booster MPI** 

int main (int argc, char \*argv[]){

/\* ... \*/

/\* ... \*/

Cluster

Executable

Cluster MPI



# **IMPROVED WORKFLOW SUPPORT**

- Simple workflows realizable using dependent jobs
  - Costly data buffering to secondary memory
- Goal: Overlapping job execution
  - Currently not supported by Slurm
    - $_{\circ}$   $\,$  Whole job pack either accepted or rejected
    - $_{\circ}$   $\,$  All jobs allocated and run in parallel
    - All jobs wait for allocation if any of the jobs can not be allocated at the moment
- New parameter --delay introduced in sbatch command for job packs
  - Amount of time, the next job should wait after start of the first job in a job pack







## OUTLINE

- Evolution of HPC architectures
  - Global historical evolution
  - Dual architecture at JSC
  - Cluster-Booster
  - Modular Supercomputing Architecture
- Software
  - Software stack
  - Network bridging
  - Programming environment
  - Scheduling and resource management
- Application experience
- Conclusions and Next steps





# **Application-driven HW+SW developments**





### **Architecture Use-Modes**





Cluster-Booster use mode

Code partition Workflow I/O forward

- Kreuzer, et al., Application Performance on a Cluster-Booster System. IPDPSW HCW (2018) [10.1109/IPDPSW.2018.00019]
- Kreuzer et al. The DEEP-ER project: I/O and resiliency extensions for the Cluster-Booster architecture. HPCC'18 proceedings (2018) [10.1109/HPCC/SmartCity/DSS.2018.00046]
- Wolf et al., PIC algorithms on DEEP: The iPiC3D case study. PARS-Mitteilungen 32, 38-48 (2015)
- Christou et al., EMAC on DEEP, Geoscientific model devel.(2016) [10.5194/gmd-9-3483-2016]
- Kumbhar et al., Leveraging a Cluster-Booster Architecture for Brain-Scale Simulations, Lecture Notes in Computer Science 9697 (2016) [10.1007/978-3-319-41321-1\_19]
- Leger et al., Adapting a Finite-Element Type Solver for Bioelectromagnetics to the DEEP-ER Platform. ParCo 2015, Advances in Parallel Computing, 27 (2016) [10.3233/978-1-61499-621-7-349]



# **Application use case: xPic**

- Space Weather simulation
  - Simulates plasma produced in solar eruptions and its interaction with the Earth magnetosphere
  - Particle-in-Cell (PIC) code
  - Authors: KU Leuven
- Two solvers:
  - Field solver: Computes electromagnetic (EM) field evolution
  - Limited code scalability
  - Frequent, global communication
  - Particle solver: Calculates motion of charged particles in EM-fields
    - o Highly parallel
    - $_{\circ}$  Billions of particles
    - Long-range communication



PEP



Forschungszentrum



ATHOLIEKE UNIVERSITEI

# **xPic – ORIGINAL CONFIGURATION**







# **xPic – CODE PARTITION**



```
#ifdef CLUSTER
1
   for (auto i=beg+1; i<=end; i++){</pre>
2
      fld.solver->calculateE();
3
4
      fld.cpyToArr_F();
5
      ClusterToBooster();
6
      // Auxiliary computations
      ClusterWait();
7
8
9
10
11
12
13
14
   BoosterToCluster();
15
   BoosterWait();
16
     fld.solver->calculateB();
17
     fld.cpyFromArr M();
18
19 }
20 #endif
```

```
#ifdef __BOOSTER__
for (auto i=beg+1; i<=end; i++){</pre>
```

```
ClusterToBooster();
```

```
ClusterWait();
pcl.cpyFromArr_F();
for (auto is=0; is<nspec; is++) {
   pcl.species[is].ParticlesMove();
   pcl.species[is].ParticleMoments();
}
pcl.cpyToArr_M();
BoosterToCluster();
// I/O and auxiliary computations
BoosterWait();</pre>
```



#endif

# Particle solver: 1.35 × faster on Booster 35 Booster C+B

• Overall performance gain:

1× 28% × gain compared to Cluster alone
node 21% × gain compared to Booster alone

**Field solver**: 6x faster on Cluster

8× 38% × gain compared to Cluster only
nodes 34% × gain compared to Booster only

- 3%-4% overhead per solver for C+B communication (point to point)

**A. Kreuzer et al**. "*Application Performance on a Cluster-Booster System*", 2018 IEEE IPDPS Workshops (IPDPSW), Vancouver, Canada, p 69 - 78 (2018) [10.1109/IPDPSW.2018.00019]

# **xPic –** (1-NODE) PERFORMANCE RESULTS

|            | 45 |                     |                                                    |                                    |
|------------|----|---------------------|----------------------------------------------------|------------------------------------|
|            | 40 | Cluster             |                                                    |                                    |
|            | 35 | ☑ Booster           |                                                    |                                    |
| S          | 30 |                     |                                                    |                                    |
|            | 25 |                     |                                                    |                                    |
|            | 20 |                     |                                                    |                                    |
| KU         | 15 |                     |                                                    |                                    |
|            | 10 | RZA                 |                                                    |                                    |
|            | 5  |                     |                                                    |                                    |
|            | 0  |                     |                                                    |                                    |
|            |    | Fields              | Particles                                          | Total                              |
|            |    | #cells per node     | 4096                                               | 1                                  |
|            |    |                     |                                                    | KATHOLIEKE UNIVERSITE              |
|            |    | #particles per cell | 2048                                               | LEUVER                             |
|            |    | Compilation flags   | -openmp, -mavx (Cluster)<br>-xMIC-AVX512 (Booster) |                                    |
| 201<br>001 |    | E IPDPS Worksho     | ops                                                | <b>JÜLICH</b><br>Forschungszentrum |



# xPic – STRONG SCALING on JURECA



The set of the set of

Variable-ratio modular strong scaling

(4 Cluster nodes)

Number of Booster nodes

| #cells per node         | 36864                                             |
|-------------------------|---------------------------------------------------|
| #particles per cell     | 1024                                              |
| #blocks per MPI process | 12, 32 or 64                                      |
| Compilation flags       | -mavx (Cluster)<br>-openmp, xMIC-AVX512 (Booster) |

- Code portions can be scaled-up independently
  - Particles scale almost linearly on Booster
  - Fields kept constant on the Cluster (4CNs)
- A configuration is reached where same time is spent on Cluster and Booster
  - Additional 2× time-saving is enabled via overlapping

J. De Amicis<sup>\*</sup>, **E.Suarez**<sup>\*</sup>, J. Amaya, N. Eicker, G.Lapenta, Th.Lippert, *"Assessing the scalability of the xPic code on a large-scale modular supercomputer",* In preparation



Suarez - 2019

# xPic worfklow – mapping on DEEP-EST





 $DLMOS \longrightarrow xPiC \leftrightarrow GMM$ 



## OUTLINE

- Evolution of HPC architectures
  - Global historical evolution
  - Dual architecture at JSC
  - Cluster-Booster
  - Modular Supercomputing Architecture
- Software
  - Software stack
  - Network bridging
  - Programming environment
  - Scheduling and resource management
- Application experience
- Conclusions and Next steps





## CONCLUSIONS

- The Modular Supercomputing Architecture (MSA)
  - Orchestrates heterogeneity at system level
  - Allows scaling hardware in economical way (Booster  $\rightarrow$  Exascale)
  - Serves very diverse application profiles
    - Maximum flexibility for users, without taking anything away (still can use individual modules)

#### • Distribute applications on the MSA give each code-part a suitable hardware

- Straight-forward implementation for workflows
- Partition at MPI-level interesting for multi-physics / multi-scale codes
- Monolithic codes do not need to be divided
- Current / Upcoming implementations of MSA
  - DEEP prototypes, JURECA, JUWELS (in 2020)
  - MELUXINA (Luxembourgh EuroHPC Petascale system)
  - Tianhe-3 (heterogeneous flexible architecture)
    - o https://www.r-ccs.riken.jp/R-CCS-Symposium/2019/slides/Wang.pdf



# **NEXT STEPS**

#### Hardware deployments

- DEEP-EST Booster (January 2020)
- JUWELS Booster (Mid 2020)
  - Integrating later JUNIQ (Quantum Annealer)
- And if everything goes well, then.... Exascale!

### • Software development

- Develop tools to map applications to hardware
- Improve scheduling of heterogeneous jobs/workflows
- Facilitate exploitation of new memory technologies
- Modularize more codes



### **THANK YOU!**



The DEEP projects have received funding from the European Union's Seventh Framework Programme (FP7) for research, technological development and demonstration and the Horion2020 (H2020) funding framework under grant agreement no. FP7-ICT-287530 (DEEP), FP7-ICT-610476 (DEEP-ER) and H2020-FETHPC-754304 (DEEP-EST).



