Excerpts from LA-UR-08-6246 & LA-UR-08-2778 & LA-UR-07-7405



HPC User Forum April 20, 2009

#### Ken Koch

Roadrunner Technical Manager, Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory

Work presented was performed by a large team of Roadrunner project staff!





#### The messages this talk will convey are:

- Why Roadrunner? Why Cell?
  - A bold but important step toward the future
- What does Roadrunner look like?
  - Cluster-of-clusters with node-attached Cell blades
- Concepts for Programming Roadrunner
  - MPI, Opteron+Cell, "local-store" memory & DMA transfers
- Status and plans for Roadrunner
  - Timeline
  - Applications to date





### A Roadrunner is born



### using Cell processors as accelerators





#### **Microprocessor trends have changed**

- Moore's law still holds, but is now being realized differently
  - More cores per chip and not all cores need be the same
  - Decreased memory bandwidth and capacity per core
  - Key findings of Jan. 2007 IDC Study: "Next Phase in HPC"
    - new ways of dealing with parallelism will be required
    - must focus more heavily on bandwidth (flow of data) and less on processor



From Burton Smith, LASCI-06 keynote, with permission





# The Cell processor is an (8+1)-way heterogeneous parallel processor





#### **IBM created PowerXCell 8i**





#### Los Alamos has a history in hybrid & petascale



EST. 1943

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

ASE INNSA IBM

#### **Roadrunner was delivered to LANL in Summer 2008**



# IBM built hybrid nodes in Rochester, MN and assembled the system in Poughkeepsie, NY





#### **Fully Assembled Roadrunner**



EST. 1943



## Roadrunner broke the 1 Petaflop/s mark on May 26<sup>th</sup>, 2008



Only 4 days after the full machine was finally assembled!



#### **Roadrunner is a TOP performer!**

| # | SITE                                         | SYSTEM                                         | Cores   | TF/sec | MW   |
|---|----------------------------------------------|------------------------------------------------|---------|--------|------|
| 1 | DOE/NNSA/LANL<br>United States               | Roadrunner, QS22/LS21/IB<br>PowerXCell 8i, IBM | 129600* | 1105   | 2.48 |
| 2 | DOE/ORNL<br>United States                    | Jaguar, XT5,<br>Opteron-QC, Cray               | 150152  | 1059   | 6.95 |
| 3 | NASA Ames Research Center<br>United States   | Pleiades, Altix ICE & IB,<br>Xeon-QC SGI       | 51200   | 487    | 2.09 |
| 4 | DOE/NNSA/LLNL<br>United States               | BGL, Blue Gene/L,<br>PowerPC, IBM              | 212992  | 478    | 2.33 |
| 5 | Argonne National Laboratory<br>United States | Intrepid, Blue Gene/P,<br>PowerPC, IBM         | 163840  | 450    | 1.26 |
| 6 | Texas Adv. Comp. Center<br>United States     | Ranger, SunBlade & IB<br>Opteron-QC, Sun       | 62976   | 433    | 2.00 |

#### #1 on the TOP500 (Nov. 2008)

\* Roadrunner core count includes 8 SPEs in each Cell; Opteron+Cell-PPE cores is only 25920





### Roadrunner System Configuration

# See the LANL Roadrunner web site at end for more details





## Roadrunner Phase 3 is Cell-accelerated, not a cluster of Cells







### A Roadrunner TriBlade node integrates Cell and Opteron blades







#### A Roadrunner TriBlade node integrates Cell and Opteron blades

- QS22 is an IBM Cell blade containing two new enhanced double-precision (eDP/PowerXCell<sup>TM</sup>) Cell chips
- Expansion blade connects two QS22 via four PCI-e x8 links to LS21 & provides the node's Mellanox ConnectX IB 4X DDR cluster attachment
- LS21 is an IBM dual-socket Opteron blade
- 4-wide IBM BladeCenter packaging
- Roadrunner Triblades are completely diskless and run from RAM disks with NFS & Panasas only to the LS21
- Node design points:
  - One Cell chip per Opteron core
  - ~400 GF/s double-precision & ~800 GF/s single-precision
  - 16 GB Opteron memory PLUS 16 GB Cell memory



core n & LUS





#### A Connected Unit (CU) forms a building block



#### A Connected Unit (CU) is a powerful cluster

**Connected Unit Specifications:** 

#### 360 1.8 GHz dual-core Opterons 720 PowerXCell chips 192 IB 4X DDR cluster links 2.59 TF DP peak Opteron 73.7 TF DP peak Cell 768 GB/s aggregate BW (bi-dir) 2.88 TB Opteron memory 2.88 TB Cell memory 384 GB/s bi-section BW (bi-dir) 24 2.6 GHz dual-core Opterons 18.4 TB/s Cell memory BW 24 10 GigE I/O links on 12 I/O nodes 24 GB/s aggregate I/O BW (uni-dir) in I/O nodes (IB limited) 180 TriBlade (1 LS21 + 2 QS22) to Panasas to Panasas filesystem filesystem compute nodes 12 IBM x3655 I/O nodes (dual 10 GigE each) (dual-socket dual-core) 192 cluster nodes Voltaire 288-port IB 4x DDR 96 2<sup>nd</sup>-stage links / / • • • \

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

AGE INNSA IEM

#### Now build a cluster-of-clusters...



Extra 2<sup>nd</sup>—stage switch ports allow expansion up to 24 CUs





#### **Roadrunner System Networks**



Slide 20

#### **Roadrunner is a petascale system in 2008**



#### **Roadrunner at a glance**

- Cluster of 17 Connected Units (CU)
  - 12,240 IBM PowerXCell 8i chips
  - 1.33 Petaflop/s DP peak (Cell)
  - 1.026 PF sustained Linpack (DP)
  - 6,120 (+408) AMD dual-core Opterons
  - 44.1 (+4.4) Teraflop/s peak (Opteron)
- InfiniBand 4x DDR fabric
  - 3264 total nodes; 2-stage fat-tree; all-optical cables
  - Full bi-section BW within each CU
    - 384 GB/s (bi-directional)
  - Half bi-section BW among CUs
    - 3.26 TB/s (bi-directional)
- ~100 TB aggregate memory
  - 49 TB Opteron (compute nodes)
  - 49 TB Cell
- ~200 GB/s sustained File System I/O:



204x2 10GE Ethernets to Panasas

#### • Fedora Linux

- On LS21 & QS22 blades & I/O & service nodes
- SDK for Multicore Acceleration
  - Cell compilers, libraries, tools

#### • xCAT Cluster Management

System-wide GigEnet network

#### • 2.35 MW Power:

- 0.437 GF/Watt
- Area:
  - 280 racks
  - 5200 ft<sup>2</sup>





## **Programming Concepts**





#### **Programming Approaches for Roadrunner**



Host Centric view



#### Three types of processors work together





#### **Roadrunner nodes have a memory hierarchy**



#### How do you keep the 256KB SPEs busy? prefetch ute bent Break the work into a stream of pieces problem grid tiles domain or particle of a Cell bundles processor (can include data chunks stream in & out ghost zones) of 8 SPEs using asynch DMAs and triple-buffering

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

EST. 1943



### Put it all together: MPI+DaCS+DMA+SIMD



- DMAs are simply block memory transfers
  - HW asynchronous (no SPE stalls)
  - DDR2 memory latency and BW performance

DMA Get: mfc\_get( LS\_addr, Mem\_addr, size, tag, 0, 0);

DMA Put: mfc\_put( Mem\_addr, LS\_addr, size, tag, 0, 0);

DMA Wait: mfc\_write\_tag\_mask(1<<tag); mfc\_read\_tag\_status\_all();



#### **IBM-ALF** is a simple work-queue approach for abstracting parallelism directly to SPEs



# Programming approach has now been demonstrated and is Tractable

- Two levels of parallelism:
  - node-to-node: MPI & DaCS-MPI-DaCS relay
  - within-Cell: threads, pipelined DMAs, & SIMD
- Large-grain computationally intense portions of code are split off for Cell acceleration within a node process
  - Usually an entire tree of subroutines
  - This is equivalent to "function offload" of entire large algorithms
- Threaded fine-grained parallelism introduced within the Cell itself
  - Create many-way parallel pipelined work units for the 8 SPEs
  - Good for both multicore/manycore chips and heterogeneous chip trends with dwindling memory bandwidth
- Communications during Cell computation are possible between Cells via DaCS-MPI-DaCS relay approach
- Considerable flexibility and opportunities exist beyond this approach





### Five Waves of Roadrunner Applications Codes





#### **Recent Roadrunner History**



#### **Five Waves of Roadrunner Application Codes**

- 1. Assessment Codes (Oct. 2006 Oct. 2007)
  - Proof of Cell & Hybrid programming capability: 4 codes
  - Prototype hardware: old Cell/QS20 blades & very first eDP Cell
- 2. Full-System Pre-Acceptance Testing (June Nov. 2008)
  - Gordon Bell finalists: VPIC & SPaSM
  - PetaVision (sustained 1+ single-precision-PF!)
  - PPM (Paul Woodward)
- 3. Roadrunner Open Science (Oct. 2008 July 2009)
  - 8 codes & 10 projects
- 4. Institutional Computing (starting May 2009 on Cerrillos 2 CUs)
  - 19 new projects to start
- 5. Classified ASC Use (starting Oct. 2009)



Operated by the Los Alamos National Security, LLC for the DOE/NNSA



today

#### **Roadrunner Open Science and Institutional Computing**





# Exciting opportunities among the 10 selected proposals for Roadrunner Open Science

| Kinetic Thermonuclear Burn Studies with VPIC on Roadrunner                                                                                                           | VPIC                |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|--|
| Multibillion-Atom Molecular Dynamics Simulations of Ejecta Production and Transport using Roadrunner                                                                 | SPaSM               |  |
| New frontiers in viral phylogenetics                                                                                                                                 | ML                  |  |
| Three-Dimensional Dynamics of Magnetic Reconnection in Space and Laboratory<br>Plasmas                                                                               | VPIC                |  |
| The Roadrunner Universe                                                                                                                                              | MC <sup>3</sup>     |  |
| Implicit Monte Carlo Calculations of Supernova Light-Curves                                                                                                          | IMC + Rage          |  |
| Instabilities-Driven Reacting Compressible Turbulence                                                                                                                | CFDNS               |  |
| Cellulosomes in Action: Peta-Scale Atomistic Bioenergy Simulations                                                                                                   | GROMACS             |  |
| Parallel-replica dynamics study of tip-surface and tip-tip interactions in atomic force microscopy and the formation and mechanical properties of metallic nanowires | PAR-REP<br>+ CellMD |  |
| Saturation of Backward Stimulated Scattering of Laser In The Collisional Regime                                                                                      | VPIC                |  |

Indicates new work



Operated by the Los Alamos National Security, LLC for the DOE/NNSA



Indicates new + old



### http://www.lanl.gov/roadrunner/

Roadrunner architecture Early applications efforts Upcoming Open Science efforts Cell & hybrid programming Computing trends Related Internet links



