Fast Matrix Multiplication on Cell (SMP) Systems

This site describes a fast matrix multiplication code for Cell BE processors. It has been developed as part of a seminar paper at the Center for Information Services and High Performance Computing. The program is freely available under the GNU GPL.

Please use this address to reference this page: http://www.tu-dresden.de/zih/cell/matmul

[1. Introduction]
[2. Program Code]
[3. Program Usage]
[4. Benchmark Results]
[5. Comparison with IBM SDK Reference Code]
[6. Download and License]
[7. Contact]

1. Introduction

This code demonstrates how near peak performance can be achieved on Cell BE processors. It is intended to work 'out of the box' without any compiler affinity. It also doesn't need a special 'huge pages' setup. Furthermore, the example is designed to scale almost perfectly up to 16 SPEs on a SMP system with two Cell BE processors (like the IBM BladeCenter QS20).

This work has been done as part of the seminar paper "Einsatz und Leistungsanalyse der Cell Broadband Engine" (Application and Performance Analysis of the Cell Broadband Engine). It has been published by the Center for Information Services and High Performance Computing and can be requested there (ZIH-R-0701, German language only).

2. Program Code

The program calculates C = C + A * B where A, B, and C are square matrices with their dimension being a multiple of 128. It uses a common block partitioning algorithm with a block size of 64x64 elements. Either block data layout or row major layout can be used to store A, B and C in main memory.

The code archive consists of 5 files:

matmul.h includes several definitions like the data layout to be used.
matmul_ppu.c is the PPU code that allocates and randomly initializes the matrices, starts the SPE threads, verifies the results, and does a global performance measurement.
matmul_spu.c includes fast code to generate DMA lists and a loop that fetches the block matrices from main memory, starts the matrix multiplication on the blocks in the local store, and writes back the results into main memory. It uses double buffering for the blocks of matrix A and B, and quadruple buffering for the blocks of matrix C. The spu_decrementer is used to do a local performance measurement on each SPU.
matmul_spu_simd.asm contains a heavily optimized function that performs a 64x64 matrix multiplication within 65909 cycles. This equals 99.43% of the theoretical peak. It is completely written in assembly and therefore quite hard to read.
COMPILE.sh is a shell script that compiles the code. It should work with SDK 1 as well as with SDK 2.

3. Program Usage

By default, the matrices are stored in block data layout to remove any TLB impact. Comment __USE_BDL__ if you want to use row major layout. This works fine with page sizes of 64K. With a page size of 4K, the impact of the TLB misses might limit the performance.
Use the COMPILE.sh script in order to compile the code. If this doesn't work you will most likely have to modify $CELL_BIN according to your SDK setup.
Perform the matrix multiplication by running "./matmul". Valid options are:
- -m #: Specify matrix dimension. This has to be a multiple of 128.
- -s #: Run with # SPEs. 1, 2, 4, 6, 8 and 16 SPEs are possible. With matrices of size 128 you can use 1 SPE, for matrices of size 256 you can use 4 SPEs, and for matrices of size 512 and larger you can use up to 16 SPEs. The default is 1 SPE.
- -v: Verify results. The verification is done on the PPE and might take very long time. For large matrices the limited single precision floating point accuracy might cause errors. You can modify VERIFY_ERROR in matmul.h in this case.
On dual Cell BE systems like the IBM QS20 you can use 16 SPEs. Although not necessary, 'numactl' can be used to set the memory allocation policy to 'interleaved'. This will distribute the matrices among the memories of both processors in order to use the additional memory bandwidth provided by the second memory interface. For example: numactl --interleave=0,1 ./matmul -m 6144 -s 16

4. Benchmark Results

This benchmark result is based on measurements on an IBM BladeCenter QS20 using a page size of 64K and row major data layout:

This is the program output of a measurement run on a QS20 with block data layout and 64K page size:

$ ./matmul -m 4096 -s 16

Fast matrix multiplications on Cell (SMP) systems.
Copyright (C) 2007 Daniel Hackenberg, ZIH, TU-Dresden

Running matrix multiplication of 4096x4096 matrices using 16 SPEs...
Initializing arrays with random numbers... done!
Starting SPE calculations... done!

Performance results:
Performance of SPE 0: 25.41 GFLOPS
Performance of SPE 1: 25.41 GFLOPS
Performance of SPE 2: 25.41 GFLOPS
Performance of SPE 3: 25.41 GFLOPS
Performance of SPE 4: 25.41 GFLOPS
Performance of SPE 5: 25.41 GFLOPS
Performance of SPE 6: 25.41 GFLOPS
Performance of SPE 7: 25.41 GFLOPS
Performance of SPE 8: 25.40 GFLOPS
Performance of SPE 9: 25.40 GFLOPS
Performance of SPE 10: 25.40 GFLOPS
Performance of SPE 11: 25.40 GFLOPS
Performance of SPE 12: 25.40 GFLOPS
Performance of SPE 13: 25.40 GFLOPS
Performance of SPE 14: 25.40 GFLOPS
Performance of SPE 15: 25.40 GFLOPS
Aggregated performance for all 16 SPEs: 406.51 GFLOPS.

PPE-measured performance of matrix multiplication using 16 SPEs: 406.04 GFLOPS.
(of 409.60 GFLOPS theoretical peak at 3200 MHz clock frequency)

5. Comparison with IBM SDK Reference Code

This implementation has the following advantages compared to the reference code in the IBM SDK 2.0:

usage of common row major data layout instead of block data layout possible, very good performance with 64K pages,
no setup of 'huge pages' necessary,
calculation of C = C + A * B instead of C = A * B,
more matrix sizes possible (multiple of 128 instead of power of two),
better performance.

On an IBM BladeCenter QS20 with a clock frequency of 3.2 GHz and a page size of 4K the following performance can be achieved using block data layout and matrices of size 4096x4096:

with one Cell BE processor (numactl --cpubind=0 --membind=0):
- IBM: 190 GFLOPS,
- this code: 202 GFLOPS
with two Cell BE processors (numactl --interleave=0,1):
- IBM: 379 GFLOPS
- this code: 405 GFLOPS

When using common row major data layout the performance of this code will need 64K pages to achieve full performance. The IBM version doesn't support row major layout.

The "-n" switch of the IBM implementation doesn't improve performance compared to interleaved memory allocation using numactl. However, it increases memory usage.

6. Download and License

The matrix multiplication code (libspe2 version) can be downloaded here: matmul.tar.gz
The old libspe1 version is available here: matmul_old.tar.gz

Copyright (C) 2007 Daniel Hackenberg, ZIH, TU-Dresden

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

7. Contact

Daniel Hackenberg

8. Acknowledgments

For this work, the IBM Deutschland Entwicklung GmbH has granted access to BladeCenter systems for several months. The Central Institute for Applied Mathematics of the Research Centre Juelich provided access to the Juelich Initiative Cell cluster. I would like to gratefully thank for the support and the provided resources.