This site describes a fast matrix multiplication code for Cell BE
processors. It has been developed as part of a seminar paper at the
Center for Information Services and High Performance Computing. The
program is freely available under the GNU GPL.
Please use this address to reference this page: http://www.tu-dresden.de/zih/cell/matmul
[1. Introduction]
[2. Program Code]
[3. Program Usage]
[4. Benchmark Results]
[5. Comparison with IBM SDK Reference Code]
[6. Download and License]
[7. Contact]
1. Introduction
This
code demonstrates how near peak performance can be achieved on Cell BE
processors. It is intended to work 'out of the box' without any
compiler affinity. It also doesn't need a special 'huge pages' setup.
Furthermore, the example is designed to scale almost perfectly up to 16
SPEs on a SMP system with two Cell BE processors (like the IBM
BladeCenter QS20).
This work has been done as part of the
seminar paper "Einsatz und Leistungsanalyse der Cell Broadband Engine"
(Application and Performance Analysis of the Cell Broadband Engine). It
has been published by the Center for Information Services and High
Performance Computing and can be requested there (ZIH-R-0701, German
language only).
2. Program Code
The
program calculates C = C + A * B where A, B, and C are square matrices
with their dimension being a multiple of 128. It uses a common block
partitioning algorithm with a block size of 64x64 elements. Either
block data layout or row major layout can be used to store A, B and C
in main memory.
The code archive consists of 5 files:
- matmul.h includes several definitions like the data layout to be used.
- matmul_ppu.c
is the PPU code that allocates and randomly initializes the matrices,
starts the SPE threads, verifies the results, and does a global
performance measurement.
- matmul_spu.c includes fast code to
generate DMA lists and a loop that fetches the block matrices from main
memory, starts the matrix multiplication on the blocks in the local
store, and writes back the results into main memory. It uses double
buffering for the blocks of matrix A and B, and quadruple buffering for
the blocks of matrix C. The spu_decrementer is used to do a local
performance measurement on each SPU.
- matmul_spu_simd.asm
contains a heavily optimized function that performs a 64x64 matrix
multiplication within 65909 cycles. This equals 99.43% of the
theoretical peak. It is completely written in assembly and therefore
quite hard to read.
- COMPILE.sh is a shell script that compiles the code. It should work with SDK 1 as well as with SDK 2.
3. Program Usage
- By default, the matrices are stored in block data layout to remove any TLB impact. Comment
__USE_BDL__ if you want to use row major layout. This works fine with page sizes of 64K. With a page size of 4K, the
impact of the TLB misses might limit the performance.
- Use the COMPILE.sh script in order to compile the code. If this doesn't
work you will most likely have to modify $CELL_BIN according to your
SDK setup.
- Perform the matrix multiplication by running "./matmul". Valid options are:
- -m #: Specify matrix dimension. This has to be a multiple of 128.
-
-s #: Run with # SPEs. 1, 2, 4, 6, 8 and 16 SPEs are possible. With
matrices of size 128 you can use 1 SPE, for matrices of size 256 you
can use 4 SPEs, and for matrices of size 512 and larger you can use up
to 16 SPEs. The default is 1 SPE.
- -v: Verify results. The
verification is done on the PPE and might take very long time. For
large matrices the limited single precision floating point accuracy
might cause errors. You can modify VERIFY_ERROR in matmul.h in this
case.
- On dual Cell BE systems like the IBM QS20 you
can use 16 SPEs. Although not necessary, 'numactl' can be used to set the
memory allocation policy to 'interleaved'. This will distribute the
matrices among the memories of both processors in order to use the
additional memory bandwidth provided by the second memory interface.
For example: numactl --interleave=0,1 ./matmul -m 6144 -s 16
4. Benchmark Results
This benchmark result is based on measurements on an IBM BladeCenter QS20 using a page size of 64K and row major data layout:

Matrix Multiplication on IBM QS20
This is the program output of a measurement run on a QS20 with block data layout and 64K page size
:$ ./matmul -m 4096 -s 16
Fast matrix multiplications on Cell (SMP) systems.
Copyright (C) 2007 Daniel Hackenberg, ZIH, TU-Dresden
Running matrix multiplication of 4096x4096 matrices using 16 SPEs...
Initializing arrays with random numbers... done!
Starting SPE calculations... done!
Performance results:
Performance of SPE 0: 25.41 GFLOPS
Performance of SPE 1: 25.41 GFLOPS
Performance of SPE 2: 25.41 GFLOPS
Performance of SPE 3: 25.41 GFLOPS
Performance of SPE 4: 25.41 GFLOPS
Performance of SPE 5: 25.41 GFLOPS
Performance of SPE 6: 25.41 GFLOPS
Performance of SPE 7: 25.41 GFLOPS
Performance of SPE 8: 25.40 GFLOPS
Performance of SPE 9: 25.40 GFLOPS
Performance of SPE 10: 25.40 GFLOPS
Performance of SPE 11: 25.40 GFLOPS
Performance of SPE 12: 25.40 GFLOPS
Performance of SPE 13: 25.40 GFLOPS
Performance of SPE 14: 25.40 GFLOPS
Performance of SPE 15: 25.40 GFLOPS
Aggregated performance for all 16 SPEs: 406.51 GFLOPS.
PPE-measured performance of matrix multiplication using 16 SPEs: 406.04 GFLOPS.
(of 409.60 GFLOPS theoretical peak at 3200 MHz clock frequency)
5. Comparison with IBM SDK Reference Code
This implementation has the following advantages compared to the reference code in the IBM SDK 2.0:
- usage of common row major data layout instead of block data layout possible, very good performance with 64K pages,
- no setup of 'huge pages' necessary,
- calculation of C = C + A * B instead of C = A * B,
- more matrix sizes possible (multiple of 128 instead of power of two),
- better performance.
On
an IBM BladeCenter QS20 with a clock frequency of 3.2 GHz and a page
size of 4K the following performance can be achieved using block data
layout and matrices of size 4096x4096:
- with one Cell BE processor (numactl --cpubind=0 --membind=0):
- IBM: 190 GFLOPS,
- this code: 202 GFLOPS
- with two Cell BE processors (numactl --interleave=0,1):
- IBM: 379 GFLOPS
- this code: 405 GFLOPS
When
using common row major data layout the performance of this code will
need 64K pages to achieve full performance. The IBM version doesn't
support row major layout.
The "-n" switch of the IBM
implementation doesn't improve performance compared to interleaved
memory allocation using numactl. However, it increases memory usage.
6. Download and License
The matrix multiplication code (libspe2 version) can be downloaded here:
matmul.tar.gzThe old libspe1 version is available here:
matmul_old.tar.gzCopyright (C) 2007 Daniel Hackenberg, ZIH, TU-Dresden
This
program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.
This program is distributed in the
hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE. See the GNU General Public License for more details.
7. Contact
You can contact the author at daniel.hackenberg@tu-dresden.de
8. Acknowledgments
For
this work, the IBM Deutschland Entwicklung GmbH has granted access to
BladeCenter systems for several months. The Central Institute for
Applied Mathematics of the Research Centre Juelich provided access to
the Juelich Initiative Cell cluster. I would like to gratefully thank
for the support and the provided resources.