This site describes a fast matrix multiplication code for
Cell BE processors. It has been developed as part of a seminar
paper at the Center for Information Services and High
Performance Computing. The program is freely available under
the GNU GPL.
Please use this address to reference this page: http://www.tu-dresden.de/zih/cell/matmul
[1. Introduction]
[2. Program Code]
[3. Program Usage]
[4. Benchmark
Results]
[5.
Comparison with IBM SDK Reference Code]
[6. Download and
License]
[7. Contact]
1. Introduction
This code demonstrates how near peak performance can be
achieved on Cell BE processors. It is intended to work 'out of
the box' without any compiler affinity. It also doesn't need a
special 'huge pages' setup. Furthermore, the example is
designed to scale almost perfectly up to 16 SPEs on a SMP
system with two Cell BE processors (like the IBM BladeCenter
QS20).
This work has been done as part of the seminar paper "Einsatz
und Leistungsanalyse der Cell Broadband Engine" (Application
and Performance Analysis of the Cell Broadband Engine). It has
been published by the Center for Information Services and High
Performance Computing and can be requested there (ZIH-R-0701,
German language only).
2. Program Code
The program calculates C = C + A * B where A, B, and C are
square matrices with their dimension being a multiple of 128.
It uses a common block partitioning algorithm with a block size
of 64x64 elements. Either block data layout or row major layout
can be used to store A, B and C in main memory.
The code archive consists of 5 files:
- matmul.h includes several definitions like the data layout
to be used.
- matmul_ppu.c is the PPU code that allocates and randomly
initializes the matrices, starts the SPE threads, verifies the
results, and does a global performance measurement.
- matmul_spu.c includes fast code to generate DMA lists and
a loop that fetches the block matrices from main memory,
starts the matrix multiplication on the blocks in the local
store, and writes back the results into main memory. It uses
double buffering for the blocks of matrix A and B, and
quadruple buffering for the blocks of matrix C. The
spu_decrementer is used to do a local performance measurement
on each SPU.
- matmul_spu_simd.asm contains a heavily optimized function
that performs a 64x64 matrix multiplication within 65909
cycles. This equals 99.43% of the theoretical peak. It is
completely written in assembly and therefore quite hard to
read.
- COMPILE.sh is a shell script that compiles the code. It
should work with SDK 1 as well as with SDK 2.
3. Program Usage
- By default, the matrices are stored in block data layout
to remove any TLB impact. Comment __USE_BDL__ if you want to
use row major layout. This works fine with page sizes of 64K.
With a page size of 4K, the impact of the TLB misses might
limit the performance.
- Use the COMPILE.sh script in order to compile the code. If
this doesn't work you will most likely have to modify
$CELL_BIN according to your SDK setup.
- Perform the matrix multiplication by running "./matmul".
Valid options are:
-
- -m #: Specify matrix dimension. This has to be a
multiple of 128.
- -s #: Run with # SPEs. 1, 2, 4, 6, 8 and 16 SPEs are
possible. With matrices of size 128 you can use 1 SPE, for
matrices of size 256 you can use 4 SPEs, and for matrices of
size 512 and larger you can use up to 16 SPEs. The default
is 1 SPE.
- -v: Verify results. The verification is done on the PPE
and might take very long time. For large matrices the
limited single precision floating point accuracy might cause
errors. You can modify VERIFY_ERROR in matmul.h in this
case.
- On dual Cell BE systems like the IBM QS20 you can use 16
SPEs. Although not necessary, 'numactl' can be used to set the
memory allocation policy to 'interleaved'. This will
distribute the matrices among the memories of both processors
in order to use the additional memory bandwidth provided by
the second memory interface. For example: numactl
--interleave=0,1 ./matmul -m 6144 -s 16
4. Benchmark Results
This benchmark
result is based on measurements on an IBM BladeCenter QS20
using a page size of 64K and row major data layout:

Matrix Multiplication
on IBM QS20
This is the program output of a measurement run on a QS20 with
block data layout and 64K page size:
$ ./matmul -m 4096 -s 16
Fast matrix multiplications on Cell (SMP) systems.
Copyright (C) 2007 Daniel Hackenberg, ZIH, TU-Dresden
Running matrix multiplication of 4096x4096 matrices using 16 SPEs...
Initializing arrays with random numbers... done!
Starting SPE calculations... done!
Performance results:
Performance of SPE 0: 25.41 GFLOPS
Performance of SPE 1: 25.41 GFLOPS
Performance of SPE 2: 25.41 GFLOPS
Performance of SPE 3: 25.41 GFLOPS
Performance of SPE 4: 25.41 GFLOPS
Performance of SPE 5: 25.41 GFLOPS
Performance of SPE 6: 25.41 GFLOPS
Performance of SPE 7: 25.41 GFLOPS
Performance of SPE 8: 25.40 GFLOPS
Performance of SPE 9: 25.40 GFLOPS
Performance of SPE 10: 25.40 GFLOPS
Performance of SPE 11: 25.40 GFLOPS
Performance of SPE 12: 25.40 GFLOPS
Performance of SPE 13: 25.40 GFLOPS
Performance of SPE 14: 25.40 GFLOPS
Performance of SPE 15: 25.40 GFLOPS
Aggregated performance for all 16 SPEs: 406.51 GFLOPS.
PPE-measured performance of matrix multiplication using 16 SPEs: 406.04 GFLOPS.
(of 409.60 GFLOPS theoretical peak at 3200 MHz clock frequency)
5. Comparison with IBM SDK Reference Code
This implementation has the following advantages compared to
the reference code in the IBM SDK 2.0:
- usage of common row major data layout instead of block
data layout possible, very good performance with 64K
pages,
- no setup of 'huge pages' necessary,
- calculation of C = C + A * B instead of C = A * B,
- more matrix sizes possible (multiple of 128 instead of
power of two),
- better performance.
On an IBM BladeCenter QS20 with a clock frequency of 3.2 GHz
and a page size of 4K the following performance can be achieved
using block data layout and matrices of size
4096x4096:
- with one Cell BE processor (numactl --cpubind=0
--membind=0):
-
- IBM: 190 GFLOPS,
- this code: 202 GFLOPS
- with two Cell BE processors (numactl
--interleave=0,1):
-
- IBM: 379 GFLOPS
- this code: 405 GFLOPS
When using common row major data layout the performance of
this code will need 64K pages to achieve full performance. The
IBM version doesn't support row major layout.
The "-n" switch of the IBM implementation doesn't improve
performance compared to interleaved memory allocation using
numactl. However, it increases memory usage.
6. Download and License
The matrix multiplication code (libspe2 version) can be
downloaded here: matmul.tar.gz
The old libspe1 version is available here: matmul_old.tar.gz
Copyright (C) 2007 Daniel Hackenberg, ZIH, TU-Dresden
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of
the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
the GNU General Public License for more details.
7. Contact
You can contact the author at
daniel.hackenberg@tu-dresden.de
8. Acknowledgments
For this work, the IBM Deutschland Entwicklung GmbH has granted
access to BladeCenter systems for several months. The Central
Institute for Applied Mathematics of the Research Centre
Juelich provided access to the Juelich Initiative Cell cluster.
I would like to gratefully thank for the support and the
provided resources.
