

Faculty of Computer Science Institute for Computer Engineering

# **Putting Queens in Carry Chains**

Thomas B. Preußer

Bernd Nägel Rainer G. Spallek

Πάφος, HIPEAC WRC'09



### Itinerary

- Problem and Complexity Overview
- Solution Approach
- Hardware Mapping and Optimization
- ullet Computation Status for N=26



### The N-Queens Puzzle

• Place N non-attacking Queens on a  $N \times N$  chessboard:







- Generic solution templates?
- → How many (fundamental) solutions?



### Known Solution Counts

| N  | Solutions | N  | Solutions        |  |  |
|----|-----------|----|------------------|--|--|
| 1  | 1         | 14 | 365596           |  |  |
| 2  | 0         | 15 | 2279184          |  |  |
| 3  | 0         | 16 | 14772512         |  |  |
| 4  | 2         | 17 | 95815104         |  |  |
| 5  | 10        | 18 | 666090624        |  |  |
| 6  | 4         | 19 | 4968057848       |  |  |
| 7  | 40        | 20 | 39029188884      |  |  |
| 8  | 92        | 21 | 314666222712     |  |  |
| 9  | 352       | 22 | 2691008701644    |  |  |
| 10 | 724       | 23 | 24233937684440   |  |  |
| 11 | 2680      | 24 | 227514171973736  |  |  |
| 12 | 14200     | 25 | 2207893435808352 |  |  |
| 13 | 73712     | 26 | ?                |  |  |

Exhaustive backtracking solution exploration scales  ${f overexponentially}$  with N.

Hard beyond N = 20.

#### N = 25:

- Java grid computation by INRIA, France.
- Runtime:

Real >6 Months Sequential >53 Years



### Tackling N = 26

- Embarrassingly parallel workload:
  - 1. Preplace  $L \ll N$  columns.
  - 2. Explore subboards independently.
  - 3. Collect and add up subtotals.
- Ideally suited for distributed computation:
  - Internet (BOINC) → NQueens@Home
  - FPGA! → Queens@TUD
    - Challenge the power of a world-wide distributed computation effort by an intelligent FPGA implementation.
    - Identified and reported an overflow bug on November 7, 2008.
    - Thereby, resolved an open dispute on the solution for N=24 without any own computation.



### Algorithmic Overview

Exhaustive backtracking solution exploration.



- 1. Mark explored.
- 2. Update blocking.
- 3. Advance to next column / Count solution.
- 1. Clear Markings.
- 2. Retreat to previous column / Done.

Calculated *Blocking Vectors* avoid frequent constraint validation.



## FPGA Mapping





### Carry Chain Structures





### Generic Carry-Chain Mapping through Addition

#### 1. Derive Carry / Token Propagation

| Case              | $c_{i+1}$ | Description                                     |
|-------------------|-----------|-------------------------------------------------|
| $k_i$ : Kill      | 0         | never a carry: holding no queen and not blocked |
| $p_i$ : Propagate | $c_i$     | pass a carry: holding no queen and blocked      |
| $g_i$ : Generate  | 1         | always a carry: holding current queen placement |

#### 2. Determine Addends

$$\left.egin{array}{ll} a_i &=& g_i + p_i \ b_i &=& g_i \end{array}
ight. 
ight.$$
 always valid, possibly suboptimal

3. Fix up Sum Bits of  $s \le a + b$ 

In equations dependent on the incoming carry/token use:

$$c_i = s_i \oplus p_i$$



## Partial Solutions Development





### Partial Solutions - Consequences

- Partition problem with  $L < \frac{N}{2}$ .
  - Otherwise the subproblems might outnumber the actual solutions, thus, producing a large communication overhead.
  - We chose L=6 for N=26.
- Early identification of stale search trees might be valuable.
  - Our attempts projecting the current blocking vectors onto a single or onto pairs of somewhat distant future columns only produced a disappointing cycle reduction.



### Pushing Performance

Optimization for a small size and a high clock frequency:

- Retain a single active column (without fast token handback).
- Keep placed columns within a plain array of shifted registers.
- Use global blocking vectors for all rows and diagonals setting and unsetting placements and retreats, respectively.
   Note that this would be too expensive in software!

#### Outcome:

- Cycle count to finish a subboard more than doubled.
- Size reduced to nearly a quarter.
  - $\rightarrow$  This also improved device utilization.
- Clock frequency grew by more than 50%.
- → Performance tripled.



## Project Infrastructure





### Contributing Devices

| Nmbr.                                     | Device              | Solvers | Clock   | Perf.    |
|-------------------------------------------|---------------------|---------|---------|----------|
| $4 \times$                                | Spartan-3 XC3S1000  | 15      | 92 MHz  | 13.8 SE  |
| $2\times$                                 | Spartan-3E XC3S500E | 9       | 100 MHz | 9.0 SE   |
| $1 \times$                                | Stratix EP1S80      | 74      | 120 MHz | 88.8 SE  |
| $1 \times$                                | Stratix II EP2SGX90 | 78      | 160 MHz | 124.8 SE |
| $4\times$                                 | Virtex-5 XC5VLX50T  | 33      | 163 MHz | 53.8 SE  |
| AMD Quad-Core Phenom(tm) 9850, Stepping 3 |                     | 3 4     | 2.5 GHz | 8.0 SE   |
| Intel Core2 Duo CPU, Stepping 11          |                     | 2       | 2.7 GHz | 3.6 SE   |

SE – Slice Equivalent measured as performance of a single 100 MHz slice.

• Even small FPGA devices easily outperform a workstation CPU.



### Computation Status

Constantly published on project website: http://queens.inf.tu-dresden.de

- Computational Progress (currently >21% of 25204802 subboards).
- Subtotals

# Thank you!