Routing on the Channel Dependency Graph: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

Kurzfassung

Dissertation zur Erlangung des akademischen Grades

Doktor rerum naturalium (Dr. rer. nat.)

vorgelegt von

Name: Domke
Vorname: Jens
geboren am: 12.09.1984 in: Bad Muskau

Tag der Einreichung: 30.03.2017

Gutachter: Prof. Dr. rer. nat. Wolfgang E. Nagel, TU Dresden, Germany
Professor Tor Skeie, PhD, University of Oslo, Norway
1 Introduction

With Moore’s law slowly coming to an end, many information technology domains started to scale-out. A global trend visible from small many-core systems-on-chip, such as the 256-core Kalray MPPA-256 chip [7] or the 1024-core Epiphany-V developed by Adapteva [20], over large-scale data centers or data wearhouses [19], to the world’s most powerful supercomputers, such as the 40,960-node Sunway TaihuLight system hosting more than 10 million compute cores [9]. For high-performance computing (HPC) systems, it can be expected that the number of network endpoints will grow significantly [17] which emphasizes the role of the interconnection network as one of the most critical components in a supercomputer even more. These HPC networks, as well as other previously mentioned networks, are largely controlled by routing algorithms which determine how to forward packets. Routing algorithms have to balance multiple, partially conflicting, requirements. For example, they shall provide the best forwarding strategy to guarantee a certain quality of service (bandwidth, latency, etc.), which is an NP-hard problem in general, while minimizing the runtime of the routing in order to quickly react to failures of network components.

Unfortunately, as network topologies grow, failures of switches and connectors or cables become common. As opposed to network endpoints, the wiring complexity and infrastructural demands of cabling, e.g., arrangement of cable trays, make maintaining complex networks challenging and expensive. Thus, network structures are often kept in place over years and multiple generations of machines while other components such as CPU or main memory are upgraded. Today, many networks are based on the concept of over-provisioning the hardware, such as installing spare cables in the cable trays or having a spare parts inventory. Alternatively, they are operated in a deferred repair mode, in which a failed component will not be replaced instantaneously but within a reasonable time frame, such as a business day. Fail-in-place strategies are common in storage systems when maintenance costs exceed maintenance benefits, such as in large-scale data centers with millions of hard drives, e.g., IBM’s Flipstone [1]. Hence, a fail-in-place approach for interconnection networks could be similarly beneficial, if supported by the correct combination of network topology and applied routing algorithm.

Routing algorithms for HPC systems have been the topic of many studies ranging from topology-specific routing algorithms [25][35] through general deadlock-free algorithms [5][28], more advanced deadlock-free algorithms balancing the routes [8], to advanced path-caching for quick failover [33]. A good overview is provided by Flich et al. [11]. Many advanced approaches for application-specific [16][22] or topology-specific [14][23][31] routing and mapping assume idealized conditions such as a regular topology without faulty components, isolated bulk-synchronous applications communicating in synchronized phases, and the absence of system noise. Unfortunately, these assumptions are rarely true in practice. Presumably, integrating additional information about the current state of the supercomputer, such as state of the batch system (i.e., the current job mix) or application demands, can help guiding the network manager in assigning optimized routes on-demand which potentially increases utilization and achievable throughput.

Due to the previously addressed hardware failures and these on-demand software adjustments, topology-
aware routing algorithms become increasingly inapplicable in modern supercomputers. Hence, avoiding deadlock situations \cite{4,6}, a requirement for lossless Layer 2 networking, becomes much harder to achieve. Topology-aware routings usually avoid deadlocks algorithmically for many regular topologies, e.g., by restricting the routing to use only a subset of all available channel dependencies, as it is implemented by dimension-order routing (DOR) \cite{24}. In contrast, many topology-agnostic routing technique for HPC and on-chip networks (NoC) use virtual channels to break deadlocks in arbitrary topologies \cite{2,8,29,30}. Yet, all these concepts have limitations: (1) topology-aware routings assume perfect topologies and often do not support switch/link failures, see Section 2, (2) cycle-avoiding routings often cannot balance routes and thus limit global bandwidth \cite{26}, and (3) routings based on virtual channel isolation fail when the required number of virtual channels is not available \cite{11}.

This thesis primarily focuses on flow-oblivious, static, destination-based, unicast routing for supercomputers, and is advancing the research field of HPC interconnects with the following contributions:

1. Developing a network simulation and analysis framework which allows system administrators and designers to evaluate current, or plan future, fail-in-place networks and stipulate operation policies while taking failure rates into consideration;

2. Showing, based on an exhaustive study of multiple InfiniBand (IB) routing algorithms and several real-world topologies, that fail-in-place network design can be accomplished with an appropriate combination of topology and routing algorithm while assuming a performance degradation up to a pre-defined threshold can be tolerated;

3. Implementing a low-overhead scheduling-aware routing (SAR) for InfiniBand-based HPC systems, aimed at optimizing the effective edge forwarding indices and dark fiber percentage of realistic multi-user/multi-job HPC environments, and demonstrating its benefits over state-of-the-art routing mechanisms using simulations and communication benchmarks;

4. Extending the formalism of per-packet consistent network updates for in-order delivery and deadlock-freedom, which is needed due to the frequent path reconfigurations performed by SAR, and show how to achieve both conditions using current InfiniBand hardware;

5. Demonstrating the shortcomings of multi-step routing approaches with respect to creating deadlock-free forwarding rules and introducing a novel approach of routing on the complete channel dependency graph to overcome these issues; and

6. Developing and implementing the deadlock-free, oblivious, and destination-based Nue routing (based on the novel approach) and showing its effectiveness, both with respect to the runtime calculating the forwarding rules and the achievable throughput resulting from the excellent path balancing, and proving its broad applicability to current and future interconnection technology.
2 Fail-in-Place High-Performance Networks

As other high-performance computing hardware, networks are not immune to failures. According to previously published fault statistics, network failures constitute between 2% – 10% for the HPC systems at Los Alamos National Laboratory [27], over 20% for local area networks (LANs) of Windows NT based computers [15] and up to 40% for internet services [21] of the total number of failures. The failure data of the TSUBAME2 supercomputer [12] indicates an annual failure rate of \( \approx 1\% \) for InfiniBand links as well as 1% for the used switches, and shows the expected bathtub curve for hardware reliability [32]. Hence, assuming a constant annual failure rate of 1% is a fair representative failure rate to simplify further fail-in-place analysis and to extrapolate the achievable throughput degradation over a maximum system lifetime of eight years with the developed simulation framework shown in Figure 1. The fail-in-place property, in the context of interconnection networks, can be defined based on the differentiation between “critical” and “non-critical” network component failures. A critical component failure disconnects all paths between two hosts, whereas a non-critical failure only disconnects a subset of all paths between two hosts. The network fail-in-place strategy is to repair critical failures only, but continue operation by bypassing non-critical failures.

Conducted simulations show that two classes of deterministic routing algorithms are suitable to build fail-in-place interconnects: (1) fault-tolerant topology-aware routings, and (2) topology-agnostic routing algorithms. In terms of resiliency, topology-agnostic algorithms, such as (DF-)SSSP [8], are superior to topology-aware routing algorithms, such as fat-tree [36] or Up*/Down* [28], when the network

Figure 1: Toolchain to evaluate the network throughput of a fail-in-place network

(a) Balanced 16-ary 2-tree with 256 HCAs

(b) Different networks with \( \approx 2,200 \) HCAs

Figure 2: Left: whisker plots of consumption bandwidth for uniform random injection (three seeds); Right: histograms with error bars for all-to-all shift exchange pattern (ten seeds for fault injection)
suffers from a high percentage of failures, see Figure 2a. However, techniques to avoid deadlocks in topology-agnostic algorithms can limit their applicability for large networks, i.e., DFSSSP fails to route the Kautz(7,3) graph [18] with 2,352 attached terminals, see Figure 2b, when the number of virtual channels is limited to eight. This problem will be further addressed and solved in Section 4.

3 Utilization Improvement through Scheduling-Aware Routing

Idealized conditions, e.g., regular topology without faulty network components (Section 2), single application running on the whole system [14], or no system noise which is altering the message injection pattern, etc., usually do not apply to production supercomputers. An analysis of the job mix on two production supercomputers emphasizes the complexity of real-world installations. For example, approximately 66% and 50% of the compute jobs on Taurus [37] and TSUBAME2, respectively, are using multiple switches, while up to ≈100 jobs are executed simultaneously on these systems. Thus, the jobs potentially share the same network resources, meaning switches and inter-switch links, and therefore can influence each others communication performance. But at the same time, communications do generally not cross from one parallel applications to another, with the exception of management and I/O traffic targeted at special endpoints in the system, a fact ignored by state-of-the-art oblivious routings deployed in practice.

Performing a scheduling-aware routing (SAR) optimization for a supercomputer is only beneficial for parallel applications when their compute nodes are attached to multiple switches. Therefore, a filtering tool is needed, see Figure 3, which periodically polls SLURM [34] for the current state of the system, and filters out a list of multi-switch jobs which is then passed to the InfiniBand fabric manager. SAR routing is implemented as an extension for the topology-agnostic DFSSSP routing to include knowledge about batch job locations, which enables the algorithm to optimize the path calculation for intra-job paths. SAR’s effectiveness is determined by two new metrics: (1) the effective edge forwarding index (extending the established EFI [3]) to differentiate between “useful” intra-job paths and unused inter-job paths, and (2) the proportion of superfluous links (or “dark fiber”) to links usable for intra-job communication.

Figure 3: Flowchart of a filtering tool; Collect information about currently running multi-switch batch jobs and initiate recalculation of switches forwarding rules (linear forwarding tables) via SAR
The results for the maximum effective EFI across all simultaneously executed jobs are shown in the top plot of Figure 5. As one can see, the scheduling-aware routing (blue line) outperforms the other three routing algorithms on the TSUBAME2 supercomputer. On average, SAR improves the effective EFI between 26.8% and 38.5% compared to the three competitors. The observed maximum difference is 71.2% between SAR and fat-tree routing. Similarly, the SAR approach lowers the maximal EFI per job (2nd plot), which should accelerate the achievable throughput within the jobs. Furthermore, SAR increases the number of links available to all running batch jobs by up to 17.74% for TSUBAME2 in comparison to the default fat-tree routing on this system, while also increasing the number of intra-job links. The scheduling-aware routing is now deployed for more than one year on the petascale Taurus production supercomputer at the TU Dresden to improve the system’s utilization and fail-in-place properties.

4 Routing on the Channel Dependency Graph

A major problem for fail-in-place and other types of networks persists: No algorithmic approach is known to perform deadlock-free destination-based routing which is generally applicable to all technologies and topologies, and delivers better performance than Up*/Down*. Two general concepts exist for creating deadlock-free routing functions, both consisting out of two phases. The first concept is an analytical solution, also referred to as turn model, which basically restricts the usage of possible turns within the network as a first phase, either by knowledge of the underlying topology (e.g., dimension order routing [24]) or similarly to the Up*/Down* routing, before calculating the routes in the second phase. Unfortunately, this approach generally results in non-shortest paths and poor path balancing, and therefore
This novel routing algorithm, called Nue routing, overcomes all limitations (1) – (3) mentioned in Section 1. To demonstrate the Nue’s competitiveness, with respect to achievable global throughput, two real-world topologies, namely the multi-island fat-tree of Taurus and TSUBAME2’s 2nd rail fat-tree network connecting 1,408 compute node, are analyzed, see Figure 7a. Nue routing is compared for a given topology to the other usable algorithms, high latency and low throughput, respectively. In contrast, the current best practice of the second concept, e.g., as implemented by DFSSSP and LASH [30], is to decouple the two problems of (shortest-)path creation into phase 1 and deadlock-free assignment to virtual channels (VCs) into phase 2, because both problems require a different graph representation of the network and routes. Generally, this approach cannot be bound with respect to the number of virtual channels required to solve the deadlock problem, which is easily comprehensible by assuming a 5-node ring topology, e.g., similar to Figure 6a but without the shortcut between \( n_3 \) and \( n_5 \). Such a topology cannot be routed along shortest paths without at least two VCs. Hence, a two-phase approach is impractical for designing an universally applicable routing.

Assuming, it is possible to combine the information required to solve both problems within one graph, then this might allow to impose routing restrictions to the path creation on-demand, because the effects of a partial or full path on the channel dependency graph (CDG) can be checked simultaneously. Hence, it would be possible to avoid closing cycles in the CDG while calculating the paths instead of breaking the cycles later. This new graph, called complete CDG (see Figure 6b), includes all possible channel dependencies and represents one virtual layer [30]. A graph search algorithm, e.g., Dijkstra’s algorithm, can traverse the graph and construct routes from all nodes to all other nodes and these routes are deadlock-free within this layer, see Algorithm 4.1. Furthermore, assuming the used network technology supports an arbitrary, but fixed, number of virtual channels, \( k > 1 \), then individual destination nodes can be assigned to different virtual layers. Hence, the graph search algorithm within one layer is able to calculate deadlock-free routes for all source nodes to the subset of destination nodes assigned to this virtual layer. Therefore, all routes in all virtual layers are deadlock-free without exceeding the virtual channel constraint.

**Algorithm 4.1:** Nue routing calculates all paths within a network \( I = G(N, C) \) for a given number of virtual channels \( k \geq 1 \)

<table>
<thead>
<tr>
<th>Step</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Partition ( N ) into ( k ) disjoint subsets ( N_1, \ldots, N_k ) of destinations</td>
</tr>
<tr>
<td>2</td>
<td>foreach virtual layer ( L_i ) with ( i \in {1, \ldots, k} ) do</td>
</tr>
<tr>
<td>3</td>
<td>Create a convex subgraph ( H_i ) for ( N^e_i )</td>
</tr>
<tr>
<td>4</td>
<td>Identify central ( n_{r,i} ) of ( H_i ) via Brandes’ algorithm</td>
</tr>
<tr>
<td>5</td>
<td>Create a new complete channel dependency graph ( \mathcal{D}_i )</td>
</tr>
<tr>
<td>6</td>
<td>Define escape paths ( D^e_i ) for spanning tree root ( n_{r,i} )</td>
</tr>
<tr>
<td>7</td>
<td>foreach node ( n \in N^d_i ) do</td>
</tr>
<tr>
<td>8</td>
<td>Identify deadlock-free paths ( P_p ) within ( \mathcal{D}_i )</td>
</tr>
<tr>
<td>9</td>
<td>Store these paths, e.g., in forwarding tables</td>
</tr>
<tr>
<td>10</td>
<td>Update channel weights in ( \mathcal{D}_i ) for these paths</td>
</tr>
</tbody>
</table>
for every number of virtual channels $1 \leq k \leq 8$. Besides the simulated throughput for each topology and routing combination, the number of needed virtual channels is given atop of the bars in Figure 7a. Two trends are visible for all investigated topologies: First, an increase of used virtual channels for Nue also increases the throughput for the all-to-all communication. The outliers from this pattern, e.g., the decrease in throughput for TSUBAME2 and two virtual channels, correlate to a sudden increase in fall backs to the escape paths. The second trend is that Nue shows a slight variance in throughput after reaching a certain peak, usually for about $k \geq 5$ in all investigated examples, but this generally depends on topology type and size. This behavior can be accounted to a mismatch between the static routing and the execution order of the point-to-point communications, which assemble the all-to-all traffic pattern, causing temporary congestion in the network which slows down the entire communication process [13].

In general, Figure 7a and further conducted simulations for other topologies show that Nue routing is competitive to the best performing routing for each individual topology, i.e., offers between 83.5% for a 10-ary 3-tree (with 1,100 terminals) and 121.4% throughput for a 1,536-node Cascade network [10] in comparison. Furthermore, two main conclusions can be drawn from the results shown in Figure 7b: Nue routing calculates the forwarding tables faster than the topology-agnostic DFSSSP, which has the same time complexity of $O(|V|^2 \cdot \log |V|)$, and outperforms LASH routing for tori larger than 4x4x4 with 256 terminals attached. Only the topology-aware Torus-2QoS routing is on average 9x faster than Nue, which is as expected since Torus-2QoS is able to avoid deadlocks analytically, and therefore expensive cycle searches in the CDG are unnecessary. The second important result is an applicability of 100%, i.e., Nue routing scales with the topology size, while the other three deadlock-free algorithms fail.

5 Summary and Conclusion

The most important role of each supercomputer is to provide other research domains, such as material science, meteorology and climate research, biomedical informatics, or astrophysics, etc., with computational
resources otherwise not available. The high-speed interconnection network, connecting the nodes of the HPC system, plays a crucial role in achieving scalability and throughput of these scientific applications. Key factors of the interconnection network are low latency, high bandwidth, high message throughput, deadlock-freedom, resiliency, and availability, hence factors which depend not only on the networking hardware, but also depend on the software used for network management. This thesis advances the state-of-the-art of network management in three main areas of interest for the HPC community, by: (1) showing that HPC interconnection networks can be operated in fail-in-place mode and providing a simulation toolchain to plan future systems and to establish operation policies, (2) developing a new and low overhead scheduling-aware routing algorithm to improve the scientific throughput and network utilization of multi-user/multi-job HPC environments, and (3) introducing a novel concept for solving the deadlock-freedom of routing algorithms, which allows the calculation of deadlock-free and path-balanced forwarding rules for arbitrary topologies within the given virtual channel constraint.

Even so the resulting irregular topologies of fail-in-place networks pose a challenge to the used routing algorithms, a low failure rate of the network components supports a fail-in-place network design strategy, which bypasses non-critical failures during the system’s lifetime. However, both types of existing deterministic routing algorithms, i.e., fault-tolerant topology-aware and topology-agnostic, show limitations. The performance degradation using a topology-aware routing algorithm increases more with an increase of failures compared to topology-agnostic routings. In contrast, topology-agnostic routing algorithms are supposed to be ideal for fail-in-place networks, but the investigated algorithms either show weaknesses in terms of deadlocks, such as MinHop routing, or cannot be used beyond a certain network size. For example, the deadlock-avoidance via virtual channels by state-of-the-art routings can require more than the available number of virtual channels, e.g., in the case of LASH and DFSSSP for the 3-dimensional torus(7,7,7) with more than 2,000 terminal nodes, both exceed the limit of eight VCs.

While the fail-in-place network operation increases the resiliency and availability considerably, the overall achievable throughput decreases slightly when the supercomputer is used by a single scientific application. Nonetheless, this is rarely the case, since most HPC systems located at universities or national laboratories are used by multiple users simultaneously. The developed scheduling-aware routing (SAR) approach for the interconnection network is able to mitigate these problems. The conducted empirical study, based on the simulative replay of the historical job-to-node allocation of Taurus and TSUBAME2, reveals that SAR is able to improve the overall system utilization by up to 17.74%, and reduce the effective edge forwarding index of individual jobs by up to 71.2%. Conducted communication benchmarks on Taurus, while the system was simultaneously used by other researchers and scientific applications, showed throughput increases by up to 17.6% after SAR optimized the forwarding rules for the locality of the benchmark job.

Additionally, a fail-in-place network is also more prone to deadlocks, when it is based on lossless interconnection technology. However, the applicability of topology-aware and topology-agnostic routings for faulty regular topologies or arbitrary topologies is limited, as discussed before, and universally applicable routings, such as Up*/Down*, are scarce and perform poorly with respect to path length
and balancing. A model implementation of the new approach of routing within the complete channel dependency graph, called Nue routing, is tailored for the deadlock-free, oblivious, and destination-based routing needed in Converged Enhanced Ethernet (CEE) and InfiniBand and can directly be employed for both, e.g., using InfiniBand’s virtual lanes or using PFC together with Priority Code Point for CEE. Flit-level simulations indicate that Nue enables competitive or superior global throughput, i.e., between 83.5% and 121.4% throughput for an all-to-all traffic pattern, in comparison to the best performing state-of-the-art routing for the respective topology. Nue’s characteristics and advantages, combined with its low time complexity of $O(|N|^2 \cdot \log |N|)$ and memory complexity of $O(|N|^2)$, make Nue routing a suitable algorithm to route modern large-scale supercomputers, lossless data center networks, and NoC architectures.

References


