Themen für Abschlussarbeiten

Alle vorgeschlagenen Themen können grundsätzlich als Bachelor-, Beleg-, Diplom-, oder Masterarbeitsthema vergeben werden. Wir werden den Umfang dann in Absprache anpassen. Auch eigene Themenvorschläge sind willkommen!

High-performance Architectures

Systems in high-performance computing (HPC) need to cater to the demanding nature of applications in this field. Traditional requirements include low latency networking, cache-optimal memory placement, and timely coordination between compute nodes in the system. More recently, energy considerations have also become a concern. We study the implications these requirements and the specialised HPC hardware have for system-level software.

Dynamic Adaptive Waiting

The Message Passing Interface (MPI) is a de-facto standard high-performance computing (HPC) applications, enabling efficient communication across distributed systems. OpenMPI, a widely used and modular implementation of MPI, traditionally operates by polling for communication events to minimise latency. While effective for highly optimised applications, this approach can lead to poor energy efficiency and suboptimal resource utilisation for general workloads.

To address these issues, we introduced a blocking mode with “adaptive waiting”, allowing processes to spin for a configurable number of iterations (x) before blocking. This provides a trade-off between energy savings and latency. However, the current implementation requires x to be set manually at startup, which depends upon prior knowledge of the application to run and limits the approaches adaptability to dynamic application behaviour.

The goal of this thesis is to tune x dynamically at runtime. This requires a monitoring mechanism to observe communication patterns, such as the time spent polling and the potential benefits of blocking. Based on these insights, x should be adjusted and optimised. Both local (per-process) and global (multi-process) evaluation strategies can be explored. The implementation will be evaluated using standard MPI benchmarks to assess effects on energy efficiency and performance.

Supervisor: Jan Bierbaum

Scheduling

The scheduling of processes or threads in different environments (e.g. realtime, distributed, or high-performance systems) is governed by a diverse set of optimisation goals: throughput, timeliness, energy conservation, quality-of-service guarantees, to name but a few. Our work in this area tries to incorporate new approaches like machine learning or explores new areas like disaggregated systems.

System Architecture

Software-defined CPU modes

In recent years, CPU vendors added several new CPU modes for specific purposes. Therefore, current CPUs do no longer have only an unprivileged and a privileged mode but also a hypervisor mode, a special mode for Intel SGX, AMD SEV, ARM TrustZone, etc. All these modes are defined in hardware and need to be taken as a given by the operating system and applications. This development raises the question whether such CPU modes could be defined in software instead. What if the operating system could freely define the CPU modes and the restrictions that come with them? What if even applications could define new modes to sandbox a part of their code?

Our current prototype of software-defined CPU modes is based on the gem5 hardware simulator and extends an existing ISA by such software-defined modes. This thesis should continue our initial work on implementing and evaluating this new concept. Multiple directions are possible like exploring different implementation variants in the (simulated) CPU hardware or running software workloads in novel CPU modes to show benefits of the concept.

Contact: Nils Asmussen, Michael Roitzsch

Trusted Computing

Virtualization

Fault Tolerance and Fault Injection

Computers are usually assumed to always work correctly and predictably, as defined in the program code. While this is generally correct, there may be internal and external influences that effect this assumption. Internal influences can be undetected production errors, which become more likely as the structure size of the chips becomes increasingly smaller. While external influences can be radiation particles, which pass through the chips and lead for example to ionization in the transistors. Especially if these errors occur only very rarely and the system architecture is based on the assumption that exactly what is defined in the code will happen, severe consequences can be the result.
To counter this, we study the possibilities of software-implemented hardware fault tolerance (SIHFT), in which faults are detected with the help of software. This variant can also be integrated after the hardware is developed and produced and is therefore also applicable for commercial off-the-shelf hardware. Since redundant hardware is expensive to develop, SIHFT has many possible areas to be applied, for example in cubesats or cars.
In particular, we evaluate SIHFT methods with the fault-injection framework FAIL*, which can imitate hardware faults caused by radiation effects and systematically examine a program regarding its fault tolerance.

Miscellaneous

This category comprises topics that don't fit any of the other categories.

Bounded Resource Reclamation on Linux

Bounded resource reclamation is an approach that aims to put a limit on the time required to reclaim resources in an operating system after a process has terminated. By enforcing this bound, the operating system can ensure that all resources allocated to a particular process are fully released within a predetermined timeframe (e.g., one second). This approach is particularly useful in resource-constrained multi-user systems, where applications need to share resources like memory and must avoid long application boot times due to cleanup of terminated processes.

This task will focus on bringing the concept of bounded resource reclamation to Linux, with the goal of improving the efficiency and predictability of resource release. The first step will be to measure the latency taken by the Linux operating system to reclaim resources after process termination. This will provide a baseline understanding of the current reclamation process. Next, the task is to modify the Linux kernel to enforce a bound on the time required to reclaim resources after process termination. This will involve implementing a resource grouping mechanism to efficiently reclaim resources and introducing a quota system to ensure that reclamation time is bounded.

Supervisor: Viktor Reusch

Integration and Analysis of L4Re on gem5: Debugging and Profiling Tools Development

This task aims to get the L4Re microkernel-based operating system running on the gem5 architectural simulator. The idea is to adapt and validate a working L4Re system on gem5 to leverage its support for instruction-level tracing. The challenge is likely adapting L4Re to the gem5-specific boot-up procedure. Building on this, one will develop and/or adapt tools to analyze L4Re’s behavior using gem5’s detailed execution traces. These tools should enable profiling of system performance, debugging of kernel/user-space interactions, and provide insights into real-time aspects and security-critical operations.

Alternativly, the existing integration of L4Re with bochs might be leveraged to build debugging tools.