Topics for Student Theses
Table of contents
All topics in this list are available for Bachelor, Undergraduate, Diploma, and Master theses. We will discuss with you to scale the topic appropriately. Your own topic proposals are also very welcome.
High-performance Architectures
Systems in high-performance computing (HPC) need to cater to the demanding nature of applications in this field. Traditional requirements include low latency networking, cache-optimal memory placement, and timely coordination between compute nodes in the system. More recently, energy considerations have also become a concern. We study the implications these requirements and the specialised HPC hardware have for system-level software.
Scheduling
The scheduling of processes or threads in different environments (e.g. realtime, distributed, or high-performance systems) is governed by a diverse set of optimisation goals: throughput, timeliness, energy conservation, quality-of-service guarantees, to name but a few. Our work in this area tries to incorporate new approaches like machine learning or explores new areas like disaggregated systems.
Applications in high-performance computing (HPC) usually comprise many processes distributed across a cluster of compute nodes. From previous research and experience in Cloud computing, we learn that concurrently running multiple such applications can improve not only overall performance but also system utilisation and, in turn, energy consumption.
We want to employ gang-scheduling1 , a concept well-known in real-time systems, as a distributed system-level scheduler to ensure performance and efficient resource sharing at the same time. Gang scheduling caters to the close interactions among the processes comprising an application by running all of them at the same time.
For scaling to large clusters, gang scheduling needs to utilise the precise clocks available each compute node2: All clocks are synchronised, and the central gang scheduler broadcasts a list of application processes and time slots when they are to run. Node-local schedulers then enforce this schedule based on their local clocks.
Previous work on the node-local schedulers investigated different mechanisms for enacting a scheduling decision from user space. This approach is necessary for easy deployment as HPC clusters usually employ predefined system software. The goal of this task is to extend the existing work into a distributed gang scheduler by refining the node-local schedulers and adding a central coordinator.
Supervisor: Jan Bierbaum
Fußnoten
One recent trend in data center architectures is the disaggregation of resources. Instead of having to equip each machine in a cluster with all types of devices, resource disaggregation aims at concentrating certain classes of devices in specialized nodes that do not execute applications themselves, but only serve as a broker for the resources they offer. Deploying a disaggregated architecture can increase resource utilization and the overall performance of a data center.
As with traditional architectures, isolating a high-priority task with tight latency requirements but low bandwidth utilization from other application's I/O noise is an interesting challenge in disagregated data centers. The goal of this thesis is to investigate how applications with real-time constraints for resource accesses are impacted by background applications in a disaggregated setup, particularly if the high-priority application and the background tasks use the same resource service. The analysis should be done on FractOS (formerly Caladan) [1], a research prototype for a disaggregated data center infrastructure.
Depending on the type of thesis, the applicability of work-constraining scheduling, as demonstrated with the K2 I/O scheduler [2], should be evaluated for a resource service in a disaggregated environment.
Supervisors: Till Miemietz Michael Roitzsch
References:
[1] Lluís Vilanova, Lina Maudlej, Matthias Hille, Michael Roitzsch, Mark Silberstein: Caladan - A Distributed Meta-OS for Data Center Disaggregation. SPMA 2020.
[2] Till Miemietz, Hannes Weisbach, Michael Roitzsch, Hermann Härtig: K2 - Work-Constraining Scheduling of NVMe-Attached Storage. RTSS 2019.
System Architecture
Serverless Computing has become a common programming paradigm in cloud
computing systems (also referred to as Function-as-as-Service - FaaS).
In this paradigm the cloud provider offers an ecosystem in which the
customer's functionality is encapsulated in functions. These functions
are invoked via the network similar to RPCs.
Today's serverless platforms are based on virtual machines (Firecracker
[1]) or containers (KNative, OpenFaaS, OpenWhisk [2-4]) based on Linux
or special purposes Hypervisors. Improving on these solutions has become
a major interest in the systems research community [5-7]
The goal of this work is to investigate a microkernel-based design for a
serverless computing environment. At first, the implementation of
existing platforms should be examined. Next, the essential components of
such an environment should be identified and incorporated into a
microkernel-based FaaS design. Finally, a prototype runtime should be
implemented on top of the Fiasco microkernel.
Supervisors: Matthias Hille, Till Miemietz, Adam Lackorzynski
Footnotes
[1] - https://www.usenix.org/system/files/nsdi20-paper-agache.pdf
[2] - https://knative.dev
[3] - https://www.openfaas.com/
[4] - https://openwhisk.apache.org/
[5] - Isolating Functions at the Hardware Limit with Virtines -
https://arxiv.org/abs/2104.11324
[6] - SEUSS: skip redundant paths to make serverless fast -
https://dl.acm.org/doi/10.1145/3342195.3392698
[7] - Faastlane: Accelerating Function-as-a-Service Workflows -
https://www.usenix.org/conference/atc21/presentation/kotni
Trusted Computing
Virtualization
Fault Tolerance and Fault Injection
Computers are usually assumed to always work correctly and predictably, as defined in the program code. While this is generally correct, there may be internal and external influences that effect this assumption. Internal influences can be undetected production errors, which become more likely as the structure size of the chips becomes increasingly smaller. While external influences can be radiation particles, which pass through the chips and lead for example to ionization in the transistors. Especially if these errors occur only very rarely and the system architecture is based on the assumption that exactly what is defined in the code will happen, severe consequences can be the result.
To counter this, we study the possibilities of software-implemented hardware fault tolerance (SIHFT), in which faults are detected with the help of software. This variant can also be integrated after the hardware is developed and produced and is therefore also applicable for commercial off-the-shelf hardware. Since redundant hardware is expensive to develop, SIHFT has many possible areas to be applied, for example in cubesats or cars.
In particular, we evaluate SIHFT methods with the fault-injection framework FAIL*, which can imitate hardware faults caused by radiation effects and systematically examine a program regarding its fault tolerance.
In order to detect and correct single-bit errors in main memory, error correction code (ECC) is often used as a directly implemented hardware measure.
However, since 20 to 50 percent of the occurring memory faults affect more than a single bit, further software measures can be useful in safety-critical applications.
In order to create the possibility of including memory with ECC in our evaluations, the goal is to emulate the characteristics of ECC for our fault injection framework FAIL*. It shall be determined how the model of ECC memory handles faults that affect more than a single bit and what the resulting fault model looks like, which can then be used to control the fault injections.
Supervisor: Robin Thunig, Horst Schirmeier
Miscellaneous
This category comprises topics that don't fit any of the other categories.
Ohua is a source-to-source compiler that is developed at Barkhausen Institute to simplify isolation of multiple components that are part of a single application. Ohua takes monolithic code as input and generates componentized code that can use different isolation mechanisms. So far, we have implemented compiler backends for MMU-based isolation as it is used by code running in separate address spaces.
Intel MPK (Memory Protection Keys) however is a more lightweight isolation mechanism: It works by quickly enabling and disabling access to different parts of the same virtual address space. Other works like FlexOS or CubicleOS already use this isolation mechanism.
The goal of this thesis is to package Intel MPK so that it is digestible by Ohua as an isolation backend. This entails the implementation of compartment setup within an address space using MPK as well as communication primitives between these compartments. A performance and security comparison with MMU-based isolation can show the relative advantages and disadvantages of these isolation mechanisms.
Advisor: Michael Roitzsch