Topics for Student Theses
Table of contents
All topics in this list are available for Bachelor, Undergraduate, Diploma, and Master theses. We will discuss with you to scale the topic appropriately. Your own topic proposals are also very welcome.
High-performance architectures
Remote Direct Memory Access (RDMA) networks offer significant performance benefits by exposing the device directly to the user application. On the one hand RDMA-networks remove the operating system from the critical path of communication. On the other hand, the interface from the application to the OS and from the application to the device becomes significantly more complicated. In comparison to traditional sockets-based API, such design increases the attack surface of the OS and the underlying network infrastructure.
With the increasing adoption of RDMA-networks in cloud settings, the attacker models consider user applications to be potentially malicious. Therefore, the interface between the application and the RDMA-network must be secure. The existing works have already shown fundamental problems in state-of-the art RDMA network architectures, but have not yet studied the safety of the API itself. The goal of this work is to characterise the attack surface created by the RDMA communication API.
The goal of this project is to design a fuzzing framework for RDMA-networks using network link layer and user-level RDMA API as possible attack vectors. The project must propose a practical tool for identifying bugs in kernel-level drivers and/or RDMA protocol implementations.
Contact: Maksym Planeta
Scheduling
The ATLAS scheduler1 uses metrics to predict task execution time. However, it is conceivable to use the same or a similar predictor to derive other application- and workload-related measures such as cache behavior, memory accesses, or energy consumption.
The thesis should experiment with the ATLAS predictor to analyse its ability and accuracy in predicting these other measures.
Supervisor: Michael Roitzsch, Till Smejkal, Jan Bierbaum
Fußnoten
Applications in high-performance computing (HPC) usually comprise many processes distributed across a cluster of compute nodes. From previous research and experience in Cloud computing, we learn that concurrently running multiple such applications can improve not only overall performance but also system utilisation and, in turn, energy consumption.
We want to employ gang-scheduling1 , a concept well-known in real-time systems, as a distributed system-level scheduler to ensure performance and efficient resource sharing at the same time. Gang scheduling caters to the close interactions among the processes comprising an application by running all of them at the same time.
For scaling to large clusters, gang scheduling needs to utilise the precise clocks available each compute node2: All clocks are synchronised, and the central gang scheduler broadcasts a list of application processes and time slots when they are to run. Node-local schedulers then enforce this schedule based on their local clocks.
Previous work on the node-local schedulers investigated different mechanisms for enacting a scheduling decision from user space. This approach is necessary for easy deployment as HPC clusters usually employ predefined system software. The goal of this task is to extend the existing work into a distributed gang scheduler by refining the node-local schedulers and adding a central coordinator.
Supervisor: Jan Bierbaum
Fußnoten
Applications in high-performance computing (HPC) usually comprise multiple processes that follow the bulk synchronous parallel (BSP)1 model: During its runtime, each process alternates between phases of independent, parallel computation and phases of communication and synchronisation. Due to the synchronisation phases, a single slow process (straggler) can delay the whole computation. So BSP simplifies programming at the expense of overall performance and system utilisation.
The ATLAS scheduler2 uses metrics to predict the execution time of applications. A metric is any value from the application domain that correlates with the amount of work to be performed. Using this predictor with an HPC application may allow to identify stragglers in advance. With this information we can either assign more (performant) compute resource to these slow processes or save energy by slowing down other processes.
In this thesis, the ATLAS predictor should be integrated with one or two simple HPC applications, e.g. from the NAS Parallel Benchmarks3. The student should identify suitable metrics within the application(s) and evaluate the prediction accuracy.
Supervisor: Jan Bierbaum
Footnotes
The ATLAS scheduler1 uses metrics to predict task execution time. So far, ATLAS has been applied to local scheduling of workloads on individual machines. However, it is conceivable to apply the same ideas of metric- and deadline-based scheduling to distributed scenarios.
The thesis should apply the ATLAS scheduling concepts to an application running distributed across multiple machines. Especially interesting are interactive cloud applications, where a tight end-to-end deadline is associated with each user-initiated request. Breaking down these end-to-end deadlines to individual scheduling obligations for the participating machines is one of the interesting challenges.
FractOS (formerly Caladan)2 is a research prototype for a distributed execution environment, which can be used as a substrate for the experiments.
Supervisor: Michael Roitzsch
References
-
Michael Roitzsch, Stefan Wächtler, Hermann Härtig: ATLAS – Look-Ahead Scheduling Using Workload Metrics. RTAS 2013.
-
Lluís Vilanova, Lina Maudlej, Matthias Hille, Nils Asmussen, Michael Roitzsch, Mark Silberstein: Caladan – A Distributed Meta-OS for Data Center Disaggregation. SPMA 2020.
One recent trend in data center architectures is the disaggregation of resources. Instead of having to equip each machine in a cluster with all types of devices, resource disaggregation aims at concentrating certain classes of devices in specialized nodes that do not execute applications themselves, but only serve as a broker for the resources they offer. Deploying a disaggregated architecture can increase resource utilization and the overall performance of a data center.
As with traditional architectures, isolating a high-priority task with tight latency requirements but low bandwidth utilization from other application's I/O noise is an interesting challenge in disagregated data centers. The goal of this thesis is to investigate how applications with real-time constraints for resource accesses are impacted by background applications in a disaggregated setup, particularly if the high-priority application and the background tasks use the same resource service. The analysis should be done on FractOS (formerly Caladan) [1], a research prototype for a disaggregated data center infrastructure.
Depending on the type of thesis, the applicability of work-constraining scheduling, as demonstrated with the K2 I/O scheduler [2], should be evaluated for a resource service in a disaggregated environment.
Supervisors: Till Miemietz Michael Roitzsch
References:
[1] Lluís Vilanova, Lina Maudlej, Matthias Hille, Michael Roitzsch, Mark Silberstein: Caladan - A Distributed Meta-OS for Data Center Disaggregation. SPMA 2020.
[2] Till Miemietz, Hannes Weisbach, Michael Roitzsch, Hermann Härtig: K2 - Work-Constraining Scheduling of NVMe-Attached Storage. RTSS 2019.
System Architecture
Traditionally, user-space applications access device drivers through OS-provided abstractions (files, sockets, etc.). In contrast, OS-bypass technique passes a raw device directly to the user-space application, such that the device can be accessed without the OS kernel involvement. OS-bypass enables much higher request and data processing rates in contrast to the traditional socket- or file-based approaches. Frameworks, like DPDK (for network controllers) and SPDK (for NVMe controllers) are typical examples of OS-bypass technique. Unfortunately, these approaches are less flexible than traditional system-call-based device usage. For instance, passing a device to one user-space application may imply that no other application can use the same device.
As of today, OS-bypass increase flexibility through new features implemented by the devices (see RDMA-networks). Specialized devices are expensive and still maintain rigidity, as hardware is hard to change. This project aims to work around limitations of specialized hardware, by combining the flexibility of system calls with the performance of OS-bypass by developing new mechanisms for fast system calls. As a potential effect, OS-bypass may become ubiquitous for a wider range of devices. Moreover, specialized devices may gain new features without the need for hardware modifications.
The student will be presented with an architecture of a fast system call infrastructure. The student will be expected to realise the architecture, study and evaluate it. The project can be implemented for both monolithic (Linux) and a micro (L4). The student is expected to have good command in C programming and UNIX-like systems. This includes familiarity with the command line, understanding OS architecture, experience with build systems. Prior experience with kernel programming is not necessary. We offer guidance, especially with concept development, kernel programming, debugging, and scientific writing.
Supervisor: Maksym Planeta Till Miemietz
In recent years, CPU vendors added several new CPU modes for specific purposes. Therefore, current CPUs do no longer have only an unprivileged and a privileged mode but also a hypervisor mode, a special mode for Intel SGX, AMD SEV, ARM TrustZone, etc. All these modes are defined in hardware and need to be taken as a given by the operating system and applications. This development raises the question whether such CPU modes could be defined in software instead. What if the operating system could freely define the CPU modes and the restrictions that come with them? What if even applications could define new modes to sandbox a part of their code?
This thesis should start to explore this direction by designing and implementing a system that has software-defined CPU modes. The implementation should be based on the gem5 hardware simulator and extend an existing ISA by such software-defined modes. The evaluation should study the performance implications of basic functionality such as setting up modes and switching between modes and, if time permits, also application-level benchmarks that take advantage of these features.
Contact: Nils Asmussen, Michael Roitzsch
Trusted Computing
Virtualization
Others
Code analysis tools simplify and facilitate understanding of the source code by programmers. For large projects such tools are indispensable. Unfortunately, the existing tools either do not provide a visual overview of the relevant code structure, lack interactivity for live analysis, or miss clarity by displaying too much information.
The goal of this project is to develop a tool that allows the programmer to introspect the code of a large repository. The key idea is to offer a programmer a flexible domain-specific language (DSL) for providing the requests to the source code. With such a language, the programmer can precisely specify what part of the source code must be shown and also how the resulting code shall be presented.
Allowing the visualisation to deviate from the actual code structure is of key importance because logical organisation sometimes deviates from the actual source code. Therefore, representing code literally may unnecessarily clutter the visualisation and complicate the code comprehension.
Short demo: Code analysis demo
Supervisor:Maksym Planeta