Jan 30, 2026; Defence
Echtzeit-AGAnalysis of Fault Tolerance in the Components of the L4Re Operating System
BBB-Link: https://bbb.tu-dresden.de/b/mat-xin-oyh-xzn
Presentation Language: German
Modern software development emphasizes minimizing bugs and mitigating malicious inputs. However, hardware is often assumed to function correctly and is rarely included in testing campaigns. In reality, environmental factors such as radiation can compromise hardware components like main memory. Specialized fault-tolerant hardware exists but is expensive and often sacrifices performance gains from semiconductor scaling. This motivates software-based approaches to hardware fault tolerance, where software detects and recovers from hardware faults. L4Re is a microkernel-based operating system framework designed for domains where security, safety, and predictability are critical. Safety-critical applications in automotive, space, and aviation demand high reliability to prevent undefined or unsafe behavior, making robust fault handling essential for L4Re. While prior work has analyzed the fault tolerance of the L4Re kernel, this study focuses on its runtime environment.
Specifically, I examine Moe, the initial user-space server in L4Re responsible for memory allocation, logging, and access to the boot file system. The goal is to identify key data structures responsible for most of Moe’s vulnerability, aiding future hardening efforts. To achieve this, I developed reusable tools and procedures leveraging fault injection results from the FAIL* framework. Building on the results of the analysis, I strengthened a key data structure to improve Moe’s resilience. This modification reduced Moe’s vulnerability by up to 50%, depending on the workload conditions.