Sep 20, 2024; Defence
Echtzeit-AGComparing Multi-Bit EDAC in Hardware and Software
In this work, we introduce the ability for simulation of Error Correcting Codes (ECC) to an existing successful fault injection tool. We also create tooling to sample and inject multi-bit faults efficiently. Together we use this to provide the, to our knowledge, first direct comparison between hardness performance of ECC and a Software-Implemented Hardware Fault Tolerance (SIHFT) method. Serving this comparison we determine how to achieve comparability between different program and bit-fault configurations. This comparison is valuable even in spite of the largely non-generalizable and outdated information on error rates, as we can show that even with wildly different assumptions about fault probabilities we can still learn a lot about a methods effectiveness.
Our fault model focuses on word local (multi-)bit faults. That is, when a fault event happens, all the affected bits are contained within the boundary of one word, which is as large as the minimum access width on the system. As we are able to actually compare now using our work, we find common ECC memory to perform better, when fault events have a probability of more than 55% to be word local single-bit faults. The other faults then are word local multi-bit faults, according to our fault model. Below that, the tested SIHFT differential addition checksum was favorable because of its better ability to handle larger multi-bit faults on average.
(Master Thesis Defense)