MOARD: Modeling Application Resilience to Transient Faults on Data Objects

02/13/2021
by   Luanzheng Guo, et al.
0

Understanding application resilience (or error tolerance) in the presence of hardware transient faults on data objects is critical to ensure computing integrity and enable efficient application-level fault tolerance mechanisms. However, we lack a method and a tool to quantify application resilience to transient faults on data objects. The traditional method, random fault injection, cannot help, because of losing data semantics and insufficient information on how and where errors are tolerated. In this paper, we introduce a method and a tool (called MOARD) to model and quantify application resilience to transient faults on data objects. Our method is based on systematically quantifying error masking events caused by application-inherent semantics and program constructs. We use MOARD to study how and why errors in data objects can be tolerated by the application. We demonstrate tangible benefits of using MOARD to direct a fault tolerance mechanism to protect data objects.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset