Fault Tolerance

Designers often do not consider fault tolerance in system solutions during their initial development. This situation can render solutions that were initially proven effective impractical when their resilience is taken into account. Researchers at KrakOS aim at considering the fault tolerance of the solutions they propose. We have identified approaches to incorporate fault tolerance for each of the three aforementioned research axes.


Virtual Machines (fault tolerance, flexibility, and performance objectives). Observability is the practice of monitoring the execution of a system. It is carried out for several reasons, including crash detection (e.g., mysql_safe1 detecting and restarting a MySQL server in case of a shutdown), hang detection, intrusion detection, performance degradation detection, and more. Observation is generally performed by auxiliary tasks [Jing2022] that run in a different execution flow than the system being observed. In the following, we will refer to the observing flow as the "Observer" and the system being observed as the "Observed."


Creating Observers raises a significant dilemma. The Observer and the Observed must reside in distinct fault domains to prevent fault propagation. The Observer must be able to continue its execution even if the Observed experiences a fault or corruption, and vice versa. At the same time, the Observer must have easy access to the state of the Observed to carry out its observation task. This dilemma has been the focus of various research efforts in non-virtualized environments, where it was demonstrated that Process and Thread abstractions are not adequate for this purpose. In 2022, Jing et al. [Jing2022] introduced the Orbit abstraction to address these needs.


The observability of VMs raises challenges that cannot be resolved using solutions proposed for non-virtualized environments due to the differences in characteristics between VMs and applications. Firstly, a VM is an execution unit that can host multiple applications. Embedding Observers and Observed within the same VM while using abstractions proposed in the literature for non-virtualized environments could be more effective. The crash of the VM would result in the crash of all Observers.


A recent paper published at DSN [Ding2023] proposes dedicating a VM to observation. Each VM is associated with a VMResponder, a VM responsible for observation. While this approach meets the isolation requirement for Observer and Observed, it is potentially inefficient due to performance-costly mechanisms required to observe the target VM. Moreover, it leads to resource wastage as each user VM requires a second VM for observation, consuming CPU, memory, disk, networks, etc.