Resilience goals for Oak
This page is an effort to clarify the concept of resilience and related and to define goals for Oak to that respect.
Resilience
Resilience refers to the ability to withstand, contain and recover from failures.
A single failure refers to a single component failing at any given time while multiple failures means that more than one component may fail at the same time.
To withstand a failure means to stays operational and sufficiently responsive during the time a failure occurs.
To contain a failure means its adverse effect does not spread beyond its initial scope. I.e. there is no collateral damage.
To recover from a failure means undoing (e.g. automatically or by manual intervention) the impact that has been caused by a failure and return to normal operation.
The impact of a failure roughly falls into one of six levels where each level is worse than its predecessor:
- (0) no impact at all,
- (1) temporary degradation with automatic recovery,
- (2) temporary degradation that needs manual intervention for recovery,
- (3) temporary outage with automatic recovery,
- (4) temporary outage that needs manual intervention for recovery,
- (5) complete outage that needs rebuilding from scratch.
Goals for Oak
Oak should be resilient against simple failures such that complete outages (level 5) do not occur. Oak is not resilient against multiple failures though and sufficient redundancy needs to be built into the system to cope with such.
Failures and their impact
- Temporary outage of database connection
- For no more than a few seconds: impact <= 1. Automatic recovery once database connection is back.
- For more than a few seconds: impact <= 3. Automatic recovery once database connection is back.
- Resource drainage
- Out of disk space: impact <= 4. Providing more disk space should be sufficient for recovery.
- Out of memory: impact <= 4. Providing more memory should be sufficient for recovery.
- Network / disk bandwidth saturated: impact <= 2. Automatic recovery once sufficient bandwidth is provided.
- Small scale data corruption (e.g. bit flip on disk, network, memory)
- On primary data (e.g. document, segment, data store, ...): impact <= 2. Repairing the corrupted data should be sufficient for recovery.
- On secondary (derived) data (e.g. index, ...): impact <= 1. Secondary data should automatically be recreated once corruption has been detected.
- Large scale data corruption (e.g. corrupt data unit like file, document, index, ...)
- On primary data: impact <= 4. Repairing the corrupted data should be sufficient for recovery.
- On secondary data: impact <= 3. Secondary data should automatically be recreated once corruption has been detected.
- Hardware failure (e.g. disk, CPU, memory, ... break down): impact <= 4. Repairing the hardware and restoring from backup in the case of a disk loss should be sufficient for recovery.
- Software failure (database, Oak, JVM, OS, ... process crash): impact <= 4. Restarting the crashed process should be sufficient for recovery.
Examples Tests
- Using a low quota on the disk space, write until the quota is exceeded such that the repository can not write, then kill the process and re-open the repository, and check if that works.
- Power failure while writing: kill a running Docker container while writing, and check if re-opening the repository works.
- Out of memory while writing: in a separate thread, slowly eat memory (one KB at a time for example), until the repository stops working. Then kill the process and try to re-open the repository.