In this guide, we will walk you through how to extend Xinda to support a new fault injection method, a new benchmark, or a new distributed system. This guide can act as a checkbox for future development.
🏠 xinda
┣ ...
┣ main.py (main entry of Xinda)
┣ 📈 data-analysis/ (data analysis scripts)
┣ 🔧 tools/ (a collection of binaries and utilities)
┣ xinda (source files of Xinda)
┣ 🤖️ systems
┣ container.yaml (container meta info)
┣ TestSystem.py (base class for distributed systems)
┣ hbase.py (hbase support based on TestSystem)
┣ crdb.py (crdb support based on TestSystem)
┣ ...
┣ 📒 configs (main Xinda configurations)
┣ benchmark.py (benchmark configs)
┣ logging.py (data collection configs)
┣ slow_fault.py (slow fault attributes)
┣ tool.py (path and configs for binaries and utilities)
┣ reslim.py (base class to enforce resource limits)
Above showcases the main structure and modules of Xinda. Let's go through them one by one:
-
main.py is the main entry of Xinda. All configurations, including the system under test, the benchmarks to be run, and slow-fault attributes, are passed through here
-
data-analysis/ contains scripts to parse and analyze the runtime logs and stats collected by Xinda
-
tools/ contains the required binaries and utilities
-
container.yaml records legal container names of all supported distributed systems. It is used as a sanity check to ensure the user input of
--fault_location
is valid -
TestSystem is the base class to implement the test pipeline of all distributed systems. It has already implemented basic functions like:
docker_up()
anddocker_down()
to bring up/down the clusterblockade_up()
andblockade_down()
to init/shut down Blockade (which is used to inject network-related slow faults)charybdefs_up()
andcharybdefs_down()
to init/shut down CharybdeFS (which is used to inject filesystem-related slow faults)inject()
to inject slow faults at a specific time, location and severity level, wait for a duration, and then clear the faultinfo()
to log INFO-level messagescleanup()
to ensure a clean state for next test (e.g., garbage-collecting docker containers, Blockade/CharybdeFS instances, etc.)
-
Benchmark and its subclasses are used to pass benchmark-related configurations to the system
-
Logging records the path to runtime logs and stats that we want to collect. System-specific logs are also implemented here
-
SlowFault is the base class to record slow-fault attributes, like fault type, location, start time, duration, etc.
-
Tool records the path to and configs of important binaries and utilities (stored in tools/) that are used by Xinda, like the Blockade binary, YCSB binary, etc.
-
ResourceLimit is the base class to define CPU/memory limits of each container instance. It can be used as a reference if you want to control other conditions in the future
The following modules should be adapted:
-
TestSystem: shell commands to initilize the tool, inject faults, clear faults, and gracefully shut down the tool should be added. This means that we shall at least implement two new functions:
tool_up()
andtool_down()
for initilization and shutdown. We also need to modify theinject()
function to invoke propercmd_inject
andcmd_clear
commands -
(Optional) Logging: if needed, log paths of the new tool should be added. However, fault injection tools usually do not generate useful logs for our analysis
Suppose we want to support a new benchmark in system DummySys
. At least the following modules should be adapted:
-
main.py and Benchmark: add new benchmark flags and configurations
-
Logging: add new path to record runtime logs of the benchmark
-
tools/ and Tool: update binaries and utilities of the benchmark
-
DummySys.py: in the
DummySys
class, we need to add function support to bring up the new benchmark, run the benchmark with different parameters, collect benchmark logs, and shut down the benchmark. Let's take YCSB in etcd as an example. We have implemented the following functions:_load_ycsb()
and_run_ycsb()
to initialize and run YCSB workloads. Benchmark configs will be passed here. Benchmark binaries or utilities from Tool and tools/ will be invoked here. Runtime logs will be redirected to the path in Logging- (Optional)
_wait_till_ycsb_ends()
to wait till benchmark ends. Some benchmarks can be configured to run for a specific duration and thus do not need to add this function
-
main.py: new benchmark flags and configurations should be added to the argument parser
-
data-analysis: a new benchmark log parser should be implemented here. The parser is mostly a collection of regular expressions to extract useful information like timestamp, throughput, latency, etc.
Suppose we want to support a new system named DummySys
. At least the following modules should be adapted:
-
DummySys.py: we should create a new DummySys class inherited from TestSystem. In this class, we need to implement functions to initilize the system and the fault injection tool, load and then run the benchmark, inject faults, collect logs, and gracefully shut down eveything. We have provided a few examples in implementing existing systems, including Cassandra, HBase, CRDB, etcd, Hadoop, and Kafka
-
tools/ and Tool: update binaries (if needed) of the new system. We also need to add a workable
docker-compose
file undertools/docker-DummySys
, similar to what we have done for docker-hbase, docker-etcd, etc. -
main.py: create an instance of the new system with all configuration flags in the main function
-
container.yaml: add the new system container names to the container list for sanity checks