Skip to content

Latest commit

 

History

History
94 lines (61 loc) · 7.03 KB

DEV.md

File metadata and controls

94 lines (61 loc) · 7.03 KB

Development Guide

In this guide, we will walk you through how to extend Xinda to support a new fault injection method, a new benchmark, or a new distributed system. This guide can act as a checkbox for future development.

Basics

🏠 xinda
  ┣ ...
  ┣ main.py                 (main entry of Xinda)
  ┣ 📈 data-analysis/       (data analysis scripts)
  ┣ 🔧 tools/               (a collection of binaries and utilities)
  ┣ xinda                   (source files of Xinda)
    ┣ 🤖️ systems
      ┣ container.yaml      (container meta info)
      ┣ TestSystem.py       (base class for distributed systems)
      ┣ hbase.py            (hbase support based on TestSystem)
      ┣ crdb.py             (crdb support based on TestSystem)
      ┣ ...
    ┣ 📒 configs            (main Xinda configurations)
      ┣ benchmark.py        (benchmark configs)
      ┣ logging.py          (data collection configs)
      ┣ slow_fault.py       (slow fault attributes)
      ┣ tool.py             (path and configs for binaries and utilities)
      ┣ reslim.py           (base class to enforce resource limits)

Above showcases the main structure and modules of Xinda. Let's go through them one by one:

  • main.py is the main entry of Xinda. All configurations, including the system under test, the benchmarks to be run, and slow-fault attributes, are passed through here

  • data-analysis/ contains scripts to parse and analyze the runtime logs and stats collected by Xinda

  • tools/ contains the required binaries and utilities

  • container.yaml records legal container names of all supported distributed systems. It is used as a sanity check to ensure the user input of --fault_location is valid

  • TestSystem is the base class to implement the test pipeline of all distributed systems. It has already implemented basic functions like:

    • docker_up() and docker_down() to bring up/down the cluster
    • blockade_up() and blockade_down() to init/shut down Blockade (which is used to inject network-related slow faults)
    • charybdefs_up() and charybdefs_down() to init/shut down CharybdeFS (which is used to inject filesystem-related slow faults)
    • inject() to inject slow faults at a specific time, location and severity level, wait for a duration, and then clear the fault
    • info() to log INFO-level messages
    • cleanup() to ensure a clean state for next test (e.g., garbage-collecting docker containers, Blockade/CharybdeFS instances, etc.)
  • Benchmark and its subclasses are used to pass benchmark-related configurations to the system

  • Logging records the path to runtime logs and stats that we want to collect. System-specific logs are also implemented here

  • SlowFault is the base class to record slow-fault attributes, like fault type, location, start time, duration, etc.

  • Tool records the path to and configs of important binaries and utilities (stored in tools/) that are used by Xinda, like the Blockade binary, YCSB binary, etc.

  • ResourceLimit is the base class to define CPU/memory limits of each container instance. It can be used as a reference if you want to control other conditions in the future

1. Support a New Fault Injection Tool

The following modules should be adapted:

  • TestSystem: shell commands to initilize the tool, inject faults, clear faults, and gracefully shut down the tool should be added. This means that we shall at least implement two new functions: tool_up() and tool_down() for initilization and shutdown. We also need to modify the inject() function to invoke proper cmd_inject and cmd_clear commands

  • tools/ and Tool: update binaries of the new tool

  • (Optional) Logging: if needed, log paths of the new tool should be added. However, fault injection tools usually do not generate useful logs for our analysis

2. Support a New Benchmark

Suppose we want to support a new benchmark in system DummySys. At least the following modules should be adapted:

  • main.py and Benchmark: add new benchmark flags and configurations

  • Logging: add new path to record runtime logs of the benchmark

  • tools/ and Tool: update binaries and utilities of the benchmark

  • DummySys.py: in the DummySys class, we need to add function support to bring up the new benchmark, run the benchmark with different parameters, collect benchmark logs, and shut down the benchmark. Let's take YCSB in etcd as an example. We have implemented the following functions:

    • _load_ycsb() and _run_ycsb() to initialize and run YCSB workloads. Benchmark configs will be passed here. Benchmark binaries or utilities from Tool and tools/ will be invoked here. Runtime logs will be redirected to the path in Logging
    • (Optional) _wait_till_ycsb_ends() to wait till benchmark ends. Some benchmarks can be configured to run for a specific duration and thus do not need to add this function
  • main.py: new benchmark flags and configurations should be added to the argument parser

  • data-analysis: a new benchmark log parser should be implemented here. The parser is mostly a collection of regular expressions to extract useful information like timestamp, throughput, latency, etc.

3. Support a New System

Suppose we want to support a new system named DummySys. At least the following modules should be adapted:

  • DummySys.py: we should create a new DummySys class inherited from TestSystem. In this class, we need to implement functions to initilize the system and the fault injection tool, load and then run the benchmark, inject faults, collect logs, and gracefully shut down eveything. We have provided a few examples in implementing existing systems, including Cassandra, HBase, CRDB, etcd, Hadoop, and Kafka

  • tools/ and Tool: update binaries (if needed) of the new system. We also need to add a workable docker-compose file under tools/docker-DummySys, similar to what we have done for docker-hbase, docker-etcd, etc.

  • main.py: create an instance of the new system with all configuration flags in the main function

  • container.yaml: add the new system container names to the container list for sanity checks