The master provides an api to orchestrate fault injections to the chaos bots. Using the chaos API you have access to a number of possible fault injection and to an automatic failure recovery mechanism
- Start by defining a ‘steady state’.
- Hypothesize that this steady state will continue in both the control group and the experimental group.
- Inject failures that reflect real world events.
- Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
At this point we should add one more stage
- Recover fast to the ‘steady state’.
The master and bots focus on two of the stages of chaos, Injection of failures and recovery to a steady state
wget https://github.com/SotirisAlfonsos/chaos-master/releases/download/v0.0.2/chaos-master-0.0.2.linux-amd64.tar.gz
tar -xzf chaos-master-0.0.2.linux-amd64.tar.gz./chaos-master --config.file=path/to/config.ymlSee examples of the file in the config/example folder.
# Contain the configuration for the port and scheme of the api.
# The deafault values are port: 8080 and scheme: http
api_options:
port: 8090
scheme: http
# Contain the definition of all enabled failures.
# Each failure injection needs to be defined in a job together with the targets that are in scope
jobs:
# The unique name of the job. The character ',' is not allowed
- job_name: "docker failure injection"
# The type of the failure. Can be [Docker, Service, CPU, Server, Network]
type: "Docker"
# The name of the target component. Only applicable to Docker and Service failure types
component_name: "nginx"
# The list of targets for which is this failure can be applied
targets: ['host1:8081', 'host2:8081']
- job_name: "network injection"
type: "Network"
targets: ['host1:8081', 'host3:8081']
# Contains the tls configuration for the communication with the bots.
# If not specified will default to http
# If specified the traffic to the bots will be https
# You can only provide a peer token if the traffic is https
bots:
# CA certificate
ca_cert: "config/test/certs/ca-cert.pem"
# The pub cert for the connection with the bot
public_cert: "config/test/certs/server-cert.pem"
# peer token for authorization with the bot. A public cert needs to also be provided
peer_token: 30028dd6-a641-4ac3-91d8-1e214ac5e6f6
# Contains the configuration for the healthcheck towards the bots
health_check:
# If set to active the master with send a healthcheck request to the bots every 1 minute
active: false
# If set to active the status of the healthcheck will be reported in application log (stderr)
report: falseSee the api specification after starting the master at <host>/chaos/api/v1/swagger/index.html
- Define the scope of your experiments. Failure types are scoped to specific targets and components.
- For the example config above
thedockerfailure is scoped to thenginxcontainers in the targets'host1:8081', 'host2:8081'
thenetworkfailure is scoped to targets'host1:8081', 'host3:8081'
- For the example config above
- Start a chaos bots in each target specified in your jobs
- For the example config above
we would have to start 3 bots. one onhost1, one onhost2and one onhost3, all on port8081
- For the example config above
- [Optional] Ensure that you have monitoring and alerting in place. Add the recover endpoint as a webhook in case of an alert, to quickly revert all running failures
- Make the first API call to inject a failure
- For the example config above
curl -ss -X POST "http://127.0.0.1:8090/chaos/api/v1/docker?action=kill" \ -H "Content-Type: application/json" \ -d '{"job": "docker failure injection", "containerName": "nginx", "target": "host1:8081"}'
- For the example config above
| Chaos master | Chaos mesh | Chaos toolkit | Gremlin | |
|---|---|---|---|---|
| Run experiments as API calls | x | x | ||
| Run experiments as json | x | x | ||
| Automatic recovery | x | x | ||
| Steady state definition | x | |||
| Status checks | x | x | ||
| Plugable failures | x | |||
| Kubernetes failures | x | x | ||
| Container failures | x | x | x | |
| Service failures | x | |||
| Server failures | x | x | ||
| netem chaos | x | x | x | |
| CPU burn | x | x | x | |
| IO chaos | x | x | ||
| Memory burn | x | x | ||
| Kernel chaos | x | |||
| dns chaos | x | x | ||
| Experiment results | x | |||
| Open source | x | x | x | |
| Free | x | x | x |