Kif is a simple monitoring program that detects programs running amok and tries to correct them. It can currently scan for:
- Memory usage (MB, GB or % of total mem available)
- No. of open file descriptors
- No. of TCP connections open
- No. of LAN TCP connections open
- Age of process
- State of process (running, waiting, zombie etc)
and act accordingly, either running a custom command (such as restarting a service) or killing it with any preferred signal. It can also notify you of issues found and actions taken, either via email or hipchat.
See kif.yaml for example configuration and features.
- python 3.6 or higher
- python-yaml
- python-psutil
- asfpy
- Download Kif
- Make a kif.yaml configuration (see the example)
- Install the dependencies with:
pip3 install -r requirements.txt
(or use pipenv) - Run as root (required to both read usage and restart services).
- Enjoy!
To install on an infra node, add the following yaml snippet to it:
pipservice:
kif:
tag: master
rules:
apache:
description: 'sample apache process rule'
# We can specify the exact cmdline and args to scan for:
procid:
- '/usr/sbin/apache2'
- '-k'
- 'start'
# We'll use combine: true to combine the resource of multiple processes into one check.
combine: true
triggers:
# Demand no more than 500 LAN connections
maxlocalconns: 500
# No more than 25,000 open connections in total
maxconns: 25000
# Require < 1GB memory used (could also be 10%, 512mb etc)
maxmemory: 1gb
# And finally, no more than 65,000 open file descriptors
maxfds: 65000
# If triggered, run this:
runlist:
- 'service apache2 restart'
zombies:
description: 'Any process caught in zombie mode'
# use empty procid to catch all
procid: ''
triggers:
# This can be any process state (zombie, sleeping, running, etc)
state: 'zombie'
# No runlist here, just kill it with signal 9
kill: true
killwith: 9
puppet:
description: 'kill -9 puppet agents that are hanging'
procid: 'puppet agent'
# Find all processes created more than 1 day ago.
triggers:
maxage: 1d
# Ignore main process
ignorepidfile: '/var/run/puppet/agent.pid'
# Kill it with signal 9
kill: true
killwith: 9
To have a specific rule run on certain nodes, please add the rule to kif.yaml, and make use of host_must_match
or host_must_not_match
definitions to narrow down where to run the rule-set, like so:
zombies_on_gitbox:
description: 'Any gitweb process caught in zombie mode'
host_must_match: gitbox.apache.org
procid: '/usr/bin/git'
triggers:
# This can be any process state (zombie, sleeping, running, etc)
# Or a git process > 30 minutes old.
state: 'zombie'
maxage: 30m
kill: true
killwith: 9
httpd_but_not_tlpserver:
description: 'httpd too many backend connections (pool filling up?)'
host_must_not_match: 'tlp-.+'
procid: '/usr/sbin/apache2'
# Use combine: true to combine the resource of multiple processes into one check.
combine: true
triggers:
maxlocalconns: 1000
runlist:
- 'service apache2 restart'
Both host_must_match
and host_must_not_match
are regular expressions and must match the full hostname.
Be sure to use double escaping for keywords, for instance \\d
instead of \d
, or the yaml will break. The must/must-not can also be used in combination to include some nodes and rule out others.
--debug
: Run in debug mode - detect but don't try to fix issues.--config $filename
: path to config file.