Skip to content

Millerderek/OpenClaw--Skill-CrashRepairKit

Repository files navigation

gateway-watchdog

An OpenClaw skill and systemd daemon that automatically detects and repairs gateway disconnections, unresolvable plugin entries, and LLM model mismatches — with a live CLI dashboard to watch it all in real time.

Built and tested on OpenClaw 2026.x running on Ubuntu with systemd.


What It Does

Problem How It's Handled
Gateway crashes Restart=always brings it back via systemd
Plugin not found (e.g. retell-voice) fix-plugins.sh detects via jq + openclaw doctor, removes entry, restarts
Deprecated or mismatched model string check-models.sh validates against known-good list, rewrites if needed
Invalid or expired API key check-models.sh probes each provider's /v1/models endpoint live
Broken escalation chain Chain integrity checked across all fallback models
Agent disconnects mid-session Watchdog polls every 30s, triggers repair if unhealthy
Network down at startup Model probing skipped entirely, plugin fix still runs

File Structure

gateway-watchdog/
├── install.sh                    # One-command installer
├── openclaw-startup-fix.sh       # Runs on every gateway start via ExecStartPost
├── openclaw-watchdog.sh          # Systemd daemon, polls every 30s
├── openclaw-watchdog.service     # Watchdog systemd unit
├── openclaw-gateway.service      # Gateway systemd unit (reference / patch target)
├── openclaw-watch.sh             # Live CLI dashboard
└── skill/
    ├── SKILL.md                  # OpenClaw skill definition
    ├── fix-plugins.sh            # Plugin detection and removal
    └── check-models.sh           # Model string, API key, and provider health checks

Install

git clone <this repo>
cd gateway-watchdog
bash install.sh

Requires: jq, bash, systemd, Linux. jq will be installed automatically if missing.

After install, verify the ExecStartPost hook was applied to the gateway service:

systemctl cat openclaw-gateway.service | grep -E "ExecStart|Restart"

You should see both ExecStart and ExecStartPost=/usr/local/bin/openclaw-startup-fix.sh.


Auto-Repair Flow

Every time the gateway starts (whether after a crash or manually), openclaw-startup-fix.sh fires automatically:

Gateway starts
  └─ ExecStartPost: openclaw-startup-fix.sh
       ├─ Backup config (timestamped, pruned after 7 days)
       ├─ Phase 1: Plugin fix       ← always runs, local only
       └─ Phase 2: Model health
            ├─ Network check        ← skip probing if offline
            ├─ Previous state check ← alert-only if last state was healthy
            └─ Auto-fix             ← only if state was already degraded

The watchdog daemon runs in parallel and catches issues that develop after startup — rate limits, keys expiring mid-session, provider outages.


Safety Mitigations

Three independent gates prevent incorrect config changes:

1. Network gate Before probing any provider endpoint, connectivity is verified with a 3-second ping. If the network is not up, the entire model-probing phase is skipped. The plugin fix still runs because it is purely local.

2. Previous state gate The last recorded watchdog state is checked before any model auto-fix. If the state was healthy, the models were working before the crash — the script alerts and logs but does not touch the config. A human decision is required to force a model fix in that case.

3. Backup before every write A timestamped backup is saved to ~/.openclaw/backups/ before any config change. Backups older than 7 days are pruned automatically. The worst case is always one manual restore:

cp ~/.openclaw/backups/openclaw.json.startup.<timestamp> ~/.openclaw/openclaw.json
systemctl restart openclaw-gateway.service

Live Dashboard

# Live refresh every 5s
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh --follow

# One-shot snapshot
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh

# Full model health report
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh --models

# Run repair with animated progress bar
bash ~/.openclaw/skills/gateway-watchdog/openclaw-watch.sh --repair

In --follow mode:

  • R + Enter — run auto-repair with live progress bar
  • M + Enter — full model health report
  • Q + Enter — quit

Manual Commands

# Plugin fix only
bash ~/.openclaw/skills/gateway-watchdog/fix-plugins.sh

# Model health check (report only)
bash ~/.openclaw/skills/gateway-watchdog/check-models.sh

# Model health check with auto-fix
bash ~/.openclaw/skills/gateway-watchdog/check-models.sh --fix

# Full startup fix (same as what runs automatically)
bash /usr/local/bin/openclaw-startup-fix.sh

# Watchdog service
systemctl status openclaw-watchdog
journalctl -u openclaw-watchdog -f

# Startup fix log
tail -f /var/log/openclaw-startup-fix.log

# Watchdog log
tail -f /var/log/openclaw-watchdog.log

Agent Slash Command

Once installed, type /gateway-watchdog in the OpenClaw TUI to trigger a full health check and repair cycle from within the agent itself.


How Plugin Detection Works

The script does not call openclaw gateway start to detect bad plugins — that command opens an interactive wizard and hangs. Instead it uses two methods:

  1. jq + filesystem scan — reads every key from plugins.entries in openclaw.json, checks if a corresponding directory or node module exists on disk. Instant, no CLI calls.
  2. openclaw doctor with a 10-second timeout — supplements method 1 for broken installs where files exist but the plugin still fails to resolve. </dev/null closes stdin to prevent interactive prompts.

The State dir migration skipped warning that appears in every openclaw doctor run is filtered out — it is a cosmetic warning, not an error.


Model Validation

check-models.sh checks:

  • Primary model string format (provider/model-name)
  • Known deprecated model strings with suggested replacements
  • Fallback / escalation chain integrity and provider diversity
  • API key presence (credentials file → env var → config file)
  • Live provider endpoint reachability (HTTP 200 = valid, 401/403 = bad key, 429 = rate limited)
  • Subagent and TTS summary model strings
  • Local providers (Ollama, LMStudio) via local endpoint ping

Supported providers: Anthropic, OpenAI, Google/Gemini, Moonshot/Kimi, Mistral, Groq, OpenRouter, Ollama, LMStudio.


After an OpenClaw Update

OpenClaw may overwrite its own systemd service file on update, removing the ExecStartPost hook. If that happens, re-run the installer:

bash install.sh

It is idempotent — safe to run multiple times.


License

MIT. No warranties. Review before running on production systems.

About

gateway-watchdog is a self-healing skill for OpenClaw on Linux. It automatically detects and removes broken plugin entries, validates LLM model strings and API keys, and restarts the gateway cleanly — without getting stuck in the interactive doctor wizard. A systemd watchdog polls every 30 seconds, a startup hook fires on every gateway restart,

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages