Languages: English | 中文说明
A small CLI toolbox around OpenClaw for:
- Web scraping → normalized Markdown for knowledge bases
- Safe text patching → incremental edits to existing files
- Daily session log summarization → turn OpenClaw chat logs into outlines
This repo collects reusable, script‑friendly tools that you can either:
- A Python library (
clawutils) you can import, and - A unified CLI entrypoint (e.g.
clawutils web scrape <URL>) so humans and agents can discover and call tools in a structured way.
Status: early but usable. Currently provides:
- A robust web scraping pipeline (
clawutils web scrape)- A safe text patcher for incremental edits (
clawutils text patch)- A daily logs summarizer for OpenClaw agents (
clawutils logs daily)
clawutils is designed to run inside the OpenClaw runtime, but can also be used in a similar Python + Node.js environment.
- Python 3.10+
Install clawutils itself (editable/local install for development):
cd clawutils
python -m pip install -e .The web scraper is implemented in Node.js and relies on:
playwright@mozilla/readabilityjsdomturndown
Install them in the same environment where you will run clawutils:
npm install playwright @mozilla/readability jsdom turndownNote: In an OpenClaw container, Node.js is already available; you only need to install these npm packages.
After installation, the main entrypoint is:
clawutils --helpCurrent top‑level commands:
clawutils web ...– Web‑related utilities (scrapers, cleaners)clawutils text ...– Text utilities (patch, transform)clawutils logs ...– Logs/session utilities (daily summaries, inspections)
Use --help on each subcommand for details:
clawutils web --help
clawutils web scrape --help
clawutils text --help
clawutils text patch --help
clawutils logs --help
clawutils logs daily --helpThe web scraper turns arbitrary web pages into clean, normalized Markdown that is suitable for ingest into Clawkb or other knowledge bases.
Fetch a page and save it as a Markdown file:
clawutils web scrape "https://example.com/article" > article.mdTypical usage with Clawkb:
# In Clawkb repo
export CLAWKB_SCRAPE_CMD="clawutils web scrape {url}"
python -m clawkb ingest --url "https://example.com/article"See below for more details.
clawutils web scrape "https://example.com/article" > out.mdOutput format:
--- METADATA ---
Title: ...
Author: ...
Site: ...
FinalURL: ...
Extraction: readability|fallback-container|body-innerText|github-raw-fast-path
FallbackSelector: ... # only when Extraction != readability
--- MARKDOWN ---
<markdown body>
METADATAblock contains basic fields for downstream tools.MARKDOWNsection is the cleaned article body.
- Readability‑based extraction: uses
@mozilla/readabilityto locate the main article content and strip navigation/ads/noise. - Fallback heuristics: if Readability fails, falls back to likely content
containers (
#js_content,article,main, etc.), then todocument.body.innerTextas a last resort. - GitHub Fast Path: for
github.comURLs, tries to fetch raw README (e.g.README.md,README_zh.md) directly via HTTP without launching a browser, for speed and lower resource usage. - Metadata & debugging:
- Includes
Extractionmode and optionalFallbackSelector. - When content is too short, emits debug info and a screenshot
(
/tmp/scrape-fail-<ts>.png) to help diagnose issues.
- Includes
clawutils is designed to be a drop‑in scraper for Clawkb.
In your Clawkb project, set in .env:
CLAWKB_SCRAPE_CMD="clawutils web scrape {url}"Clawkb's ingest --url will then:
- Call
clawutils web scrape <URL>; - Parse the
METADATAandMARKDOWNsections; - Store the article in SQLite + Markdown as usual.
You can still override --scrape-cmd per call if needed:
python -m clawkb ingest \
--url "https://example.com/article" \
--scrape-cmd "clawutils web scrape {url}"Append a line to the end of a file:
clawutils text patch --file README.md --mode append --text "\nUpdated by clawutils.\n"Insert a block after a marker line:
clawutils text patch \
--file THESIS_PLAN.md \
--mode after \
--marker "## 3. Planned workstreams" \
--text "\n- [ ] TODO: add more experiments.\n"See below for modes and details.
The text patcher provides a small, safe tool for incremental edits to text files. It is designed to avoid whole‑file rewrites when only small changes are needed.
Supported modes:
prepend– insert at the very beginningappend– append at the endafter– insert immediately after a marker substring
clawutils text patch \
--file README.md \
--mode prepend \
--text "# My Project"clawutils text patch \
--file README.md \
--mode append \
--text "\n---\nUpdated via clawutils text patch"clawutils text patch \
--file README.md \
--mode after \
--marker "### Changelog" \
--text "- Added clawutils text patch utility"Safety behavior:
- If the file does not exist → prints
ERROR: File not foundand exits with code 1. - If
--mode=afterand marker is missing → prints a clear error and exits with code 1.
This tool is particularly useful when large files are managed by an AI agent:
let the agent generate only the small piece of new text, and let clawutils
handle the insertion on disk.
Summarise yesterday’s logs for the default agent:
clawutils logs daily > logs_$(date -u -d yesterday +%F).mdSpecify a date and agent directory explicitly:
clawutils logs daily \
--date 2026-03-02 \
--agent-dir ~/.openclaw/agents/main \
--cluster-threshold 0.6 \
--verbose \
> logs_2026-03-02.mdThe output is a semantic outline (topics, keywords, quotes) that can be used as a diary or as input to other tools.
The logs summarizer turns a day's worth of OpenClaw session logs into a structured outline or diary‑style summary.
The current implementation focuses on the main agent's session JSONL files under
~/.openclaw/agents/main/sessions.
# Summarize yesterday (UTC) for the main agent
clawutils logs daily
# Summarize a specific date
clawutils logs daily --date 2026-03-02
# Summarize using a custom agent directory
clawutils logs daily --date 2026-03-02 \
--agent-dir ~/.openclaw/agents/main \
--verboseKey options:
--date YYYY-MM-DD– Target date (UTC). If omitted, defaults to yesterday.--agent-dir PATH– Agent directory; defaults to~/.openclaw/agents/main.--verbose– Print detailed progress logs (phases, counts, clustering).--cluster-threshold FLOAT– Override TF‑IDF cosine threshold when embeddings are not used. Lower values merge more segments into fewer, broader topics; higher values keep clusters more granular.
-
Session scanning
- Scans
sessions/*.jsonlunder the chosenagent-dir. - Files are processed in reverse mtime order (newest first).
- Uses file modification time to short‑circuit: once a file's
mtimeis strictly before the start of the target date, older files are skipped.
- Scans
-
Message extraction & cleaning
- Keeps only
type == "message"withrolein{user, assistant}. - Flattens
content[]into plain text (type == "text"). - Strips obvious system metadata blocks:
Conversation info (untrusted metadata)Forwarded message context (untrusted metadata)and their following JSON blocks.
- Filters out short shell/log‑like noise (e.g. one‑liners containing
pip,npm,git,ls,cd,docker,openclaw).
- Keeps only
-
Segmentation
- Messages are grouped into segments in chronological order.
- A new segment starts when adding another message would exceed a size limit (currently ~2000 characters).
- This avoids both extremely small and extremely large segments; topic grouping is handled by clustering.
-
Clustering (topics)
- If embeddings are configured (
EMBEDDING_*env vars) and available, segments are clustered via embedding‑based cosine similarity. - Otherwise, a TF‑IDF‑style bag‑of‑words cosine similarity is used as a
fallback:
- Default threshold is
0.6. - You can override with
--cluster-threshold.
- Default threshold is
- The goal is to group semantically related segments into topic clusters, even if they are far apart in time.
- If embeddings are configured (
-
Summarization modes
- If
SMALL_LLM_*env vars are configured andhttpxis available:- Each topic cluster is summarized via a small LLM.
- A final "daily diary" is generated from the per‑cluster summaries.
- Otherwise, a deterministic outline mode is used:
- Each topic is printed with keywords, stats, and a representative sample line.
- If
When no small LLM is configured, the output looks roughly like this:
Daily summary for 2026-03-02
## 1. [clawutils, daily, logs] (15 msgs | 42.3 min)
> 代表性的那一句话……
## 2. [Clawkb, maintenance, delete] (8 msgs | 21.5 min)
> Another representative sentence...
Details:
- Topic numbering: topics are numbered
1.,2., ... to make them easy to reference in conversation. - Keyword fingerprint: keywords are extracted per topic via
jieba.analyse.textrank(when available), with a simple bag‑of‑words fallback. This makes the topic's core content visible at a glance. - Compact stats:
(<N> msgs | <M> min)shows approximate message count and time span. - Representative sample: a single quoted line (
> ...) is chosen per topic using a rough signal‑to‑noise ratio heuristic (token density), so that high‑information sentences are preferred.
This mode requires no external LLMs and is designed to be useful even in offline or "no‑API" environments.
When SMALL_LLM_BASE_URL, SMALL_LLM_MODEL, and SMALL_LLM_API_KEY are
set and reachable, clawutils logs daily will:
- Generate a short structured summary for each topic cluster via the small LLM.
- Ask the small LLM to merge those topic summaries into a single daily diary, using a first‑person style and emphasizing decisions, conclusions, and follow‑up TODOs.
If any LLM call fails, the command falls back to the outline mode described above.
clawutils ships with example skills for OpenClaw agents under skills/:
skills/web-scrape/SKILL.mdskills/text-patch/SKILL.md
You can symlink them into your main OpenClaw workspace if you want agents to call these tools as first‑class skills, for example:
# In your main OpenClaw workspace
cd /home/node/.openclaw/workspace
ln -s ./clawutils/skills/web-scrape ./skills/web-scrape
ln -s ./clawutils/skills/text-patch ./skills/text-patchEach SKILL.md documents how to invoke the underlying CLI, typically using:
PYTHONPATH=src /opt/venv/bin/python -m clawutils.cli web scrape <URL>
PYTHONPATH=src /opt/venv/bin/python -m clawutils.cli text patch [...]MIT © Ernest Yu