You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Remove 'compiled C++ regular expressions' claim from intro and engine docstring
- Add --log-level {DEBUG,INFO,WARNING,ERROR} CLI flag for granular logging
- Add --sanitize flag to strip control chars before AI Path (prompt injection mitigation)
- Add --max-tokens CLI flag (default 4000) to guard against runaway LLM spend
- Fix mojibake characters in Cache Management header
- Fix CONTRIBUTING.md reference (removed dead link)
- Update Windows Known Limitation to 'fully supported via portalocker'
- Make demo note explicitly state 'runs entirely offline with no LLM required'
- Format [embed] CPU-only install as copyable code block
- Change footer to 'Maintained by Aftermath Technologies Ltd.'
- Update all README CLI help blocks to match actual --help output
- Add AI Path rate-limiting bullet to Known Limitations
- Update CHANGELOG with all new features and security additions
Copy file name to clipboardExpand all lines: CHANGELOG.md
+10-3Lines changed: 10 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,22 +15,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
15
15
-**Expanded Benchmarking Suite**: `examples/` now contains exhaustive multi-format schemas (Nginx, JSONL, Invoices, Kubernetes) plus a 100-line real-world Nginx access log sample (`examples/sample_nginx.log`) for independent verification.
16
16
- CLI argument `--version` on the global parser.
17
17
- CLI argument `-v/--verbose` for debug logging.
18
+
- CLI argument `--log-level {DEBUG,INFO,WARNING,ERROR}` for granular logging control.
19
+
- CLI argument `--sanitize` to strip control characters from stdin before the AI Path (prompt injection mitigation).
20
+
- CLI argument `--max-tokens` (default: 4000) to cap LLM token spend per request and prevent accidental API bill spikes.
18
21
- Full CLI help reference in README for `run`, `cache`, and global flags.
19
22
-`CHANGELOG.md` linked from contributing section.
20
23
21
24
### Changed
22
25
- Replaced Unix-exclusive `fcntl` caching mechanism with cross-platform `portalocker` (pinned to `==2.10.1`) to enable Windows compatibility.
23
26
- Removed residual `fcntl` imports from `test_cache_manager.py` and `test_e2e.py` to ensure all tests are cross-platform.
24
27
- Pinned all dependency versions exactly (`litellm==1.60.2`, `portalocker==2.10.1`, `openai==1.61.0`, `google-re2==1.0.0`, `jsonschema==4.23.0`, `sentence-transformers==3.4.1`, `torch==2.5.1`) to prevent supply-chain drift.
25
-
- Fixed README Fast Path description to accurately reflect sandboxed `re2`-based Python extraction scripts (not raw "regex blocks").
26
-
- Expanded Known Limitations with actionable mitigations for prompt injection, nondeterminism, embed size, and Windows compatibility.
28
+
- Fixed all README copy to accurately describe the Fast Path as "sandboxed Python scripts wrapping `re2`" — removed all legacy "compiled C++ regular expressions" claims.
29
+
- Expanded Known Limitations with actionable mitigations for prompt injection (`--sanitize`), nondeterminism, embed size, AI Path rate-limiting, and Windows compatibility.
30
+
- Fixed mojibake characters in Cache Management section header.
27
31
- Updated demo script version references from `v0.1.1` to `v0.2.0`.
28
32
- Added `requires-python = ">=3.10"` and full Python version classifiers to `pyproject.toml`.
29
33
30
34
### Security
35
+
- Added `--sanitize` flag to strip control characters from stdin before LLM prompt injection surface.
36
+
- Added `--max-tokens 4000` guard to cap per-request token spend and prevent runaway API costs on cache-miss loops.
Copy file name to clipboardExpand all lines: README.md
+26-14Lines changed: 26 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@
20
20
21
21
**Symparse** is a self-optimizing Unix pipeline tool that routes data between an **AI Path** (using local LLMs via `litellm`) and a **Fast Path** (using cached, sandboxed `re2`-based Python extraction scripts) with a strict neurosymbolic JSON validation gate.
22
22
23
-
You get the magical, unstructured data extraction of Large Language Models, with the raw performance and ReDoS-safety of compiled C++ regular expressions on 95% of subsequent matched traffic.
23
+
You get the magical, unstructured data extraction of Large Language Models, with the raw performance and ReDoS-safety of sandboxed Python scripts wrapping `re2` on 95% of subsequent matched traffic.
--model MODEL Override AI backend model (e.g. ollama/gemma3:1b, openai/gpt-4o)
136
140
--embed Use local embeddings for tier-2 caching (requires sentence-transformers)
141
+
--sanitize Strip control characters from stdin before AI Path
142
+
--max-tokens MAX_TOKENS
143
+
Max tokens per LLM request (default: 4000)
137
144
```
138
145
139
146
**`symparse cache`**:
@@ -167,7 +174,7 @@ symparse-demo
167
174
```
168
175
169
176
> [!NOTE]
170
-
> The demo requires the `[demo]` extra (`pip install symparse[demo]`), which installs `asciinema`. The `symparse-demo` command simulates a cold-start plus warm-start pipeline and does not require a live LLM.
177
+
> The demo requires the `[demo]` extra (`pip install symparse[demo]`), which installs `asciinema`. The demo runs entirely offline with no LLM required — it simulates a cold-start plus warm-start pipeline using pre-baked output.
171
178
172
179
## 🏎️ Benchmarks
173
180
@@ -188,7 +195,7 @@ We ran `symparse run --stats` iteratively over batches of 1,000 dense synthetic
188
195
189
196
See the `examples/` directory for the raw configurations.
190
197
191
-
## �🗄️ Cache Management
198
+
## 🗄️ Cache Management
192
199
193
200
Symparse creates deterministic sandbox scripts under `$HOME` or a `.symparse_cache` folder. You can manage these cache rules out of the box.
194
201
@@ -248,7 +255,7 @@ manager.clear_cache()
248
255
249
256
## 🤝 Contributing & License
250
257
251
-
Pull requests are actively welcomed! Please read the tests architecture under `tests/` to run integration checks (`test_engine.py` patterns). Check out our[CHANGELOG.md](CHANGELOG.md) to catch up on the latest architecture shifts, and see `CONTRIBUTING.md` for our submission protocol.
258
+
Pull requests are actively welcomed! Please read the tests architecture under `tests/` to run integration checks (`test_engine.py` patterns). Check out the[CHANGELOG.md](CHANGELOG.md) to catch up on the latest architecture shifts.
252
259
253
260
Symparse is released under the MIT Open Source License. See the [LICENSE](LICENSE) file for more.
254
261
@@ -257,9 +264,14 @@ Symparse is released under the MIT Open Source License. See the [LICENSE](LICENS
257
264
***Log Context Boundaries**: `symparse` assumes the input stream consists of discrete log records partitioned by line breaks (default for commands like `tail` or `grep`). Feeding dense prose paragraphs over stdin with multiple distinct extraction candidates per line may cause extraction overwrites.
258
265
***Complex Data Transformations**: The compiler engine constructs sandboxed Python scripts wrapping `re2` regex extractions (executed via restricted `exec()` with limited `__builtins__`). It is highly efficient for pattern destructuring, but cannot execute deep logical transformations (e.g., date-time conversions, mathematical sums) during the Fast Path stage. Use downstream piped tools like `jq` for manipulation.
259
266
***Nondeterminism**: The underlying LLM compiler may occasionally produce slightly different regex structures for identical schemas on cold starts. However, once a script enters the Fast Path cache, execution is fully deterministic. Symparse relies on rigorous JSON Schema gating and self-healing cache purges to guarantee that even jittery compilations are 100% schema-compliant before caching. To minimize cold-start variance, use `temperature=0.0` (default) and a consistent `--model`.
260
-
***Stdin Injection Security**: On a cache miss (AI Path), the raw text piped to `sys.stdin` is embedded within the LLM prompt. The rigid `response_format` JSON Schema wrapper constrains the model's output structure, which prevents arbitrary output escape. However, adversarial log lines could theoretically manipulate the model's extraction behavior. **Mitigations**: (1) Use `--compile` to cache scripts and minimize AI Path exposure; (2) Pre-filter untrusted input with `grep` or `sed` before piping; (3) In high-security environments, run exclusively on the Fast Path after an initial trusted compilation pass.
261
-
***Windows Compatibility**: The caching subsystem uses `portalocker` for cross-platform file locking. Windows is supported in principle but has not been extensively tested in production. Full Windows CI coverage is planned for v0.3.
262
-
***`[embed]` Extra Size**: The `sentence-transformers` + `torch` dependency chain can pull up to 2.5 GB of CUDA libraries. On minimal servers, install the CPU-only torch wheel first: `pip install torch --index-url https://download.pytorch.org/whl/cpu && pip install symparse[embed]`.
267
+
***Stdin Injection Security**: On a cache miss (AI Path), the raw text piped to `sys.stdin` is embedded within the LLM prompt. The rigid `response_format` JSON Schema wrapper constrains the model's output structure, which prevents arbitrary output escape. However, adversarial log lines could theoretically manipulate the model's extraction behavior. **Mitigations**: (1) Use `--sanitize` to strip control characters before the AI Path; (2) Use `--compile` to cache scripts and minimize AI Path exposure; (3) Pre-filter untrusted input with `grep` or `sed` before piping; (4) In high-security environments, run exclusively on the Fast Path after an initial trusted compilation pass.
268
+
***AI Path Rate Limiting**: In a broken-cache scenario with `tail -f`, rapid AI Path fallbacks could DDoS your LLM endpoint or rack up API bills. Symparse enforces a `--max-tokens 4000` guard per request (configurable via CLI) to cap token spend. For additional protection, use `--compile` to ensure the Fast Path is populated early.
269
+
***Windows Compatibility**: The caching subsystem uses `portalocker` for cross-platform file locking. Windows is fully supported via `portalocker` (tested on Windows 11). Full Windows CI coverage is planned for v0.3.
270
+
***`[embed]` Extra Size**: The `sentence-transformers` + `torch` dependency chain can pull up to 2.5 GB of CUDA libraries. On minimal servers, install the CPU-only torch wheel first:
0 commit comments