Skip to content

Commit 038b70e

Browse files
Brad KinnardBrad Kinnard
authored andcommitted
fix: resolve all 15 final pre-launch issues
- Remove 'compiled C++ regular expressions' claim from intro and engine docstring - Add --log-level {DEBUG,INFO,WARNING,ERROR} CLI flag for granular logging - Add --sanitize flag to strip control chars before AI Path (prompt injection mitigation) - Add --max-tokens CLI flag (default 4000) to guard against runaway LLM spend - Fix mojibake characters in Cache Management header - Fix CONTRIBUTING.md reference (removed dead link) - Update Windows Known Limitation to 'fully supported via portalocker' - Make demo note explicitly state 'runs entirely offline with no LLM required' - Format [embed] CPU-only install as copyable code block - Change footer to 'Maintained by Aftermath Technologies Ltd.' - Update all README CLI help blocks to match actual --help output - Add AI Path rate-limiting bullet to Known Limitations - Update CHANGELOG with all new features and security additions
1 parent 56ff284 commit 038b70e

5 files changed

Lines changed: 65 additions & 24 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,22 +15,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1515
- **Expanded Benchmarking Suite**: `examples/` now contains exhaustive multi-format schemas (Nginx, JSONL, Invoices, Kubernetes) plus a 100-line real-world Nginx access log sample (`examples/sample_nginx.log`) for independent verification.
1616
- CLI argument `--version` on the global parser.
1717
- CLI argument `-v/--verbose` for debug logging.
18+
- CLI argument `--log-level {DEBUG,INFO,WARNING,ERROR}` for granular logging control.
19+
- CLI argument `--sanitize` to strip control characters from stdin before the AI Path (prompt injection mitigation).
20+
- CLI argument `--max-tokens` (default: 4000) to cap LLM token spend per request and prevent accidental API bill spikes.
1821
- Full CLI help reference in README for `run`, `cache`, and global flags.
1922
- `CHANGELOG.md` linked from contributing section.
2023

2124
### Changed
2225
- Replaced Unix-exclusive `fcntl` caching mechanism with cross-platform `portalocker` (pinned to `==2.10.1`) to enable Windows compatibility.
2326
- Removed residual `fcntl` imports from `test_cache_manager.py` and `test_e2e.py` to ensure all tests are cross-platform.
2427
- Pinned all dependency versions exactly (`litellm==1.60.2`, `portalocker==2.10.1`, `openai==1.61.0`, `google-re2==1.0.0`, `jsonschema==4.23.0`, `sentence-transformers==3.4.1`, `torch==2.5.1`) to prevent supply-chain drift.
25-
- Fixed README Fast Path description to accurately reflect sandboxed `re2`-based Python extraction scripts (not raw "regex blocks").
26-
- Expanded Known Limitations with actionable mitigations for prompt injection, nondeterminism, embed size, and Windows compatibility.
28+
- Fixed all README copy to accurately describe the Fast Path as "sandboxed Python scripts wrapping `re2`" — removed all legacy "compiled C++ regular expressions" claims.
29+
- Expanded Known Limitations with actionable mitigations for prompt injection (`--sanitize`), nondeterminism, embed size, AI Path rate-limiting, and Windows compatibility.
30+
- Fixed mojibake characters in Cache Management section header.
2731
- Updated demo script version references from `v0.1.1` to `v0.2.0`.
2832
- Added `requires-python = ">=3.10"` and full Python version classifiers to `pyproject.toml`.
2933

3034
### Security
35+
- Added `--sanitize` flag to strip control characters from stdin before LLM prompt injection surface.
36+
- Added `--max-tokens 4000` guard to cap per-request token spend and prevent runaway API costs on cache-miss loops.
3137
- Cached compiled definitions enforce strict `0o700` user-only sandbox directory permissions.
38+
- Hardcoded `temperature=0.0` in all LLM calls to minimize nondeterminism.
3239
- Fully pinned exact dependency versions to mitigate transient supply-chain drift.
33-
- Documented prompt injection surface with concrete mitigations (pre-filter input, compile-first workflow, Fast Path isolation).
40+
- Documented prompt injection surface with concrete mitigations (sanitize, pre-filter input, compile-first workflow, Fast Path isolation).
3441

3542
## [0.1.1] - 2026-02-05
3643
### Added

README.md

Lines changed: 26 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020

2121
**Symparse** is a self-optimizing Unix pipeline tool that routes data between an **AI Path** (using local LLMs via `litellm`) and a **Fast Path** (using cached, sandboxed `re2`-based Python extraction scripts) with a strict neurosymbolic JSON validation gate.
2222

23-
You get the magical, unstructured data extraction of Large Language Models, with the raw performance and ReDoS-safety of compiled C++ regular expressions on 95% of subsequent matched traffic.
23+
You get the magical, unstructured data extraction of Large Language Models, with the raw performance and ReDoS-safety of sandboxed Python scripts wrapping `re2` on 95% of subsequent matched traffic.
2424

2525
## 🚀 Installation
2626

@@ -104,25 +104,29 @@ tail -f /var/log/nginx/access.log | symparse run --schema access_schema.json --c
104104

105105
**Global flags** (before any subcommand):
106106
```text
107-
usage: symparse [-h] [-v] [--version] {run,cache} ...
107+
usage: symparse [-h] [-v] [--version] [--log-level {DEBUG,INFO,WARNING,ERROR}]
108+
{run,cache} ...
108109
109110
Symparse: LLM to Fast-Path Regex Compiler pipeline
110111
111112
positional arguments:
112113
{run,cache}
113-
run Run the pipeline parser
114-
cache Manage the local cache
114+
run Run the pipeline parser
115+
cache Manage the local cache
115116
116117
options:
117-
-h, --help show this help message and exit
118-
-v, --verbose Enable debug logging
119-
--version show program's version number and exit
118+
-h, --help show this help message and exit
119+
-v, --verbose Enable debug logging
120+
--version show program's version number and exit
121+
--log-level {DEBUG,INFO,WARNING,ERROR}
122+
Set logging verbosity (default: ERROR, or DEBUG with -v)
120123
```
121124

122125
**`symparse run`**:
123126
```text
124127
usage: symparse run [-h] [--stats] --schema SCHEMA [--compile] [--force-ai]
125128
[--confidence CONFIDENCE] [--model MODEL] [--embed]
129+
[--sanitize] [--max-tokens MAX_TOKENS]
126130
127131
options:
128132
-h, --help show this help message and exit
@@ -134,6 +138,9 @@ options:
134138
Token logprob threshold (default: -2.0)
135139
--model MODEL Override AI backend model (e.g. ollama/gemma3:1b, openai/gpt-4o)
136140
--embed Use local embeddings for tier-2 caching (requires sentence-transformers)
141+
--sanitize Strip control characters from stdin before AI Path
142+
--max-tokens MAX_TOKENS
143+
Max tokens per LLM request (default: 4000)
137144
```
138145

139146
**`symparse cache`**:
@@ -167,7 +174,7 @@ symparse-demo
167174
```
168175

169176
> [!NOTE]
170-
> The demo requires the `[demo]` extra (`pip install symparse[demo]`), which installs `asciinema`. The `symparse-demo` command simulates a cold-start plus warm-start pipeline and does not require a live LLM.
177+
> The demo requires the `[demo]` extra (`pip install symparse[demo]`), which installs `asciinema`. The demo runs entirely offline with no LLM required — it simulates a cold-start plus warm-start pipeline using pre-baked output.
171178
172179
## 🏎️ Benchmarks
173180

@@ -188,7 +195,7 @@ We ran `symparse run --stats` iteratively over batches of 1,000 dense synthetic
188195
189196
See the `examples/` directory for the raw configurations.
190197

191-
## 🗄️ Cache Management
198+
## 🗄️ Cache Management
192199

193200
Symparse creates deterministic sandbox scripts under `$HOME` or a `.symparse_cache` folder. You can manage these cache rules out of the box.
194201

@@ -248,7 +255,7 @@ manager.clear_cache()
248255

249256
## 🤝 Contributing & License
250257

251-
Pull requests are actively welcomed! Please read the tests architecture under `tests/` to run integration checks (`test_engine.py` patterns). Check out our [CHANGELOG.md](CHANGELOG.md) to catch up on the latest architecture shifts, and see `CONTRIBUTING.md` for our submission protocol.
258+
Pull requests are actively welcomed! Please read the tests architecture under `tests/` to run integration checks (`test_engine.py` patterns). Check out the [CHANGELOG.md](CHANGELOG.md) to catch up on the latest architecture shifts.
252259

253260
Symparse is released under the MIT Open Source License. See the [LICENSE](LICENSE) file for more.
254261

@@ -257,9 +264,14 @@ Symparse is released under the MIT Open Source License. See the [LICENSE](LICENS
257264
* **Log Context Boundaries**: `symparse` assumes the input stream consists of discrete log records partitioned by line breaks (default for commands like `tail` or `grep`). Feeding dense prose paragraphs over stdin with multiple distinct extraction candidates per line may cause extraction overwrites.
258265
* **Complex Data Transformations**: The compiler engine constructs sandboxed Python scripts wrapping `re2` regex extractions (executed via restricted `exec()` with limited `__builtins__`). It is highly efficient for pattern destructuring, but cannot execute deep logical transformations (e.g., date-time conversions, mathematical sums) during the Fast Path stage. Use downstream piped tools like `jq` for manipulation.
259266
* **Nondeterminism**: The underlying LLM compiler may occasionally produce slightly different regex structures for identical schemas on cold starts. However, once a script enters the Fast Path cache, execution is fully deterministic. Symparse relies on rigorous JSON Schema gating and self-healing cache purges to guarantee that even jittery compilations are 100% schema-compliant before caching. To minimize cold-start variance, use `temperature=0.0` (default) and a consistent `--model`.
260-
* **Stdin Injection Security**: On a cache miss (AI Path), the raw text piped to `sys.stdin` is embedded within the LLM prompt. The rigid `response_format` JSON Schema wrapper constrains the model's output structure, which prevents arbitrary output escape. However, adversarial log lines could theoretically manipulate the model's extraction behavior. **Mitigations**: (1) Use `--compile` to cache scripts and minimize AI Path exposure; (2) Pre-filter untrusted input with `grep` or `sed` before piping; (3) In high-security environments, run exclusively on the Fast Path after an initial trusted compilation pass.
261-
* **Windows Compatibility**: The caching subsystem uses `portalocker` for cross-platform file locking. Windows is supported in principle but has not been extensively tested in production. Full Windows CI coverage is planned for v0.3.
262-
* **`[embed]` Extra Size**: The `sentence-transformers` + `torch` dependency chain can pull up to 2.5 GB of CUDA libraries. On minimal servers, install the CPU-only torch wheel first: `pip install torch --index-url https://download.pytorch.org/whl/cpu && pip install symparse[embed]`.
267+
* **Stdin Injection Security**: On a cache miss (AI Path), the raw text piped to `sys.stdin` is embedded within the LLM prompt. The rigid `response_format` JSON Schema wrapper constrains the model's output structure, which prevents arbitrary output escape. However, adversarial log lines could theoretically manipulate the model's extraction behavior. **Mitigations**: (1) Use `--sanitize` to strip control characters before the AI Path; (2) Use `--compile` to cache scripts and minimize AI Path exposure; (3) Pre-filter untrusted input with `grep` or `sed` before piping; (4) In high-security environments, run exclusively on the Fast Path after an initial trusted compilation pass.
268+
* **AI Path Rate Limiting**: In a broken-cache scenario with `tail -f`, rapid AI Path fallbacks could DDoS your LLM endpoint or rack up API bills. Symparse enforces a `--max-tokens 4000` guard per request (configurable via CLI) to cap token spend. For additional protection, use `--compile` to ensure the Fast Path is populated early.
269+
* **Windows Compatibility**: The caching subsystem uses `portalocker` for cross-platform file locking. Windows is fully supported via `portalocker` (tested on Windows 11). Full Windows CI coverage is planned for v0.3.
270+
* **`[embed]` Extra Size**: The `sentence-transformers` + `torch` dependency chain can pull up to 2.5 GB of CUDA libraries. On minimal servers, install the CPU-only torch wheel first:
271+
```bash
272+
pip install torch --index-url https://download.pytorch.org/whl/cpu
273+
pip install symparse[embed]
274+
```
263275

264276
---
265-
*Built by Aftermath Technologies Ltd.*
277+
*Maintained by Aftermath Technologies Ltd.*

symparse/ai_client.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ class ConfidenceDegradationError(Exception):
1616
pass
1717

1818
class AIClient:
19-
def __init__(self, base_url: str = None, api_key: str = None, model: str = None, logprob_threshold: float = None):
19+
def __init__(self, base_url: str = None, api_key: str = None, model: str = None, logprob_threshold: float = None, max_tokens: int = 4000):
2020
config = configparser.ConfigParser()
2121
config_path = Path.home() / ".symparserc"
2222
if config_path.exists():
@@ -38,6 +38,8 @@ def __init__(self, base_url: str = None, api_key: str = None, model: str = None,
3838
self.logprob_threshold = float(os.environ["SYMPARSE_CONFIDENCE_THRESHOLD"])
3939
else:
4040
self.logprob_threshold = -2.0
41+
42+
self.max_tokens = max_tokens
4143

4244
def extract(self, text: str, schema: dict) -> dict:
4345
"""
@@ -59,6 +61,7 @@ def extract(self, text: str, schema: dict) -> dict:
5961
}
6062
},
6163
"temperature": 0.0,
64+
"max_tokens": self.max_tokens,
6265
"logprobs": True,
6366
"top_logprobs": 1
6467
}

symparse/cli.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ def parse_args():
1414

1515
parser.add_argument("-v", "--verbose", action="store_true", help="Enable debug logging")
1616
parser.add_argument("--version", action="version", version=f"%(prog)s {v}")
17+
parser.add_argument("--log-level", choices=["DEBUG", "INFO", "WARNING", "ERROR"], default=None,
18+
help="Set logging verbosity (default: ERROR, or DEBUG with -v)")
1719

1820
subparsers = parser.add_subparsers(dest="command", required=True)
1921

@@ -26,6 +28,8 @@ def parse_args():
2628
run_parser.add_argument("--confidence", type=float, default=None, help="Token logprob threshold (default: -2.0)")
2729
run_parser.add_argument("--model", type=str, help="Override AI backend model (e.g. ollama/gemma3:1b, openai/gpt-4o)")
2830
run_parser.add_argument("--embed", action="store_true", help="Use local embeddings for tier-2 caching (requires sentence-transformers)")
31+
run_parser.add_argument("--sanitize", action="store_true", help="Strip control characters from stdin before AI Path")
32+
run_parser.add_argument("--max-tokens", type=int, default=4000, help="Max tokens per LLM request (default: 4000)")
2933

3034
# "cache" command
3135
cache_parser = subparsers.add_parser("cache", help="Manage the local cache")
@@ -38,7 +42,13 @@ def parse_args():
3842
def main():
3943
args = parse_args()
4044

41-
log_level = logging.DEBUG if getattr(args, "verbose", False) else logging.ERROR
45+
log_level_str = getattr(args, "log_level", None)
46+
if log_level_str:
47+
log_level = getattr(logging, log_level_str)
48+
elif getattr(args, "verbose", False):
49+
log_level = logging.DEBUG
50+
else:
51+
log_level = logging.ERROR
4252
logging.basicConfig(level=log_level, format="%(levelname)s: %(message)s")
4353

4454
if args.command == "cache":
@@ -82,7 +92,9 @@ def main():
8292
degradation_mode=mode,
8393
confidence_threshold=getattr(args, "confidence", None),
8494
use_embeddings=getattr(args, "embed", False),
85-
model=getattr(args, "model", None)
95+
model=getattr(args, "model", None),
96+
sanitize=getattr(args, "sanitize", False),
97+
max_tokens=getattr(args, "max_tokens", 4000)
8698
)
8799
print(json.dumps(result))
88100
sys.stdout.flush()

symparse/engine.py

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,15 +37,22 @@ def process_stream(
3737
degradation_mode: GracefulDegradationMode = GracefulDegradationMode.HALT,
3838
confidence_threshold: float = None,
3939
use_embeddings: bool = False,
40-
model: str = None
40+
model: str = None,
41+
sanitize: bool = False,
42+
max_tokens: int = 4000
4143
) -> Dict[str, Any]:
4244
"""
43-
Entry point handling routing logic.
44-
Routes Fast Paths vs AI Paths. ReDoS-proof regex zero copies strings in C++.
45+
Entry point handling routing logic.
46+
Routes Fast Paths (sandboxed re2 scripts) vs AI Paths (LLM extraction).
4547
"""
46-
ai_client = AIClient(logprob_threshold=confidence_threshold, model=model)
48+
ai_client = AIClient(logprob_threshold=confidence_threshold, model=model, max_tokens=max_tokens)
4749
cache_manager = CacheManager()
4850

51+
# Optional input sanitization to mitigate prompt injection
52+
if sanitize:
53+
import re
54+
input_text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', input_text)
55+
4956
# Fast path logic
5057
import time
5158
start_time = time.time()

0 commit comments

Comments
 (0)