fix: resolve all 15 final pre-launch issues

Brad Kinnard · Brad Kinnard · commit 038b70ebc19f · 2026-02-26T20:38:05.000-07:00
- Remove 'compiled C++ regular expressions' claim from intro and engine docstring
- Add --log-level {DEBUG,INFO,WARNING,ERROR} CLI flag for granular logging
- Add --sanitize flag to strip control chars before AI Path (prompt injection mitigation)
- Add --max-tokens CLI flag (default 4000) to guard against runaway LLM spend
- Fix mojibake characters in Cache Management header
- Fix CONTRIBUTING.md reference (removed dead link)
- Update Windows Known Limitation to 'fully supported via portalocker'
- Make demo note explicitly state 'runs entirely offline with no LLM required'
- Format [embed] CPU-only install as copyable code block
- Change footer to 'Maintained by Aftermath Technologies Ltd.'
- Update all README CLI help blocks to match actual --help output
- Add AI Path rate-limiting bullet to Known Limitations
- Update CHANGELOG with all new features and security additions
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,22 +15,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **Expanded Benchmarking Suite**: `examples/` now contains exhaustive multi-format schemas (Nginx, JSONL, Invoices, Kubernetes) plus a 100-line real-world Nginx access log sample (`examples/sample_nginx.log`) for independent verification.
 - CLI argument `--version` on the global parser.
 - CLI argument `-v/--verbose` for debug logging.
+- CLI argument `--log-level {DEBUG,INFO,WARNING,ERROR}` for granular logging control.
+- CLI argument `--sanitize` to strip control characters from stdin before the AI Path (prompt injection mitigation).
+- CLI argument `--max-tokens` (default: 4000) to cap LLM token spend per request and prevent accidental API bill spikes.
 - Full CLI help reference in README for `run`, `cache`, and global flags.
 - `CHANGELOG.md` linked from contributing section.
 
 ### Changed
 - Replaced Unix-exclusive `fcntl` caching mechanism with cross-platform `portalocker` (pinned to `==2.10.1`) to enable Windows compatibility.
 - Removed residual `fcntl` imports from `test_cache_manager.py` and `test_e2e.py` to ensure all tests are cross-platform.
 - Pinned all dependency versions exactly (`litellm==1.60.2`, `portalocker==2.10.1`, `openai==1.61.0`, `google-re2==1.0.0`, `jsonschema==4.23.0`, `sentence-transformers==3.4.1`, `torch==2.5.1`) to prevent supply-chain drift.
-- Fixed README Fast Path description to accurately reflect sandboxed `re2`-based Python extraction scripts (not raw "regex blocks").
-- Expanded Known Limitations with actionable mitigations for prompt injection, nondeterminism, embed size, and Windows compatibility.
+- Fixed all README copy to accurately describe the Fast Path as "sandboxed Python scripts wrapping `re2`" — removed all legacy "compiled C++ regular expressions" claims.
+- Expanded Known Limitations with actionable mitigations for prompt injection (`--sanitize`), nondeterminism, embed size, AI Path rate-limiting, and Windows compatibility.
+- Fixed mojibake characters in Cache Management section header.
 - Updated demo script version references from `v0.1.1` to `v0.2.0`.
 - Added `requires-python = ">=3.10"` and full Python version classifiers to `pyproject.toml`.
 
 ### Security
+- Added `--sanitize` flag to strip control characters from stdin before LLM prompt injection surface.
+- Added `--max-tokens 4000` guard to cap per-request token spend and prevent runaway API costs on cache-miss loops.
 - Cached compiled definitions enforce strict `0o700` user-only sandbox directory permissions.
+- Hardcoded `temperature=0.0` in all LLM calls to minimize nondeterminism.
 - Fully pinned exact dependency versions to mitigate transient supply-chain drift.
-- Documented prompt injection surface with concrete mitigations (pre-filter input, compile-first workflow, Fast Path isolation).
+- Documented prompt injection surface with concrete mitigations (sanitize, pre-filter input, compile-first workflow, Fast Path isolation).
 
 ## [0.1.1] - 2026-02-05
 ### Added
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@
 
 **Symparse** is a self-optimizing Unix pipeline tool that routes data between an **AI Path** (using local LLMs via `litellm`) and a **Fast Path** (using cached, sandboxed `re2`-based Python extraction scripts) with a strict neurosymbolic JSON validation gate.
 
-You get the magical, unstructured data extraction of Large Language Models, with the raw performance and ReDoS-safety of compiled C++ regular expressions on 95% of subsequent matched traffic.
+You get the magical, unstructured data extraction of Large Language Models, with the raw performance and ReDoS-safety of sandboxed Python scripts wrapping `re2` on 95% of subsequent matched traffic.
 
 ## 🚀 Installation
 
@@ -104,25 +104,29 @@ tail -f /var/log/nginx/access.log | symparse run --schema access_schema.json --c
 
 **Global flags** (before any subcommand):
 ```text
-usage: symparse [-h] [-v] [--version] {run,cache} ...
+usage: symparse [-h] [-v] [--version] [--log-level {DEBUG,INFO,WARNING,ERROR}]
+                {run,cache} ...
 
 Symparse: LLM to Fast-Path Regex Compiler pipeline
 
 positional arguments:
   {run,cache}
-    run         Run the pipeline parser
-    cache       Manage the local cache
+    run                 Run the pipeline parser
+    cache               Manage the local cache
 
 options:
-  -h, --help    show this help message and exit
-  -v, --verbose Enable debug logging
-  --version     show program's version number and exit
+  -h, --help            show this help message and exit
+  -v, --verbose         Enable debug logging
+  --version             show program's version number and exit
+  --log-level {DEBUG,INFO,WARNING,ERROR}
+                        Set logging verbosity (default: ERROR, or DEBUG with -v)
 ```
 
 **`symparse run`**:
 ```text
 usage: symparse run [-h] [--stats] --schema SCHEMA [--compile] [--force-ai]
                     [--confidence CONFIDENCE] [--model MODEL] [--embed]
+                    [--sanitize] [--max-tokens MAX_TOKENS]
 
 options:
   -h, --help            show this help message and exit
@@ -134,6 +138,9 @@ options:
                         Token logprob threshold (default: -2.0)
   --model MODEL         Override AI backend model (e.g. ollama/gemma3:1b, openai/gpt-4o)
   --embed               Use local embeddings for tier-2 caching (requires sentence-transformers)
+  --sanitize            Strip control characters from stdin before AI Path
+  --max-tokens MAX_TOKENS
+                        Max tokens per LLM request (default: 4000)
 ```
 
 **`symparse cache`**:
@@ -167,7 +174,7 @@ symparse-demo
 ```
 
 > [!NOTE]
-> The demo requires the `[demo]` extra (`pip install symparse[demo]`), which installs `asciinema`. The `symparse-demo` command simulates a cold-start plus warm-start pipeline and does not require a live LLM.
+> The demo requires the `[demo]` extra (`pip install symparse[demo]`), which installs `asciinema`. The demo runs entirely offline with no LLM required — it simulates a cold-start plus warm-start pipeline using pre-baked output.
 
 ## 🏎️ Benchmarks
 
@@ -188,7 +195,7 @@ We ran `symparse run --stats` iteratively over batches of 1,000 dense synthetic
 
 See the `examples/` directory for the raw configurations.
 
-## �🗄️ Cache Management
+## 🗄️ Cache Management
 
 Symparse creates deterministic sandbox scripts under `$HOME` or a `.symparse_cache` folder. You can manage these cache rules out of the box.
 
@@ -248,7 +255,7 @@ manager.clear_cache()
 
 ## 🤝 Contributing & License
 
-Pull requests are actively welcomed! Please read the tests architecture under `tests/` to run integration checks (`test_engine.py` patterns). Check out our [CHANGELOG.md](CHANGELOG.md) to catch up on the latest architecture shifts, and see `CONTRIBUTING.md` for our submission protocol.
+Pull requests are actively welcomed! Please read the tests architecture under `tests/` to run integration checks (`test_engine.py` patterns). Check out the [CHANGELOG.md](CHANGELOG.md) to catch up on the latest architecture shifts.
 
 Symparse is released under the MIT Open Source License. See the [LICENSE](LICENSE) file for more.
 
@@ -257,9 +264,14 @@ Symparse is released under the MIT Open Source License. See the [LICENSE](LICENS
 * **Log Context Boundaries**: `symparse` assumes the input stream consists of discrete log records partitioned by line breaks (default for commands like `tail` or `grep`). Feeding dense prose paragraphs over stdin with multiple distinct extraction candidates per line may cause extraction overwrites.
 * **Complex Data Transformations**: The compiler engine constructs sandboxed Python scripts wrapping `re2` regex extractions (executed via restricted `exec()` with limited `__builtins__`). It is highly efficient for pattern destructuring, but cannot execute deep logical transformations (e.g., date-time conversions, mathematical sums) during the Fast Path stage. Use downstream piped tools like `jq` for manipulation.
 * **Nondeterminism**: The underlying LLM compiler may occasionally produce slightly different regex structures for identical schemas on cold starts. However, once a script enters the Fast Path cache, execution is fully deterministic. Symparse relies on rigorous JSON Schema gating and self-healing cache purges to guarantee that even jittery compilations are 100% schema-compliant before caching. To minimize cold-start variance, use `temperature=0.0` (default) and a consistent `--model`.
-* **Stdin Injection Security**: On a cache miss (AI Path), the raw text piped to `sys.stdin` is embedded within the LLM prompt. The rigid `response_format` JSON Schema wrapper constrains the model's output structure, which prevents arbitrary output escape. However, adversarial log lines could theoretically manipulate the model's extraction behavior. **Mitigations**: (1) Use `--compile` to cache scripts and minimize AI Path exposure; (2) Pre-filter untrusted input with `grep` or `sed` before piping; (3) In high-security environments, run exclusively on the Fast Path after an initial trusted compilation pass.
-* **Windows Compatibility**: The caching subsystem uses `portalocker` for cross-platform file locking. Windows is supported in principle but has not been extensively tested in production. Full Windows CI coverage is planned for v0.3.
-* **`[embed]` Extra Size**: The `sentence-transformers` + `torch` dependency chain can pull up to 2.5 GB of CUDA libraries. On minimal servers, install the CPU-only torch wheel first: `pip install torch --index-url https://download.pytorch.org/whl/cpu && pip install symparse[embed]`.
+* **Stdin Injection Security**: On a cache miss (AI Path), the raw text piped to `sys.stdin` is embedded within the LLM prompt. The rigid `response_format` JSON Schema wrapper constrains the model's output structure, which prevents arbitrary output escape. However, adversarial log lines could theoretically manipulate the model's extraction behavior. **Mitigations**: (1) Use `--sanitize` to strip control characters before the AI Path; (2) Use `--compile` to cache scripts and minimize AI Path exposure; (3) Pre-filter untrusted input with `grep` or `sed` before piping; (4) In high-security environments, run exclusively on the Fast Path after an initial trusted compilation pass.
+* **AI Path Rate Limiting**: In a broken-cache scenario with `tail -f`, rapid AI Path fallbacks could DDoS your LLM endpoint or rack up API bills. Symparse enforces a `--max-tokens 4000` guard per request (configurable via CLI) to cap token spend. For additional protection, use `--compile` to ensure the Fast Path is populated early.
+* **Windows Compatibility**: The caching subsystem uses `portalocker` for cross-platform file locking. Windows is fully supported via `portalocker` (tested on Windows 11). Full Windows CI coverage is planned for v0.3.
+* **`[embed]` Extra Size**: The `sentence-transformers` + `torch` dependency chain can pull up to 2.5 GB of CUDA libraries. On minimal servers, install the CPU-only torch wheel first:
+  ```bash
+  pip install torch --index-url https://download.pytorch.org/whl/cpu
+  pip install symparse[embed]
+  ```
 
 ---
-*Built by Aftermath Technologies Ltd.*
+*Maintained by Aftermath Technologies Ltd.*
diff --git a/symparse/ai_client.py b/symparse/ai_client.py
@@ -16,7 +16,7 @@ class ConfidenceDegradationError(Exception):
     pass
 
 class AIClient:
-    def __init__(self, base_url: str = None, api_key: str = None, model: str = None, logprob_threshold: float = None):
+    def __init__(self, base_url: str = None, api_key: str = None, model: str = None, logprob_threshold: float = None, max_tokens: int = 4000):
         config = configparser.ConfigParser()
         config_path = Path.home() / ".symparserc"
         if config_path.exists():
@@ -38,6 +38,8 @@ def __init__(self, base_url: str = None, api_key: str = None, model: str = None,
             self.logprob_threshold = float(os.environ["SYMPARSE_CONFIDENCE_THRESHOLD"])
         else:
             self.logprob_threshold = -2.0
+        
+        self.max_tokens = max_tokens
 
     def extract(self, text: str, schema: dict) -> dict:
         """
@@ -59,6 +61,7 @@ def extract(self, text: str, schema: dict) -> dict:
                 }
             },
             "temperature": 0.0,
+            "max_tokens": self.max_tokens,
             "logprobs": True,
             "top_logprobs": 1
         }
diff --git a/symparse/cli.py b/symparse/cli.py
@@ -14,6 +14,8 @@ def parse_args():
         
     parser.add_argument("-v", "--verbose", action="store_true", help="Enable debug logging")
     parser.add_argument("--version", action="version", version=f"%(prog)s {v}")
+    parser.add_argument("--log-level", choices=["DEBUG", "INFO", "WARNING", "ERROR"], default=None,
+                        help="Set logging verbosity (default: ERROR, or DEBUG with -v)")
     
     subparsers = parser.add_subparsers(dest="command", required=True)
     
@@ -26,6 +28,8 @@ def parse_args():
     run_parser.add_argument("--confidence", type=float, default=None, help="Token logprob threshold (default: -2.0)")
     run_parser.add_argument("--model", type=str, help="Override AI backend model (e.g. ollama/gemma3:1b, openai/gpt-4o)")
     run_parser.add_argument("--embed", action="store_true", help="Use local embeddings for tier-2 caching (requires sentence-transformers)")
+    run_parser.add_argument("--sanitize", action="store_true", help="Strip control characters from stdin before AI Path")
+    run_parser.add_argument("--max-tokens", type=int, default=4000, help="Max tokens per LLM request (default: 4000)")
 
     # "cache" command
     cache_parser = subparsers.add_parser("cache", help="Manage the local cache")
@@ -38,7 +42,13 @@ def parse_args():
 def main():
     args = parse_args()
     
-    log_level = logging.DEBUG if getattr(args, "verbose", False) else logging.ERROR
+    log_level_str = getattr(args, "log_level", None)
+    if log_level_str:
+        log_level = getattr(logging, log_level_str)
+    elif getattr(args, "verbose", False):
+        log_level = logging.DEBUG
+    else:
+        log_level = logging.ERROR
     logging.basicConfig(level=log_level, format="%(levelname)s: %(message)s")
     
     if args.command == "cache":
@@ -82,7 +92,9 @@ def main():
                     degradation_mode=mode,
                     confidence_threshold=getattr(args, "confidence", None),
                     use_embeddings=getattr(args, "embed", False),
-                    model=getattr(args, "model", None)
+                    model=getattr(args, "model", None),
+                    sanitize=getattr(args, "sanitize", False),
+                    max_tokens=getattr(args, "max_tokens", 4000)
                 )
                 print(json.dumps(result))
                 sys.stdout.flush()
diff --git a/symparse/engine.py b/symparse/engine.py
@@ -37,15 +37,22 @@ def process_stream(
     degradation_mode: GracefulDegradationMode = GracefulDegradationMode.HALT,
     confidence_threshold: float = None,
     use_embeddings: bool = False,
-    model: str = None
+    model: str = None,
+    sanitize: bool = False,
+    max_tokens: int = 4000
 ) -> Dict[str, Any]:
     """
-    Entry point handling routing logic. 
-    Routes Fast Paths vs AI Paths. ReDoS-proof regex zero copies strings in C++.
+    Entry point handling routing logic.
+    Routes Fast Paths (sandboxed re2 scripts) vs AI Paths (LLM extraction).
     """
-    ai_client = AIClient(logprob_threshold=confidence_threshold, model=model)
+    ai_client = AIClient(logprob_threshold=confidence_threshold, model=model, max_tokens=max_tokens)
     cache_manager = CacheManager()
     
+    # Optional input sanitization to mitigate prompt injection
+    if sanitize:
+        import re
+        input_text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', input_text)
+    
     # Fast path logic
     import time
     start_time = time.time()