Skip to content

Commit 56ff284

Browse files
Brad KinnardBrad Kinnard
authored andcommitted
fix: resolve all 20 pre-release issues for v0.2.0
- Pin litellm==1.60.2 and portalocker==2.10.1 (exact versions) - Remove fcntl imports from test_cache_manager.py and test_e2e.py (use portalocker) - Fix README fast-path description: 'sandboxed re2-based Python extraction scripts' - Add real-world sample_nginx.log (100 lines) to examples/ - Expand CLI Options in README with global, run, and cache help blocks - Add Windows compatibility bullet to Known Limitations (portalocker, v0.3 roadmap) - Add prompt injection mitigations to Known Limitations - Clarify nondeterminism: fast path is deterministic once cached - Add [embed] CPU-only pip install command to Known Limitations - Fix demo reference from scripts/record_demo.py to symparse-demo entry point - Update demo.py version strings from v0.1.1 to v0.2.0 - Fix duplicate Auto-Compiler heading in README - Update CHANGELOG.md to reflect all 20 resolved issues
1 parent e266785 commit 56ff284

7 files changed

Lines changed: 162 additions & 28 deletions

File tree

CHANGELOG.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,22 +7,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [0.2.0] - 2026-02-26
99
### Added
10-
- **Multi-Model Support**: Native integration with `litellm` allows seamless drop-in of OpenAI, Anthropic, vLLM, and Ollama backends via the `--model` flag and `~/.symparserc` files.
10+
- **Multi-Model Support**: Native integration with `litellm` allows seamless drop-in of OpenAI, Anthropic, vLLM, and Ollama backends via the `--model` flag and `~/.symparserc` config files.
1111
- **Nested Schema Compilation**: The LLM compiler now dynamically writes sandboxed Python `def extract()` dict-builders instead of flat matching strings, expanding 95% execution coverage to deep nested JSON.
1212
- **Semantic Tier-2 Caching**: Incorporated local Contrastive Collision checks using exact sentence transformer thresholding (enabled via `--embed` flag and `symparse[embed]`).
1313
- **Telemetry & Streaming**: The `run` command now features true unbuffered `stdin` stream processing for commands like `tail -f`, alongside a robust `--stats` flag for cycle metrics and average latency tracking.
14-
- **Packaged Viral Demo**: Exported `scripts/record_demo.py` into a core executable `symparse-demo` (via `symparse[demo]`) to allow seamless community benchmarking videos.
15-
- **Expanded Benchmarking Suite**: `examples/` now contains exhaustive multi-format schemas (Nginx, JSONL, Invoices, Kubernetes) heavily tested for accuracy.
16-
- CLI argument `--version`.
14+
- **Packaged Demo**: `symparse-demo` entry point (via `symparse[demo]` extra) for recording terminal demos without a live LLM.
15+
- **Expanded Benchmarking Suite**: `examples/` now contains exhaustive multi-format schemas (Nginx, JSONL, Invoices, Kubernetes) plus a 100-line real-world Nginx access log sample (`examples/sample_nginx.log`) for independent verification.
16+
- CLI argument `--version` on the global parser.
17+
- CLI argument `-v/--verbose` for debug logging.
18+
- Full CLI help reference in README for `run`, `cache`, and global flags.
19+
- `CHANGELOG.md` linked from contributing section.
1720

1821
### Changed
19-
- Replaced Unix-exclusive `fcntl` caching mechanism with cross-platform `portalocker` to enable seamless Windows compatibility.
20-
- Transitioned cache lock modes from generic shared memory maps to strictly serialized JSON block definitions to avert concurrency overwrites.
21-
- Re-architected README with comprehensive hardware context variance and security disclaimers (Nondeterminism + Stdin Injection).
22+
- Replaced Unix-exclusive `fcntl` caching mechanism with cross-platform `portalocker` (pinned to `==2.10.1`) to enable Windows compatibility.
23+
- Removed residual `fcntl` imports from `test_cache_manager.py` and `test_e2e.py` to ensure all tests are cross-platform.
24+
- Pinned all dependency versions exactly (`litellm==1.60.2`, `portalocker==2.10.1`, `openai==1.61.0`, `google-re2==1.0.0`, `jsonschema==4.23.0`, `sentence-transformers==3.4.1`, `torch==2.5.1`) to prevent supply-chain drift.
25+
- Fixed README Fast Path description to accurately reflect sandboxed `re2`-based Python extraction scripts (not raw "regex blocks").
26+
- Expanded Known Limitations with actionable mitigations for prompt injection, nondeterminism, embed size, and Windows compatibility.
27+
- Updated demo script version references from `v0.1.1` to `v0.2.0`.
28+
- Added `requires-python = ">=3.10"` and full Python version classifiers to `pyproject.toml`.
2229

2330
### Security
24-
- Cached compiled definitions now enforce strict `0700` user-only sandbox directory permissions for system egress safety.
25-
- Fully pinned exact versions for `openai`, `google-re2`, `jsonschema`, `sentence-transformers`, and `torch` to completely mitigate transient dependency supply-chain drift.
31+
- Cached compiled definitions enforce strict `0o700` user-only sandbox directory permissions.
32+
- Fully pinned exact dependency versions to mitigate transient supply-chain drift.
33+
- Documented prompt injection surface with concrete mitigations (pre-filter input, compile-first workflow, Fast Path isolation).
2634

2735
## [0.1.1] - 2026-02-05
2836
### Added

README.md

Lines changed: 41 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
---
2020

21-
**Symparse** is a self-optimizing Unix pipeline tool that routes data between an **AI Path** (using local LLMs/Ollama) and a **Fast Path** (using cached `<re2>` regex blocks) with a strict neurosymbolic JSON validation gate.
21+
**Symparse** is a self-optimizing Unix pipeline tool that routes data between an **AI Path** (using local LLMs via `litellm`) and a **Fast Path** (using cached, sandboxed `re2`-based Python extraction scripts) with a strict neurosymbolic JSON validation gate.
2222

2323
You get the magical, unstructured data extraction of Large Language Models, with the raw performance and ReDoS-safety of compiled C++ regular expressions on 95% of subsequent matched traffic.
2424

@@ -102,6 +102,24 @@ tail -f /var/log/nginx/access.log | symparse run --schema access_schema.json --c
102102

103103
### CLI Options
104104

105+
**Global flags** (before any subcommand):
106+
```text
107+
usage: symparse [-h] [-v] [--version] {run,cache} ...
108+
109+
Symparse: LLM to Fast-Path Regex Compiler pipeline
110+
111+
positional arguments:
112+
{run,cache}
113+
run Run the pipeline parser
114+
cache Manage the local cache
115+
116+
options:
117+
-h, --help show this help message and exit
118+
-v, --verbose Enable debug logging
119+
--version show program's version number and exit
120+
```
121+
122+
**`symparse run`**:
105123
```text
106124
usage: symparse run [-h] [--stats] --schema SCHEMA [--compile] [--force-ai]
107125
[--confidence CONFIDENCE] [--model MODEL] [--embed]
@@ -118,7 +136,15 @@ options:
118136
--embed Use local embeddings for tier-2 caching (requires sentence-transformers)
119137
```
120138

121-
Run `symparse run --help` or `symparse cache --help` to explore the subcommands.
139+
**`symparse cache`**:
140+
```text
141+
usage: symparse cache [-h] {list,clear} ...
142+
143+
positional arguments:
144+
{list,clear}
145+
list Display cached extraction scripts
146+
clear Wipe the local compilation directory
147+
```
122148

123149
## 🐳 Docker (Pre-Loaded Fast Start)
124150

@@ -131,14 +157,18 @@ docker pull aftermath/symparse:latest
131157
docker run -i --rm aftermath/symparse run --schema my_schema.json < logs.txt
132158
```
133159

134-
## 🎬 Viral 60-Second Demo
160+
## 🎬 Demo
135161

136-
We have provided an automated terminal typing simulation script perfect for recording marketing GIFs (e.g. via Asciinema or QuickTime).
162+
Symparse ships with a terminal typing simulation for recording marketing GIFs or live demos (e.g. via Asciinema or QuickTime).
137163

138164
```bash
139-
python scripts/record_demo.py
165+
# Installed as an entry point via pip install symparse[demo]
166+
symparse-demo
140167
```
141168

169+
> [!NOTE]
170+
> The demo requires the `[demo]` extra (`pip install symparse[demo]`), which installs `asciinema`. The `symparse-demo` command simulates a cold-start plus warm-start pipeline and does not require a live LLM.
171+
142172
## 🏎️ Benchmarks
143173

144174
How fast is the exact same Unix pipe once the AI successfully compiles the cache?
@@ -154,7 +184,7 @@ We ran `symparse run --stats` iteratively over batches of 1,000 dense synthetic
154184
| **JSONL Polishing** | Plucking sparse target keys from giant files | `1830.29ms ± 52.56ms` | `~546 ops/sec` |
155185

156186
> [!NOTE]
157-
> **Real-world Variance**: Throughput scales inversely with schema nesting depth, regex multiline complexity, and CPU hardware. The `2300+ ops/sec` figure assumes flat data on standard server hardware; extremely nested JSON builders pulling disjoint strings across megabytes of context will naturally trend toward `500-800 ops/sec`. However, this still represents a massive magnitude leap over performing 1000 synchronous 1-second native LLM iterations. Note: To guarantee rigorous reproducibility, `benchmarks/run_examples.py` strictly utilizes `random.seed(42)` on all generated text workloads.
187+
> **Reproducibility & Methodology**: All synthetic benchmarks use `random.seed(42)` for deterministic log generation (see `benchmarks/run_examples.py`). A real-world Nginx access log sample is provided at `examples/sample_nginx.log` (100 lines from a production-like workload) for independent verification. Throughput scales inversely with schema nesting depth, regex multiline complexity, and CPU hardware. The `2300+ ops/sec` figure assumes flat data on standard server hardware; deeply nested JSON builders trending toward `500-800 ops/sec` still represent a massive leap over 1000 synchronous LLM iterations.
158188
159189
See the `examples/` directory for the raw configurations.
160190

@@ -204,8 +234,6 @@ result = process_stream(
204234

205235
### Auto-Compiler & Cache System
206236

207-
### Auto-Compiler & Cache System
208-
209237
Symparse dynamically builds ReDoS-resistant extraction pipelines on the fly by generating sandboxed Python `dict`-builder functions surrounding `re2` matches. The output acts identical to strict LLM object extraction without needing `json.loads()`.
210238

211239
```python
@@ -227,9 +255,11 @@ Symparse is released under the MIT Open Source License. See the [LICENSE](LICENS
227255
### ⚠️ Known Limitations & Risks
228256

229257
* **Log Context Boundaries**: `symparse` assumes the input stream consists of discrete log records partitioned by line breaks (default for commands like `tail` or `grep`). Feeding dense prose paragraphs over stdin with multiple distinct extraction candidates per line may cause extraction overwrites.
230-
* **Complex Data Transformations**: The compiler engine currently constructs regular expressions via safe sandboxed Python code. It is highly efficient for pattern destructuring, but cannot execute deep logical transformations (e.g., date-time conversions, mathematical sums) during the Fast Path stage. Use downstream piped tools like `jq` for manipulation.
231-
* **Nondeterminism**: The underlying LLM compiler may occasionally produce slightly different regular expression structures for identical schemas on cold starts. Symparse relies on rigorous JSON Schema gating and self-healing purges to guarantee that even jittery compilations are 100% compliant before entering the fast path.
232-
* **Stdin Injection Security**: The text piped directly to `sys.stdin` on a cold miss is embedded within the backend AI prompt structure. While the rigid `response_format` JSON wrapper prevents code execution or systemic prompt escape, the backend model is technically exposed to arbitrary parsed string manipulation if ingesting adversarial logs.
258+
* **Complex Data Transformations**: The compiler engine constructs sandboxed Python scripts wrapping `re2` regex extractions (executed via restricted `exec()` with limited `__builtins__`). It is highly efficient for pattern destructuring, but cannot execute deep logical transformations (e.g., date-time conversions, mathematical sums) during the Fast Path stage. Use downstream piped tools like `jq` for manipulation.
259+
* **Nondeterminism**: The underlying LLM compiler may occasionally produce slightly different regex structures for identical schemas on cold starts. However, once a script enters the Fast Path cache, execution is fully deterministic. Symparse relies on rigorous JSON Schema gating and self-healing cache purges to guarantee that even jittery compilations are 100% schema-compliant before caching. To minimize cold-start variance, use `temperature=0.0` (default) and a consistent `--model`.
260+
* **Stdin Injection Security**: On a cache miss (AI Path), the raw text piped to `sys.stdin` is embedded within the LLM prompt. The rigid `response_format` JSON Schema wrapper constrains the model's output structure, which prevents arbitrary output escape. However, adversarial log lines could theoretically manipulate the model's extraction behavior. **Mitigations**: (1) Use `--compile` to cache scripts and minimize AI Path exposure; (2) Pre-filter untrusted input with `grep` or `sed` before piping; (3) In high-security environments, run exclusively on the Fast Path after an initial trusted compilation pass.
261+
* **Windows Compatibility**: The caching subsystem uses `portalocker` for cross-platform file locking. Windows is supported in principle but has not been extensively tested in production. Full Windows CI coverage is planned for v0.3.
262+
* **`[embed]` Extra Size**: The `sentence-transformers` + `torch` dependency chain can pull up to 2.5 GB of CUDA libraries. On minimal servers, install the CPU-only torch wheel first: `pip install torch --index-url https://download.pytorch.org/whl/cpu && pip install symparse[embed]`.
233263

234264
---
235265
*Built by Aftermath Technologies Ltd.*

0 commit comments

Comments
 (0)