You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Pin litellm==1.60.2 and portalocker==2.10.1 (exact versions)
- Remove fcntl imports from test_cache_manager.py and test_e2e.py (use portalocker)
- Fix README fast-path description: 'sandboxed re2-based Python extraction scripts'
- Add real-world sample_nginx.log (100 lines) to examples/
- Expand CLI Options in README with global, run, and cache help blocks
- Add Windows compatibility bullet to Known Limitations (portalocker, v0.3 roadmap)
- Add prompt injection mitigations to Known Limitations
- Clarify nondeterminism: fast path is deterministic once cached
- Add [embed] CPU-only pip install command to Known Limitations
- Fix demo reference from scripts/record_demo.py to symparse-demo entry point
- Update demo.py version strings from v0.1.1 to v0.2.0
- Fix duplicate Auto-Compiler heading in README
- Update CHANGELOG.md to reflect all 20 resolved issues
Copy file name to clipboardExpand all lines: CHANGELOG.md
+17-9Lines changed: 17 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,22 +7,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [0.2.0] - 2026-02-26
9
9
### Added
10
-
-**Multi-Model Support**: Native integration with `litellm` allows seamless drop-in of OpenAI, Anthropic, vLLM, and Ollama backends via the `--model` flag and `~/.symparserc` files.
10
+
-**Multi-Model Support**: Native integration with `litellm` allows seamless drop-in of OpenAI, Anthropic, vLLM, and Ollama backends via the `--model` flag and `~/.symparserc`config files.
11
11
-**Nested Schema Compilation**: The LLM compiler now dynamically writes sandboxed Python `def extract()` dict-builders instead of flat matching strings, expanding 95% execution coverage to deep nested JSON.
12
12
-**Semantic Tier-2 Caching**: Incorporated local Contrastive Collision checks using exact sentence transformer thresholding (enabled via `--embed` flag and `symparse[embed]`).
13
13
-**Telemetry & Streaming**: The `run` command now features true unbuffered `stdin` stream processing for commands like `tail -f`, alongside a robust `--stats` flag for cycle metrics and average latency tracking.
14
-
-**Packaged Viral Demo**: Exported `scripts/record_demo.py` into a core executable `symparse-demo` (via `symparse[demo]`) to allow seamless community benchmarking videos.
15
-
-**Expanded Benchmarking Suite**: `examples/` now contains exhaustive multi-format schemas (Nginx, JSONL, Invoices, Kubernetes) heavily tested for accuracy.
16
-
- CLI argument `--version`.
14
+
-**Packaged Demo**: `symparse-demo` entry point (via `symparse[demo]` extra) for recording terminal demos without a live LLM.
15
+
-**Expanded Benchmarking Suite**: `examples/` now contains exhaustive multi-format schemas (Nginx, JSONL, Invoices, Kubernetes) plus a 100-line real-world Nginx access log sample (`examples/sample_nginx.log`) for independent verification.
16
+
- CLI argument `--version` on the global parser.
17
+
- CLI argument `-v/--verbose` for debug logging.
18
+
- Full CLI help reference in README for `run`, `cache`, and global flags.
19
+
-`CHANGELOG.md` linked from contributing section.
17
20
18
21
### Changed
19
-
- Replaced Unix-exclusive `fcntl` caching mechanism with cross-platform `portalocker` to enable seamless Windows compatibility.
20
-
- Transitioned cache lock modes from generic shared memory maps to strictly serialized JSON block definitions to avert concurrency overwrites.
21
-
- Re-architected README with comprehensive hardware context variance and security disclaimers (Nondeterminism + Stdin Injection).
22
+
- Replaced Unix-exclusive `fcntl` caching mechanism with cross-platform `portalocker` (pinned to `==2.10.1`) to enable Windows compatibility.
23
+
- Removed residual `fcntl` imports from `test_cache_manager.py` and `test_e2e.py` to ensure all tests are cross-platform.
24
+
- Pinned all dependency versions exactly (`litellm==1.60.2`, `portalocker==2.10.1`, `openai==1.61.0`, `google-re2==1.0.0`, `jsonschema==4.23.0`, `sentence-transformers==3.4.1`, `torch==2.5.1`) to prevent supply-chain drift.
25
+
- Fixed README Fast Path description to accurately reflect sandboxed `re2`-based Python extraction scripts (not raw "regex blocks").
26
+
- Expanded Known Limitations with actionable mitigations for prompt injection, nondeterminism, embed size, and Windows compatibility.
27
+
- Updated demo script version references from `v0.1.1` to `v0.2.0`.
28
+
- Added `requires-python = ">=3.10"` and full Python version classifiers to `pyproject.toml`.
22
29
23
30
### Security
24
-
- Cached compiled definitions now enforce strict `0700` user-only sandbox directory permissions for system egress safety.
25
-
- Fully pinned exact versions for `openai`, `google-re2`, `jsonschema`, `sentence-transformers`, and `torch` to completely mitigate transient dependency supply-chain drift.
Copy file name to clipboardExpand all lines: README.md
+41-11Lines changed: 41 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@
18
18
19
19
---
20
20
21
-
**Symparse** is a self-optimizing Unix pipeline tool that routes data between an **AI Path** (using local LLMs/Ollama) and a **Fast Path** (using cached`<re2>` regex blocks) with a strict neurosymbolic JSON validation gate.
21
+
**Symparse** is a self-optimizing Unix pipeline tool that routes data between an **AI Path** (using local LLMs via `litellm`) and a **Fast Path** (using cached, sandboxed `re2`-based Python extraction scripts) with a strict neurosymbolic JSON validation gate.
22
22
23
23
You get the magical, unstructured data extraction of Large Language Models, with the raw performance and ReDoS-safety of compiled C++ regular expressions on 95% of subsequent matched traffic.
docker run -i --rm aftermath/symparse run --schema my_schema.json < logs.txt
132
158
```
133
159
134
-
## 🎬 Viral 60-Second Demo
160
+
## 🎬 Demo
135
161
136
-
We have provided an automated terminal typing simulation script perfect for recording marketing GIFs (e.g. via Asciinema or QuickTime).
162
+
Symparse ships with a terminal typing simulation for recording marketing GIFs or live demos (e.g. via Asciinema or QuickTime).
137
163
138
164
```bash
139
-
python scripts/record_demo.py
165
+
# Installed as an entry point via pip install symparse[demo]
166
+
symparse-demo
140
167
```
141
168
169
+
> [!NOTE]
170
+
> The demo requires the `[demo]` extra (`pip install symparse[demo]`), which installs `asciinema`. The `symparse-demo` command simulates a cold-start plus warm-start pipeline and does not require a live LLM.
171
+
142
172
## 🏎️ Benchmarks
143
173
144
174
How fast is the exact same Unix pipe once the AI successfully compiles the cache?
@@ -154,7 +184,7 @@ We ran `symparse run --stats` iteratively over batches of 1,000 dense synthetic
> **Real-world Variance**: Throughput scales inversely with schema nesting depth, regex multiline complexity, and CPU hardware. The `2300+ ops/sec` figure assumes flat data on standard server hardware; extremely nested JSON builders pulling disjoint strings across megabytes of context will naturally trend toward `500-800 ops/sec`. However, this still represents a massive magnitude leap over performing 1000 synchronous 1-second native LLM iterations. Note: To guarantee rigorous reproducibility, `benchmarks/run_examples.py` strictly utilizes `random.seed(42)` on all generated text workloads.
187
+
> **Reproducibility & Methodology**: All synthetic benchmarks use `random.seed(42)` for deterministic log generation (see `benchmarks/run_examples.py`). A real-world Nginx access log sample is provided at `examples/sample_nginx.log` (100 lines from a production-like workload) for independent verification. Throughput scales inversely with schema nesting depth, regex multiline complexity, and CPU hardware. The `2300+ ops/sec` figure assumes flat data on standard server hardware; deeply nested JSON builders trending toward `500-800 ops/sec`still represent a massive leap over 1000 synchronous LLM iterations.
158
188
159
189
See the `examples/` directory for the raw configurations.
160
190
@@ -204,8 +234,6 @@ result = process_stream(
204
234
205
235
### Auto-Compiler & Cache System
206
236
207
-
### Auto-Compiler & Cache System
208
-
209
237
Symparse dynamically builds ReDoS-resistant extraction pipelines on the fly by generating sandboxed Python `dict`-builder functions surrounding `re2` matches. The output acts identical to strict LLM object extraction without needing `json.loads()`.
210
238
211
239
```python
@@ -227,9 +255,11 @@ Symparse is released under the MIT Open Source License. See the [LICENSE](LICENS
227
255
### ⚠️ Known Limitations & Risks
228
256
229
257
***Log Context Boundaries**: `symparse` assumes the input stream consists of discrete log records partitioned by line breaks (default for commands like `tail` or `grep`). Feeding dense prose paragraphs over stdin with multiple distinct extraction candidates per line may cause extraction overwrites.
230
-
***Complex Data Transformations**: The compiler engine currently constructs regular expressions via safe sandboxed Python code. It is highly efficient for pattern destructuring, but cannot execute deep logical transformations (e.g., date-time conversions, mathematical sums) during the Fast Path stage. Use downstream piped tools like `jq` for manipulation.
231
-
***Nondeterminism**: The underlying LLM compiler may occasionally produce slightly different regular expression structures for identical schemas on cold starts. Symparse relies on rigorous JSON Schema gating and self-healing purges to guarantee that even jittery compilations are 100% compliant before entering the fast path.
232
-
***Stdin Injection Security**: The text piped directly to `sys.stdin` on a cold miss is embedded within the backend AI prompt structure. While the rigid `response_format` JSON wrapper prevents code execution or systemic prompt escape, the backend model is technically exposed to arbitrary parsed string manipulation if ingesting adversarial logs.
258
+
***Complex Data Transformations**: The compiler engine constructs sandboxed Python scripts wrapping `re2` regex extractions (executed via restricted `exec()` with limited `__builtins__`). It is highly efficient for pattern destructuring, but cannot execute deep logical transformations (e.g., date-time conversions, mathematical sums) during the Fast Path stage. Use downstream piped tools like `jq` for manipulation.
259
+
***Nondeterminism**: The underlying LLM compiler may occasionally produce slightly different regex structures for identical schemas on cold starts. However, once a script enters the Fast Path cache, execution is fully deterministic. Symparse relies on rigorous JSON Schema gating and self-healing cache purges to guarantee that even jittery compilations are 100% schema-compliant before caching. To minimize cold-start variance, use `temperature=0.0` (default) and a consistent `--model`.
260
+
***Stdin Injection Security**: On a cache miss (AI Path), the raw text piped to `sys.stdin` is embedded within the LLM prompt. The rigid `response_format` JSON Schema wrapper constrains the model's output structure, which prevents arbitrary output escape. However, adversarial log lines could theoretically manipulate the model's extraction behavior. **Mitigations**: (1) Use `--compile` to cache scripts and minimize AI Path exposure; (2) Pre-filter untrusted input with `grep` or `sed` before piping; (3) In high-security environments, run exclusively on the Fast Path after an initial trusted compilation pass.
261
+
***Windows Compatibility**: The caching subsystem uses `portalocker` for cross-platform file locking. Windows is supported in principle but has not been extensively tested in production. Full Windows CI coverage is planned for v0.3.
262
+
***`[embed]` Extra Size**: The `sentence-transformers` + `torch` dependency chain can pull up to 2.5 GB of CUDA libraries. On minimal servers, install the CPU-only torch wheel first: `pip install torch --index-url https://download.pytorch.org/whl/cpu && pip install symparse[embed]`.
0 commit comments