Note on Stability: Symparse utilizes litellm to expose multi-model routing to Ollama, OpenAI, vLLM, and Anthropic seamlessly via the --model flag. All massive dependencies, especially torch and google-re2, are strictly pinned to exact versions natively preventing ecosystem drift.
Symparse is a self-optimizing Unix pipeline tool that routes data between an AI Path (using local LLMs via litellm) and a Fast Path (using cached, sandboxed re2-based Python extraction scripts) with a strict neurosymbolic JSON validation gate.
You get the magical, unstructured data extraction of Large Language Models, with the raw performance and ReDoS-safety of sandboxed Python scripts wrapping re2 on 95% of subsequent matched traffic.
Requires Python 3.10+
Install Symparse from PyPI:
pip install symparseOr with optional but recommended Tier-2 Local Vector Embeddings (uses sentence-transformers for vastly superior cache hit-rates against polymorphic logs):
pip install symparse[embed]Warning
The [embed] extra installs PyTorch. Depending on your environment, pip may resolve a massive 2.5GB CUDA payload. If you are installing this on a minimal log server, you can strictly install the CPU-only torch wheel first, then run pip install symparse[embed] to keep the footprint lightweight.
Or from source:
git clone https://github.com/Aftermath-Technologies-Ltd/symparse.git
cd symparse
pip install -e .Symparse is built for Unix pipes. You stream standard text into stdin and provide a JSON Schema describing your desired format. Symparse enforces that structure and outputs valid JSON objects to stdout.
Parse a messy raw text log into a clean standard JSON string:
# First, create a schema describing the structure you want:
cat > login_schema.json << 'EOF'
{
"type": "object",
"properties": {
"email": { "type": "string" },
"ip_address": { "type": "string" }
},
"required": ["email", "ip_address"]
}
EOF
# Then pipe unstructured text through symparse:
echo "User alice@example.com logged in from 192.168.1.50 at 10:45 AM" | \
symparse run --schema login_schema.json --compileOutput (stdout):
{
"email": "alice@example.com",
"ip_address": "192.168.1.50"
}Because we passed --compile, Symparse will use the LLM (AI Path) on this first pass to deduce a standard Re2 regex schema. The next time a log matching this prototype is fired, Symparse will hit the Fast Path Cache, completely bypassing the LLM.
The following results were captured from a live run on a CPU-only 22-core Linux workstation using ollama/gemma3:4b:
| Test | Path | Latency | Speedup |
|---|---|---|---|
| Login (flat schema) — 1st run | AI Path | 29,071ms |
— |
| Login (flat schema) — 2nd run | Fast Path | 1.15ms |
25,279x |
| Nginx (nested schema) — 1st run | AI Path | 58,515ms |
— |
| Nginx (nested schema) — 2nd run | Fast Path | 2.98ms |
19,635x |
| Multi-line streaming — 2 lines | Fast Path | 1.15ms avg |
both cached |
The compiler self-tests every generated script against the archetype data before caching. If the LLM produces a broken regex, symparse falls back to a deterministic template compiler that derives patterns directly from the extracted values.
Symparse is built to natively handle imperfect LLM generation on the fly:
- Log 1 (
Cold Start): LLM extracts correctly. The compiler self-tests the generated re2 script against the archetype data, and caches it only if it reproduces the expected output. - Log 2 (
Fast Path): The cached script runs in ~1-3ms. If the regex fails schema validation, Symparse purges the flawed script, falls back to the AI Path, and returns the correct result — all without crashing the active Unix pipeline. - Self-Healing: On cache miss or broken script, the pipeline seamlessly degrades to the LLM and re-compiles a fresh script for next time.
Symparse excels at processing live unstructured data feeds. For example, attaching it directly to an Apache or Nginx log tail stream:
tail -f /var/log/nginx/access.log | symparse run --schema access_schema.json --compile >> parsed_logs.jsonlsymparse run --schema <file>— Run the extraction pipeline (required)--compile— Cache a fast-path script on success--model <name>— Override AI backend (e.g.ollama/gemma3:1b,openai/gpt-4o)--embed— Use local embeddings for tier-2 cache matching--sanitize— Strip control characters from stdin before AI Path--max-tokens N— Cap tokens per LLM request (default: 4000)--confidence N— Token logprob threshold (default: -2.0)--force-ai— Bypass cache and force AI execution--stats— Print performance stats when finished--log-level— Set verbosity (DEBUG,INFO,WARNING,ERROR)symparse cache list/cache clear— Manage the local compilation cache
Full --help output
Global flags (before any subcommand):
usage: symparse [-h] [-v] [--version] [--log-level {DEBUG,INFO,WARNING,ERROR}]
{run,cache} ...
Symparse: LLM to Fast-Path Regex Compiler pipeline
positional arguments:
{run,cache}
run Run the pipeline parser
cache Manage the local cache
options:
-h, --help show this help message and exit
-v, --verbose Enable debug logging
--version show program's version number and exit
--log-level {DEBUG,INFO,WARNING,ERROR}
Set logging verbosity (default: ERROR, or DEBUG with -v)
symparse run:
usage: symparse run [-h] [--stats] --schema SCHEMA [--compile] [--force-ai]
[--confidence CONFIDENCE] [--model MODEL] [--embed]
[--sanitize] [--max-tokens MAX_TOKENS]
options:
-h, --help show this help message and exit
--stats Print performance cache stats when finished
--schema SCHEMA Path to JSON schema file
--compile Compile a fast-path script on success
--force-ai Bypass local cache and force AI execution
--confidence CONFIDENCE
Token logprob threshold (default: -2.0)
--model MODEL Override AI backend model (e.g. ollama/gemma3:1b, openai/gpt-4o)
--embed Use local embeddings for tier-2 caching (requires sentence-transformers)
--sanitize Strip control characters from stdin before AI Path
--max-tokens MAX_TOKENS
Max tokens per LLM request (default: 4000)
symparse cache:
usage: symparse cache [-h] {list,clear} ...
positional arguments:
{list,clear}
list Display cached extraction scripts
clear Wipe the local compilation directory
To completely eliminate the "Does it work on my machine?" factor and remove the need to configure a local Ollama daemon, you can run the pre-packaged Symparse container.
The standard image comes bundled with Ollama and gemma3:1b downloaded.
docker pull aftermath/symparse:latest
docker run -i --rm aftermath/symparse run --schema my_schema.json < logs.txtSymparse ships with a terminal typing simulation for recording marketing GIFs or live demos (e.g. via Asciinema or QuickTime).
# Installed as an entry point via pip install symparse[demo]
symparse-demoNote
The demo requires the [demo] extra (pip install symparse[demo]), which installs asciinema. The demo runs entirely offline with no LLM required — it simulates a cold-start plus warm-start pipeline using pre-baked output.
How fast is the exact same Unix pipe once the AI successfully compiles the cache?
We ran symparse run --stats iteratively over batches of 1,000 dense synthetic lines down the warmed Fast Path.
| Schema Type | Description | Avg Wall Time (1000 lines) | Throughput |
|---|---|---|---|
| Apache Basic | Flat regex string matching | 424.13ms ± 9.91ms |
~2,357 ops/sec |
Nginx (Raw .log file) |
Nested dict-building (IP, Request obj, HTTP) | 1750.62ms ± 75.44ms |
~571 ops/sec |
| Kubernetes Audit | Deeply structured event parsing | 1837.01ms ± 97.43ms |
~544 ops/sec |
| Invoices | Heavy multiline text slicing/casting | 1794.73ms ± 70.50ms |
~557 ops/sec |
| JSONL Polishing | Plucking sparse target keys from giant files | 1830.29ms ± 52.56ms |
~546 ops/sec |
Note
Reproducibility & Methodology: All synthetic benchmarks use random.seed(42) for deterministic log generation (see benchmarks/run_examples.py). A real-world Nginx access log sample is provided at examples/sample_nginx.log (100 lines from a production-like workload) for independent verification. Throughput scales inversely with schema nesting depth, regex multiline complexity, and CPU hardware. The 2300+ ops/sec figure assumes flat data on standard server hardware; deeply nested JSON builders trending toward 500-800 ops/sec still represent a massive leap over 1000 synchronous LLM iterations.
See the examples/ directory for the raw configurations.
Symparse creates deterministic sandbox scripts under $HOME or a .symparse_cache folder. Cache directory is created with 0o700 permissions for security. You can manage these cache rules out of the box.
symparse cache list # List all cached schema signatures and their compiled RE2 Regexes
symparse cache clear # Purge the local compilation directorySymparse exposes a reliable internal Python API for direct application integrations.
Enforce strict schema properties manually via validation logic:
from symparse.validator import enforce_schema, SchemaViolationError
schema = {"type": "object", "properties": {"status": {"type": "string"}}, "required": ["status"]}
data = {"status": "success"}
# Fast. Returns True or raises SchemaViolationError.
enforce_schema(data, schema)Run the neurosymbolic system programmatically with graceful degradation.
from symparse.engine import process_stream, GracefulDegradationMode
# Returns dict or raises EngineFailure.
result = process_stream(
"unstructured data chunk from memory",
schema,
compile=True,
max_retries=3,
degradation_mode=GracefulDegradationMode.PASSTHROUGH
)Symparse dynamically builds ReDoS-resistant extraction pipelines on the fly by generating sandboxed Python dict-builder functions surrounding re2 matches. The output acts identical to strict LLM object extraction without needing json.loads().
from symparse.compiler import generate_script, execute_script
from symparse.cache_manager import CacheManager
manager = CacheManager()
manager.save_script(schema, "example text", "def extract(txt): ...") # Applies IPC Cross-Platform Lock
manager.fetch_script(schema, "example text") # Uses Hybrid Two-Tier Collision Detection
manager.clear_cache()Pull requests are actively welcomed! Please read the tests architecture under tests/ to run integration checks (test_engine.py patterns). Check out the CHANGELOG.md to catch up on the latest architecture shifts.
Symparse is released under the MIT Open Source License. See the LICENSE file for more.
- Log Context Boundaries:
symparseassumes the input stream consists of discrete log records partitioned by line breaks (default for commands liketailorgrep). Feeding dense prose paragraphs over stdin with multiple distinct extraction candidates per line may cause extraction overwrites. - Complex Data Transformations: The compiler engine constructs sandboxed Python scripts wrapping
re2regex extractions (executed via restrictedexec()with limited__builtins__). The restrictedexec()uses a minimal builtins sandbox with no filesystem or network access. It is highly efficient for pattern destructuring, but cannot execute deep logical transformations (e.g., date-time conversions, mathematical sums) during the Fast Path stage. Use downstream piped tools likejqfor manipulation. - Nondeterminism: The underlying LLM compiler may occasionally produce slightly different regex structures for identical schemas on cold starts. However, once a script enters the Fast Path cache, execution is fully deterministic. Symparse relies on rigorous JSON Schema gating and self-healing cache purges to guarantee that even jittery compilations are 100% schema-compliant before caching. To minimize cold-start variance, use
temperature=0.0(default) and a consistent--model. - Stdin Injection Security: On a cache miss (AI Path), the raw text piped to
sys.stdinis embedded within the LLM prompt. The rigidresponse_formatJSON Schema wrapper constrains the model's output structure, which prevents arbitrary output escape. However, adversarial log lines could theoretically manipulate the model's extraction behavior. Mitigations: (1) Use--sanitizeto strip control characters before the AI Path; (2) Use--compileto cache scripts and minimize AI Path exposure; (3) Pre-filter untrusted input withgreporsedbefore piping; (4) In high-security environments, run exclusively on the Fast Path after an initial trusted compilation pass. - AI Path Rate Limiting: In a broken-cache scenario with
tail -f, rapid AI Path fallbacks could DDoS your LLM endpoint or rack up API bills. Symparse enforces a--max-tokens 4000guard per request (configurable via CLI) to cap token spend. For additional protection, use--compileto ensure the Fast Path is populated early. - Windows Compatibility: The caching subsystem uses
portalockerfor cross-platform file locking. Windows is fully supported viaportalocker(tested on Windows 11). Full Windows CI coverage is planned for v0.3. [embed]Extra Size: Thesentence-transformers+torchdependency chain can pull up to 2.5 GB of CUDA libraries. On minimal servers, install the CPU-only torch wheel first:pip install torch --index-url https://download.pytorch.org/whl/cpu pip install symparse[embed]
Maintained by Aftermath Technologies Ltd.