Vestigo is a collection of tools, scripts and services to automate the process of (1) producing cross-compiled test binaries, (2) statically and dynamically analyzing firmware/binaries, (3) extracting ML-ready features, and (4) producing datasets and inference results for cryptographic-function detection. The repo combines headless Ghidra-based extraction, Qiling-based dynamic tracing, a dataset generation pipeline (including optional LLM assisted labeling), and a small backend + frontend for web access.
This README gives a concise, practical overview and quickstart so you can get the pipeline running and contribute.
- Produce reproducible binary datasets (many architectures, compilers, optimizations)
- Extract function- and trace-level features suitable for ML
- Provide utilities for static (Ghidra) and dynamic (Qiling) analysis
- Offer scripts to build training CSVs and run inference
- Provide a backend API and frontend for file upload and analysis orchestration
- Languages: Python (main tooling & backend), some shell, TypeScript/React frontend
- Major folders:
ghidra_scripts,qiling_analysis,ml,backend,frontend,factory - Important entry points:
generate_dataset.py— create ML CSVs from Ghidra JSONs (optionally uses OpenAI)analyzer.py,bare_metal.py,main.py— orchestrate analysis flowsfactory/builder.py— cross-compile sources across arch/opt matrixqiling_analysis/— dynamic tracing & batch extraction pipelinebackend/— FastAPI backend with analysis endpoints
- Python 3.9+ (3.11 recommended)
- pip and virtualenv
- Ghidra (for static feature extraction using headless analyzer)
- Qiling (optional, for dynamic tracing features)
- Cross-compilers and QEMU for emulation (used by
factoryandqiling_analysis)
See setup.sh for an automated environment setup script and path hints.
-
Create and activate a virtualenv and install Python deps:
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt -
(Optional) Install/point to Ghidra. Set env var
GHIDRA_INSTALL_DIRif not in/opt/ghidra. -
If you want to generate the ML dataset from existing Ghidra JSONs:
export OPENAI_API_KEY="sk-..." # optional; generate_dataset can use LLMs to help labeling python3 generate_dataset.py --input-dir ghidra_output --output dataset_output.csv --limit 10
-
To run the dynamic trace batch extractor (Qiling pipeline) on a small sample:
python3 qiling_analysis/batch_extract_features.py \ --dataset-dir ./dataset_binaries --output-dir ./batch_results --limit 5 --parallel 2 -
To run the backend API locally (development):
cd backend pip install -r requirements.txt # run with uvicorn uvicorn main:app --reload --host 127.0.0.1 --port 8000
-
Frontend:
frontend/contains a React app — see its own package.json and scripts.
- Cross-compile many algorithm sources (factory):
python3 factory/builder.py --source <file.c>(see options in the script)
- Run static Ghidra analysis for a binary:
python3 analyzer.py <binary>(scripts callghidra_headlessinternally)
- Generate ML dataset from Ghidra JSONs:
python3 generate_dataset.py --input-dir ghidra_output --output dataset.csv
- Run the Qiling-based full pipeline (traces → windowed features → inference):
- see
qiling_analysis/FULL_PIPELINE_README.mdandqiling_analysis/QUICKSTART_GUIDE.md
- see
factory/— tools to cross-compile C source set across architectures and optionsghidra_scripts/— Ghidra helper scripts used by headless analysisqiling_analysis/— dynamic tracing and batch extraction pipelineml/— dataset processing, labels, model helpers and evaluation scriptsbackend/— FastAPI backend and supporting servicesfrontend/— React UI for uploads and results (if present)dataset_binaries/,test_dataset_binaries/— sample compiled binariesghidra_json/,ghidra_output/— expected outputs from Ghidra headless runsfeatures.csv,features.txt— canonical feature columns used by dataset scripts
- Some scripts expect environment configuration (Ghidra path, OpenAI API key, rootfs for Qiling).
- Not all components are required to run every workflow; you can use only the static path (Ghidra → generate_dataset) or the dynamic path (Qiling tracing) independently.
- Several scripts are designed to be run inside CI or Docker with specific mounts; review
setup.shandContainerfilefor reproducible environments.
Please see CONTRIBUTING.md for the contribution process, coding style and testing guidelines.
This repository is licensed under the MIT License — see LICENSE.
qiling_analysis/QUICKSTART_GUIDE.md— dynamic tracing quick startREADME_DATASET.md— dataset generation details and column descriptionsIMPLEMENTATION_SUMMARY.md— helpers for converting JSON → CSV and matching columns
If anything is missing or you want a feature explained or automated, open an issue or follow the contribution guide.
- Cross-Compilation Matrix: Automatically builds C code for multiple architectures (x86_64, ARM, MIPS, RISC-V, AVR, Z80) and optimization levels using Docker.
- Headless Ghidra Analysis: Automates Ghidra to analyze binaries and export P-Code/instruction data to JSON.
- Feature Extraction: Extracts cryptographic indicators (S-Boxes, constants) and structural features (entropy, instruction histograms) from binaries and analysis results.
- Docker: Required for the cross-compilation environment.
- Python 3.x: For running the orchestration scripts.
- Ghidra: Required for
analyzer.pyto perform binary analysis.
-
Clone the repository:
git clone <repository-url> cd cross-compiler
-
Build the Docker image (required for
builder.py):python3 builder.py --source aes_128.c --build-image
Note: You only need to run with
--build-imageonce.
Compiles a source C file across all defined architectures and optimization levels.
python3 builder.py --source <source_file.c> [--output <output_dir>]Example:
python3 builder.py --source aes_128.c --output binThis will generate ELF/IHX binaries in the bin/ directory for x86_64, ARM, MIPS, RISC-V, AVR, and Z80.
Uses Ghidra's analyzeHeadless to process all binaries in the bin/ directory and export intermediate data.
Configuration:
Update the GHIDRA_HOME variable in analyzer.py to point to your Ghidra installation.
python3 analyzer.pyThis produces ghidra_output.json (or individual JSONs depending on script configuration) containing function and instruction data.
Extracts ML-ready features from a binary and its corresponding Ghidra analysis JSON.
Final feature vector for Machine Learning or rule-based crypto detection.
python3 extract_features.pyNote: Currently configured to run on a specific example in __main__. Modify the script to iterate over your dataset as needed.
Automates the emulation and instrumentation of firmware binaries to extract runtime secrets and detect security vulnerabilities.
Components:
emulator.py: Runs binaries using QEMU User Mode.instrumentation.py: Injects Frida hooks to capture crypto keys.log_monitor.py: Scans logs for Secure Boot failures and other leaks.
Usage:
python3 dynamic_analysis/dynamic_main.py <binary_path> <arch> <sysroot_path>Example:
python3 dynamic_analysis/dynamic_main.py ./busybox arm /tmp/extracted_fsThe builder.py script supports the following architectures via the provided Dockerfile:
- x86_64 (GCC, Clang)
- ARM (arm-linux-gnueabihf)
- MIPS (mips-linux-gnu)
- RISC-V (riscv64-linux-gnu)
- AVR (avr-gcc)
- Z80 (sdcc)
builder.py: Orchestrates the Docker-based cross-compilation.Dockerfile: Defines the build environment with all cross-compilers.analyzer.py: Wrapper for Ghidra headless analysis.ghidra_script.py: The Ghidra Python script executed byanalyzeHeadless.extract_features.py: Extracts static and structural features from binaries.aes_*.c: Example cryptographic source files.
- Python 3
- Podman (Docker can be used with script modification, but Podman is default)
- Binwalk (
sudo apt install binwalk) - System Tools:
objcopy(usually part of binutils)
-
Build the image:
podman build -t sasquatch_tool .
Run the Python script on your firmware file. The script handles conversion, Binwalk extraction, container mounting, and crypto scanning automatically.
python3 unpacker.py <firmware_filename>