Vestigo — Firmware analysis & crypto-detection pipeline

Vestigo is a collection of tools, scripts and services to automate the process of (1) producing cross-compiled test binaries, (2) statically and dynamically analyzing firmware/binaries, (3) extracting ML-ready features, and (4) producing datasets and inference results for cryptographic-function detection. The repo combines headless Ghidra-based extraction, Qiling-based dynamic tracing, a dataset generation pipeline (including optional LLM assisted labeling), and a small backend + frontend for web access.

This README gives a concise, practical overview and quickstart so you can get the pipeline running and contribute.

Key project goals

Produce reproducible binary datasets (many architectures, compilers, optimizations)
Extract function- and trace-level features suitable for ML
Provide utilities for static (Ghidra) and dynamic (Qiling) analysis
Offer scripts to build training CSVs and run inference
Provide a backend API and frontend for file upload and analysis orchestration

Quick facts / highlights

Languages: Python (main tooling & backend), some shell, TypeScript/React frontend
Major folders: ghidra_scripts, qiling_analysis, ml, backend, frontend, factory
Important entry points:
- generate_dataset.py — create ML CSVs from Ghidra JSONs (optionally uses OpenAI)
  - analyzer.py, bare_metal.py, main.py — orchestrate analysis flows
  - factory/builder.py — cross-compile sources across arch/opt matrix
  - qiling_analysis/ — dynamic tracing & batch extraction pipeline
  - backend/ — FastAPI backend with analysis endpoints

Minimum prerequisites

Python 3.9+ (3.11 recommended)
pip and virtualenv
Ghidra (for static feature extraction using headless analyzer)
Qiling (optional, for dynamic tracing features)
Cross-compilers and QEMU for emulation (used by factory and qiling_analysis)

See setup.sh for an automated environment setup script and path hints.

Quickstart (recommended test flow)

Create and activate a virtualenv and install Python deps:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

(Optional) Install/point to Ghidra. Set env var GHIDRA_INSTALL_DIR if not in /opt/ghidra.

If you want to generate the ML dataset from existing Ghidra JSONs:

export OPENAI_API_KEY="sk-..."    # optional; generate_dataset can use LLMs to help labeling
python3 generate_dataset.py --input-dir ghidra_output --output dataset_output.csv --limit 10

To run the dynamic trace batch extractor (Qiling pipeline) on a small sample:

python3 qiling_analysis/batch_extract_features.py \
    --dataset-dir ./dataset_binaries --output-dir ./batch_results --limit 5 --parallel 2

To run the backend API locally (development):

cd backend
pip install -r requirements.txt
# run with uvicorn
uvicorn main:app --reload --host 127.0.0.1 --port 8000

Frontend: frontend/ contains a React app — see its own package.json and scripts.

Common workflows and commands

Cross-compile many algorithm sources (factory):
- python3 factory/builder.py --source <file.c> (see options in the script)
Run static Ghidra analysis for a binary:
- python3 analyzer.py <binary> (scripts call ghidra_headless internally)
Generate ML dataset from Ghidra JSONs:
- python3 generate_dataset.py --input-dir ghidra_output --output dataset.csv
Run the Qiling-based full pipeline (traces → windowed features → inference):
- see qiling_analysis/FULL_PIPELINE_README.md and qiling_analysis/QUICKSTART_GUIDE.md

Repo layout (high level)

factory/ — tools to cross-compile C source set across architectures and options
ghidra_scripts/ — Ghidra helper scripts used by headless analysis
qiling_analysis/ — dynamic tracing and batch extraction pipeline
ml/ — dataset processing, labels, model helpers and evaluation scripts
backend/ — FastAPI backend and supporting services
frontend/ — React UI for uploads and results (if present)
dataset_binaries/, test_dataset_binaries/ — sample compiled binaries
ghidra_json/, ghidra_output/ — expected outputs from Ghidra headless runs
features.csv, features.txt — canonical feature columns used by dataset scripts

Notes, assumptions and safety

Some scripts expect environment configuration (Ghidra path, OpenAI API key, rootfs for Qiling).
Not all components are required to run every workflow; you can use only the static path (Ghidra → generate_dataset) or the dynamic path (Qiling tracing) independently.
Several scripts are designed to be run inside CI or Docker with specific mounts; review setup.sh and Containerfile for reproducible environments.

Contributing

Please see CONTRIBUTING.md for the contribution process, coding style and testing guidelines.

License

This repository is licensed under the MIT License — see LICENSE.

Where to go next

qiling_analysis/QUICKSTART_GUIDE.md — dynamic tracing quick start
README_DATASET.md — dataset generation details and column descriptions
IMPLEMENTATION_SUMMARY.md — helpers for converting JSON → CSV and matching columns

If anything is missing or you want a feature explained or automated, open an issue or follow the contribution guide.

Features

Cross-Compilation Matrix: Automatically builds C code for multiple architectures (x86_64, ARM, MIPS, RISC-V, AVR, Z80) and optimization levels using Docker.
Headless Ghidra Analysis: Automates Ghidra to analyze binaries and export P-Code/instruction data to JSON.
Feature Extraction: Extracts cryptographic indicators (S-Boxes, constants) and structural features (entropy, instruction histograms) from binaries and analysis results.

Prerequisites

Docker: Required for the cross-compilation environment.
Python 3.x: For running the orchestration scripts.
Ghidra: Required for analyzer.py to perform binary analysis.

Installation

Clone the repository:

git clone <repository-url>
cd cross-compiler

Build the Docker image (required for builder.py):
```
python3 builder.py --source aes_128.c --build-image
```
Note: You only need to run with --build-image once.

Usage

1. Cross-Compilation (`builder.py`)

Compiles a source C file across all defined architectures and optimization levels.

python3 builder.py --source <source_file.c> [--output <output_dir>]

Example:

python3 builder.py --source aes_128.c --output bin

This will generate ELF/IHX binaries in the bin/ directory for x86_64, ARM, MIPS, RISC-V, AVR, and Z80.

2. Binary Analysis (`analyzer.py`)

Uses Ghidra's analyzeHeadless to process all binaries in the bin/ directory and export intermediate data.

Configuration: Update the GHIDRA_HOME variable in analyzer.py to point to your Ghidra installation.

python3 analyzer.py

This produces ghidra_output.json (or individual JSONs depending on script configuration) containing function and instruction data.

3. Feature Extraction (`extract_features.py`)

Extracts ML-ready features from a binary and its corresponding Ghidra analysis JSON.

`4.features.json`

Final feature vector for Machine Learning or rule-based crypto detection.

python3 extract_features.py

Note: Currently configured to run on a specific example in __main__. Modify the script to iterate over your dataset as needed.

4. Dynamic Analysis (`dynamic_analysis/`)

Automates the emulation and instrumentation of firmware binaries to extract runtime secrets and detect security vulnerabilities.

Components:

emulator.py: Runs binaries using QEMU User Mode.
instrumentation.py: Injects Frida hooks to capture crypto keys.
log_monitor.py: Scans logs for Secure Boot failures and other leaks.

Usage:

python3 dynamic_analysis/dynamic_main.py <binary_path> <arch> <sysroot_path>

Example:

python3 dynamic_analysis/dynamic_main.py ./busybox arm /tmp/extracted_fs

Supported Architectures

The builder.py script supports the following architectures via the provided Dockerfile:

x86_64 (GCC, Clang)
ARM (arm-linux-gnueabihf)
MIPS (mips-linux-gnu)
RISC-V (riscv64-linux-gnu)
AVR (avr-gcc)
Z80 (sdcc)

Project Structure

builder.py: Orchestrates the Docker-based cross-compilation.
Dockerfile: Defines the build environment with all cross-compilers.
analyzer.py: Wrapper for Ghidra headless analysis.
ghidra_script.py: The Ghidra Python script executed by analyzeHeadless.
extract_features.py: Extracts static and structural features from binaries.
aes_*.c: Example cryptographic source files.

Firmware Extraction

Prerequisites

Python 3
Podman (Docker can be used with script modification, but Podman is default)
Binwalk (sudo apt install binwalk)
System Tools: objcopy (usually part of binutils)

Step 1: Build the Sasquatch Container

Build the image:
```
podman build -t sasquatch_tool .
```

Step 2: Run the Analyzer

Run the Python script on your firmware file. The script handles conversion, Binwalk extraction, container mounting, and crypto scanning automatically.

python3 unpacker.py <firmware_filename>

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
analysis		analysis
backend		backend
bin		bin
dataset_binaries		dataset_binaries
dynamic_analysis		dynamic_analysis
factory		factory
filtered_json		filtered_json
firmware_samples		firmware_samples
frontend		frontend
ghidra_final_output		ghidra_final_output
ghidra_json		ghidra_json
ghidra_json_new		ghidra_json_new
ghidra_output_confuse		ghidra_output_confuse
ghidra_projects		ghidra_projects
ghidra_scripts		ghidra_scripts
gnn_output		gnn_output
ml		ml
pipeline_output		pipeline_output
qiling_analysis		qiling_analysis
scripts		scripts
source_code		source_code
test_dataset_binaries		test_dataset_binaries
test_dataset_json		test_dataset_json
test_outputs		test_outputs
z_present		z_present
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Containerfile		Containerfile
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
PRODUCTION_README.md		PRODUCTION_README.md
README.md		README.md
README_DATASET.md		README_DATASET.md
SCRIPT_USAGE.md		SCRIPT_USAGE.md
UPDATE_SUMMARY_COMBINED_COLUMNS.md		UPDATE_SUMMARY_COMBINED_COLUMNS.md
activate_vestigo.sh		activate_vestigo.sh
aes_x86_O3		aes_x86_O3
analyze_secure_boot.py		analyze_secure_boot.py
analyzer.py		analyzer.py
bare_metal.py		bare_metal.py
binary_analysis.py		binary_analysis.py
chacha20.elf		chacha20.elf
combine_csvs_clean.py		combine_csvs_clean.py
constants.py		constants.py
crypto_aes_x86_64		crypto_aes_x86_64
crypto_functions.txt		crypto_functions.txt
crypto_functions_claude.txt		crypto_functions_claude.txt
crypto_functions_grok.txt		crypto_functions_grok.txt
debug_output.txt		debug_output.txt
enhanced_crypto_analyzer.py		enhanced_crypto_analyzer.py
enhanced_crypto_pipeline.py		enhanced_crypto_pipeline.py
extraction_v7.log		extraction_v7.log
factory-to-ddwrt.chk		factory-to-ddwrt.chk
features.csv		features.csv
features.txt		features.txt
features_output.csv		features_output.csv
fix_plugin_corruption.py		fix_plugin_corruption.py
fs_scan.py		fs_scan.py
generate_dataset.py		generate_dataset.py
ghidra_features_labeled.csv		ghidra_features_labeled.csv
ghidra_new_json_to_csv.py		ghidra_new_json_to_csv.py
ghidra_script.py		ghidra_script.py
harness.log		harness.log
harvester.py		harvester.py
ingest.py		ingest.py
json_to_csv.py		json_to_csv.py
label_json.py		label_json.py
main.py		main.py
match_json_to_csv.py		match_json_to_csv.py
move_negative_json.py		move_negative_json.py
noncrypto_functions.txt		noncrypto_functions.txt
noncrypto_functions_claude.txt		noncrypto_functions_claude.txt
noncrypto_functions_grok.txt		noncrypto_functions_grok.txt
output.json		output.json
patch_binwalk.py		patch_binwalk.py
predict.py		predict.py
requirements.txt		requirements.txt
requirements_production.txt		requirements_production.txt
results.json		results.json
run_batch_pipeline.py		run_batch_pipeline.py
setup.sh		setup.sh
simple_test.c		simple_test.c
test_dynamic.c		test_dynamic.c
test_firmware		test_firmware
test_mips_20251202_040549.jsonl		test_mips_20251202_040549.jsonl
unpack.py		unpack.py
vestigo_project.lock		vestigo_project.lock
vestigo_project.lock~		vestigo_project.lock~
wolfssl_chacha_obf_basic.elf		wolfssl_chacha_obf_basic.elf
wolfssl_chacha_stripped_obf.elf		wolfssl_chacha_stripped_obf.elf
wolfssl_chacha_x86_O3.elf		wolfssl_chacha_x86_O3.elf
wolfssl_chacha_x86_O3_sstrip.elf		wolfssl_chacha_x86_O3_sstrip.elf
wolfssl_packed.elf		wolfssl_packed.elf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vestigo — Firmware analysis & crypto-detection pipeline

Key project goals

Quick facts / highlights

Minimum prerequisites

Quickstart (recommended test flow)

Common workflows and commands

Repo layout (high level)

Notes, assumptions and safety

Contributing

License

Where to go next

Features

Prerequisites

Installation

Usage

1. Cross-Compilation (`builder.py`)

2. Binary Analysis (`analyzer.py`)

3. Feature Extraction (`extract_features.py`)

`4.features.json`

4. Dynamic Analysis (`dynamic_analysis/`)

Supported Architectures

Project Structure

Firmware Extraction

Prerequisites

Step 1: Build the Sasquatch Container

Step 2: Run the Analyzer

About

Uh oh!

Releases

Packages

Languages

License

ShinichiShi/Vestigo

Folders and files

Latest commit

History

Repository files navigation

Vestigo — Firmware analysis & crypto-detection pipeline

Key project goals

Quick facts / highlights

Minimum prerequisites

Quickstart (recommended test flow)

Common workflows and commands

Repo layout (high level)

Notes, assumptions and safety

Contributing

License

Where to go next

Features

Prerequisites

Installation

Usage

1. Cross-Compilation (builder.py)

2. Binary Analysis (analyzer.py)

3. Feature Extraction (extract_features.py)

4.features.json

4. Dynamic Analysis (dynamic_analysis/)

Supported Architectures

Project Structure

Firmware Extraction

Prerequisites

Step 1: Build the Sasquatch Container

Step 2: Run the Analyzer

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Cross-Compilation (`builder.py`)

2. Binary Analysis (`analyzer.py`)

3. Feature Extraction (`extract_features.py`)

`4.features.json`

4. Dynamic Analysis (`dynamic_analysis/`)

Packages