THIS PROJECT IS RETIRED
GitSleuth searches GitHub repositories for sensitive data. It provides both a command-line interface and a PyQt5 GUI with a dark theme.
- Predefined and custom search queries using templates in ADVANCED_QUERIES.md and SEARCH_QUERIES.md
- OAuth device flow authentication with secure token storage and rotation
- Optional session keep-alive after closing the GUI with automatic login restoration
- Dark-themed GUI and CLI with a keyword filter to narrow searches
- Searches include tokens for Vercel, Hugging Face, Supabase, Sentry, Rollbar, GitLab, Cloudflare, Vault and Pinecone
- Status bar shows rate limit pauses and tokens rotate automatically
- Export results to Excel or CSV
- Machine learning tab to label results and train a classifier using entropy and context features
- Optional integration with Yelp's
detect-secretsorgitleaksfor advanced scanning - Results table includes rule descriptions
- Dictionary and format heuristics filter out common words, UUIDs or dates
- Pattern detection identifies environment variable names and token strings
- Snippets referencing environment variables (e.g.
os.environorprocess.env) are ignored - Allowlist patterns skip known dummy secrets via
ALLOWLIST_PATTERNS - Placeholder filtering now detects values repeating the key name or wrapped in bold markup
auto_resolve_conflicts.shscript merges branches while keeping incoming changessetup.shallows installing dependencies before network access is disabled
- Python 3.x
- pip
- Git
-
Clone the repository
git clone https://github.com/your-repository/GitSleuth.git cd GitSleuth -
Install dependencies Run the setup script to ensure all requirements are installed while network access is available.
./setup.sh
python GitSleuth_GUI.pyUse the Keywords field to limit searches to specific domains or terms.
Each result row also displays the description of the rule that matched.
Use the Label column to mark each result as a True Positive or False Positive. Click Export Labels to save the selections to training_labels.csv for machine-learning.
Open the ML tab and click Perform Machine Learning to train a simple text classifier on the saved labels. Any newly labeled rows are automatically appended to training_labels.csv before training begins to ensure no data is lost. Training progress is shown in the tab's output area.
The labeled data is stored in training_labels.csv. Models are currently kept in memory after training.
Example passwords for experimentation are provided in training_data.csv.
Training uses TF‑IDF text features combined with entropy and character composition metrics
Ensure your labeled dataset includes both True Positive and False Positive examples, otherwise model training will fail with a single-class error.
(length, numeric %, alphabetic %, special %) plus simple dictionary and pattern checks to help distinguish real secrets from placeholders. Additional features also encode the file type (config, source, log, other) and simple structural context such as assignments or secret-setting function calls.
Open the ML tab and click Test Example Model to evaluate a simple
classifier on testing_data.csv and display the accuracy in the output
area. Enter any phrase in the provided field and click Analyze Phrase
to highlight detected secrets, the preceding indicator, entropy score and
- After running a search, mark each result row as True Positive or False Positive using the Label column.
- Click Export Labels to save the selections. This writes them to
training_labels.csvso the ML tab can load them. - Switch to the ML tab and click Perform Machine Learning. The application reads
training_labels.csv, extracts text and entropy features and trains a logistic regression classifier. - When training completes, the tab displays how many samples were used and the model remains in memory for the current session.
python GitSleuth.pyWhen starting OAuth authentication, your default browser will automatically open to the GitHub device flow page so you can enter the provided code.
Edit config.json to adjust log level, ignored filenames, and path patterns
that should be skipped (e.g. tests/, examples/, or files containing .sample.).
You can also control whether placeholder terms are filtered from queries and results. When enabled the
filter removes hits where environment variables have empty or placeholder
values, and ignores snippets that retrieve values from environment variables.
You can also define ALLOWLIST_PATTERNS with regex strings for
known dummy secrets so matching snippets are ignored.
Enable USE_DETECT_SECRETS to scan snippets with the detect-secrets
tool and set DETECT_SECRETS_BASELINE to a baseline file for allowlisted
secrets. Enable USE_GITLEAKS to perform an additional scan with
gitleaks and optionally provide GITLEAKS_CONFIG to specify a custom
configuration file.
Set ENTROPY_THRESHOLD (bits/char) to skip low-entropy values that
look like placeholders.
The application ships with a default GitHub OAuth client ID so it works out of
the box. Set GITHUB_OAUTH_CLIENT_ID to override it and define
GITHUB_OAUTH_CLIENT_SECRET if your OAuth app requires a secret.
By default the application requests the public_repo OAuth scope for
read-only access. Override this by setting GITHUB_OAUTH_SCOPE if you
require additional permissions.
The repository includes auto_resolve_conflicts.sh to merge incoming
changes while prioritizing the new code. Run the script with the name
of the branch you want to merge:
./auto_resolve_conflicts.sh mainIt fetches the branch from origin, merges it with the theirs strategy,
and verifies that no conflict markers remain.
If you regularly pull from the remote repository, use ./auto_pull.sh
instead of running git pull directly. This wrapper invokes the
conflict resolution script automatically whenever a merge conflict
occurs, so the incoming version is always kept.
Before committing, ensure the main modules compile:
python -m py_compile GitSleuth_GUI.py OAuth_Manager.py Token_Manager.py GitSleuth.py GitSleuth_API.pyContributions are welcome. Please follow standard open-source practices.
Released under the MIT License. See LICENSE.
For support, open an issue on GitHub or contact david.mclaughlin@dsmcl.com.