feat: add legal and scam training datasets for safety model#54
Merged
aidan-diaz merged 1 commit intodevfrom Apr 14, 2026
Merged
feat: add legal and scam training datasets for safety model#54aidan-diaz merged 1 commit intodevfrom
aidan-diaz merged 1 commit intodevfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
|
This looks good! Possible recommendation to add the training data to the .gitignore once training is finished, as downloading large files can cause lag/slower git operations + performance. No actual features are being modified, so everything in the app still works exactly the same as before. Nice work! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Feature Summary
What does this PR change?
Adds two labeled training datasets for the Safety model under data/safety/.
Why was this change made?
The Safety model needs labeled examples of both legitimate legal documents and scam documents to learn how to differentiate between them for red flag detection.
What is the code meant to do?
Provides training data for the Safety model. Legitimate legal documents are labeled likely_legitimate and real fraud emails impersonating legal/government entities are labeled likely_scam. Both follow the schema defined in app/api/safety/system-prompt.md.
Feature Team / Lane
Team #: (1–5)
Team 2
DevOps Lane: (if applicable)
Type of Change
Testing
How was this tested?
Automated Testing:
N/A — data files only, no code changes
Automated Testing
Manual Testing
Files validated locally against safety schema
JSON structure verified
Screenshots (if UI changes)
Attach screenshots or screen recordings here if the PR includes UI changes.
Risks / Edge Cases
Scam dataset is email-based so some records may be shorter than typical legal documents
Category assignment is keyword-based and may need refinement once model is tested
Environment Variables Added or Changed
None
Checklist