Skip to content

feat: add legal and scam training datasets for safety model#54

Merged
aidan-diaz merged 1 commit intodevfrom
feature/safety-training-data
Apr 14, 2026
Merged

feat: add legal and scam training datasets for safety model#54
aidan-diaz merged 1 commit intodevfrom
feature/safety-training-data

Conversation

@Fadma1234
Copy link
Copy Markdown

Feature Summary

What does this PR change?
Adds two labeled training datasets for the Safety model under data/safety/.

Why was this change made?
The Safety model needs labeled examples of both legitimate legal documents and scam documents to learn how to differentiate between them for red flag detection.
What is the code meant to do?
Provides training data for the Safety model. Legitimate legal documents are labeled likely_legitimate and real fraud emails impersonating legal/government entities are labeled likely_scam. Both follow the schema defined in app/api/safety/system-prompt.md.

Feature Team / Lane

Team #: (1–5)
Team 2
DevOps Lane: (if applicable)


Type of Change

  • [x ] Feature
  • Bug fix
  • Refactor
  • Documentation
  • CI/CD
  • Other (please specify)

Testing

How was this tested?
Automated Testing:
N/A — data files only, no code changes

Automated Testing

  • Unit tests added or updated
  • Integration tests added or updated
  • Existing tests pass locally
  • CI pipeline passes

Manual Testing

Files validated locally against safety schema
JSON structure verified


Screenshots (if UI changes)

Attach screenshots or screen recordings here if the PR includes UI changes.


Risks / Edge Cases

Scam dataset is email-based so some records may be shorter than typical legal documents
Category assignment is keyword-based and may need refinement once model is tested


Environment Variables Added or Changed

None


Checklist

  • [x ] Lint passes
  • Type check passes
  • No console logs remain
  • Deployment preview verified

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
multilingual-ai-document-assistant Ready Ready Preview, Comment Apr 13, 2026 11:33pm

Request Review

@Fadma1234 Fadma1234 requested a review from rakimdevcraig April 13, 2026 23:35
@aidan-diaz aidan-diaz requested review from aidan-diaz and removed request for rakimdevcraig April 14, 2026 02:07
@aidan-diaz
Copy link
Copy Markdown
Contributor

This looks good! Possible recommendation to add the training data to the .gitignore once training is finished, as downloading large files can cause lag/slower git operations + performance. No actual features are being modified, so everything in the app still works exactly the same as before. Nice work!

@aidan-diaz aidan-diaz merged commit 1437fdd into dev Apr 14, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants