This is a full-stack document classification system with three main components:
- Backend (Flask):
app.py- REST API orchestrating the ML pipeline - ML Pipeline:
aws_sagemaker/(classification) +aws_textract_project/(OCR/extraction) - Frontend (React):
awsfrontend/src/- Multi-language UI with camera/file upload
File → S3 upload → Textract (OCR) → SageMaker endpoint (classification) → Bedrock/Polly (enrichment) → Frontend display
The system classifies tax forms (W2 vs non-W2) and extracts their fields using AWS services.
- S3 bucket: Hardcoded as
'w2-datasets'across all files (centralize if moving) - Label mapping:
label_mapping.json- maps class indices {0: "w2", 1: "non-w2"} - Endpoint config:
endpoint_name.txt+label_mapping.jsonrequired at runtime (copied to root fromaws_sagemaker/) - Allowed file types:
{'png', 'jpg', 'jpeg', 'pdf'}intextract_for_sagemaker.py
POST /upload-and-process- Main pipeline entry: validates file → uploads to S3 → processes through Textract → calls SageMaker- Uses
CORSenabled (necessary for frontend communication) - Temporary files cleaned with
tempfilemodule; S3 input folder cleared before each processing
- Predictor initialized in
aws_sagemaker/predict.pywith JSON serialization - Confidence threshold check at
0.5inpredict_text()function - Training datasets: 13 tax form types in
train_model.py(1040_1, sch_a, sch_b, etc.) - Output stored to
output/result/result.jsonin S3 with structure:{"predicted_label": "...", "confidence": ...}
- W2 forms: Use key-value pair extraction (
textract_ocr_better.py→get_kv_map/relationship) - Non-W2 forms: Fall back to line-based OCR (
textract_ocr_line.py) - Router logic in
aws_textract_project/pipeline.py- checks predicted form type first, then branches extraction method - Output: JSON with text content for Bedrock processing
- Language selection stored in
localStorage.selectedLanguage(default: 'en') - Uses
i18nextfor translations; locales atpublic/locales/{en,es,zh}/translation.json - Two-page SPA:
HomePage(upload/capture) →ResultsPage(results display) - Webcam integration via
react-webcam; audio encoding withlamejs - Navigation via
react-router-domv6
# Backend
pip install -r aws_sagemaker/requirements.txt
python app.py # Listens on Flask default (5000)
# Frontend
cd awsfrontend
npm install
npm start # Runs on http://localhost:3000python aws_sagemaker/train_model.py --bucket-name w2-datasets --prefix datasetsExpects training data in S3 at s3://w2-datasets/datasets/{form_type}/
python aws_sagemaker/deploy_model.pyGenerates endpoint_name.txt and label_mapping.json (copy to root if using app.py locally)
- All AWS clients initialized without explicit credentials → relies on IAM role in deployment
- SageMaker execution role hardcoded:
arn:aws:iam::715841371006:role/SageMakerExecutionRole - Development likely uses AWS CLI credentials or EC2 instance role
- No custom exception classes; logging via
loggingmodule (INFO level in predict.py) - S3 operations wrapped in try-catch with app.logger calls
- File validation uses
werkzeug.utils.secure_filenamebefore S3 upload
- Form type detection happens in pipeline.py by downloading SageMaker prediction result first
- No explicit session/request ID tracking; relies on single-file-at-a-time processing
- Temporary files cleaned up via
os.remove()after reading
- Scikit-learn models (implied by
train_test_splitimport; likely sklearn.naive_bayes or similar) - No explicit feature engineering visible; assumes Textract text features sufficient
- Confidence thresholds and label mapping configured at inference time, not training time
Adding a new form type to classify:
- Add dataset name to
dataset_nameslist intrain_model.py - Upload training data to S3 at
s3://w2-datasets/datasets/{form_type}/ - Retrain and redeploy SageMaker model
- Update
label_mapping.jsonwith new class index - Pipeline logic in
pipeline.pywill branch onget_kv_mapvs line OCR based on predicted label
Changing language support:
- Add locale JSON at
awsfrontend/public/locales/{lang_code}/translation.json - Update language selector in
HomePage.jsto include new option - Video instructions: add
instructions_{lang_code}.mp4to public folder
Debugging ML predictions:
- Check form type prediction in S3:
s3://w2-datasets/output/result/result.json - If confidence < 0.5, check Textract OCR output quality and training data distribution
- Enable verbose logging by changing
logging.basicConfig(level=logging.DEBUG)in predict.py