Student: Jeffrey Dick
Date: May 28, 2025
Project: ML Engineering Capstone - AI-Powered Citation Verification
This document outlines the production deployment architecture for the Citation Verification System, a machine learning application that classifies scientific claims as SUPPORT, REFUTE, or NEI (Not Enough Information) based on provided evidence. The system leverages multiple approaches including a fine-tuned DeBERTa model deployed on HuggingFace Spaces with GPU acceleration, LLM-powered evidence retrieval, and domain-adaptive retraining strategies designed to handle diverse user populations while maintaining cost-effectiveness and scalability.
[Architecture Diagram - Figure 1: Citation Verification System Production Deployment Architecture showing the complete system flow from user interface through model inference, data storage, and continuous improvement cycles]
The production architecture comprises five primary layers that handle the complete lifecycle from user interaction to model improvement.
- The User Interface Layer serves as the primary entry point through the Gradio web application hosted on AI4citations, providing both interactive web access for researchers and programmatic REST API access for third-party integrations.
- The Application Layer orchestrates the core business logic through an inference pipeline that manages evidence preprocessing, and model selection. The BM25S implementation is used to extract relevant sentences from PDFs.
- The Model Inference Layer represents the core intelligence of the system, featuring primary DeBERTa model fine-tuned on SciFact and Citation-Integrity datasets. GPU infrastructure utilizes Nvidia T4 instances on HuggingFace Spaces.
- The Data Storage Layer maintains comprehensive version control and artifact management through the HuggingFace Hub for model artifacts. User interaction logs capture detailed performance metrics and feedback for RLHF data collection.
- The Monitoring and Operations Layer provides real-time visibility into system performance through comprehensive metrics tracking, user feedback collection, and statistical monitoring for data drift detection.
The fine-tuned DeBERTa model provides the primary classification capability, maintaining the 7 percentage point F1 improvement over baseline models.
LLM integration offers complementary capabilities for complex evidence retrieval and one-shot verification scenarios where traditional fine-tuned models may struggle. However, this approach introduces significant cost and latency tradeoffs. Running the test harness on 300 examples with LLM-based verification would cost approximately $7.20 per run (300 × 20,000 tokens × $1/1M input tokens + 300 × 1,000 tokens × $4/1M output tokens), making it suitable for high-value scenarios but impractical for routine testing.
HuggingFace Spaces remains the primary deployment platform due to its cost-effectiveness and integration benefits.
The system processes diverse input types to accommodate various user workflows and integration scenarios.
- Primary inputs include text pairs consisting of claims and evidence statements, with support for both direct text entry and PDF document upload for evidence extraction.
- Secondary inputs encompass user feedback for model improvement, configuration preferences for model selection, and domain-specific metadata that influences retraining decisions.
System outputs provide:
- classification predictions (SUPPORT/REFUTE/NEI) with confidence scores
- visualizations showing probability distributions across classes
- retrieved evidence sentences with relevance rankings
- response times
The request processing workflow has been designed to accommodate multiple interaction modes:
• Mode 1: User submits claim-evidence pairs via web interface or API endpoint • Mode 2: User submits claim-PDF pairs, and retrieval system extracts evidences statements from PDF • Model selection to choose between fine-tuned model or LLM-powered verification • Response formatting includes visualizations, evidence citations, and performance metadata • Interaction logging captures detailed metrics for monitoring and future training data
Model versioning and deployment supports domain-specific adaptations.
- Models are stored on HuggingFace Hub with versioning to track fine-tuning iterations.
- [TODO] Implement event-driven triggers for retraining:
- User domain analysis
- Performance degradation
- Accumulated feedback volume
Deployment strategies vary by model type:
- Fine-tune transformer models supporting rapid deployment and rollback.
- LLM integrations require careful cost monitoring and evaluations for model updates.
- Blue-green deployment ensures zero-downtime updates
- Automatic rollback capabilities if performance metrics fall below established thresholds during the validation window.
The evaluation framework has these components:
- F1 score as the primary performance metric, as it provides balanced assessment of precision and recall across the three-class classification problem.
- Test harnesses utilize held-out test sets from both SciFact and Citation-Integrity datasets, providing consistent benchmarks across retraining iterations
- Additional metrics include inference latency to ensure comprehensive performance assessment.
Retraining occurs through both statistical monitoring and usage-based triggers. Data drift detection analyzes input distribution changes that may indicate new user populations or evolving claim patterns. Performance degradation alerts trigger immediate investigation and potential retraining when accuracy drops below established thresholds. User feedback accumulation triggers domain-specific retraining when sufficient annotated examples become available from particular user communities.
Monthly operating costs reflect the multi-tiered approach with baseline infrastructure at $11/month for GPU instances and storage, periodic retraining averaging $15/month across quarterly cycles, and variable LLM costs ranging from $0-100/month depending on usage patterns. The total estimated range of $26-126/month provides flexibility to scale service levels based on user demand and performance requirements.
LLM integration introduces significant cost and performance considerations that must be carefully balanced against accuracy benefits. While LLM-powered verification can achieve higher accuracy on complex claims, the computational cost is approximately 10-15x higher than fine-tuned model inference. Latency increases from sub-second response times to 5-15 seconds for complex document processing, potentially impacting user experience for interactive applications.
The cost structure makes LLM verification most suitable for high-value use cases such as publication review, grant evaluation, or regulatory compliance where accuracy justifies increased expenses. For routine verification tasks, the fine-tuned LORA models provide optimal cost-performance balance while maintaining high accuracy standards.
The system architecture emphasizes procedural automation to ensure reproducible results and efficient debugging workflows. All training pipelines, evaluation procedures, and deployment processes are fully scripted and version-controlled, enabling complete result regeneration through simple command execution. This approach significantly reduces debugging time and ensures consistent results across development, testing, and production environments.
- Claim detection to automatically decompose complex input text into atomic claims for more granular verification
- Use LLM-powered prompts based on FActScore/SAFE studies.
- Expanded utility for document-level verification and comprehensive fact-checking workflows.
- Update the AI4citations interface to provide user-configurable options for reranking algorithms, model selection preferences, and LLM integration based on accuracy and cost requirements. This flexibility allows users to optimize the verification process for their specific needs and budget constraints.
- Incorporate advanced NLP augmentation techniques to improve model robustness. These augmentation strategies particularly benefit domain adaptation when transitioning between scientific fields, as they help models generalize across varying writing styles and technical vocabularies while maintaining semantic accuracy.
- Reverse translation through multiple language pairs;
- Context-aware synonym replacement.
- Acquire new training datasets for domain-specific retraining scenarios.
- Trigger automated retraining workflows:
- when sufficient user feedback is obtained
- when performance degradation is detected
- when new training data from other user domains are available
The retraining strategy acknowledges the diverse user base and varying domain requirements through intelligent dataset selection and trigger mechanisms. When user analytics indicate increased traffic from specific domains (e.g., clinical research vs. general science), the system automatically initiates domain-specific retraining using relevant datasets from the expanded corpus.
Available datasets for domain-specific adaptation include ReferenceErrorDetection for general academic claims, CliniFact for clinical research verification, DisFact for disaster-related claims, SC20 for high-impact journal citations, and CSR23 for humanities quotation accuracy. This comprehensive dataset library enables targeted improvements that directly address user community needs while maintaining general performance.
Phase 1 (Current): Initial Deployment
Next Steps:
- Upgrade app to HuggingFace Spaces with GPU and persistent storage for performance metrics and user feedback
- Activate monitoring and logging
- Begin user feedback collection
- Integrate reranker into AI4citations app
Phase 2 (Next 30 days): LLM integration for evidence retrieval and comprehensive test harness implementation
Phase 3 (Next 60 days): Domain-adaptive retraining pipeline, LORA optimization, and expanded dataset integration
This comprehensive architecture provides a robust foundation for production deployment while maintaining the flexibility necessary for continuous improvement and domain adaptation. The design successfully balances cost-effectiveness with performance requirements, ensuring the system can serve diverse user communities while evolving through feedback and retraining cycles.
Project: ML Engineering Capstone - Scientific Citation Verification
Deployment Target: HuggingFace Spaces with GPU acceleration
Date: May 27, 2025
- Problem Statement: Automated validation of scientific citations and claims to classify as SUPPORT, REFUTE, or NEI
- Success Criteria: Achieve >60% macro F1 score across test datasets (Current: 66% average F1)
- Target Users: Researchers, students, authors, and academic institutions
- Use Case Validation: Manual verification is time-consuming; automated solution provides efficiency gains
- Functional Requirements:
- Classify claim-evidence pairs with confidence scores
- Handle both direct text input and PDF evidence extraction
- Provide visualization of prediction probabilities
- Performance Requirements:
- Response time <5 seconds per query
- Support for concurrent users (initial target: 10-50 concurrent)
- Reliability Requirements: 99% uptime target with graceful degradation
- Data Sources Identified: SciFact (1,409 claims) and Citation-Integrity (3,063 instances)
- Data Quality Assessment: Completed data wrangling notebooks documenting quality issues
- Label Consistency: Unified labeling scheme implemented in pyvers package
- Data Split Strategy: Train/validation/test splits preserved from original datasets
- Input Validation: Text preprocessing and length limits implemented
- Data Format Standardization: Consistent tokenization using DeBERTa tokenizer
- Evidence Retrieval: BM25S implementation for PDF processing
- Error Handling: Graceful handling of malformed inputs and edge cases
- Model Selection: DeBERTa v3 chosen based on NLI performance and domain fit
- Baseline Comparison: 7 percentage point improvement over MultiVerS baselines
- Architecture Documentation: Model card and documentation on HuggingFace Hub
- Hyperparameter Tuning: Completed systematic tuning using PyTorch Lightning
- Cross-Dataset Evaluation: Tested on both SciFact and Citation-Integrity test sets
- Performance Metrics: Macro F1, AUROC, and per-class metrics calculated
- Zero-Shot Baseline: Base model performance documented for comparison
- Edge Case Testing: Tested with short claims, long evidence, and ambiguous cases
- Model Versioning: Stored on HuggingFace Hub with version tags
- Reproducibility: Training pipeline documented in pyvers package
- Model Size: 400MB model size acceptable for deployment target
- Inference Speed: <2 seconds per prediction on GPU instance
- Platform Selection: HuggingFace Spaces chosen for cost-effectiveness and simplicity
- Resource Requirements: Nvidia T4 GPU instance specified and tested
- Scaling Strategy: Automatic scaling provided by HF Spaces
- Backup Strategy: Model artifacts stored on HF Hub with version control
- Web Interface: Gradio app developed and tested (AI4citations)
- API Design: REST API endpoints documented and functional
- Load Balancing: Handled by HF Spaces infrastructure
- Error Handling: Comprehensive error handling and user feedback
- Data Privacy: User inputs not stored unless opted-in for feedback
- API Security: Rate limiting considerations documented 1
- Model Security: Access controls on model artifacts
- Compliance: No PII or sensitive data handling required
- Metrics Definition: Response time, GPU utilization, memory usage
- Alerting Strategy: Performance degradation thresholds defined
- Logging Implementation: User interaction logging for RLHF
- Dashboard Planning: Key metrics identified for monitoring dashboard
- Drift Detection Strategy: Statistical comparison of input distributions
- Feedback Collection: User rating system for RLHF implementation
- Data Validation: Input format and content validation rules
- Anomaly Detection: Unusual query patterns and response times
- Unit Tests: Core functions tested in pyvers package
- Integration Tests: End-to-end pipeline tested
- User Acceptance Testing: Manual testing of web interface
- Performance Testing: Load testing with concurrent requests 1
- Regression Testing: Model performance benchmarks established
- A/B Testing Framework: Comparison methodology defined
- Fallback Testing: Base model fallback mechanism tested 1
- Edge Case Validation: Boundary conditions and error cases tested
- Deployment Pipeline: GitHub integration with HF Spaces configured
- Rollback Strategy: Previous model versions available for quick reversion
- Blue-Green Deployment: Strategy defined for zero-downtime updates
- Release Documentation: Deployment procedures documented
- Retraining Schedule: Quarterly reviews and trigger-based retraining
- Model Updates: Process for incorporating new training data
- Dependency Management: Package versions pinned and documented
- Backup Procedures: Model and configuration backup strategy
- Infrastructure Costs: $6/month GPU + $5/month storage = $11/month base
- Training Costs: $10/month averaged for periodic retraining
- Monitoring Costs: $0-50/month for optional enhanced monitoring (MLFlow)
- Total Budget: $21-71/month ($252-852/year) approved and documented
- Resource Utilization: GPU usage optimized for inference-only workloads
- Storage Optimization: Log rotation and cleanup procedures defined
- Scaling Economics: Cost per user calculated and monitored
- Alternative Assessments: AWS/GCP comparison completed for future scaling
- API Documentation: Gradio auto-generated API docs available
- Architecture Documentation: System design documented in deployment plan
- Runbooks: Operational procedures documented
- Troubleshooting Guide: Common issues and solutions documented
- User Guide: Web interface usage instructions
- API Guide: Integration documentation for developers
- FAQ: Common questions and use cases addressed 1
- Performance Expectations: Response times and accuracy documented
- Environment Validation: Production environment tested and validated
- Performance Baseline: Initial metrics and benchmarks established
- Monitoring Active: All monitoring and alerting systems operational
- Support Process: Issue escalation and support procedures defined
- Success Metrics: User adoption and performance KPIs defined
- Feedback Collection: User feedback mechanisms active
- Iteration Planning: Enhancement roadmap for next 3-6 months
- Scale Planning: Growth scenarios and scaling triggers identified