Skip to content

jedick/MLE-capstone-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML Engineering Capstone Project

This project develops an NLP framework for automated validation of citations and claims, ensuring references accurately support stated information. We build on established datasets and models to classify citation accuracy as SUPPORT, REFUTE, or NEI (Not Enough Information).

MLE Capstone Project Diagram

Highlights

  • Reproduction of state-of-the-art scientific claim verification baselines - baselines
    • Uses MultiVerS model, based on Longformer
  • Development of Python package for model training - pyvers
    • Ingestion of multiple data sources using consistent labeling -- both data files and HuggingFace datasets
    • Uses HF models pretrained on natural language inference (NLI) datasets to support the claim verification task
    • Fine-tunes models using PyTorch Lightning for scalable model training, evaluation, and reporting
  • Deployment of final model to HuggingFace - fine-tuned model
  • Web app for end users - AI4citations
    • Input a claim and evidence statements to get results
    • Barchart visualization of class probabilities
    • Choose from pretrained and fine-tuned models

The model generated in this project achieves a 7 percentage point increase in average F1 over the best baseline model fine-tuned on a single dataset:

Macro F1 on test split
Model SciFact Citation-Integrity Average
SciFact baseline [1] 0.81 0.15 0.48
Citation-Integrity baseline [2] 0.74 0.44 0.59
Fine-tuned DeBERTa [3] 0.84 0.47 0.66

Milestones

All the steps of the project, from data exploration and processing to model training and deployment are recorded in notebook and blog posts.

Data Sources

The project utilizes two primary datasets, normalized with consistent labeling:

SciFact

Citation-Integrity

For more details on data format, see MultiVerS data documentation.

Managing Uncertainty

The best approach wasn't always obvious at the beginning.

  • The order of sentence pairs in the tokenizer is important
    • Model documentation and papers are not clear about ordering sentence pairs for natural language inference
    • Experimenting with pretrained DeBERTa suggests that it was trained with evidence before claim
    • Maintaining the same order for fine-tuning and inference gives improved performance
    • The pyvers package uses this order consistently, improving reliability of NLI classification
      • See zero-shot demonstration in the pyvers README
  • Overfitting deep networks isn't necessarily bad
    • Fine-tuning pretrained transformer models on small datasets begins overfitting after 1 or 2 epochs
    • Nevertheless, continued fine-tuning improves prediction accuracy on test data
    • The bias-variance tradeoff in classical ML should be rethought for models with large numbers of parameters
      • Empirical discovery in this project: blog post
      • "Benign overfitting" in the research literature: blog post

Looking Forward

TODOs for future development.

  • Handle class imbalance: MultiVerS implements reweighting in the loss function. Can we do the same with DeBERTa?
  • Data augmentation: Use a library such as TextAttack, TextAugment, or nlpaug to add examples with synonyms or back-translations. This may improve generalizability.
  • Low-rank adaptation (LoRA): Can speed up fine-tuning and mitigate overfitting compared to optimizing all parameters in the model.

Acknowledgments

I wish to thank my mentor, Divya Vellanki, for giving valuable advice and encouragement throughout this project.

The Springboard MLE bootcamp curriculum provided the conceptual and practical foundations.

This project builds upon several significant contributions from the research community:

About

Capstone project for ML engineering bootcamp

Topics

Resources

License

Stars

Watchers

Forks