ML Engineering Capstone Project

This project develops an NLP framework for automated validation of citations and claims, ensuring references accurately support stated information. We build on established datasets and models to classify citation accuracy as SUPPORT, REFUTE, or NEI (Not Enough Information).

Highlights

Reproduction of state-of-the-art scientific claim verification baselines - baselines
- Uses MultiVerS model, based on Longformer
Development of Python package for model training - pyvers
- Ingestion of multiple data sources using consistent labeling -- both data files and HuggingFace datasets
- Uses HF models pretrained on natural language inference (NLI) datasets to support the claim verification task
- Fine-tunes models using PyTorch Lightning for scalable model training, evaluation, and reporting
Deployment of final model to HuggingFace - fine-tuned model
Web app for end users - AI4citations
- Input a claim and evidence statements to get results
- Barchart visualization of class probabilities
- Choose from pretrained and fine-tuned models

The model generated in this project achieves a 7 percentage point increase in average F1 over the best baseline model fine-tuned on a single dataset:

	Macro F1 on test split
Model	SciFact	Citation-Integrity	Average
SciFact baseline [1]	0.81	0.15	0.48
Citation-Integrity baseline [2]	0.74	0.44	0.59
Fine-tuned DeBERTa [3]	0.84	0.47	*0.66*

[1] MultiVerS pretrained on FeverSci and fine-tuned on SciFact by Wadden et al. (2021)
[2] MultiVerS pretrained on HealthVer and fine-tuned on Citation-Integrity by Sarol et al. (2024)
[3] DeBERTa v3 pretrained on multiple NLI datasets and fine-tuned on shuffled data from SciFact and Citation-Integrity in this project

Milestones

All the steps of the project, from data exploration and processing to model training and deployment are recorded in notebook and blog posts.

Project Proposal
Data Wrangling
- Citation-Integrity: Data quality (some claims are very short sentence fragments) and class imbalance (less than 10% of claims are NEI)
- SciFact: Partial test dataset (corrected with data from scifact_10; see below) and class imbalance (40% NEI and Support, 20% Refute)
Data Exploration
- Citation-Integrity: Main topics are cells, cancer, COVID-19, patients, infection, and disease
- SciFact: Main topics are gene expression, cancer, treatment, and infection
Baselines: MultiVerS model fine-tuned on single datasets
- Reproduction of Citation-Integrity
- Model baselines
- Comparison of starting checkpoints
- eval.py: Metrics calculation module
Model Development: DeBERTa fine-tuned on multiple datasets
- Experiments with different transformer models (blog post)
- Scaling up the model
Model Deployment:
- Deployment and engineering plan
- Deployment architecture

Data Sources

The project utilizes two primary datasets, normalized with consistent labeling:

SciFact

1,409 scientific claims verified against 5,183 abstracts
Source: GitHub | Paper
Downloaded from: https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz
Test fold includes labels and abstract IDs from scifact_10

Citation-Integrity

3,063 citation instances from biomedical publications
Source: GitHub | Paper
Downloaded from: https://github.com/ScienceNLP-Lab/Citation-Integrity/ (Google Drive link)

For more details on data format, see MultiVerS data documentation.

Managing Uncertainty

The best approach wasn't always obvious at the beginning.

The order of sentence pairs in the tokenizer is important
- Model documentation and papers are not clear about ordering sentence pairs for natural language inference
- Experimenting with pretrained DeBERTa suggests that it was trained with evidence before claim
- Maintaining the same order for fine-tuning and inference gives improved performance
- The pyvers package uses this order consistently, improving reliability of NLI classification
  - See zero-shot demonstration in the pyvers README
Overfitting deep networks isn't necessarily bad
- Fine-tuning pretrained transformer models on small datasets begins overfitting after 1 or 2 epochs
- Nevertheless, continued fine-tuning improves prediction accuracy on test data
- The bias-variance tradeoff in classical ML should be rethought for models with large numbers of parameters
  - Empirical discovery in this project: blog post
  - "Benign overfitting" in the research literature: blog post

Looking Forward

TODOs for future development.

Handle class imbalance: MultiVerS implements reweighting in the loss function. Can we do the same with DeBERTa?
Data augmentation: Use a library such as TextAttack, TextAugment, or nlpaug to add examples with synonyms or back-translations. This may improve generalizability.
Low-rank adaptation (LoRA): Can speed up fine-tuning and mitigate overfitting compared to optimizing all parameters in the model.

Acknowledgments

I wish to thank my mentor, Divya Vellanki, for giving valuable advice and encouragement throughout this project.

The Springboard MLE bootcamp curriculum provided the conceptual and practical foundations.

This project builds upon several significant contributions from the research community:

Citation-Integrity dataset by Sarol et al., 2024
DeBERTa model by He et al., 2021
MultiVerS model by Wadden et al., 2021
SciFact dataset by Wadden et al., 2020
Longformer model by Beltagy et al., 2020

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
baselines		baselines
data		data
images		images
notebooks		notebooks
predictions		predictions
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ML Engineering Capstone Project

Highlights

Milestones

Data Sources

SciFact

Citation-Integrity

Managing Uncertainty

Looking Forward

Acknowledgments

About

Uh oh!

Uh oh!

Languages

License

jedick/MLE-capstone-project

Folders and files

Latest commit

History

Repository files navigation

ML Engineering Capstone Project

Highlights

Milestones

Data Sources

SciFact

Citation-Integrity

Managing Uncertainty

Looking Forward

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages