This project develops an NLP framework for automated validation of citations and claims, ensuring references accurately support stated information. We build on established datasets and models to classify citation accuracy as SUPPORT, REFUTE, or NEI (Not Enough Information).
- Reproduction of state-of-the-art scientific claim verification baselines - baselines
- Uses MultiVerS model, based on Longformer
- Development of Python package for model training - pyvers
- Ingestion of multiple data sources using consistent labeling -- both data files and HuggingFace datasets
- Uses HF models pretrained on natural language inference (NLI) datasets to support the claim verification task
- Fine-tunes models using PyTorch Lightning for scalable model training, evaluation, and reporting
- Deployment of final model to HuggingFace - fine-tuned model
- Web app for end users - AI4citations
- Input a claim and evidence statements to get results
- Barchart visualization of class probabilities
- Choose from pretrained and fine-tuned models
The model generated in this project achieves a 7 percentage point increase in average F1 over the best baseline model fine-tuned on a single dataset:
Macro F1 on test split | |||
Model | SciFact | Citation-Integrity | Average |
SciFact baseline [1] | 0.81 | 0.15 | 0.48 |
Citation-Integrity baseline [2] | 0.74 | 0.44 | 0.59 |
Fine-tuned DeBERTa [3] | 0.84 | 0.47 | 0.66 |
- [1] MultiVerS pretrained on FeverSci and fine-tuned on SciFact by Wadden et al. (2021)
- [2] MultiVerS pretrained on HealthVer and fine-tuned on Citation-Integrity by Sarol et al. (2024)
- [3] DeBERTa v3 pretrained on multiple NLI datasets and fine-tuned on shuffled data from SciFact and Citation-Integrity in this project
All the steps of the project, from data exploration and processing to model training and deployment are recorded in notebook and blog posts.
- Project Proposal
- Data Wrangling
- Citation-Integrity: Data quality (some claims are very short sentence fragments) and class imbalance (less than 10% of claims are NEI)
- SciFact: Partial test dataset (corrected with data from
scifact_10
; see below) and class imbalance (40% NEI and Support, 20% Refute)
- Data Exploration
- Citation-Integrity: Main topics are cells, cancer, COVID-19, patients, infection, and disease
- SciFact: Main topics are gene expression, cancer, treatment, and infection
- Baselines: MultiVerS model fine-tuned on single datasets
- Reproduction of Citation-Integrity
- Model baselines
- Comparison of starting checkpoints
- eval.py: Metrics calculation module
- Model Development: DeBERTa fine-tuned on multiple datasets
- Model Deployment:
The project utilizes two primary datasets, normalized with consistent labeling:
- 1,409 scientific claims verified against 5,183 abstracts
- Source: GitHub | Paper
- Downloaded from: https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz
- Test fold includes labels and abstract IDs from
scifact_10
- 3,063 citation instances from biomedical publications
- Source: GitHub | Paper
- Downloaded from: https://github.com/ScienceNLP-Lab/Citation-Integrity/ (Google Drive link)
For more details on data format, see MultiVerS data documentation.
The best approach wasn't always obvious at the beginning.
- The order of sentence pairs in the tokenizer is important
- Model documentation and papers are not clear about ordering sentence pairs for natural language inference
- Experimenting with pretrained DeBERTa suggests that it was trained with evidence before claim
- Maintaining the same order for fine-tuning and inference gives improved performance
- The pyvers package uses this order consistently, improving reliability of NLI classification
- See zero-shot demonstration in the pyvers README
- Overfitting deep networks isn't necessarily bad
- Fine-tuning pretrained transformer models on small datasets begins overfitting after 1 or 2 epochs
- Nevertheless, continued fine-tuning improves prediction accuracy on test data
- The bias-variance tradeoff in classical ML should be rethought for models with large numbers of parameters
- Empirical discovery in this project: blog post
- "Benign overfitting" in the research literature: blog post
TODOs for future development.
- Handle class imbalance: MultiVerS implements reweighting in the loss function. Can we do the same with DeBERTa?
- Data augmentation: Use a library such as TextAttack, TextAugment, or nlpaug to add examples with synonyms or back-translations. This may improve generalizability.
- Low-rank adaptation (LoRA): Can speed up fine-tuning and mitigate overfitting compared to optimizing all parameters in the model.
I wish to thank my mentor, Divya Vellanki, for giving valuable advice and encouragement throughout this project.
The Springboard MLE bootcamp curriculum provided the conceptual and practical foundations.
This project builds upon several significant contributions from the research community:
- Citation-Integrity dataset by Sarol et al., 2024
- DeBERTa model by He et al., 2021
- MultiVerS model by Wadden et al., 2021
- SciFact dataset by Wadden et al., 2020
- Longformer model by Beltagy et al., 2020