This repository documents experimental work on fine-tuning large language models and building a multi-model inference workflow for response generation, comparison, and evaluation.
The project focuses on generating answers to the same input query using different LLMs and systematically comparing their outputs to study response quality, behavior, and suitability.
Experiments follow a staged workflow, progressing from model fine-tuning to controlled inference and cross-model response evaluation.
The goal of this project is to explore how different fine-tuned language models respond to the same query and how their outputs can be compared and evaluated in a structured manner.
Rather than relying on a single model response, this work emphasizes side-by-side response generation, qualitative evaluation, and simple selection strategies to better understand model behavior.
This stage prepares task-specific language models used for downstream evaluation.
Key steps include:
- Loading and preprocessing a benchmark QA dataset
- Configuring tokenizers and training parameters
- Fine-tuning pretrained QA and generative language models
- Saving trained checkpoints for comparative inference
This stage produces multiple fine-tuned models used for response generation.
Using the fine-tuned models, an inference workflow is implemented to generate and evaluate responses.
Key aspects include:
- Submitting the same input question to multiple fine-tuned LLMs
- Collecting generated responses from each model
- Comparing outputs based on relevance, completeness, and response style
- Exploring simple evaluation and selection logic to identify preferred responses
This stage focuses on understanding differences between model outputs rather than optimizing a single response.
- Python
- PyTorch
- Hugging Face Transformers
- Pretrained QA and generative language models
LLM_tuning.py— model fine-tuning workflows for multiple LLM variantsLLM_main.py— inference pipeline for multi-model response generation and evaluationREADME.md— project documentation
- Different fine-tuned models produce notably different responses to the same query.
- QA-oriented and generative models vary in structure, verbosity, and factual focus.
- Side-by-side comparison provides clearer insight into model strengths and limitations than isolated outputs.
- Explicit evaluation criteria are necessary for meaningful response selection.
This repository represents an experimental, portfolio-oriented project focused on multi-model LLM evaluation.
The emphasis is on comparing and understanding model responses through structured inference and evaluation rather than deploying a single production system.