LLM Fine-tuning and Inference

This repository documents experimental work on fine-tuning large language models and building a multi-model inference workflow for response generation, comparison, and evaluation.
The project focuses on generating answers to the same input query using different LLMs and systematically comparing their outputs to study response quality, behavior, and suitability.

Experiments follow a staged workflow, progressing from model fine-tuning to controlled inference and cross-model response evaluation.

Project Motivation

The goal of this project is to explore how different fine-tuned language models respond to the same query and how their outputs can be compared and evaluated in a structured manner.
Rather than relying on a single model response, this work emphasizes side-by-side response generation, qualitative evaluation, and simple selection strategies to better understand model behavior.

Project Structure

1. Model Fine-tuning

This stage prepares task-specific language models used for downstream evaluation.

Key steps include:

Loading and preprocessing a benchmark QA dataset
Configuring tokenizers and training parameters
Fine-tuning pretrained QA and generative language models
Saving trained checkpoints for comparative inference

This stage produces multiple fine-tuned models used for response generation.

2. Inference and Cross-model Response Evaluation

Using the fine-tuned models, an inference workflow is implemented to generate and evaluate responses.

Key aspects include:

Submitting the same input question to multiple fine-tuned LLMs
Collecting generated responses from each model
Comparing outputs based on relevance, completeness, and response style
Exploring simple evaluation and selection logic to identify preferred responses

This stage focuses on understanding differences between model outputs rather than optimizing a single response.

Tech Stack

Python
PyTorch
Hugging Face Transformers
Pretrained QA and generative language models

Repository Contents

LLM_tuning.py — model fine-tuning workflows for multiple LLM variants
LLM_main.py — inference pipeline for multi-model response generation and evaluation
README.md — project documentation

Observations and Insights

Different fine-tuned models produce notably different responses to the same query.
QA-oriented and generative models vary in structure, verbosity, and factual focus.
Side-by-side comparison provides clearer insight into model strengths and limitations than isolated outputs.
Explicit evaluation criteria are necessary for meaningful response selection.

Notes

This repository represents an experimental, portfolio-oriented project focused on multi-model LLM evaluation.
The emphasis is on comparing and understanding model responses through structured inference and evaluation rather than deploying a single production system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Fine-tuning and Inference

Project Motivation

Project Structure

1. Model Fine-tuning

2. Inference and Cross-model Response Evaluation

Tech Stack

Repository Contents

Observations and Insights

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LLM_main.py		LLM_main.py
LLM_tuning.py		LLM_tuning.py
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

LLM Fine-tuning and Inference

Project Motivation

Project Structure

1. Model Fine-tuning

2. Inference and Cross-model Response Evaluation

Tech Stack

Repository Contents

Observations and Insights

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages