This repository provides implementation of "Adapting Large Vision-Language Models to Visually-Aware Conversational Recommendation" (LaViC) which is published at KDD 2025. The process is as follows:
-
Image Crawling (
crawl_images.py): Crawls images before training LaViC. -
Visual Knowledge Self-Distillation (
knowledge_distillation.py): Compresses each product image’s patch embeddings into a small set of [CLS]-positioned embeddings (one per sub-image), applying LoRA to the vision module. -
Recommendation Prompt Tuning (
prompt_tuning.py): Fine-tunes LLaVA on a recommendation task by applying LoRA to the large language model, using 5 [CLS]-positioned tokens per item in candidate-based conversational recommendation.
LaViC/
├── data/
│ ├── all_beauty/
│ │ ├── train.jsonl
│ │ ├── valid.jsonl
│ │ └── test.jsonl
│ ├── amazon_fashion/
│ │ ├── train.jsonl
│ │ ├── valid.jsonl
│ │ └── test.jsonl
│ ├── amazon_home/
│ │ ├── train.jsonl
│ │ ├── valid.jsonl
│ │ └── test.jsonl
│ ├── train_images/
│ ├── valid_images/
│ ├── item2meta_train.json
│ └── item2meta_valid.jsonl
└── src/
├── crawl_images.py
├── knowledge_distillation.py
└── prompt_tuning.py
-
data/:- Subdirectories for each domain (e.g.,
all_beauty,amazon_fashion,amazon_home), each containingtrain.jsonl,valid.jsonl,test.jsonlwith conversational data and associated ground-truth items. train_images/andvalid_images/: directories holding the actual product images (not included by default, but can be downloaded by the providedcrawl_images.py).item2meta_train.json(you can get this by unzipitem2meta_train.json.zip) anditem2meta_valid.jsonl:item2meta_train.jsonis a dictionary mapping item IDs (ASINs) to metadata (e.g., title, categories, features, description, images, etc.). We also provide image descriptions generated by LLaVA-v1.6.item2meta_valid.jsonlis a line-by-line JSON file that similarly describes items for validation, including a title, image_name, and a pre-generated image descriptions by LLaVA-v1.6.
- Subdirectories for each domain (e.g.,
-
src/:crawl_images.py: Downloads product images from URLs initem2meta_train.jsonanditem2meta_valid.jsonlintotrain_images/andvalid_images/, respectively.knowledge_distillation.py: Distills image knowledge into[CLS]embeddings.prompt_tuning.py: Fine-tunes the language model for conversation-based recommendation.
Install required libraries:
cd LaViC
pip install -r requirements.txtPopulate train_images/ and valid_images/:
cd src
python crawl_images.pypython knowledge_distillation.py \
--model_name llava-hf/llava-v1.6-mistral-7b-hf \
--train_data ../data/item2meta_train.json \
--val_data ../data/item2meta_valid.jsonl \
--train_images_dir ../data/train_images \
--val_images_dir ../data/valid_images \
--output_dir ./out_distilled \
--lr 5e-5 --weight_decay 1e-5 --num_epochs 5 --batch_size 4- Key Arguments:
- --model_name: The base LLaVA model to distill (e.g., llava-hf/llava-v1.6-mistral-7b-hf).
- --train_data, --val_data: JSON or JSONL paths with product info and descriptions.
- --train_images_dir, --val_images_dir: Image directories.
- --output_dir: Where to save checkpoints and the final "vision_lora_adapter_best".
- --lr, --weight_decay, --num_epochs, --batch_size: Basic training hyperparameters.
python prompt_tuning.py \
--model_dir ./out_distilled/vision_lora_adapter_best \
--candidate_type candidates_st \
--finetune_output_dir ./out_finetuned \
--max_length 2048 \
--batch_size 1 \
--lr 5e-5 --weight_decay 1e-5 \
--num_epochs 1 \
--item_meta_path ../data/item2meta_train.json \
--image_dir ../data/train_images \
--category all_beauty- Key Arguments:
- --model_dir: Path to the distilled model from the previous distillation step.
- --candidate_type: Which key in your conversation JSON indicates the candidate items (e.g., candidates_st of candidates_gpt_large).
- --category: The domain (subdirectory) for your data (e.g., all_beauty, amazon_fashion, or amazon_home).
- --item_meta_path: JSON with item metadata (titles, etc.).
- --image_dir: Directory containing product images.
- --finetune_output_dir: Where to save the final LoRA adapter for the LM side.
To cite LaViC in your work, please use the following BibTeX entry:
@inproceedings{jeon25adapting,
title = "Adapting large vision-language models to visually-aware conversational recommendation",
author = "Hyunsik Jeon and Satoshi Koide and Yu Wang and Zhankui He and Julian McAuley",
year = "2025",
booktitle = "KDD"
}