LaViC: Large Vision-Language Conversational Recommendation Framework

This repository provides implementation of "Adapting Large Vision-Language Models to Visually-Aware Conversational Recommendation" (LaViC) which is published at KDD 2025. The process is as follows:

Image Crawling (crawl_images.py): Crawls images before training LaViC.
Visual Knowledge Self-Distillation (knowledge_distillation.py): Compresses each product image’s patch embeddings into a small set of [CLS]-positioned embeddings (one per sub-image), applying LoRA to the vision module.
Recommendation Prompt Tuning (prompt_tuning.py): Fine-tunes LLaVA on a recommendation task by applying LoRA to the large language model, using 5 [CLS]-positioned tokens per item in candidate-based conversational recommendation.

Repository Structure

LaViC/
  ├── data/
  │   ├── all_beauty/
  │   │   ├── train.jsonl
  │   │   ├── valid.jsonl
  │   │   └── test.jsonl
  │   ├── amazon_fashion/
  │   │   ├── train.jsonl
  │   │   ├── valid.jsonl
  │   │   └── test.jsonl
  │   ├── amazon_home/
  │   │   ├── train.jsonl
  │   │   ├── valid.jsonl
  │   │   └── test.jsonl
  │   ├── train_images/
  │   ├── valid_images/
  │   ├── item2meta_train.json
  │   └── item2meta_valid.jsonl
  └── src/
      ├── crawl_images.py
      ├── knowledge_distillation.py
      └── prompt_tuning.py

data/:
- Subdirectories for each domain (e.g., all_beauty, amazon_fashion, amazon_home), each containing train.jsonl, valid.jsonl, test.jsonl with conversational data and associated ground-truth items.
- train_images/ and valid_images/: directories holding the actual product images (not included by default, but can be downloaded by the provided crawl_images.py).
- item2meta_train.json (you can get this by unzip item2meta_train.json.zip) and item2meta_valid.jsonl:
  - item2meta_train.json is a dictionary mapping item IDs (ASINs) to metadata (e.g., title, categories, features, description, images, etc.). We also provide image descriptions generated by LLaVA-v1.6.
  - item2meta_valid.jsonl is a line-by-line JSON file that similarly describes items for validation, including a title, image_name, and a pre-generated image descriptions by LLaVA-v1.6.
src/:
- crawl_images.py: Downloads product images from URLs in item2meta_train.json and item2meta_valid.jsonl into train_images/ and valid_images/, respectively.
- knowledge_distillation.py: Distills image knowledge into [CLS] embeddings.
- prompt_tuning.py: Fine-tunes the language model for conversation-based recommendation.

Quick Start

1. Environment Setup

Install required libraries:

cd LaViC
pip install -r requirements.txt

2. Image Crawling

Populate train_images/ and valid_images/:

cd src
python crawl_images.py

3. Visual Knowledge Self-Distillation

python knowledge_distillation.py \
  --model_name llava-hf/llava-v1.6-mistral-7b-hf \
  --train_data ../data/item2meta_train.json \
  --val_data ../data/item2meta_valid.jsonl \
  --train_images_dir ../data/train_images \
  --val_images_dir ../data/valid_images \
  --output_dir ./out_distilled \
  --lr 5e-5 --weight_decay 1e-5 --num_epochs 5 --batch_size 4

Key Arguments:
- --model_name: The base LLaVA model to distill (e.g., llava-hf/llava-v1.6-mistral-7b-hf).
- --train_data, --val_data: JSON or JSONL paths with product info and descriptions.
- --train_images_dir, --val_images_dir: Image directories.
- --output_dir: Where to save checkpoints and the final "vision_lora_adapter_best".
- --lr, --weight_decay, --num_epochs, --batch_size: Basic training hyperparameters.

4. Recommendation Prompt Tuning

python prompt_tuning.py \
  --model_dir ./out_distilled/vision_lora_adapter_best \
  --candidate_type candidates_st \
  --finetune_output_dir ./out_finetuned \
  --max_length 2048 \
  --batch_size 1 \
  --lr 5e-5 --weight_decay 1e-5 \
  --num_epochs 1 \
  --item_meta_path ../data/item2meta_train.json \
  --image_dir ../data/train_images \
  --category all_beauty

Key Arguments:
- --model_dir: Path to the distilled model from the previous distillation step.
- --candidate_type: Which key in your conversation JSON indicates the candidate items (e.g., candidates_st of candidates_gpt_large).
- --category: The domain (subdirectory) for your data (e.g., all_beauty, amazon_fashion, or amazon_home).
- --item_meta_path: JSON with item metadata (titles, etc.).
- --image_dir: Directory containing product images.
- --finetune_output_dir: Where to save the final LoRA adapter for the LM side.

Citation

To cite LaViC in your work, please use the following BibTeX entry:

@inproceedings{jeon25adapting,
  title = "Adapting large vision-language models to visually-aware conversational recommendation",
  author = "Hyunsik Jeon and Satoshi Koide and Yu Wang and Zhankui He and Julian McAuley",
  year = "2025",
  booktitle = "KDD"
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LaViC: Large Vision-Language Conversational Recommendation Framework

Repository Structure

Quick Start

1. Environment Setup

2. Image Crawling

3. Visual Knowledge Self-Distillation

4. Recommendation Prompt Tuning

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LaViC: Large Vision-Language Conversational Recommendation Framework

Repository Structure

Quick Start

1. Environment Setup

2. Image Crawling

3. Visual Knowledge Self-Distillation

4. Recommendation Prompt Tuning

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages