Skip to content

Latest commit

 

History

History
 
 

CrossViT

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, arxiv

PaddlePaddle training/validation code and pretrained models for CrossViT.

The official pytorch implementation is here.

This implementation is developed by PPViT.

drawing

CrossVit Model Overview

Update

  • Update (2021-09-27): Model FLOPs and # params are uploaded.
  • Update (2021-09-16): Code is released and ported weights are uploaded.
  • Update (2021-09-22): Support more models eval.

Models Zoo

Model Acc@1 Acc@5 #Params FLOPs Image Size Crop_pct Interpolation Link
cross_vit_tiny_224 73.20 91.90 6.9M 1.3G 224 0.875 bicubic google/baidu(scvb)
cross_vit_small_224 81.01 95.33 26.7M 5.2G 224 0.875 bicubic google/baidu(32us)
cross_vit_base_224 82.12 95.87 104.7M 20.2G 224 0.875 bicubic google/baidu(jj2q)
cross_vit_9_224 73.78 91.93 8.5M 1.6G 224 0.875 bicubic google/baidu(mjcb)
cross_vit_15_224 81.51 95.72 27.4M 5.2G 224 0.875 bicubic google/baidu(n55b)
cross_vit_18_224 82.29 96.00 43.1M 8.3G 224 0.875 bicubic google/baidu(xese)
cross_vit_9_dagger_224 76.92 93.61 8.7M 1.7G 224 0.875 bicubic google/baidu(58ah)
cross_vit_15_dagger_224 82.23 95.93 28.1M 5.6G 224 0.875 bicubic google/baidu(qwup)
cross_vit_18_dagger_224 82.51 96.03 44.1M 8.7G 224 0.875 bicubic google/baidu(qtw4)
cross_vit_15_dagger_384 83.75 96.75 28.1M 16.4G 384 1.0 bicubic google/baidu(w71e)
cross_vit_18_dagger_384 84.17 96.82 44.1M 25.8G 384 1.0 bicubic google/baidu(99b6)

|

*The results are evaluated on ImageNet2012 validation set.

Notebooks

We provide a few notebooks in aistudio to help you get started:

*(coming soon)*

Requirements

Data

ImageNet2012 dataset is used in the following folder structure:

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Usage

To use the model with pretrained weights, download the .pdparam weight file and change related file paths in the following python scripts. The model config files are located in ./configs/.

For example, assume the downloaded weight file is stored in ./crossvit_base_224.pdparams, to use the crossvit_base_224 model in python:

from config import get_config
from crossvit import build_crossvit as build_model
# config files in ./configs/
config = get_config('./configs/crossvit_base_224.yaml.yaml')
# build model
model = build_model(config)
# load pretrained weights, .pdparams is NOT needed
model_state_dict = paddle.load('./crossvit_base_224')
model.set_dict(model_state_dict)

Evaluation

To evaluate CrossViT model performance on ImageNet2012 with a single GPU, run the following script using command line:

sh run_eval.sh

or

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
    -cfg='./configs/crossvit_base_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \
    -eval \
    -pretrained='./crossvit_base_224'
Run evaluation using multi-GPUs:
sh run_eval_multi.sh

or

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg='./configs/crossvit_base_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \
    -eval \
    -pretrained='./crossvit_base_224'

Training

To train the CrossViT Transformer model on ImageNet2012 with single GPU, run the following script using command line:

sh run_train.sh

or

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
  -cfg='./configs/crossvit_base_224.yaml' \
  -dataset='imagenet2012' \
  -batch_size=16 \
  -data_path='/dataset/imagenet' \
Run training using multi-GPUs:
sh run_train_multi.sh

or

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python main_multi_gpu.py \
    -cfg='./configs/crossvit_base_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=32 \
    -data_path='/dataset/imagenet' \

Visualization Attention Map

(coming soon)

Reference

@article{chen2021crossvit,
  title={Crossvit: Cross-attention multi-scale vision transformer for image classification},
  author={Chen, Chun-Fu and Fan, Quanfu and Panda, Rameswar},
  journal={arXiv preprint arXiv:2103.14899},
  year={2021}
}