Video Feature Extraction

This repository contains a collection of scripts for extracting features from videos using different deep learning models such as SlowFast and CLIP.

Project Structure

.
├── README.md
├── requirements.txt
├── raw_videos/            # Directory containing input videos
├── features/             # Directory for extracted features
└── models/
    ├── slow_fast/        # SlowFast feature extraction scripts
    └── clip/             # CLIP feature extraction scripts

Installation

Set up the environment using conda:

# Create and activate conda environment
conda create -n video_features python=3.8
conda activate video_features

# Install PyTorch dependencies
conda install pytorch torchvision torchaudio cpuonly -c pytorch

# Install additional dependencies
pip install pytorchvideo opencv-python tqdm
pip install git+https://github.com/openai/CLIP.git
pip install decord gluoncv-torch torch tqdm

Usage

SlowFast Feature Extraction

The SlowFast model is used for extracting spatiotemporal features from videos. It uses two pathways:

A slow pathway that captures spatial semantics
A fast pathway that captures motion at fine temporal resolution

To extract features using the SlowFast model:

python models/slow_fast/extract_features.py --input_dir ./raw_videos/ --output_dir ./features/ --clip_len 2

Parameters:

--input_dir: Directory containing input videos
--output_dir: Directory where extracted features will be saved
--clip_len: Length of each clip segment in seconds (default: 2)

The script processes each video by:

Splitting it into 2-second clips
Processing each clip through the SlowFast model
Extracting features
Saving the features in .npz format (one file per video)

CLIP Feature Extraction

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. This repository uses CLIP's vision encoder to extract visual features from video frames.

To extract features using the CLIP model:

python models/clip/extract_features.py --input_dir ./raw_videos/ --output_dir ./features/ --model ViT-B/32 --batch_size 32

Parameters:

--input_dir: Directory containing input videos
--output_dir: Directory where extracted features will be saved
--model: CLIP model variant (default: 'ViT-B/32', alternatives: 'RN50', etc.)
--batch_size: Number of frames processed simultaneously (default: 32)
--device: Processing device: 'cuda', 'cpu', or 'auto' (default: 'auto')

The script processes each video by:

Extracting 2-second clips from the video
Processing each frame through CLIP's image encoder
Averaging features across frames in each clip
Saving the features in .npz format (one file per video)

Technical Details

SlowFast Feature Extractor

The SlowFast feature extractor uses the pre-trained SlowFast R50 model from PyTorchVideo with the following specifications:

Model Architecture: SlowFast R50 (ResNet-50 backbone)
Feature Dimension: The extracted features have a dimension of [n, 2304]
- n: number of clips in the video
- 2304: feature dimensionality
Processing Pipeline:
- Video is split into 2-second clips
- Each clip is processed through a dual pathway network:
  - Slow pathway: Operates at 1/4 temporal resolution to capture spatial semantics
  - Fast pathway: Operates at full temporal resolution to capture motion
- The classification layer is removed to obtain feature embeddings
- Frames are standardized with mean=[0.45, 0.45, 0.45] and std=[0.225, 0.225, 0.225]
- Short side of frames is scaled to 256 pixels

The SlowFast architecture is based on the paper SlowFast Networks for Video Recognition (ICCV 2019) by Christoph Feichtenhofer et al.

CLIP Feature Extractor

The CLIP feature extractor uses OpenAI's CLIP model with the following specifications:

Model Architecture: By default, ViT-B/32 (Vision Transformer)
Feature Dimension: The extracted features have a dimension of [num_clips, 512]
- num_clips: number of 2-second clips in the video
- 512: feature dimensionality (may vary based on model variant)
Processing Pipeline:
- Video is split into 2-second clips
- Each frame is processed using CLIP's official augmentations
- Features are extracted from CLIP's image encoder
- Features are averaged across all frames in a clip
- Additional metadata (fps, clip length) is saved alongside features

The output .npz files contain:

features: Feature vectors for each clip
fps: Original video frame rate
clip_length: Length of each clip (2 seconds)

CLIP is based on the paper Learning Transferable Visual Models From Natural Language Supervision by Alec Radford et al.

Output Format

The extracted features are saved as NumPy .npz files in the output directory:

SlowFast Output

{video_name}.npz:
  - features: array of shape [num_clips, 2304]

CLIP Output

{video_name}.npz:
  - features: array of shape [num_clips, 512]
  - fps: original video frame rate
  - clip_length: 2 (seconds)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Feature Extraction

Project Structure

Installation

Usage

SlowFast Feature Extraction

CLIP Feature Extraction

Technical Details

SlowFast Feature Extractor

CLIP Feature Extractor

Output Format

SlowFast Output

CLIP Output

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
c3d		c3d
clip		clip
features		features
models		models
raw_video		raw_video
raw_videos		raw_videos
slow_fast		slow_fast
README.md		README.md
requirements.txt		requirements.txt

yogesh-iitj/video_features

Folders and files

Latest commit

History

Repository files navigation

Video Feature Extraction

Project Structure

Installation

Usage

SlowFast Feature Extraction

CLIP Feature Extraction

Technical Details

SlowFast Feature Extractor

CLIP Feature Extractor

Output Format

SlowFast Output

CLIP Output

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages