Image Data Filter Pipeline

一个聚合了 9 大类图像质量与内容过滤算法的统一分布式 pipeline,用于大规模图文数据集(LAION / COYO / Recap / WIT / PixelProse 等)的清洗、打分和分类。

支持多节点 + 多 GPU(torchrun)、Parquet 输入/输出、断点续跑(Resume)。

📦 集成的过滤算法

#	算法	用途	模型/工具	启用开关
1	Watermark 水印检测	检测带水印图像	`watermark_model_v1.pt`(自研)	`--enable_watermark`
2	Aesthetic 美学评分	评估图像美感	`aesthetic_predictor_v2_5.pth` + `sac+logos+ava1-l14-linearMSE.pth`	`--enable_aesthetic`
3	Quality 综合质量评估	多算子组合(见下方明细)	OpenCV / PIL,无外部模型	`--enable_quality`
4	OCR 文字识别	检测图中文字(中/英)	PaddleOCR `ch_PP-OCRv4_det/rec` (ONNX)	`--enable_ocr`
5	SSCD 重复/近似检测	图像去重	`sscd_disc_mixup.torchscript.pt`	`--enable_sscd`
6	NSFW 内容安全	过滤敏感内容	`h14_nsfw.pth`(OpenAI ViT-H/14)	`--enable_nsfw`
7	Places365 场景分类	365 类场景识别	`rope_vit_reg4_b14_capi-places365.pt`(ViT)或 `wideresnet18_places365.pth.tar`(CNN)	`--enable_places365`
8	ImageNet21K 语义分类	21,841 类细粒度类别 + 语义树	`imagenet21k_miil_tree.pth` + 骨干权重	(默认启用)
9	CLIP Score / MetaCLIP	图文匹配度 + 类别碰撞检测	OpenAI CLIP `ViT-B/32` + MetaCLIP `ViT-H-14-quickgelu`	`--enable_clipscore` / `--enable_metaclip`

🎛️ Quality 模块支持的算子列表

Quality 模块位于 quality/,包含两个文件:enhanced_quality.py(综合质量评估)和 missing_filters.py(论文中 Stage1/Stage2 缺失算子的补全)。

A. `EnhancedQualityProcessor` — CleanVision 风格综合评估

调用 assess_quality(image) 一次性返回所有指标:

算子	字段	方法	默认阈值
尺寸检查	`is_too_small`	width/height 下限	min_width=64, min_height=64
宽高比异常	`is_odd_aspect_ratio`	aspect_ratio 上下限	[0.1, 10.0]
灰度图检测	`is_grayscale`	R/G/B 通道一致性采样	—
亮度分析	`brightness`, `is_dark`, `is_light`	CleanVision 加权亮度公式	dark<50, light>200
模糊度	`blur_score`, `is_blurry`	Laplacian 算子方差	< 100 视为模糊
对比度	`contrast`, `is_low_contrast`	灰度标准差	< 30 视为低对比
信息熵	`entropy`, `is_low_entropy`	Shannon entropy on histogram	< 4.0 视为信息量低
颜色统计	`mean_rgb`, `std_rgb`, ...	RGB 通道均值/标准差/偏度	—

B. `MissingFilters` — 论文 Stage 1 & 2 补全算子

算子	方法	Stage	关键字段
文件大小过滤	`check_file_size`	Stage 1	`is_file_too_small`(默认 < 1KB)
损坏文件检测	`check_broken_file`	Stage 1	`is_broken`, `is_truncated`, `has_valid_format`
旋转检测	`check_rotation`	Stage 2	基于 EXIF Orientation 标签,识别 90/180/270 度旋转
饱和度过滤	`check_saturation`	Stage 2	`is_oversaturated`(均值饱和度 > 0.75 或 30% 像素饱和度 > 0.8)
纹理复杂度	`check_texture_complexity`	Stage 2	FFT 高频比 + 局部方差 + Canny 边缘密度三方法综合

调用 apply_all_filters(image_path) 会一次性运行 Stage 1 + Stage 2 全部算子并返回 should_filter 综合判断。

📥 模型权重下载

仓库中不包含模型权重(总计约 890MB)。请按下表自行下载到 models/ 与对应子目录。

1. 自有/打包权重(放在仓库根目录 `models/` 下)

文件	大小	用途
`aesthetic_predictor_v2_5.pth`	2.6M	美学评分主模型
`sac+logos+ava1-l14-linearMSE.pth`	3.7M	美学评分(CLIP+MLP)
`ch_PP-OCRv4_det_infer.onnx`	4.7M	PaddleOCR 检测
`ch_PP-OCRv4_rec_infer.onnx`	11M	PaddleOCR 识别
`ppocr_keys_v1.txt`	26K	OCR 字典
`h14_nsfw.pth`	22M	NSFW 二分类头
`watermark_model_v1.pt`	48M	水印检测
`sscd_disc_mixup.torchscript.pt`	99M	SSCD 复制检测
`rope_vit_reg4_b14_capi-places365.pt`	344M	Places365 ViT 模型
`rope_vit_places365/`	344M	Places365 HuggingFace 目录格式

权重内部分发地址 / 备份地址请联系仓库 maintainer。

2. Places365 CNN(可选,放在 `places365/` 下)

places365/wideresnet18_places365.pth.tar   # 44MB

来源:CSAILVision/places365 model zoo

3. ImageNet21K 骨干权重(必需,放在 `ImageNet21K/` 下)

仓库已包含语义树 imagenet21k_miil_tree.pth(7.4M),需要额外下载主干分类器。MIIL Model Zoo:

模型	top-5 准确率	下载链接
ViT-B-16 (推荐)	84.4%	vit_base_patch16_224_miil_21k.pth
TResNet-L (V2)	83.9%	tresnet_l_v2_miil_21k.pth
TResNet-M	83.1%	tresnet_m_miil_21k.pth
ResNet50	82.0%	resnet50_miil_21k.pth
Mixer-B-16	82.3%	mixer_b16_224_miil_in21k.pth

完整 ModelZoo 见 ImageNet21K/MODEL_ZOO.md。

4. MetaCLIP(可选,如需启用 `--enable_metaclip`)

# 默认路径(可通过 --metaclip_* 参数覆盖)
h14_fullcc2.5b_state_dict.pt              # ~4GB
text_features_metaclip_h14.npy            # 预计算文本特征
text_features_metaclip_h14_categories.json # 类别列表

下载:facebookresearch/MetaCLIP。不需要时加 --disable_metaclip 即可跳过。

5. CLIP Score(自动下载)

默认使用 openai:ViT-B/32,首次运行时由 OpenAI / HuggingFace 自动下载到 ~/.cache/,无需手动放置。

🚀 快速开始

安装依赖

git clone git@github.com:zpwithme/imagedatafilterpipeline.git
cd imagedatafilterpipeline
pip install -r requirements.txt   # 见 ImageNet21K/requirements.txt
pip install paddlepaddle paddleocr onnxruntime-gpu open_clip_torch

下载权重

mkdir -p models
# 自行下载上表中的权重文件到 models/ 目录
# Places365 CNN:
wget -P places365/ http://places2.csail.mit.edu/models_places365/wideresnet18_places365.pth.tar

运行 Parquet 全类型推理

# 单 GPU
python ImageNet21K/pipeline_all/parquet_alltype_inference.py \
    --parquet_dir /path/to/parquets \
    --image_root  /path/to/images \
    --output_dir  /path/to/output \
    --disable_metaclip

# 多 GPU
torchrun --nproc_per_node 8 \
    ImageNet21K/pipeline_all/parquet_alltype_inference.py \
    --parquet_dir /path/to/parquets \
    --image_root  /path/to/images \
    --output_dir  /path/to/output

# 多节点
torchrun --nproc_per_node 8 \
    ImageNet21K/pipeline_all/parquet_alltype_inference.py \
    --parquet_dir /path/to/parquets \
    --image_root  /path/to/images \
    --output_dir  /path/to/output \
    --node_rank 0 --node_world_size 4

选择性启用算法

# 关掉某些算法
python parquet_alltype_inference.py ... \
    --disable_metaclip \
    --disable_nsfw \
    --disable_watermark

📁 仓库结构

imagedatafilterpipeline/
├── README.md                          # 本文件
├── ImageNet21K/                       # 语义分类 + 流水线入口
│   ├── pipeline_all/
│   │   └── parquet_alltype_inference.py   # 主入口:9 大算法统一调度
│   ├── imagenet21k_miil_tree.pth      # 语义树(7.4M,已包含)
│   ├── MODEL_ZOO.md                   # ImageNet21K 主干权重清单
│   └── ...
├── quality/                           # 质量评估算子
│   ├── enhanced_quality.py            # CleanVision 风格综合评估
│   └── missing_filters.py             # 论文 Stage1/Stage2 补全算子
├── aesthetic/                         # 美学打分
├── watermark/                         # 水印检测
├── nsfw/                              # NSFW 内容过滤
├── sscd/                              # SSCD 重复检测
├── places365/                         # 场景分类(365 类)
├── t2v_metrics/                       # CLIP Score / 图文匹配评估
├── yaml_configs/                      # 各数据集推理配置
├── utils/                             # 通用工具
├── filter_pipeline*.py                # 不同变体的过滤管线
├── distributed_inference.py           # 分布式推理调度
└── *.yaml / *.sh                      # 各数据集的提交脚本

📑 文档索引

OPTIMIZATION_GUIDE.md — 性能调优指南
PARQUET_COLUMNS_AND_FILTERS.md — Parquet 输出字段说明
README_scene_classification.md — Places365 场景分类详解
README_corrupted_detector.md — 损坏图像检测
WEBDATASET_OPTIMIZATION.md — WebDataset 优化
ImageNet21K/README.md — ImageNet21K 子工具链(分类 / 过滤 / 多标题合并)
ImageNet21K/MODEL_ZOO.md — 21K 主干权重清单
ImageNet21K/QUICKSTART_ENHANCED.md — 5 分钟上手增强版分类工具

🔧 第三方组件

仓库直接内嵌了以下第三方代码(均保留各自 LICENSE):

组件	上游
`sscd/sscd-copy-detection/`	facebookresearch/sscd-copy-detection
`nsfw/opennsfw2/`	bhky/opennsfw2
`t2v_metrics/`	linzhiqiu/t2v_metrics
`places365/`	CSAILVision/places365
`ImageNet21K/`	Alibaba-MIIL/ImageNet21K

📝 License

代码采用 MIT License(继承各上游组件 LICENSE)。模型权重请遵守各自原作者的发布协议。

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
ImageNet21K		ImageNet21K
aesthetic/src/aesthetic_predictor_v2_5		aesthetic/src/aesthetic_predictor_v2_5
nsfw/opennsfw2		nsfw/opennsfw2
places365		places365
quality		quality
sscd/sscd-copy-detection		sscd/sscd-copy-detection
t2v_metrics		t2v_metrics
utils		utils
watermark		watermark
yaml_configs		yaml_configs
.amltconfig		.amltconfig
.gitignore		.gitignore
OPTIMIZATION_GUIDE.md		OPTIMIZATION_GUIDE.md
PARQUET_COLUMNS_AND_FILTERS.md		PARQUET_COLUMNS_AND_FILTERS.md
PIXELPROSE_DUAL_CLIP_README.md		PIXELPROSE_DUAL_CLIP_README.md
QUICKSTART_WIT300M.md		QUICKSTART_WIT300M.md
README.md		README.md
README_WIT300M_INFERENCE.md		README_WIT300M_INFERENCE.md
README_YAML_RANGE.md		README_YAML_RANGE.md
README_YFCC100M.md		README_YFCC100M.md
README_corrupted_detector.md		README_corrupted_detector.md
README_folder_processing.md		README_folder_processing.md
README_multithread_extraction.md		README_multithread_extraction.md
README_scene_classification.md		README_scene_classification.md
SUMMARY_modifications.md		SUMMARY_modifications.md
USAGE_SUBDIR_RANGE.md		USAGE_SUBDIR_RANGE.md
WEBDATASET_OPTIMIZATION.md		WEBDATASET_OPTIMIZATION.md
YFCC100M_LAUNCH_GUIDE.md		YFCC100M_LAUNCH_GUIDE.md
corrupted_image_detection.yaml		corrupted_image_detection.yaml
datafilter_multinode_auto_capsfusion.yaml		datafilter_multinode_auto_capsfusion.yaml
datafilter_multinode_auto_coyo.yaml		datafilter_multinode_auto_coyo.yaml
datafilter_multinode_auto_coyo_per_tar.yaml		datafilter_multinode_auto_coyo_per_tar.yaml
datafilter_multinode_auto_pixelprose.yaml		datafilter_multinode_auto_pixelprose.yaml
datafilter_multinode_dragon.yaml		datafilter_multinode_dragon.yaml
datafilter_multinode_folder copy.yaml		datafilter_multinode_folder copy.yaml
datafilter_multinode_folder.yaml		datafilter_multinode_folder.yaml
datafilter_multinode_folder_commoncatery.yaml		datafilter_multinode_folder_commoncatery.yaml
datafilter_multinode_folder_manifest.yaml		datafilter_multinode_folder_manifest.yaml
datafilter_multinode_folder_manifest_cc.yaml		datafilter_multinode_folder_manifest_cc.yaml
datafilter_multinode_optimized_capsfusion.yaml		datafilter_multinode_optimized_capsfusion.yaml
datafilter_multinode_optimized_coyo.yaml		datafilter_multinode_optimized_coyo.yaml
datafilter_multinode_optimized_coyo_range.yaml		datafilter_multinode_optimized_coyo_range.yaml
datafilter_multinode_safe_GBC10M.yaml		datafilter_multinode_safe_GBC10M.yaml
datafilter_multinode_safe_Recap_1b.yaml		datafilter_multinode_safe_Recap_1b.yaml
datafilter_multinode_safe_capsfusion_full.yaml		datafilter_multinode_safe_capsfusion_full.yaml
datafilter_multinode_safe_coshuman.yaml		datafilter_multinode_safe_coshuman.yaml
datafilter_multinode_safe_coyo.yaml		datafilter_multinode_safe_coyo.yaml
datafilter_multinode_safe_coyo_stage2.yaml		datafilter_multinode_safe_coyo_stage2.yaml
datafilter_multinode_safe_human10M.yaml		datafilter_multinode_safe_human10M.yaml
datafilter_multinode_safe_laion_1b.yaml		datafilter_multinode_safe_laion_1b.yaml
datafilter_multinode_safe_laion_2B_multi.yaml		datafilter_multinode_safe_laion_2B_multi.yaml
datafilter_multinode_safe_laion_en.yaml		datafilter_multinode_safe_laion_en.yaml
datafilter_multinode_safe_pixelprose.yaml		datafilter_multinode_safe_pixelprose.yaml
datafilter_multinode_safe_pixelprose_stage2.yaml		datafilter_multinode_safe_pixelprose_stage2.yaml
datafilter_multinode_safe_wit300M.yaml		datafilter_multinode_safe_wit300M.yaml
datafilter_multinode_yfcc100m.yaml		datafilter_multinode_yfcc100m.yaml
detect_corrupted_images.py		detect_corrupted_images.py
detect_corrupted_images_folder.py		detect_corrupted_images_folder.py
distributed_inference.py		distributed_inference.py
fast_ocr_onnx.py		fast_ocr_onnx.py
fast_ocr_onnx_fixed.py		fast_ocr_onnx_fixed.py
filter_pipeline.py		filter_pipeline.py
filter_pipeline_lazy_nsfw.py		filter_pipeline_lazy_nsfw.py
filter_pipeline_onnx.py		filter_pipeline_onnx.py
filter_pipeline_original.py		filter_pipeline_original.py
filter_pipeline_sync.py		filter_pipeline_sync.py
filter_pipeline_sync_onnx.py		filter_pipeline_sync_onnx.py
find_clip_score_files.py		find_clip_score_files.py
fix_collate.py		fix_collate.py
generate_parquet_inference_yamls.py		generate_parquet_inference_yamls.py
generate_target_list.py		generate_target_list.py
generate_target_list_local.py		generate_target_list_local.py
generate_wit300m_shard_list.py		generate_wit300m_shard_list.py
incremental_per_tar_runner.py		incremental_per_tar_runner.py
launch_corrupted_image_detector.sh		launch_corrupted_image_detector.sh
launch_distributed.sh		launch_distributed.sh
launch_distributed_batch_local.sh		launch_distributed_batch_local.sh
launch_distributed_local.sh		launch_distributed_local.sh
launch_distributed_optimized.yaml		launch_distributed_optimized.yaml
launch_distributed_per_tar.sh		launch_distributed_per_tar.sh
launch_distributed_per_tar_optimized.sh		launch_distributed_per_tar_optimized.sh
launch_distributed_per_tar_optimized_local.sh		launch_distributed_per_tar_optimized_local.sh
launch_generate_folder_manifests.sh		launch_generate_folder_manifests.sh
launch_parquet_inference.sh		launch_parquet_inference.sh
launch_pixelprose_dual_clip_remote.sh		launch_pixelprose_dual_clip_remote.sh
launch_places365_cloud.sh		launch_places365_cloud.sh
launch_places365_distributed.sh		launch_places365_distributed.sh
launch_places365_distributed_fixed copy.sh		launch_places365_distributed_fixed copy.sh
launch_places365_distributed_fixed.sh		launch_places365_distributed_fixed.sh
launch_places365_distributed_folder.sh		launch_places365_distributed_folder.sh
launch_places365_distributed_lazy.sh		launch_places365_distributed_lazy.sh
launch_places365_local.sh		launch_places365_local.sh
launch_places365_local_with_tasklist.sh		launch_places365_local_with_tasklist.sh
launch_places365_yfcc100m.sh		launch_places365_yfcc100m.sh
launch_places365_yfcc100m_distributed.sh		launch_places365_yfcc100m_distributed.sh
launch_safe.sh		launch_safe.sh
launch_safe_file.sh		launch_safe_file.sh
launch_safe_file_no_opencv_fix.sh		launch_safe_file_no_opencv_fix.sh
launch_safe_local.sh		launch_safe_local.sh
launch_safe_local_file.sh		launch_safe_local_file.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Data Filter Pipeline

📦 集成的过滤算法

🎛️ Quality 模块支持的算子列表

A. `EnhancedQualityProcessor` — CleanVision 风格综合评估

B. `MissingFilters` — 论文 Stage 1 & 2 补全算子

📥 模型权重下载

1. 自有/打包权重(放在仓库根目录 `models/` 下)

2. Places365 CNN(可选,放在 `places365/` 下)

3. ImageNet21K 骨干权重(必需,放在 `ImageNet21K/` 下)

4. MetaCLIP(可选,如需启用 `--enable_metaclip`)

5. CLIP Score(自动下载)

🚀 快速开始

安装依赖

下载权重

运行 Parquet 全类型推理

选择性启用算法

📁 仓库结构

📑 文档索引

🔧 第三方组件

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Image Data Filter Pipeline

📦 集成的过滤算法

🎛️ Quality 模块支持的算子列表

A. EnhancedQualityProcessor — CleanVision 风格综合评估

B. MissingFilters — 论文 Stage 1 & 2 补全算子

📥 模型权重下载

1. 自有/打包权重(放在仓库根目录 models/ 下)

2. Places365 CNN(可选,放在 places365/ 下)

3. ImageNet21K 骨干权重(必需,放在 ImageNet21K/ 下)

4. MetaCLIP(可选,如需启用 --enable_metaclip)

5. CLIP Score(自动下载)

🚀 快速开始

安装依赖

下载权重

运行 Parquet 全类型推理

选择性启用算法

📁 仓库结构

📑 文档索引

🔧 第三方组件

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

A. `EnhancedQualityProcessor` — CleanVision 风格综合评估

B. `MissingFilters` — 论文 Stage 1 & 2 补全算子

1. 自有/打包权重(放在仓库根目录 `models/` 下)

2. Places365 CNN(可选,放在 `places365/` 下)

3. ImageNet21K 骨干权重(必需,放在 `ImageNet21K/` 下)

4. MetaCLIP(可选,如需启用 `--enable_metaclip`)

Packages