This repo contains the preprocessing and training/inference code for MassID45: A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level.
Requirements for downloading and preprocessing the data can be found in the image_preprocessing folder. We recommend creating a dedicated Python environment for image_preprocessing.
The CutLER, Mask-RCNN, Mask2Former, MaskDINO, and Grounded-SAM-2 submodules each contain their own installation instructions (see the INSTALL.md files in each submodule). We recommend creating separate Python or Conda environments for each submodule.
Code to run the watershed algorithm and crop the bulk images into annotator patches is contained in the arth-imrec submodule.
The bulk images can be downloaded from Google Drive by running image_preprocessing/download_data.py. Scripts for downloading the ENA sequence data can be found in the image_preprocessing/ENA_sequence_scripts folder.
The image_preprocessing folder contains utility scripts for downloading and assembling the bulk image data. The data pipeline is as follows:
- Download the bulk images
download_data.py - Assemble the annotated patches into the full bulk images
assemble_annotations.py - Crop the bulk images to exclude excess areas
crop_bulk_imgs.py - Divide the bulk images into tiles for training
tile_imgs.py - And lastly, postprocess the sliced annotations
postprocess_dataset.pyThe entire preprocessing pipeline can be run via
sbatch pipeline.shThe CutLER, Mask-RCNN, Mask2Former, MaskDINO, and Grounded-SAM-2 submodules each have their own usage instructions (see the MassID45 Instructions section in each submodule's README.md). We ran our experiments using 4 RTX6000 GPUs for training, and 1 RTX6000 GPU for testing.
Model checkpoints can be downloaded at Zenodo.