|
| 1 | + |
| 2 | +# FastDup Manual |
| 3 | + |
| 4 | +FastDup is a tool for fast detection of duplicate and near duplicate images. |
| 5 | + |
| 6 | +# Installation |
| 7 | +## Ubuntu 20.04 LTS Machine Setup |
| 8 | +Required setup |
| 9 | +- sudo apt update |
| 10 | +- sudo apt -y install software-properties-common |
| 11 | +- sudo add-apt-repository -y ppa:deadsnakes/ppa |
| 12 | +- sudo apt update |
| 13 | +- sudo apt -y install python3.8 |
| 14 | +- sudo apt -y install python3-pip |
| 15 | +- pip install --upgrade pip |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | +# Pip Package setup |
| 20 | +Download the FastDup latest wheel from the following shared folder: `s3://visualdb` |
| 21 | + |
| 22 | +Latest version: 0.25 |
| 23 | + |
| 24 | +## For pip (python 3.8) install using |
| 25 | +``` |
| 26 | +pip install fastdup-<VERSION>-cp38-cp38-linux_x86_64.whl |
| 27 | +``` |
| 28 | + |
| 29 | +## For conda (python 3.7.11) install using |
| 30 | +``` |
| 31 | +conda install -y pandas tqdm opencv numpy |
| 32 | +conda install fastdup-<VERSION>-py37_0.tar.bz |
| 33 | +``` |
| 34 | + |
| 35 | + |
| 36 | +# Currently supported software/hardware |
| 37 | + |
| 38 | +Operating system |
| 39 | +- `Ubuntu 20.04 LTS` |
| 40 | + |
| 41 | +Software versions |
| 42 | +- `Python 3.8` (via pip) or `Python 3.7` (via pip or conda) or a `debian package` (Python is not required) |
| 43 | + |
| 44 | +Hardware support |
| 45 | +- CPU (GPU not needed!) |
| 46 | + |
| 47 | + |
| 48 | +# Running the code |
| 49 | +``` |
| 50 | +> python3 |
| 51 | +> import fastdup |
| 52 | +> fastdup.__version__ # prints the version number |
| 53 | +> fastdup.run(“/path/to/your/folder”) #main running function |
| 54 | +``` |
| 55 | + |
| 56 | +Detailed Python API documentation |
| 57 | + |
| 58 | +``` |
| 59 | + Run fastdup tool for find duplicate and near duplicate images in a corpus of images. |
| 60 | + The only mandatory argument is image_dir. Given an image directory it will compare all pairs of images and store the most similar ones in the output file output_similarity. |
| 61 | +
|
| 62 | + Parameters: |
| 63 | + input_dir (str): Location of the images directory (or videos). |
| 64 | +Alternatively, it is also possible to give a location of a file listing images full path, one image per row. |
| 65 | +
|
| 66 | + work_dir (str): Working directory for saving intermediate results and outputs. |
| 67 | +
|
| 68 | + compute (str): Compute type [cpu|gpu] default is cpu. |
| 69 | +
|
| 70 | + verbose (boolean): Verbosity. Default is False. |
| 71 | +
|
| 72 | + num_threads (int): Number of threads. Default is -1 to be auto configured by the number of cores. |
| 73 | +
|
| 74 | + num_images (int): Number of images to run on. Default is -1 which means run on all the images in the image_dir folder. |
| 75 | +
|
| 76 | + nnmodel (str): Nearest Neighbor model for clustering the features together, when using turi (has no effect when using faiss). Supported options are brute_force (exact), ball_tree and lsh (both approximate). Default is brute_force. |
| 77 | +
|
| 78 | + distance (str): Distance metric for the Nearest Neighbors algorithm. Default is cosine. Other distances are euclidean, squared_euclidean, manhattan. |
| 79 | +
|
| 80 | + threshold (float): Similarity measure in the range 0->1, where 1 is totally identical, 0.98 and above is almost identical, and 0.85 and above is very similar. Default is 0.85 which means that only image pairs with similarity larger than 0.85 are stored. |
| 81 | +
|
| 82 | + lower_threshold (float): Similarity measure to outline images that are far away (outliers) vs. the total distribution. Default value is 0.3. |
| 83 | +
|
| 84 | + model_path(str): Optional location of ONNX model file, should not be used. |
| 85 | +
|
| 86 | + version(bool): Print out the version number. This function takes no argument. |
| 87 | +
|
| 88 | + nearest_neighbors_k (int): For each image, how many similar images to look for. Default is 2. |
| 89 | +
|
| 90 | + run_mode (int): This software can run for either feature vector extraction and similarity measurement (0), or just feature vector extraction (1), or just similarity measure computation (2). |
| 91 | + |
| 92 | + nn_provider (string): Provider of the nearest neighbor algorithm, allowed values are turi|faiss. |
| 93 | +
|
| 94 | + min_offset (int): Optional min offset to start iterating on the full file list. Default is -1. |
| 95 | +
|
| 96 | + max_offset (int): Optional max offset to start iterating on the full file list. Default is -1. |
| 97 | +
|
| 98 | + faiss_mode (str): When nn_provider='faiss' selects the faiss mode. Supported options are HNSW32 and any other faiss string. |
| 99 | +
|
| 100 | + faiss_param (str): When nn_provider='faiss' assigns optional faiss parameters. For example efSearch=175. Multiple params are supported - for example 'efSearch=175,nprobes=200' |
| 101 | +
|
| 102 | +
|
| 103 | + |
| 104 | + |
| 105 | + Returns: |
| 106 | + Status code 0 = success, 1 = error. |
| 107 | +``` |
| 108 | + |
| 109 | +## Input / output formats |
| 110 | +The input to fastdup tool is given in the command line argument: data_dir. There are a few options: |
| 111 | +Location of a local folder. In that case all images in this folder are searched recursively. |
| 112 | +Location of an s3 path. Again all images in the path will be used recursively. |
| 113 | +A file containing image locations (either local or full s3 paths). Each image in its own row. |
| 114 | + |
| 115 | +The intermediate outputs and final outputs are stored in the folder work_dir. |
| 116 | +Feature extraction related files: |
| 117 | +Binary numpy array containing n rows of 576 columns with the feature vectors. (Default filename is features.dat) |
| 118 | +An additional csv file containing the full paths to the image names corresponding to the feature vectors (default filename is features.dat.csv). This is needed from two reasons: |
| 119 | +The order of extraction may change depends on the file system listing |
| 120 | +In case of corrupted images, its feature vector is skipped and not generated. In that case an additional output file is provided ( features.bad.csv) |
| 121 | + |
| 122 | +Similarity pair list |
| 123 | +The output of the fastdup tool is a similarity file (filename is similarity.csv) which is a csv file with 3 columns: from, to, distance. The file is sorted from the closest matching images to less similar images. |
| 124 | + |
| 125 | +Note: for exploiting the binary features we provide the following function in Python: |
| 126 | + |
| 127 | +``` |
| 128 | +def load_binary_feature(filename): |
| 129 | +
|
| 130 | + Example Python function for loading the stored binary features and their matching filenames. |
| 131 | +
|
| 132 | + Parameters: |
| 133 | + filename(str):The binary feature file location |
| 134 | +
|
| 135 | + Returns: |
| 136 | + A list of with all image file names of length X. |
| 137 | + An np matrix of shape X rows x 576 cols. Each row conform to feature vector os a single image. |
| 138 | +
|
| 139 | + Example: |
| 140 | + import fastdup |
| 141 | + file_list, mat_features = fastdup.load_binary('features.dat') |
| 142 | +
|
| 143 | +``` |
| 144 | + |
| 145 | +Faiss index files |
| 146 | +When using faiss an additional intermediate results file is created: faiss.index. |
| 147 | +Support for cloud storage |
| 148 | +FastDup supports two types of cloud storage: |
| 149 | +Amazon s3 aws cli |
| 150 | +Min.io cloud storage api |
| 151 | + |
| 152 | +## Amazon s3 aws cli support |
| 153 | +### Preliminaries: |
| 154 | +- Install aws cli using the command |
| 155 | +`sudo apt install awscli` |
| 156 | +- Configure your aws using the command |
| 157 | +`aws configure` |
| 158 | +- Make sure you can access your bucket using |
| 159 | +`aws s3 ls s3://<your bucket name>` |
| 160 | + |
| 161 | +## How to run |
| 162 | +There are two options to run. |
| 163 | +In the input_dir command line argument put the full path your bucket for example: `s3://mybucket/myfolder/myother_folder/` |
| 164 | +This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used. |
| 165 | +Alternatively (and recommended) create a file with the list of all your images in the following format: |
| 166 | +``` |
| 167 | +s3://mybucket/myfolder/myother_folder/image1.jpg |
| 168 | +s3://mybucket/myfolder2/myother_folder4/image2.jpg |
| 169 | +s3://mybucket/myfolder3/myother_folder5/image3.jpg |
| 170 | +``` |
| 171 | +Assuming the filename is files.txt you can run with input_dir=’/path/to/files.txt’ |
| 172 | + |
| 173 | +Notes: |
| 174 | +Currently we support a single cloud provider and a single bucket. |
| 175 | +It is OK to have images with the same name assuming they are nested in different subfolders. |
| 176 | +In terms of performance, it is better to copy the full bucket to the local node first in case the local disk is hard enough. Then give the input_dir as the local folder location of the copied data. The explanation above is for the case the dataset is larger than the local disk (and potentially multiple nodes run in parallel). |
| 177 | + |
| 178 | + |
| 179 | + |
| 180 | +## Min.io support |
| 181 | +Preliminaries |
| 182 | +Install the min.io client using the command |
| 183 | +``` |
| 184 | +wget https://dl.min.io/client/mc/release/linux-amd64/mc |
| 185 | +sudo mv mc /usr/bin/ |
| 186 | +chmod +x /usr/bin/mc |
| 187 | +``` |
| 188 | +Configure the client to point to the cloud provider |
| 189 | + |
| 190 | +``` |
| 191 | +mc alias set myminio/ http://MINIO-SERVER MYUSER MYPASSWORD |
| 192 | +``` |
| 193 | +For example for google cloud: |
| 194 | +``` |
| 195 | +/usr/bin/mc alias set google https://storage.googleapis.com/ <access_key> <secret_key> |
| 196 | +``` |
| 197 | +Make sure the bucket is accessible using the command: |
| 198 | +``` |
| 199 | +/usr/bin/mc ls google/mybucket/myfolder/myotherfolder/ |
| 200 | +``` |
| 201 | + |
| 202 | +How to run |
| 203 | +There are two options to run. |
| 204 | +In the input_dir command line argument put the full path your cloud storage provider as defined by the minio alias, for example: `minio://google/mybucket/myfolder/myother_folder/` |
| 205 | +(Note that google is the alias set for google cloud, and the path has to start with `minio://` prefix). |
| 206 | +This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used. |
| 207 | +Alternatively (and recommended) create a file with the list of all your images in the following format: |
| 208 | +``` |
| 209 | +minio://google/mybucket/myfolder/myother_folder/image1.jpg |
| 210 | +minio://google/mybucket/myfolder/myother_folder/image2.jpg |
| 211 | +minio://google/mybucket/myfolder/myother_folder/image3.jpg |
| 212 | +``` |
| 213 | +Assuming the filename is `files.txt` you can run with `input_dir=’/path/to/files.txt’` |
| 214 | + |
| 215 | + |
| 216 | +## Error handling |
| 217 | +When bad images are encountered, namely corrupted images that can not be read, an additional csv output file is generated called features.dat.bad. The bad images filenames are stored there. In addition there is a printout that states the number of good and bad images encountered. The good images filenames are stored in the file features.dat.csv file. Namely the bad images are excluded from the total images listing. The function fastdup.load_binary_features() reads the features corresponding to the good images and returns a list of all the good images, and a numpy array of all their corresponding features. |
| 218 | +The output file similarity.csv with the list of all similar pairs does not include any of the bad images. |
| 219 | + |
| 220 | + |
| 221 | +## Speeding up the nearest neighbor search |
| 222 | +Once short feature vectors are generated per each image, we cluster them to find similarities using a nearest neighbor method. FastDup supports two families of algorithms (given using the nn_provider command line argument) |
| 223 | +- turi |
| 224 | +- faiss |
| 225 | +Turi (nn_provider=’turi’) has the following methods inside |
| 226 | +- nnmodel=’brute_force’ (exact method but may be slower) |
| 227 | +- nnmodel=’ball_tree’ (approximate method) |
| 228 | +- nnmodel=’lsh’ (locality sensitive hashing, approximate method) |
| 229 | +Faiss (nn_provider=’faiss’) supports multiple methods |
| 230 | +- faiss_mode=’HSNW32’ the default |
| 231 | + |
| 232 | + |
| 233 | + |
| 234 | + |
| 235 | + |
| 236 | +Example command line: |
| 237 | +``` |
| 238 | +> import fastdup |
| 239 | +> fastdup.run(“/path/to/folder”, nn_provider=”turi”, nnmodel=’brute_force’) |
| 240 | +> fastdup.run(“/path/to/folder”, nn_provider=”faiss”, faiss_mode=’HSNW32’) |
| 241 | +``` |
| 242 | + |
| 243 | + |
| 244 | +## Resuming a stored run |
| 245 | +There are 3 supported running modes: |
| 246 | +run_mode=0 (the default) does the feature extraction and NN embedding to provide similarities. It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on). The features are extracted and saved into feature_out_file (the default features out file is features.dat in the same folder for storing the numpy features and features.dat.csv for storing the image file names corresponding to the numpy features). |
| 247 | +For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error. |
| 248 | +run_mode=1 computes the extracted features and stores them, does not compute the NN embedding. For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel. Use the min_offset and max_offset flags to allocate a subset of the images for each computing node. Offsets start from 0 to n-1 where n is the number of images in the input_dir folder. |
| 249 | +run_mode=2, reads a stored feature file and computes the NN embedding to provide similarities. The input_dir param is ignored, and the features_out_file is used to point to the numpy feature file. (Give a full path and filename). |
| 250 | + |
| 251 | +## Visualizing the outputs |
| 252 | +Once fastdup runs you can look at the results in an easy way using two options. When running from a jupyter notebook the code will produce a table gallery. Otherwise when running a from python shell an html report will be generated. |
| 253 | + |
| 254 | +The following command creates the html report: |
| 255 | +``` |
| 256 | +def create_duplicates_gallery(similarity_file, save_path, num_images=20, descending=True): |
| 257 | +
|
| 258 | + Function to create and display a gallery of images computed by the similarity metrics |
| 259 | +
|
| 260 | + Parameters: |
| 261 | + similarity_file (str): csv file with the computed similarities by the fastdup tool |
| 262 | + save_path (str): output folder location for the visuals |
| 263 | + num_images(int): Max number of images to display (deafult = 50) |
| 264 | + descending (boolean): If False, print the similarities from the least similar to the most similar. Default is True. |
| 265 | +``` |
| 266 | + |
| 267 | +# Example of the html generated. |
| 268 | + |
| 269 | +Example for the html report generation: |
| 270 | +``` |
| 271 | +import fastdup |
| 272 | +fastdup.generate_duplicates_gallery(‘/path/to/similarity.csv’, save_path=’/path/to/report/’) |
| 273 | +``` |
| 274 | + |
| 275 | +Note: the report should be generated on the same machine since we assume that the input folder for reading the images exists under the same location. |
| 276 | + |
| 277 | +Notes |
| 278 | +This is an experimental version tested up to 13M images |
| 279 | + |
| 280 | + |
| 281 | + |
0 commit comments