Skip to content

Commit b6e7f38

Browse files
Create README.md
1 parent 75034aa commit b6e7f38

File tree

1 file changed

+281
-0
lines changed

1 file changed

+281
-0
lines changed

README.md

Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
2+
# FastDup Manual
3+
4+
FastDup is a tool for fast detection of duplicate and near duplicate images.
5+
6+
# Installation
7+
## Ubuntu 20.04 LTS Machine Setup
8+
Required setup
9+
- sudo apt update
10+
- sudo apt -y install software-properties-common
11+
- sudo add-apt-repository -y ppa:deadsnakes/ppa
12+
- sudo apt update
13+
- sudo apt -y install python3.8
14+
- sudo apt -y install python3-pip
15+
- pip install --upgrade pip
16+
17+
18+
19+
# Pip Package setup
20+
Download the FastDup latest wheel from the following shared folder: `s3://visualdb`
21+
22+
Latest version: 0.25
23+
24+
## For pip (python 3.8) install using
25+
```
26+
pip install fastdup-<VERSION>-cp38-cp38-linux_x86_64.whl
27+
```
28+
29+
## For conda (python 3.7.11) install using
30+
```
31+
conda install -y pandas tqdm opencv numpy
32+
conda install fastdup-<VERSION>-py37_0.tar.bz
33+
```
34+
35+
36+
# Currently supported software/hardware
37+
38+
Operating system
39+
- `Ubuntu 20.04 LTS`
40+
41+
Software versions
42+
- `Python 3.8` (via pip) or `Python 3.7` (via pip or conda) or a `debian package` (Python is not required)
43+
44+
Hardware support
45+
- CPU (GPU not needed!)
46+
47+
48+
# Running the code
49+
```
50+
> python3
51+
> import fastdup
52+
> fastdup.__version__ # prints the version number
53+
> fastdup.run(“/path/to/your/folder”) #main running function
54+
```
55+
56+
Detailed Python API documentation
57+
58+
```
59+
Run fastdup tool for find duplicate and near duplicate images in a corpus of images.
60+
The only mandatory argument is image_dir. Given an image directory it will compare all pairs of images and store the most similar ones in the output file output_similarity.
61+
62+
Parameters:
63+
input_dir (str): Location of the images directory (or videos).
64+
Alternatively, it is also possible to give a location of a file listing images full path, one image per row.
65+
66+
work_dir (str): Working directory for saving intermediate results and outputs.
67+
68+
compute (str): Compute type [cpu|gpu] default is cpu.
69+
70+
verbose (boolean): Verbosity. Default is False.
71+
72+
num_threads (int): Number of threads. Default is -1 to be auto configured by the number of cores.
73+
74+
num_images (int): Number of images to run on. Default is -1 which means run on all the images in the image_dir folder.
75+
76+
nnmodel (str): Nearest Neighbor model for clustering the features together, when using turi (has no effect when using faiss). Supported options are brute_force (exact), ball_tree and lsh (both approximate). Default is brute_force.
77+
78+
distance (str): Distance metric for the Nearest Neighbors algorithm. Default is cosine. Other distances are euclidean, squared_euclidean, manhattan.
79+
80+
threshold (float): Similarity measure in the range 0->1, where 1 is totally identical, 0.98 and above is almost identical, and 0.85 and above is very similar. Default is 0.85 which means that only image pairs with similarity larger than 0.85 are stored.
81+
82+
lower_threshold (float): Similarity measure to outline images that are far away (outliers) vs. the total distribution. Default value is 0.3.
83+
84+
model_path(str): Optional location of ONNX model file, should not be used.
85+
86+
version(bool): Print out the version number. This function takes no argument.
87+
88+
nearest_neighbors_k (int): For each image, how many similar images to look for. Default is 2.
89+
90+
run_mode (int): This software can run for either feature vector extraction and similarity measurement (0), or just feature vector extraction (1), or just similarity measure computation (2).
91+
92+
nn_provider (string): Provider of the nearest neighbor algorithm, allowed values are turi|faiss.
93+
94+
min_offset (int): Optional min offset to start iterating on the full file list. Default is -1.
95+
96+
max_offset (int): Optional max offset to start iterating on the full file list. Default is -1.
97+
98+
faiss_mode (str): When nn_provider='faiss' selects the faiss mode. Supported options are HNSW32 and any other faiss string.
99+
100+
faiss_param (str): When nn_provider='faiss' assigns optional faiss parameters. For example efSearch=175. Multiple params are supported - for example 'efSearch=175,nprobes=200'
101+
102+
103+
104+
105+
Returns:
106+
Status code 0 = success, 1 = error.
107+
```
108+
109+
## Input / output formats
110+
The input to fastdup tool is given in the command line argument: data_dir. There are a few options:
111+
Location of a local folder. In that case all images in this folder are searched recursively.
112+
Location of an s3 path. Again all images in the path will be used recursively.
113+
A file containing image locations (either local or full s3 paths). Each image in its own row.
114+
115+
The intermediate outputs and final outputs are stored in the folder work_dir.
116+
Feature extraction related files:
117+
Binary numpy array containing n rows of 576 columns with the feature vectors. (Default filename is features.dat)
118+
An additional csv file containing the full paths to the image names corresponding to the feature vectors (default filename is features.dat.csv). This is needed from two reasons:
119+
The order of extraction may change depends on the file system listing
120+
In case of corrupted images, its feature vector is skipped and not generated. In that case an additional output file is provided ( features.bad.csv)
121+
122+
Similarity pair list
123+
The output of the fastdup tool is a similarity file (filename is similarity.csv) which is a csv file with 3 columns: from, to, distance. The file is sorted from the closest matching images to less similar images.
124+
125+
Note: for exploiting the binary features we provide the following function in Python:
126+
127+
```
128+
def load_binary_feature(filename):
129+
130+
Example Python function for loading the stored binary features and their matching filenames.
131+
132+
Parameters:
133+
filename(str):The binary feature file location
134+
135+
Returns:
136+
A list of with all image file names of length X.
137+
An np matrix of shape X rows x 576 cols. Each row conform to feature vector os a single image.
138+
139+
Example:
140+
import fastdup
141+
file_list, mat_features = fastdup.load_binary('features.dat')
142+
143+
```
144+
145+
Faiss index files
146+
When using faiss an additional intermediate results file is created: faiss.index.
147+
Support for cloud storage
148+
FastDup supports two types of cloud storage:
149+
Amazon s3 aws cli
150+
Min.io cloud storage api
151+
152+
## Amazon s3 aws cli support
153+
### Preliminaries:
154+
- Install aws cli using the command
155+
`sudo apt install awscli`
156+
- Configure your aws using the command
157+
`aws configure`
158+
- Make sure you can access your bucket using
159+
`aws s3 ls s3://<your bucket name>`
160+
161+
## How to run
162+
There are two options to run.
163+
In the input_dir command line argument put the full path your bucket for example: `s3://mybucket/myfolder/myother_folder/`
164+
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used.
165+
Alternatively (and recommended) create a file with the list of all your images in the following format:
166+
```
167+
s3://mybucket/myfolder/myother_folder/image1.jpg
168+
s3://mybucket/myfolder2/myother_folder4/image2.jpg
169+
s3://mybucket/myfolder3/myother_folder5/image3.jpg
170+
```
171+
Assuming the filename is files.txt you can run with input_dir=’/path/to/files.txt’
172+
173+
Notes:
174+
Currently we support a single cloud provider and a single bucket.
175+
It is OK to have images with the same name assuming they are nested in different subfolders.
176+
In terms of performance, it is better to copy the full bucket to the local node first in case the local disk is hard enough. Then give the input_dir as the local folder location of the copied data. The explanation above is for the case the dataset is larger than the local disk (and potentially multiple nodes run in parallel).
177+
178+
179+
180+
## Min.io support
181+
Preliminaries
182+
Install the min.io client using the command
183+
```
184+
wget https://dl.min.io/client/mc/release/linux-amd64/mc
185+
sudo mv mc /usr/bin/
186+
chmod +x /usr/bin/mc
187+
```
188+
Configure the client to point to the cloud provider
189+
190+
```
191+
mc alias set myminio/ http://MINIO-SERVER MYUSER MYPASSWORD
192+
```
193+
For example for google cloud:
194+
```
195+
/usr/bin/mc alias set google https://storage.googleapis.com/ <access_key> <secret_key>
196+
```
197+
Make sure the bucket is accessible using the command:
198+
```
199+
/usr/bin/mc ls google/mybucket/myfolder/myotherfolder/
200+
```
201+
202+
How to run
203+
There are two options to run.
204+
In the input_dir command line argument put the full path your cloud storage provider as defined by the minio alias, for example: `minio://google/mybucket/myfolder/myother_folder/`
205+
(Note that google is the alias set for google cloud, and the path has to start with `minio://` prefix).
206+
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used.
207+
Alternatively (and recommended) create a file with the list of all your images in the following format:
208+
```
209+
minio://google/mybucket/myfolder/myother_folder/image1.jpg
210+
minio://google/mybucket/myfolder/myother_folder/image2.jpg
211+
minio://google/mybucket/myfolder/myother_folder/image3.jpg
212+
```
213+
Assuming the filename is `files.txt` you can run with `input_dir=’/path/to/files.txt’`
214+
215+
216+
## Error handling
217+
When bad images are encountered, namely corrupted images that can not be read, an additional csv output file is generated called features.dat.bad. The bad images filenames are stored there. In addition there is a printout that states the number of good and bad images encountered. The good images filenames are stored in the file features.dat.csv file. Namely the bad images are excluded from the total images listing. The function fastdup.load_binary_features() reads the features corresponding to the good images and returns a list of all the good images, and a numpy array of all their corresponding features.
218+
The output file similarity.csv with the list of all similar pairs does not include any of the bad images.
219+
220+
221+
## Speeding up the nearest neighbor search
222+
Once short feature vectors are generated per each image, we cluster them to find similarities using a nearest neighbor method. FastDup supports two families of algorithms (given using the nn_provider command line argument)
223+
- turi
224+
- faiss
225+
Turi (nn_provider=’turi’) has the following methods inside
226+
- nnmodel=’brute_force’ (exact method but may be slower)
227+
- nnmodel=’ball_tree’ (approximate method)
228+
- nnmodel=’lsh’ (locality sensitive hashing, approximate method)
229+
Faiss (nn_provider=’faiss’) supports multiple methods
230+
- faiss_mode=’HSNW32’ the default
231+
232+
233+
234+
235+
236+
Example command line:
237+
```
238+
> import fastdup
239+
> fastdup.run(“/path/to/folder”, nn_provider=”turi”, nnmodel=’brute_force’)
240+
> fastdup.run(“/path/to/folder”, nn_provider=”faiss”, faiss_mode=’HSNW32’)
241+
```
242+
243+
244+
## Resuming a stored run
245+
There are 3 supported running modes:
246+
run_mode=0 (the default) does the feature extraction and NN embedding to provide similarities. It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on). The features are extracted and saved into feature_out_file (the default features out file is features.dat in the same folder for storing the numpy features and features.dat.csv for storing the image file names corresponding to the numpy features).
247+
For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error.
248+
run_mode=1 computes the extracted features and stores them, does not compute the NN embedding. For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel. Use the min_offset and max_offset flags to allocate a subset of the images for each computing node. Offsets start from 0 to n-1 where n is the number of images in the input_dir folder.
249+
run_mode=2, reads a stored feature file and computes the NN embedding to provide similarities. The input_dir param is ignored, and the features_out_file is used to point to the numpy feature file. (Give a full path and filename).
250+
251+
## Visualizing the outputs
252+
Once fastdup runs you can look at the results in an easy way using two options. When running from a jupyter notebook the code will produce a table gallery. Otherwise when running a from python shell an html report will be generated.
253+
254+
The following command creates the html report:
255+
```
256+
def create_duplicates_gallery(similarity_file, save_path, num_images=20, descending=True):
257+
258+
Function to create and display a gallery of images computed by the similarity metrics
259+
260+
Parameters:
261+
similarity_file (str): csv file with the computed similarities by the fastdup tool
262+
save_path (str): output folder location for the visuals
263+
num_images(int): Max number of images to display (deafult = 50)
264+
descending (boolean): If False, print the similarities from the least similar to the most similar. Default is True.
265+
```
266+
267+
# Example of the html generated.
268+
269+
Example for the html report generation:
270+
```
271+
import fastdup
272+
fastdup.generate_duplicates_gallery(‘/path/to/similarity.csv’, save_path=’/path/to/report/’)
273+
```
274+
275+
Note: the report should be generated on the same machine since we assume that the input folder for reading the images exists under the same location.
276+
277+
Notes
278+
This is an experimental version tested up to 13M images
279+
280+
281+

0 commit comments

Comments
 (0)