Skip to content

Commit 66267d1

Browse files
author
dbickson
committed
cleaning
1 parent d1b6f42 commit 66267d1

File tree

2 files changed

+66
-66
lines changed

2 files changed

+66
-66
lines changed

README.md

Lines changed: 0 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -50,69 +50,3 @@ pip install fastdup
5050
[Detailed instructions](CLOUD.md)
5151

5252

53-
## Error handling
54-
When bad images are encountered, namely corrupted images that can not be read, an additional csv output file is generated called features.dat.bad. The bad images filenames are stored there. In addition there is a printout that states the number of good and bad images encountered. The good images filenames are stored in the file features.dat.csv file. Namely the bad images are excluded from the total images listing. The function fastdup.load_binary_features() reads the features corresponding to the good images and returns a list of all the good images, and a numpy array of all their corresponding features.
55-
The output file similarity.csv with the list of all similar pairs does not include any of the bad images.
56-
57-
58-
## Speeding up the nearest neighbor search
59-
Once short feature vectors are generated per each image, we cluster them to find similarities using a nearest neighbor method. FastDup supports two families of algorithms (given using the nn_provider command line argument)
60-
- turi
61-
- faiss
62-
Turi (nn_provider=’turi’) has the following methods inside
63-
- nnmodel=’brute_force’ (exact method but may be slower)
64-
- nnmodel=’ball_tree’ (approximate method)
65-
- nnmodel=’lsh’ (locality sensitive hashing, approximate method)
66-
Faiss (nn_provider=’faiss’) supports multiple methods
67-
- faiss_mode=’HSNW32’ the default
68-
69-
70-
71-
72-
73-
Example command line:
74-
```
75-
> import fastdup
76-
> fastdup.run(“/path/to/folder”, nn_provider=”turi”, nnmodel=’brute_force’)
77-
> fastdup.run(“/path/to/folder”, nn_provider=”faiss”, faiss_mode=’HSNW32’)
78-
```
79-
80-
81-
## Resuming a stored run
82-
There are 3 supported running modes:
83-
run_mode=0 (the default) does the feature extraction and NN embedding to provide similarities. It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on). The features are extracted and saved into feature_out_file (the default features out file is features.dat in the same folder for storing the numpy features and features.dat.csv for storing the image file names corresponding to the numpy features).
84-
For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error.
85-
run_mode=1 computes the extracted features and stores them, does not compute the NN embedding. For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel. Use the min_offset and max_offset flags to allocate a subset of the images for each computing node. Offsets start from 0 to n-1 where n is the number of images in the input_dir folder.
86-
run_mode=2, reads a stored feature file and computes the NN embedding to provide similarities. The input_dir param is ignored, and the features_out_file is used to point to the numpy feature file. (Give a full path and filename).
87-
88-
## Visualizing the outputs
89-
Once fastdup runs you can look at the results in an easy way using two options. When running from a jupyter notebook the code will produce a table gallery. Otherwise when running a from python shell an html report will be generated.
90-
91-
The following command creates the html report:
92-
```
93-
def create_duplicates_gallery(similarity_file, save_path, num_images=20, descending=True):
94-
95-
Function to create and display a gallery of images computed by the similarity metrics
96-
97-
Parameters:
98-
similarity_file (str): csv file with the computed similarities by the fastdup tool
99-
save_path (str): output folder location for the visuals
100-
num_images(int): Max number of images to display (deafult = 50)
101-
descending (boolean): If False, print the similarities from the least similar to the most similar. Default is True.
102-
```
103-
104-
# Example of the html generated.
105-
106-
Example for the html report generation:
107-
```
108-
import fastdup
109-
fastdup.generate_duplicates_gallery(‘/path/to/similarity.csv’, save_path=’/path/to/report/’)
110-
```
111-
112-
Note: the report should be generated on the same machine since we assume that the input folder for reading the images exists under the same location.
113-
114-
Notes
115-
This is an experimental version tested up to 13M images
116-
117-
118-

RUN.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,3 +91,69 @@ def load_binary_feature(filename):
9191
Faiss index files
9292
When using faiss an additional intermediate results file is created: faiss.index.
9393

94+
## Error handling
95+
When bad images are encountered, namely corrupted images that can not be read, an additional csv output file is generated called features.dat.bad. The bad images filenames are stored there. In addition there is a printout that states the number of good and bad images encountered. The good images filenames are stored in the file features.dat.csv file. Namely the bad images are excluded from the total images listing. The function fastdup.load_binary_features() reads the features corresponding to the good images and returns a list of all the good images, and a numpy array of all their corresponding features.
96+
The output file similarity.csv with the list of all similar pairs does not include any of the bad images.
97+
98+
99+
## Speeding up the nearest neighbor search
100+
Once short feature vectors are generated per each image, we cluster them to find similarities using a nearest neighbor method. FastDup supports two families of algorithms (given using the nn_provider command line argument)
101+
- turi
102+
- faiss
103+
Turi (nn_provider=’turi’) has the following methods inside
104+
- nnmodel=’brute_force’ (exact method but may be slower)
105+
- nnmodel=’ball_tree’ (approximate method)
106+
- nnmodel=’lsh’ (locality sensitive hashing, approximate method)
107+
Faiss (nn_provider=’faiss’) supports multiple methods
108+
- faiss_mode=’HSNW32’ the default
109+
110+
111+
112+
113+
114+
Example command line:
115+
```
116+
> import fastdup
117+
> fastdup.run(“/path/to/folder”, nn_provider=”turi”, nnmodel=’brute_force’)
118+
> fastdup.run(“/path/to/folder”, nn_provider=”faiss”, faiss_mode=’HSNW32’)
119+
```
120+
121+
122+
## Resuming a stored run
123+
There are 3 supported running modes:
124+
run_mode=0 (the default) does the feature extraction and NN embedding to provide similarities. It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on). The features are extracted and saved into feature_out_file (the default features out file is features.dat in the same folder for storing the numpy features and features.dat.csv for storing the image file names corresponding to the numpy features).
125+
For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error.
126+
run_mode=1 computes the extracted features and stores them, does not compute the NN embedding. For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel. Use the min_offset and max_offset flags to allocate a subset of the images for each computing node. Offsets start from 0 to n-1 where n is the number of images in the input_dir folder.
127+
run_mode=2, reads a stored feature file and computes the NN embedding to provide similarities. The input_dir param is ignored, and the features_out_file is used to point to the numpy feature file. (Give a full path and filename).
128+
129+
## Visualizing the outputs
130+
Once fastdup runs you can look at the results in an easy way using two options. When running from a jupyter notebook the code will produce a table gallery. Otherwise when running a from python shell an html report will be generated.
131+
132+
The following command creates the html report:
133+
```
134+
def create_duplicates_gallery(similarity_file, save_path, num_images=20, descending=True):
135+
136+
Function to create and display a gallery of images computed by the similarity metrics
137+
138+
Parameters:
139+
similarity_file (str): csv file with the computed similarities by the fastdup tool
140+
save_path (str): output folder location for the visuals
141+
num_images(int): Max number of images to display (deafult = 50)
142+
descending (boolean): If False, print the similarities from the least similar to the most similar. Default is True.
143+
```
144+
145+
# Example of the html generated.
146+
147+
Example for the html report generation:
148+
```
149+
import fastdup
150+
fastdup.generate_duplicates_gallery(‘/path/to/similarity.csv’, save_path=’/path/to/report/’)
151+
```
152+
153+
Note: the report should be generated on the same machine since we assume that the input folder for reading the images exists under the same location.
154+
155+
Notes
156+
This is an experimental version tested up to 13M images
157+
158+
159+

0 commit comments

Comments
 (0)