Skip to content

Commit e138852

Browse files
authored
Update README.md
1 parent c3bd126 commit e138852

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11

22
# FastDup
33

4-
FastDup is a tool for gaining insights from a large image collection. It can find anomalies, duplicate and near duplicate images, clusters of similaritity, learn the normal behavior and temporal interactions between images. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be sent for tagging. FastDup scales to millions of images running on CPU only.
4+
FastDup is a tool for gaining insights from a large image collection. It can find anomalies, duplicate and near duplicate images, clusters of similaritity, learn the normal behavior and temporal interactions between images. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be sent for tagging. FastDup scales to millions of images running on CPU only.
55

66
From the authors of [GraphLab](https://github.com/jegonzal/PowerGraph) and [Turi Create](https://github.com/apple/turicreate).
77

88
![alt text](https://github.com/visualdatabase/fastdup/blob/main/gallery/imagenet21k_duplicates.png)
9-
*Duplicates and near duplicates identified in ImageNet data*
9+
*Duplicates and near duplicates identified in ms-coco dataset*
1010

1111
![alt text](https://github.com/visualdatabase/fastdup/blob/main/gallery/landmark_outliers.png)
1212
*Outliers in a landmarks 2021 dataset (dataset intention is to capture recognizable landmarks, like the empire state building etc.)*
@@ -23,7 +23,7 @@ From the authors of [GraphLab](https://github.com/jegonzal/PowerGraph) and [Turi
2323

2424

2525
## Results on Key Datasets
26-
We have thourougly tested fastdup across various famous visual dataset. Ranging from Academic datasets to Kaggle competitions. A key finding we have made using FastDup is that there are ~1.2M (!) duplicate images on the ImageNet21K dataset, a new unknown result! Full results are below.
26+
We have thourougly tested fastdup across various famous visual datasets. Ranging from pilar Academic datasets to Kaggle competitions. A key finding we have made using FastDup is that there are ~1.2M (!) duplicate images on the ImageNet-21K dataset, a new unknown result! Full results are below.
2727

2828
### FastDup is FAST
2929
|Dataset |Total Images |cost [$]|spot cost [$]|processing [sec]|Identical pairs|Anomalies|
@@ -40,7 +40,7 @@ We have thourougly tested fastdup across various famous visual dataset. Ranging
4040

4141
* Experiments on a 32 core Google cloud machine, with 128GB RAM (no GPU required).
4242

43-
* We run on the full ImageNet dataset (11.5M images) to compare all pairs of images in less than 3 hours WITHOUT a GPU (with Google cloud cost of 5$).
43+
* We run on the full ImageNet-21K dataset (11.5M images) to compare all pairs of images in less than 3 hours WITHOUT a GPU (with Google cloud cost of 5$).
4444

4545
## Quick Installation (Ubuntu 20.04 or Ubuntu 18.04)
4646
For Python 3.7 and 3.8
@@ -81,10 +81,10 @@ fastdup.run(input_dir="/path/to/your/folder", work_dir="/path/to/your/folder") #
8181
|--|--------------|-------------------|
8282
|Operating Systems | Ubuntu 20.04, Ubuntu 18.04 | Plus Amazon Linux, RedHat, Windows, Mac OS|
8383
|Python Versions | Python 3.7+3.8+conda | Plus Python 3.6, 3.9, 3.10|
84-
|Compute | CPU | GPU, TPU, Intel OpenVino|
84+
|Compute | CPU | Plus GPU, TPU, Intel OpenVino|
8585
|Storage| NFS, local | Plus ec2 s3, google cloud storage, minio |
86-
|Cloud Instance | On demand | Support for spot instance|
87-
|Numbr of images | Up to 1 million | Up to 1 billion|
86+
|Cloud Instance | On demand | Plus spot instance|
87+
|Number of images | Up to 1 million | Up to 1 billion|
8888
|Execution | Single node | Cluster|
8989
|Features | Outlier detection, duplicate detection | Plus novelty detection, wrong label detection, missing label detection, data summarization, connected components, train/test leaks, temporal sequence detection, advanced visual search, label quality analysis|
9090
|Input | Images | Plus Video|

0 commit comments

Comments
 (0)