You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/analyzing-hf-datasets.ipynb
+8-8Lines changed: 8 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -18,13 +18,13 @@
18
18
"[](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb)\n",
19
19
"[](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb)\n",
20
20
"\n",
21
-
"This notebook shows how you can use fastdup to analyze any datasets from [Hugging Face Datasets](https://huggingface.co/docs/datasets/index).\n",
21
+
"This notebook shows how you can use fastdup to analyze any dataset from [Hugging Face Datasets](https://huggingface.co/docs/datasets/index).\n",
22
22
"\n",
23
23
"We will analyze an image classification dataset for:\n",
24
24
"\n",
25
-
"+ Duplicates / near-duplicates.\n",
26
-
"+ Outliers.\n",
27
-
"+ Wrong labels."
25
+
"+ Duplicates / near-duplicates\n",
26
+
"+ Outliers\n",
27
+
"+ Wrong labels"
28
28
]
29
29
},
30
30
{
@@ -202,7 +202,7 @@
202
202
"id": "61b315c3",
203
203
"metadata": {},
204
204
"source": [
205
-
"## Get labels mapping\n",
205
+
"## Get Labels Mapping\n",
206
206
"\n",
207
207
"Tiny ImageNet follows the original ImageNet class names. Let's download the class mappings `classes.py`."
208
208
]
@@ -257,7 +257,7 @@
257
257
"id": "edb6463d",
258
258
"metadata": {},
259
259
"source": [
260
-
"Now we can get the class names by providing the class id. For example"
260
+
"Now we can get the class names by providing the class ID. For example:"
261
261
]
262
262
},
263
263
{
@@ -362,7 +362,7 @@
362
362
"source": [
363
363
"## Load Annotations\n",
364
364
"\n",
365
-
"To load the image labels into fastdup we need to prepare a DataFrame with the following column\n",
365
+
"To load the image labels into fastdup we need to prepare a DataFrame with the following column:\n",
366
366
"+ `filename`\n",
367
367
"+ `label`\n",
368
368
"+ `split`\n"
@@ -606,7 +606,7 @@
606
606
"id": "1017106b",
607
607
"metadata": {},
608
608
"source": [
609
-
"There are several methods we can use to inspect the issues found\n",
609
+
"There are several methods we can use to inspect the issues found:\n",
610
610
"\n",
611
611
"```python\n",
612
612
"fd.vis.duplicates_gallery() # create a visual gallery of duplicates\n",
Copy file name to clipboardExpand all lines: examples/analyzing-kaggle-datasets.ipynb
+13-13Lines changed: 13 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@
18
18
"[](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-kaggle-datasets.ipynb)\n",
19
19
"[](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-kaggle-datasets.ipynb)\n",
20
20
"\n",
21
-
"This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze any computer vision datasets from [Kaggle](https://kaggle.com)."
21
+
"This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze any computer vision dataset from [Kaggle](https://kaggle.com)."
22
22
]
23
23
},
24
24
{
@@ -28,7 +28,7 @@
28
28
"source": [
29
29
"## Install Kaggle API\n",
30
30
"\n",
31
-
"To load data programmatically from Kaggle we will need to install the [Kaggle API](https://github.com/Kaggle/kaggle-api). The API lets us pull data from Kaggle using Python.\n",
31
+
"To load data programmatically from Kaggle, we will need to install the [Kaggle API](https://github.com/Kaggle/kaggle-api). The API lets us pull data from Kaggle using Python.\n",
32
32
"\n",
33
33
"To install the API, run:"
34
34
]
@@ -48,21 +48,21 @@
48
48
"id": "eb3fd4c9-bdfb-4ba9-aef6-528d9811b588",
49
49
"metadata": {},
50
50
"source": [
51
-
"To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com/ . \n",
51
+
"Note: to use the Kaggle API, you'll need to sign up for a Kaggle account at https://www.kaggle.com/ . \n",
52
52
"\n",
53
-
"Then, go to the 'Account' tab and select 'Create API Token'. This will trigger the download of `kaggle.json`, a file containing your API credentials. \n",
53
+
"Go to the 'Account' tab and select 'Create API Token'. This will trigger the download of `kaggle.json`, a file containing your API credentials. \n",
54
54
"\n",
55
-
"Place this file in the location `~/.kaggle/kaggle.json` (on Windows in the location `C:\\Users\\<Windows-username>\\.kaggle\\kaggle.json`\n",
55
+
"Place this file in the location `~/.kaggle/kaggle.json` (on Windows in the location `C:\\Users\\<Windows-username>\\.kaggle\\kaggle.json`)\n",
56
56
"\n",
57
-
"Read more [here](https://github.com/Kaggle/kaggle-api#api-credentials)."
57
+
"Fore more information on the Kaggle API, click [here](https://github.com/Kaggle/kaggle-api#api-credentials)."
58
58
]
59
59
},
60
60
{
61
61
"cell_type": "markdown",
62
62
"id": "4b6ae131-572a-4008-a9c1-6f49b21c029e",
63
63
"metadata": {},
64
64
"source": [
65
-
"If the set up is done correctly, you should be able to run the kaggle commands on your terminal. For instance, to list kaggle datasets that has the term \"computer vision\" , run:"
65
+
"If the set up is done correctly, you should be able to run the kaggle commands on your terminal. For instance, to list kaggle datasets that have the term \"computer vision\" , run:"
66
66
]
67
67
},
68
68
{
@@ -132,7 +132,7 @@
132
132
"id": "622ed625-8e11-4e39-85ed-ae2faf3320a8",
133
133
"metadata": {},
134
134
"source": [
135
-
"Let's say we're interested to analyze the [RVL-CDIP Test Dataset](https://www.kaggle.com/datasets/pdavpoojan/the-rvlcdip-dataset-test). You can head to the dataset page and click on Copy API command and paste it in your terminal.\n",
135
+
"Let's say we're interested to analyze the [RVL-CDIP Test Dataset](https://www.kaggle.com/datasets/pdavpoojan/the-rvlcdip-dataset-test). You can head to the dataset page and click on \"Copy API command\" and paste it in your terminal.\n",
Copy file name to clipboardExpand all lines: examples/analyzing-object-detection-dataset.ipynb
+18-18Lines changed: 18 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@
19
19
"[](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb)\n",
20
20
"[](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb)\n",
21
21
"\n",
22
-
"In this tutorial, we will analyze an object detection dataset with bounding boxes and identify potential issues. By the end of the notebook, you'll discover how to load a COCOformat bounding box annotations into fastdup and inspect the dataset for issues at the bounding box level."
22
+
"In this tutorial, we will analyze an object detection dataset with bounding boxes and identify potential issues. By the end of the notebook, you'll discover how to load COCO-format bounding box annotations into fastdup and inspect the dataset for issues at the bounding box level."
23
23
]
24
24
},
25
25
{
@@ -100,7 +100,7 @@
100
100
},
101
101
"source": [
102
102
"## Download Dataset\n",
103
-
"We will be using the [COCO minitrain](https://github.com/giddyyupp/coco-minitrain) dataset for this tutorial. COCO minitrain is a curated mini training set with about 25,000 images or 20% of the original [COCO dataset](https://cocodataset.org/#home).\n",
103
+
"We will be using the [COCO minitrain](https://github.com/giddyyupp/coco-minitrain) dataset for this tutorial. COCO minitrain is a curated mini training set with about 25,000 images, or 20% of the original [COCO dataset](https://cocodataset.org/#home).\n",
104
104
"\n",
105
105
"Let's download the dataset into our local drive."
106
106
]
@@ -128,9 +128,9 @@
128
128
},
129
129
"source": [
130
130
"## Load Annotations\n",
131
-
"fastdup expects the annotations to be in a specific format.\n",
131
+
"fastdup requires the annotations to be in a specific format.\n",
132
132
"\n",
133
-
"We will use a simple converter to convert the COCO format JSON annotation file into the fastdup annotation dataframe. This converter is applicable to any dataset which uses COCO format."
133
+
"We will use a simple converter to convert the COCO format JSON annotation file into the fastdup annotation dataframe. This converter is applicable to any dataset that uses the COCO format."
134
134
]
135
135
},
136
136
{
@@ -272,7 +272,7 @@
272
272
"source": [
273
273
"## Run fastdup\n",
274
274
"\n",
275
-
"Run fastdup with annotations on the dataset. If you're running on a free Google Colab instance, specify the `num_images` to limit the run to fewer images."
275
+
"Run fastdup with annotations on the dataset. If you're running on a free Google Colab instance, you may want to specify `num_images` to limit the run to fewer images."
276
276
]
277
277
},
278
278
{
@@ -293,10 +293,10 @@
293
293
"id": "3b4f5823"
294
294
},
295
295
"source": [
296
-
"## Class distribution\n",
297
-
"The dataset contains 25k images and 183k objects, an average of 7.3 objects per image. \n",
296
+
"## Class Distribution\n",
297
+
"The dataset contains 25k images and 183k objects, for an average of 7.3 objects per image. \n",
298
298
"\n",
299
-
"Interestingly, we see a highly unbalanced class distribution, where all 80 coco classes are present here, but there is a strong balance towards the person class, that accounts for over 56k instances (30.6%). Car and Chair classes also contain over 8k instances each, while at the bottom of the list the toaster and hair drier classes contain as few as 40 instances. \n",
299
+
"Interestingly, we see a highly unbalanced class distribution. All 80 COCO classes are present here, but the distribution of classes is strongly skewed towards the person class, which accounts for over 56k instances (30.6%). Car and Chair classes also contain over 8k instances each, while the toaster and hair drier classes contain as few as 40 instances. \n",
300
300
"\n",
301
301
"Using `Plotly` we get a useful interactive histogram. "
302
302
]
@@ -184820,7 +184820,7 @@
184820
184820
},
184821
184821
"source": [
184822
184822
"## Outliers\n",
184823
-
"Visualize outliers from the run."
184823
+
"Using fastdup's gallery feature, we can visualize outliers from the run."
184824
184824
]
184825
184825
},
184826
184826
{
@@ -185963,8 +185963,8 @@
185963
185963
"id": "c0f1fade"
185964
185964
},
185965
185965
"source": [
185966
-
"## Size and shape issues\n",
185967
-
"Objects come in various shapes and sizes, and sometimes objects might be incorrectly labeled or too small to be useful. We will now find the smallest, narrowest and widest objects, and asses their usefulness. "
185966
+
"## Size and Shape Issues\n",
185967
+
"Objects come in various shapes and sizes, and sometimes objects might be incorrectly labeled, or too small to be useful. We will now find the smallest, narrowest and widest objects, and assess their usefulness. "
185968
185968
]
185969
185969
},
185970
185970
{
@@ -186382,7 +186382,7 @@
186382
186382
"id": "da5709c9-297e-47cd-9d13-599c4c76a883",
186383
186383
"metadata": {},
186384
186384
"source": [
186385
-
"Let's visualize here how the top 3 smallest images look like.\n",
186385
+
"Let's visualize what the 3 smallest images look like.\n",
186386
186386
"\n",
186387
186387
"The following image is labeled as a `person` in the dataset."
186388
186388
]
@@ -186475,7 +186475,7 @@
186475
186475
"id": "a5a1f0b1-7a85-46bd-a8c1-ebaad837c85f",
186476
186476
"metadata": {},
186477
186477
"source": [
186478
-
"Considering the image size, we can hardly tell if the label is correct."
186478
+
"Considering the image size, it is difficult to discern if the label is correct."
186479
186479
]
186480
186480
},
186481
186481
{
@@ -186789,7 +186789,7 @@
186789
186789
"id": "9af6979b"
186790
186790
},
186791
186791
"source": [
186792
-
"Look at that! The slices reveal many items that are either tiny (10x10 pixels) or have extreme aspect ratios!"
186792
+
"Using fastdup, we have discovered many items that are either tiny (10x10 pixels) or have extreme aspect ratios!"
186793
186793
]
186794
186794
},
186795
186795
{
@@ -186799,7 +186799,7 @@
186799
186799
"source": [
186800
186800
"## Bad Bounding Boxes\n",
186801
186801
"\n",
186802
-
"Bounding boxes that are either too small or go beyond image boundaries are flagged as bad bounding box in fastdup.\n",
186802
+
"Bounding boxes that are either too small or go beyond image boundaries are flagged as a bad bounding box in fastdup.\n",
186803
186803
"\n",
186804
186804
"We can get a list of bad bounding boxes by reading the `atrain_features.bad.csv` file."
186805
186805
]
@@ -186972,7 +186972,7 @@
186972
186972
"id": "6ea5ddca-4c6f-4f92-afd6-32467ed3a437",
186973
186973
"metadata": {},
186974
186974
"source": [
186975
-
"We have found 18,592 (!) bounding boxes that are either too small or go beyond image boundaries. This is 10% of the data! Filtering them would both save us grusome debugging of training errors and failures and help up provide the model with useful size objects."
186975
+
"We have found 18,592 (!) bounding boxes that are either too small or go beyond image boundaries. This is 10% of the data! Filtering them out would both save us gruesome debugging of training errors or failures, and help up provide the model with useful size objects."
186976
186976
]
186977
186977
},
186978
186978
{
@@ -186983,9 +186983,9 @@
186983
186983
},
186984
186984
"source": [
186985
186985
"## Possible Mislabels\n",
186986
-
"The fastdup similarity search and gallery is a strong tool for finding objects that are possibly mislabeled. By finding each object's nearest neighbors and their classes, we can find objects with classes contradicting their neighbors' - a strong sign for mislabels.\n",
186986
+
"The fastdup similarity search and similarity gallery are strong tools for finding objects that are possibly mislabeled. By finding each object's nearest neighbors and their classes, we can find objects with classes contradicting their neighbors' (a strong sign of mislabels).\n",
186987
186987
"\n",
186988
-
"Running similarity gallery shows if an image has high similarity with two of its closest neighbors yet has different labels. This helps surface potential mislabeling in the dataset. "
186988
+
"Running the similarity gallery shows if an image has high similarity with two of its closest neighbors, yet has different labels. This helps surface potential mislabeling in the dataset. "
0 commit comments