Skip to content

Commit 3e45a63

Browse files
committed
Deprecated use_ann
1 parent ca6386f commit 3e45a63

1 file changed

Lines changed: 10 additions & 15 deletions

File tree

README.md

Lines changed: 10 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -318,21 +318,23 @@ deduplicated_texts = semhash.self_deduplicate()
318318
<summary> Using custom ANN backends </summary>
319319
<br>
320320

321-
The following code snippet shows how to use a custom ANN backend and custom args with SemHash:
321+
By default, we use [USearch](https://github.com/unum-cloud/USearch) as the ANN (approximate-nearest neighbors) backend for deduplication. We recommend keeping this since the recall for smaller datasets is ~100%, and it's needed for larger datasets (>1M samples) since these will take too long to deduplicate without ANN. If you want to use a flat/exact-matching backend, you can set `ann_backend=Backend.BASIC` in the SemHash constructor:
322322

323323
```python
324-
from datasets import load_dataset
325324
from semhash import SemHash
326325
from vicinity import Backend
327326

328-
# Load a dataset to deduplicate
329-
texts = load_dataset("ag_news", split="train")["text"]
327+
semhash = SemHash.from_records(records=texts, ann_backend=Backend.BASIC)
328+
```
330329

331-
# Initialize a SemHash with the model and custom ann backend and custom args
332-
semhash = SemHash.from_records(records=texts, ann_backend=Backend.FAISS, nlist=50)
330+
Any backend from [Vicinity](https://github.com/MinishLab/vicinity) can be used with SemHash. The following code snippet shows how to use [FAISS](https://github.com/facebookresearch/faiss) with a custom `nlist` parameter:
333331

334-
# Deduplicate the texts
335-
deduplicated_texts = semhash.self_deduplicate()
332+
```python
333+
from datasets import load_dataset
334+
from semhash import SemHash
335+
from vicinity import Backend
336+
337+
semhash = SemHash.from_records(records=texts, ann_backend=Backend.FAISS, nlist=50)
336338
```
337339

338340
For the full list of supported ANN backends and args, see the [Vicinity docs](https://github.com/MinishLab/vicinity/tree/main?tab=readme-ov-file#supported-backends).
@@ -398,13 +400,6 @@ representative_texts = semhash.self_find_representative().selected
398400
```
399401
</details>
400402

401-
NOTE: By default, we use the ANN (approximate-nearest neighbors) backend for deduplication. We recommend keeping this since the recall for smaller datasets is ~100%, and it's needed for larger datasets (>1M samples) since these will take too long to deduplicate without ANN. If you want to use the flat/exact-matching backend, you can set `ann_backend=Backend.BASIC` in the SemHash constructor:
402-
403-
```python
404-
from vicinity import Backend
405-
406-
semhash = SemHash.from_records(records=texts, ann_backend=Backend.BASIC)
407-
```
408403

409404

410405

0 commit comments

Comments
 (0)