Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Execution Time When Scanning Large Files #1461

Open
vinay-cldscle opened this issue Oct 4, 2024 · 6 comments
Open

Slow Execution Time When Scanning Large Files #1461

vinay-cldscle opened this issue Oct 4, 2024 · 6 comments

Comments

@vinay-cldscle
Copy link

vinay-cldscle commented Oct 4, 2024

Hey team,
When I tried to scan a file that is 7 MB and contains more than 700,000 lines, I passed the data in chunks(chunks size is 100000). It takes about 7 to 10 minutes to complete execution. Is this normal behavior? Can we reduce the execution time? Does batch analysis support TXT files? I would like to complete the execution within 1 minute. Is that possible?

@omri374
Copy link
Contributor

omri374 commented Oct 8, 2024

Hi @vinay-cldscle, have you lookied into the BatchAnalyzerEngine option?

@vinay-cldscle
Copy link
Author

vinay-cldscle commented Oct 15, 2024

Hi @omri374 Yes, i tried using the BatchAnalyzerEngine for txt files but it not working.
analyzer_engine = AnalyzerEngine()
analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer_engine)

error:
results = analyzer.analyze(texts=text_chunks, language="en", return_decision_process=True)
^^^^^^^^^^^^^^^^
AttributeError: 'BatchAnalyzerEngine' object has no attribute 'analyze'

Batch analyzer works only for list and dict?

@omri374
Copy link
Contributor

omri374 commented Oct 15, 2024

Please see the python API reference here: https://microsoft.github.io/presidio/api/analyzer_python/#presidio_analyzer.BatchAnalyzerEngine.analyze_iterator

your text_chunks should be iterable (such as List[str]) and then you could call batch_analyzer.analyze_iter(text_cunks,...)

@solomonbrjnih
Copy link

Agree with OP that even BatchAnalyzerEngine could really benefit from additional speedups.

For example, using BatchAnalyzerEngine on a 489K CSV with 2001 rows takes about 15 seconds. (MacOS 13.6, Python 3.11, Presidio Analyzer 2.x.) This dataset is not very big by today's standards.

presidio-structured does not do much better.

$ cat benchmarks/batch-iter.py
from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine

analyzer = AnalyzerEngine(supported_languages=["en"])
batch_analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer)

with open("tests/data/big/COVID-19_Treatments_20241216-small.csv") as f:
    results = batch_analyzer.analyze_iterator(f, language="en", batch_size=100)
    for result in results:
        pass
$ time python3 benchmarks/batch-iter.py
# real	0m15.470s
# user	0m13.594s
# sys	0m2.890s

@omri374
Copy link
Contributor

omri374 commented Feb 3, 2025

@solomonbrjnih thanks for the input. presidio-structured is sampling rows, and the lower the sampling ratio, the quicker it should be to calculate. Could you provide more insights into your presidio-structured process?

On BatchAnalyzerEngine, it essentially runs an NLP pipeline on every cell of the table, so for 2000 rows (assuming around 20 columns), it would have to pass 40K values through the spaCy NLP pipeline.

Note that it's possible to tweak the underlying spaCy pipeline, for example by providing a different number of processes (n_process) or batch size. This isn't officially supported in Presidio yet. See #883.

Presidio-structured, on the other hand, starts with sampling, so this process should be much quicker.

@omri374
Copy link
Contributor

omri374 commented Feb 4, 2025

Update- a new PR #1521

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants