Slow Execution Time When Scanning Large Files #1461

vinay-cldscle · 2024-10-04T04:29:57Z

Hey team,
When I tried to scan a file that is 7 MB and contains more than 700,000 lines, I passed the data in chunks(chunks size is 100000). It takes about 7 to 10 minutes to complete execution. Is this normal behavior? Can we reduce the execution time? Does batch analysis support TXT files? I would like to complete the execution within 1 minute. Is that possible?

omri374 · 2024-10-08T05:41:52Z

Hi @vinay-cldscle, have you lookied into the BatchAnalyzerEngine option?

vinay-cldscle · 2024-10-15T05:44:49Z

Hi @omri374 Yes, i tried using the BatchAnalyzerEngine for txt files but it not working.
analyzer_engine = AnalyzerEngine()
analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer_engine)

error:
results = analyzer.analyze(texts=text_chunks, language="en", return_decision_process=True)
^^^^^^^^^^^^^^^^
AttributeError: 'BatchAnalyzerEngine' object has no attribute 'analyze'

Batch analyzer works only for list and dict?

omri374 · 2024-10-15T07:30:50Z

Please see the python API reference here: https://microsoft.github.io/presidio/api/analyzer_python/#presidio_analyzer.BatchAnalyzerEngine.analyze_iterator

your text_chunks should be iterable (such as List[str]) and then you could call batch_analyzer.analyze_iter(text_cunks,...)

solomonbrjnih · 2025-02-02T01:04:41Z

Agree with OP that even BatchAnalyzerEngine could really benefit from additional speedups.

For example, using BatchAnalyzerEngine on a 489K CSV with 2001 rows takes about 15 seconds. (MacOS 13.6, Python 3.11, Presidio Analyzer 2.x.) This dataset is not very big by today's standards.

presidio-structured does not do much better.

$ cat benchmarks/batch-iter.py
from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine

analyzer = AnalyzerEngine(supported_languages=["en"])
batch_analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer)

with open("tests/data/big/COVID-19_Treatments_20241216-small.csv") as f:
    results = batch_analyzer.analyze_iterator(f, language="en", batch_size=100)
    for result in results:
        pass

$ time python3 benchmarks/batch-iter.py
# real	0m15.470s
# user	0m13.594s
# sys	0m2.890s

omri374 · 2025-02-03T08:35:23Z

@solomonbrjnih thanks for the input. presidio-structured is sampling rows, and the lower the sampling ratio, the quicker it should be to calculate. Could you provide more insights into your presidio-structured process?

On BatchAnalyzerEngine, it essentially runs an NLP pipeline on every cell of the table, so for 2000 rows (assuming around 20 columns), it would have to pass 40K values through the spaCy NLP pipeline.

Note that it's possible to tweak the underlying spaCy pipeline, for example by providing a different number of processes (n_process) or batch size. This isn't officially supported in Presidio yet. See #883.

Presidio-structured, on the other hand, starts with sampling, so this process should be much quicker.

omri374 · 2025-02-04T10:08:54Z

Update- a new PR #1521

someshfengde mentioned this issue Oct 9, 2024

Mask / redact PII from traces logged to langfuse langfuse/langfuse#3518

Closed

omri374 mentioned this issue Feb 4, 2025

Add multiprocessing parameters #1521

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Execution Time When Scanning Large Files #1461

Slow Execution Time When Scanning Large Files #1461

vinay-cldscle commented Oct 4, 2024 •

edited

Loading

omri374 commented Oct 8, 2024

vinay-cldscle commented Oct 15, 2024 •

edited

Loading

omri374 commented Oct 15, 2024

solomonbrjnih commented Feb 2, 2025

omri374 commented Feb 3, 2025 •

edited

Loading

omri374 commented Feb 4, 2025

Slow Execution Time When Scanning Large Files #1461

Slow Execution Time When Scanning Large Files #1461

Comments

vinay-cldscle commented Oct 4, 2024 • edited Loading

omri374 commented Oct 8, 2024

vinay-cldscle commented Oct 15, 2024 • edited Loading

omri374 commented Oct 15, 2024

solomonbrjnih commented Feb 2, 2025

omri374 commented Feb 3, 2025 • edited Loading

omri374 commented Feb 4, 2025

vinay-cldscle commented Oct 4, 2024 •

edited

Loading

vinay-cldscle commented Oct 15, 2024 •

edited

Loading

omri374 commented Feb 3, 2025 •

edited

Loading