-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Execution Time When Scanning Large Files #1461
Comments
Hi @vinay-cldscle, have you lookied into the |
Hi @omri374 Yes, i tried using the BatchAnalyzerEngine for txt files but it not working. error: Batch analyzer works only for list and dict? |
Please see the python API reference here: https://microsoft.github.io/presidio/api/analyzer_python/#presidio_analyzer.BatchAnalyzerEngine.analyze_iterator your |
Agree with OP that even BatchAnalyzerEngine could really benefit from additional speedups. For example, using presidio-structured does not do much better.
|
@solomonbrjnih thanks for the input. presidio-structured is sampling rows, and the lower the sampling ratio, the quicker it should be to calculate. Could you provide more insights into your presidio-structured process? On Note that it's possible to tweak the underlying spaCy pipeline, for example by providing a different number of processes (n_process) or batch size. This isn't officially supported in Presidio yet. See #883. Presidio-structured, on the other hand, starts with sampling, so this process should be much quicker. |
Update- a new PR #1521 |
Hey team,
When I tried to scan a file that is 7 MB and contains more than 700,000 lines, I passed the data in chunks(chunks size is 100000). It takes about 7 to 10 minutes to complete execution. Is this normal behavior? Can we reduce the execution time? Does batch analysis support TXT files? I would like to complete the execution within 1 minute. Is that possible?
The text was updated successfully, but these errors were encountered: