-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout for PDF extraction from OpenOffice supported document format. #34
base: master
Are you sure you want to change the base?
Conversation
Because when we extract_pdf() document more than 400-500 pages, the JODConverter fails with exception: Exception in thread "main" org.artofsolving.jodconverter.office.OfficeException: task did not complete within timeout at org.artofsolving.jodconverter.office.PooledOfficeManager.execute...
vote for supporting timeout option. 👍 |
Me too. Have run into the issue before. Very heavy docs can take almost 5min to convert to pdf. |
+1 here too; what are the chances this pull request will be granted? |
+1 Has this been resolved yet? I am running into this problem as well. Anything over 1.5 mB on .doc format seems to timeout along with a lot of pdfs. |
@@ -94,6 +94,9 @@ def parse_options | |||
opts.on('--no-clean', 'disable cleaning of OCR\'d text') do |c| | |||
@options[:clean] = false | |||
end | |||
opts.on('-t', '--timeout [SEC]', 'Timeout for PDF extraction from OpenOffice document format (default is 1 hour)') do |t| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps change this message to "Timeout for PDF extraction from OpenOffice supported document format" so as not to lead people into thinking the flag will only apply to OpenOffice files and not .doc, .xlsx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alxndrmlr, will do thanks
Original work by documentcloud#34 with modification to not use a default timeout (causing no change from existing functionality).
Because when we extract_pdf() from document more than 400-500 pages,
the JODConverter fails with exception:
Exception in thread "main" org.artofsolving.jodconverter.office.OfficeException: task did not complete within timeout at org.artofsolving.jodconverter.office.PooledOfficeManager.execute(PooledOfficeManager.java:88) at
org.artofsolving.jodconverter.office.ProcessPoolOfficeManager.execute(ProcessPoolOfficeManager.java:78) at org.artofsolving.jodconverter.OfficeDocumentConverter.convert(OfficeDocumentConverter.java:78) at org.artofsolving.jodconverter.OfficeDocumentConverter.convert(OfficeDocumentConverter.java:69) at org.artofsolving.jodconverter.cli.Convert.main(Convert.java:118) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228) at java.util.concurrent.FutureTask.get(FutureTask.java:91) at org.artofsolving.jodconverter.office.PooledOfficeManager.execute(PooledOfficeManager.java:85) ...
The new JODConverter 3.0b4 getting timeout param. The problem is solved.
I don't know if timeout should be hardcoded, or if it should be documented Docsplit's option. I did both in separate commits.