Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout for PDF extraction from OpenOffice supported document format. #34

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

vrybas
Copy link
Contributor

@vrybas vrybas commented Jan 13, 2012

Because when we extract_pdf() from document more than 400-500 pages,
the JODConverter fails with exception:

Exception in thread "main" org.artofsolving.jodconverter.office.OfficeException: task did not complete within timeout at org.artofsolving.jodconverter.office.PooledOfficeManager.execute(PooledOfficeManager.java:88) at
org.artofsolving.jodconverter.office.ProcessPoolOfficeManager.execute(ProcessPoolOfficeManager.java:78) at org.artofsolving.jodconverter.OfficeDocumentConverter.convert(OfficeDocumentConverter.java:78) at org.artofsolving.jodconverter.OfficeDocumentConverter.convert(OfficeDocumentConverter.java:69) at org.artofsolving.jodconverter.cli.Convert.main(Convert.java:118) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228) at java.util.concurrent.FutureTask.get(FutureTask.java:91) at org.artofsolving.jodconverter.office.PooledOfficeManager.execute(PooledOfficeManager.java:85) ...

The new JODConverter 3.0b4 getting timeout param. The problem is solved.

I don't know if timeout should be hardcoded, or if it should be documented Docsplit's option. I did both in separate commits.

Because when we extract_pdf() document more than 400-500 pages,
the JODConverter fails with exception:

    Exception in thread "main" org.artofsolving.jodconverter.office.OfficeException:
    task did not complete within timeout at org.artofsolving.jodconverter.office.PooledOfficeManager.execute...
@tienle
Copy link

tienle commented May 7, 2012

vote for supporting timeout option. 👍

@jravetch
Copy link

Me too. Have run into the issue before. Very heavy docs can take almost 5min to convert to pdf.

@mromaine
Copy link

+1 here too; what are the chances this pull request will be granted?

@pzaich
Copy link

pzaich commented Nov 1, 2012

+1 Has this been resolved yet? I am running into this problem as well. Anything over 1.5 mB on .doc format seems to timeout along with a lot of pdfs.

@@ -94,6 +94,9 @@ def parse_options
opts.on('--no-clean', 'disable cleaning of OCR\'d text') do |c|
@options[:clean] = false
end
opts.on('-t', '--timeout [SEC]', 'Timeout for PDF extraction from OpenOffice document format (default is 1 hour)') do |t|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps change this message to "Timeout for PDF extraction from OpenOffice supported document format" so as not to lead people into thinking the flag will only apply to OpenOffice files and not .doc, .xlsx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alxndrmlr, will do thanks

doxavore pushed a commit to ebp/docsplit that referenced this pull request Apr 25, 2014
Original work by documentcloud#34 with
modification to not use a default timeout (causing no change from
existing functionality).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants