-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getPageText crashes on larger PDFs #9
Comments
Example to test with: https://arxiv.org/pdf/1708.08021.pdf For reference, this issue does not occur with PyMuPDF. |
Hey @rybon , this is likely because the in-memory representation of the PDF page exceeds the memory allocated to the WASM worker. This limit varies depending the JS engine, see for example V8. It might be possible to work around this using the browser's memory API but this isn't something I'm familiar with.
This isn't surprising, as that library uses native C bindings instead of webassembly. Thanks for providing example file, I'll use this to investigate further. |
Thanks! I appreciate your efforts. By the way, I encountered this issue by looping over the number of pages in the document and grabbing the text per page. It seems memory is not freed in C after the text is retrieved. I tried to fix it by mucking about in the C code (adding a few I tried version |
Fortunately there's a workaround: loading the same PDF with PDF.js and grabbing the page text via its API instead. This works in the browser, web worker and Node.js. |
@rybon I've added an automated test suite on this branch. In it, I've tested the entire API and in particular ran Keen to diagnose further - could you let me know which OS, architecture and hardware you've been running this on, as well as which version of NodeJS and which browser?
That makes sense, PDF.js is potentially a better alternative for this use case. I originally created MuPDF.js as a way to render PDF files to images in the browser. |
Hi, just adding onto this, similar problem in getting a memory access out of bounds error when looping through multiple pages of a pdf with getPageText() (initially triggered when I was sequentially scanning 15 different PDF's, but in my case for PDF's exceeding even one page). I'm on Windows 11, AMD64, Node 16.13.2. Ultimately the browser will be Electron 18.3.13 but the error is thrown in my testing, which is just running Node. I don't have a sample file to test at the moment since the PDF's I'm scanning are confidential but I'll see if I can find a generic example to attach |
Thanks @bfarmilo ! I'll see if I can reproduce the issue on different architectures and node versions using cloud services. It looks like WASM's promise of portability has some caveats 🤔 this package might well need compatibility information in the documentation. |
I managed to resolve all these issues by reusing the build setup of this repo with some minor tweaks and compiling to WASM from the mupdf source code on GitHub. I am no longer using this package itself. |
By the way, I also started seeing these issues with |
@rybon Great to hear that you managed to get it working.
Update: I've been able to replicate the issue in the test suite. Based on what you've said, it looks like the cause is the options I'm passing to emscripten/cmake, if you're having success building from the original mupdf sources instead of using my build scripts. Thank you for the pointer - I'm confident I can resolve this issue now. |
@andytango I copied the contents of the mupdf GitHub repo (after cloning and initializing its Git submodules) to the I cannot make use of Further tweaks:
The build will output a In your project code:
Note: some API calls require a Note: running this code in the browser requires using Parcel as the bundler. |
Some notes for those who want to process a PDF in parallel (browser example, Node.js requires a slightly different API):
Spin up as many workers as CPU cores and pass each of them a slice of pages:
In the worker use the same mupdf code as stated in the previous comment. |
I've looked into this and also ran some builds from the HEAD of the master branch on the mupdf sources on GitHub. Am I right in thinking that you also worked from the latest commit on their master branch? This is an unstable development branch, and looking through the code, there are a number of TODOs and FIXMEs in their wrapper JS file at What I can do in the meantime is update the documentation with your helpful comments on this thread, so that anyone else who is encountering memory or performance issues can workaround them. It is also very useful to see the changes that will be coming to the next release of This is particularly the case given it will take quite some time to write Typescript declarations for this new API, so again, thank you for the heads up. |
Yeah, from master. I think from commit |
I made PR #58 which could fix this problem |
getPageText()
crashes with aRuntimeError: memory access out of bounds
exception on PDFs larger than around 20 pages. This happens in the browser / WebWorker and Node.js. On Node.js it crashes withbus error
The text was updated successfully, but these errors were encountered: