Parsing complete scanned document #117

SBhat2615 · 2024-08-26T13:14:05Z

SBhat2615
Aug 26, 2024

Trying to parse a below scanned document.

Tried to convert scanned document to searchable using tesseract. Still no result.
What is recommended way to parse such documents?

Using latest pymupdf4llm

scansmpl.pdf
scansmpl.pdf-searchable.pdf

JorjMcKie · 2024-08-26T13:46:02Z

JorjMcKie
Aug 26, 2024
Maintainer

Don't understand what you did wrong. Here is my script and associated output:

from pathlib import Path
import pymupdf4llm

md = pymupdf4llm.to_markdown("scansmp.pdf-searchable.pdf")
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())

        THE SLEREXE COMPANY LIMITED
                    SAPORS LANE - BOOLE - DORSET - BH 25 8ER
                         TELEPHONE BOOLE (945 13) 51617 - TELEX 123456
          Our Ref. 350/PJC/EAC 18th January, 1972.
          Dr. P.N. Cundall,
         Mining Surveys Ltd.,
          Holroyd Road,
          Reading,
          Berks.
           Dear Pete,
            Permit me to introduce you to the facility of facsimile
          transmission.
             In facsimile a photocell is caused to perform a raster scan over
          the subject copy. The variations of print density on the document
          cause the photocell to generate an analogous electrical video signal.
          This signal is used to modulate a carrier, which is transmitted to a
          remote destination over a radio or cable communications link.
            At the remote terminal, demodulation reconstructs the video
          signal, which is used to modulate the density of print produced by a
         printing device. This device is scanning in a raster scan synchronised
         with that at the transmitting terminal. As a result, a facsimile
          copy of the subject document is produced.
            Probably you have uses for this facility in your organisation.
                               Yours sincerely,
           ThA.
                                 P.J. CROSS
                              Group Leader - Facsimile Research



-----

0 replies

SBhat2615 · 2024-08-26T13:57:40Z

SBhat2615
Aug 26, 2024
Author

Don't understand what you did wrong. Here is my script and associated output:

from pathlib import Path
import pymupdf4llm

md = pymupdf4llm.to_markdown("scansmp.pdf-searchable.pdf")
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())

        THE SLEREXE COMPANY LIMITED
                    SAPORS LANE - BOOLE - DORSET - BH 25 8ER
                         TELEPHONE BOOLE (945 13) 51617 - TELEX 123456
          Our Ref. 350/PJC/EAC 18th January, 1972.
          Dr. P.N. Cundall,
         Mining Surveys Ltd.,
          Holroyd Road,
          Reading,
          Berks.
           Dear Pete,
            Permit me to introduce you to the facility of facsimile
          transmission.
             In facsimile a photocell is caused to perform a raster scan over
          the subject copy. The variations of print density on the document
          cause the photocell to generate an analogous electrical video signal.
          This signal is used to modulate a carrier, which is transmitted to a
          remote destination over a radio or cable communications link.
            At the remote terminal, demodulation reconstructs the video
          signal, which is used to modulate the density of print produced by a
         printing device. This device is scanning in a raster scan synchronised
         with that at the transmitting terminal. As a result, a facsimile
          copy of the subject document is produced.
            Probably you have uses for this facility in your organisation.
                               Yours sincerely,
           ThA.
                                 P.J. CROSS
                              Group Leader - Facsimile Research



-----

Don't understand what you did wrong. Here is my script and associated output:

from pathlib import Path
import pymupdf4llm

md = pymupdf4llm.to_markdown("scansmp.pdf-searchable.pdf")
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())

        THE SLEREXE COMPANY LIMITED
                    SAPORS LANE - BOOLE - DORSET - BH 25 8ER
                         TELEPHONE BOOLE (945 13) 51617 - TELEX 123456
          Our Ref. 350/PJC/EAC 18th January, 1972.
          Dr. P.N. Cundall,
         Mining Surveys Ltd.,
          Holroyd Road,
          Reading,
          Berks.
           Dear Pete,
            Permit me to introduce you to the facility of facsimile
          transmission.
             In facsimile a photocell is caused to perform a raster scan over
          the subject copy. The variations of print density on the document
          cause the photocell to generate an analogous electrical video signal.
          This signal is used to modulate a carrier, which is transmitted to a
          remote destination over a radio or cable communications link.
            At the remote terminal, demodulation reconstructs the video
          signal, which is used to modulate the density of print produced by a
         printing device. This device is scanning in a raster scan synchronised
         with that at the transmitting terminal. As a result, a facsimile
          copy of the subject document is produced.
            Probably you have uses for this facility in your organisation.
                               Yours sincerely,
           ThA.
                                 P.J. CROSS
                              Group Leader - Facsimile Research



-----

I had this parameter set,

write_images=True

Why does this affect the output?

0 replies

JorjMcKie · 2024-08-26T14:19:02Z

JorjMcKie
Aug 26, 2024
Maintainer

Converted this to a discussion item.
An OCRed page still consists of an image covering the full page. The option you used, writes the image to the image folder, references it in the MD output and also output the text on it.

For OCRed pages it probably does not make a lot of sense to do it this way - as there is exactly one image always - plus the text.
The way you did it, you should have received the text and the image reference. If you used a recent version of the package.

0 replies

SBhat2615 · 2024-08-26T15:01:24Z

SBhat2615
Aug 26, 2024
Author

Got it. The one i was using wasn't very latest version. That's the reason for not getting a output.
Thanks for you quick reply @JorjMcKie !!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing complete scanned document #117

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Parsing complete scanned document #117

SBhat2615 Aug 26, 2024

Replies: 4 comments

JorjMcKie Aug 26, 2024 Maintainer

SBhat2615 Aug 26, 2024 Author

JorjMcKie Aug 26, 2024 Maintainer

SBhat2615 Aug 26, 2024 Author

SBhat2615
Aug 26, 2024

JorjMcKie
Aug 26, 2024
Maintainer

SBhat2615
Aug 26, 2024
Author

JorjMcKie
Aug 26, 2024
Maintainer

SBhat2615
Aug 26, 2024
Author