[Feature Request] [PDF] Option to use PDF outlines (ToC metadata) for header extraction #107
Replies: 2 comments 1 reply
-
This is an excellent idea - which I am also considering myself since some time. And with the same issues: What if text found in the document is only almost equal to the text in the outline item ... But I keep thinking here. Maybe both texts can be made (more) equal with some extra treatment before comparing, or using regular expressions ... |
Beta Was this translation helpful? Give feedback.
-
I was also thinking about using regex but I suck at it so I ended up doing Come to think of it I could add a "\n" at the back to improve accuracy, but whatever :/ It works for my small set of documents (they all have consistent formatting) but it's more like a hack instead of a solution. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Some PDFs have built-in outlines (or chapters) so instead of relying on font size for headers, we can use that instead for a much more accurate header extraction (when the outlines are available).
Header extraction based on font size can break when a title has multiple font sizes for aesthetic reasons.
For example: A numbered chapter title which says
1 Introduction
but the1
is larger thanIntroduction
, the extracted header would look like this:I tried to implement it myself but I'm not sure how I can map the outline text to the extracted text and append the appropriate hashtags, especially when the outline text is slightly different from the extracted text.
For example: A outline
4 Appendix
may refer to the extracted text4Appendix
.Beta Was this translation helpful? Give feedback.
All reactions