Skip to content

(Crosspost: marker#469) unusual bboxes for whitespace can affect span bboxes #38

@conjuncts

Description

@conjuncts

marker#469 seems to be an issue with pdftext.

Here is the raw pypdfium2 output for page 2 (index 1) of the linked pdf: (result of text_page.get_text_range(40:44))

(62.515220642089844, 89.5628662109375, 68.42025756835938, 94.87738037109375, '(a)')
(159.5265655517578, 89.28570556640625, 165.97390747070312, 94.87738037109375, '(b)')
(257.2007141113281, 89.5628662109375, 262.86474609375, 94.87738037109375, '(c)')
(354.03131103515625, 89.28570556640625, 360.41839599609375, 94.87738037109375, '(d)')

Note that these individual words have a lot of horizontal spacing, yet are still put in the same span.

For the same text, here is the provider output that marker sees. It goes far beyond the page boundaries.

[62.26817321777344, -175.2513427734375, 696.563720703125, 167.5281982421875]

Here are the relevant spans produced by dictionary_output():

{'bbox': [62.26817321777344, -175.2513427734375, 696.563720703125, 167.5281982421875], 'text': '(a) (b) (c) (d)', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_start_idx': 203, 'char_end_idx': 217, 'chars': [{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, ...], 'url': ''}
{'bbox': [447.8277587890625, 93.931396484375, 447.8277587890625, 93.931396484375], 'text': '\n', 'rotation': 0.0, 'font': {'name': '', 'flags': 0, 'size': 1.0, 'weight': -1}, 'char_start_idx': 218, 'char_end_idx': 219, 'chars': [{...}, {...}], 'url': ''}

The culprit seems to be bad bboxes for whitespace.

{'bbox': [62.26817321777344, 89.1995849609375, 63.90109634399414, 95.22509765625], 'char': '(', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_idx': 203}
{'bbox': [63.89507293701172, 89.1995849609375, 67.08258819580078, 95.22509765625], 'char': 'a', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_idx': 204}
{'bbox': [67.02835845947266, 89.1995849609375, 68.66128540039062, 95.22509765625], 'char': ')', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_idx': 205}
{'bbox': [159.27951049804688, -175.2513427734375, 502.0589904785156, 167.5281982421875], 'char': ' ', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_idx': 206}
{'bbox': [159.27951049804688, 89.1995849609375, 160.9124298095703, 95.22509765625], 'char': '(', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_idx': 207}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions