marker#469 seems to be an issue with pdftext.
Here is the raw pypdfium2 output for page 2 (index 1) of the linked pdf: (result of text_page.get_text_range(40:44))
(62.515220642089844, 89.5628662109375, 68.42025756835938, 94.87738037109375, '(a)')
(159.5265655517578, 89.28570556640625, 165.97390747070312, 94.87738037109375, '(b)')
(257.2007141113281, 89.5628662109375, 262.86474609375, 94.87738037109375, '(c)')
(354.03131103515625, 89.28570556640625, 360.41839599609375, 94.87738037109375, '(d)')
Note that these individual words have a lot of horizontal spacing, yet are still put in the same span.
For the same text, here is the provider output that marker sees. It goes far beyond the page boundaries.
[62.26817321777344, -175.2513427734375, 696.563720703125, 167.5281982421875]
Here are the relevant spans produced by dictionary_output():
{'bbox': [62.26817321777344, -175.2513427734375, 696.563720703125, 167.5281982421875], 'text': '(a) (b) (c) (d)', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_start_idx': 203, 'char_end_idx': 217, 'chars': [{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, ...], 'url': ''}
{'bbox': [447.8277587890625, 93.931396484375, 447.8277587890625, 93.931396484375], 'text': '\n', 'rotation': 0.0, 'font': {'name': '', 'flags': 0, 'size': 1.0, 'weight': -1}, 'char_start_idx': 218, 'char_end_idx': 219, 'chars': [{...}, {...}], 'url': ''}
The culprit seems to be bad bboxes for whitespace.
{'bbox': [62.26817321777344, 89.1995849609375, 63.90109634399414, 95.22509765625], 'char': '(', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_idx': 203}
{'bbox': [63.89507293701172, 89.1995849609375, 67.08258819580078, 95.22509765625], 'char': 'a', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_idx': 204}
{'bbox': [67.02835845947266, 89.1995849609375, 68.66128540039062, 95.22509765625], 'char': ')', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_idx': 205}
{'bbox': [159.27951049804688, -175.2513427734375, 502.0589904785156, 167.5281982421875], 'char': ' ', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_idx': 206}
{'bbox': [159.27951049804688, 89.1995849609375, 160.9124298095703, 95.22509765625], 'char': '(', 'rotation': 0.0, 'font': {'name': 'CIDFont+F3', 'flags': 524294, 'size': 327.67999267578125, 'weight': 3016}, 'char_idx': 207}
marker#469 seems to be an issue with pdftext.
Here is the raw pypdfium2 output for page 2 (index 1) of the linked pdf: (result of
text_page.get_text_range(40:44))Note that these individual words have a lot of horizontal spacing, yet are still put in the same span.
For the same text, here is the provider output that marker sees. It goes far beyond the page boundaries.
Here are the relevant spans produced by
dictionary_output():The culprit seems to be bad bboxes for whitespace.