Skip to content

Conversation

@glypt
Copy link

@glypt glypt commented Nov 7, 2025

The PR docling-project/docling#2589 revealed a bug, the caption is having the same number as the table here in https://github.com/docling-project/docling/pull/2589/files#diff-0ff184cc09560c89eb50dc9cf939c40bdf55d372d26823c956b55fb8f6c5e16fR1-R5.

This introduces a global numbering with increment at each item.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

DCO Check Passed

Thanks @glypt, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Nov 7, 2025

Related Documentation

Checked 3 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Nov 7, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@codecov
Copy link

codecov bot commented Nov 10, 2025

Codecov Report

❌ Patch coverage is 58.33333% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/types/doc/document.py 58.33% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ceberam ceberam self-assigned this Nov 10, 2025
@ceberam
Copy link
Collaborator

ceberam commented Nov 11, 2025

Thanks @glypt for your contributions and for suggesting this PR.
However, the issue you described is not an actual bug. The code works as designed.
The indented text serialization has the goal of exposing the document hierarchy in a succinct yet visual manner. In the serialization, we iterate over the document items through the line:

for i, (item, level) in enumerate(self.iterate_items(with_groups=True)):

The index i in the enumeration distinguishes those items and helps us track their order in the Docling document (typically the reading order of the original document).
For TextItem objects, we show their indent level and the beginning of the text.
ForTableItem (a table) and PictureItem (a picture), in addition to their indent level, we also show their caption (if it exists), since it may be helpful for identifying that table or picture. It should not be interpreted as a completely distinct node item.
With this consideration, the example output that you marked as a bug is actually correct:

item-0 at level 0: unspecified: group _root_
  item-1 at level 1: section: group sheet: Duck Observations
    item-2 at level 2: caption: Number of freshwater ducks per year
    item-3 at level 2: table with [7x2]
      item-3 at level 3: caption: Number of freshwater ducks per year

This shows that the item at index 2 is a caption. Right after that, we find a table at index 3, which has a certain caption.
The line item-3 at level 3: caption: Number of freshwater ducks per year should be read as: item at index 3 is indented at level 3 and has a caption starting with the following text: Number of freshwater ducks per year.

If we applied the changes in your PR, the serialization would look like:

item-0 at level 0: unspecified: group _root_
  item-1 at level 1: section: group sheet: Duck Observations
    item-2 at level 2: caption: Number of freshwater ducks per year
    item-3 at level 2: table with [7x2]
      item-4 at level 3: caption: Number of freshwater ducks per year

and it would therefore count more items than they are in the document, since the caption would be counted as two different items.

Please, let me know if the explanation is reasonable.

@ceberam ceberam added the invalid This doesn't seem right label Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

invalid This doesn't seem right

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants