Skip to content

Kylel/2023/hardening codebase#259

Open
kyleclo wants to merge 25 commits into
mainfrom
kylel/2023/spangroup-overlap-bug
Open

Kylel/2023/hardening codebase#259
kyleclo wants to merge 25 commits into
mainfrom
kylel/2023/spangroup-overlap-bug

Conversation

@kyleclo

@kyleclo kyleclo commented Jul 5, 2023

Copy link
Copy Markdown
Contributor

Big PR. A lot of it is actually just auto linting/formatting. Here are main changes:

  1. For Box, I added some assertions to make sure problematic Boxes are caught at creation time:
        if w < 0 or h < 0:
            raise ValueError(f"Width and height must be non-negative, got {w} and {h}")
        if page < 0:
            raise ValueError(f"Page must be non-negative, got {page}")
        if l < 0 or t < 0:
            raise ValueError(f"Left and top must be non-negative, got {l} and {t}")

This has been helpful for debugging issues & the library assumes "nice" Boxes.

  1. For Box, I added another assertion for the is_overlap() method:
    def is_overlap(
        self, other: "Box", x: float = 0.0, y: float = 0, center: bool = False
    ) -> bool:
         ...
        if self.page != other.page:
            return False
  1. For Span, I added a merge_boxes: bool = True option to small_spans_to_big_span(). This should preserve the default behavior. But gives us the option to not do that and just create a big Span that has no larger Box. I would prefer that Not-Merging is actually the default behavior, but I think that might break too many things, so maybe in the future.

  2. For Box and Span, I added utility method cluster_boxes() and cluster_spans(). It's based on overlap.

  3. For Box, I also added utility method shrink(). It's not used anywhere now, but it's helpful for debugging.

  4. For BoxGroup and SpanGroup, I've added a new flag in the constructor: allow_overlap: Optional[bool] = False. This should preserve default behavior, which is that it disallows Box or Span within itself to have overlaps. For example, this should catch if a single BoxGroup has duplicate Boxes or a single SpanGroup has duplicate Spans. Otherwise, classes behave as they originally did.

  5. For Indexers, I added a new BoxGroupIndexer class that behaves similarly to SpanGroupIndexer. It's not used, but it was helpful for debugging.

  6. Besides the above library changes, almost everything else I added was a missing test.

@kyleclo kyleclo requested review from cmwilhelm and regan-huff and removed request for cmwilhelm and regan-huff July 5, 2023 18:59
@kyleclo kyleclo changed the title Kylel/2023/spangroup overlap bug Kylel/2023/hardening codebase Jul 6, 2023
@kyleclo kyleclo marked this pull request as ready for review July 7, 2023 00:14

@cmwilhelm cmwilhelm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes generally make sense to me; I called out the places I believe to be breaking and have suggested alternatives.

Whenever this is good to go I also suggest we merge/rebase in the latest main (to get CI to rerun) and bump the MMDA version in pyproject.toml

id: Optional[int] = None,
doc: Optional["Document"] = None,
metadata: Optional[Metadata] = None,
allow_overlap: Optional[bool] = False,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_overlap=False as the default is a behavioral change and will almost certainly break stuff in practice. Can we switch the default to True?

id: Optional[int] = None,
doc: Optional["Document"] = None,
metadata: Optional[Metadata] = None,
allow_overlap: Optional[bool] = False,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story here as BoxGroup -- False as the default is a breaking change in behavior, and there's no telling how much data we already have that violates it.

Comment thread src/mmda/types/box.py
@property
def xywh(self) -> Tuple[float, float, float, float]:
"""Return a tuple of the (left, top, width, height) format."""
return self.l, self.t, self.w, self.h

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the method in but add a logging.warning() with a deprecation notice. This may break clients of mmda that use .xywh

Comment thread src/mmda/types/box.py
center (bool) if True, only consider overlapping if this box's center is contained by other
"""
if self.page != other.page:
return False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

f"Detected overlap with existing SpanGroup(s) {matches} for {span_group}"
f"Detected overlap! While processing the Span {span} as part of query SpanGroup {span_group.to_json()}, we found that it overlaps with existing SpanGroup(s):\n"
+ "\n".join(
[f"\t{i}\t{m} " for i, m in zip(match_ids, matches)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is much more informative output.

Comment thread src/mmda/types/box.py
if page < 0:
raise ValueError(f"Page must be non-negative, got {page}")
if l < 0 or t < 0:
raise ValueError(f"Left and top must be non-negative, got {l} and {t}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find any evidence of these negatively positioned or dimensioned boxes when you were running through things? As with the allow_overlap flags in SpanGroup and BoxGroup, I think it won't be possible to audit what existing data already violates these constraints

@regan-huff regan-huff left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love for this PR to include or be accompanied by documentation for the current best expectations for how SpanGroups and BoxGroups will be used.

I also support the more conservative approach Chris suggests of leaving the defaults in a state that will allow existing/old data to be loaded from the annotation store without throwing an error. But perhaps the documentation can explain that the defaults are for backwards compatibility.

I think what we'll try to do to handle the possible impact of these model changes is attempt to update the mmda version for the whole Semantic Reader more or less simultaneously, so we can hopefully have a tight response loop if we do discover some data incompatibilities with the updated mmda models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants