Kylel/2023/hardening codebase#259
Conversation
cmwilhelm
left a comment
There was a problem hiding this comment.
Changes generally make sense to me; I called out the places I believe to be breaking and have suggested alternatives.
Whenever this is good to go I also suggest we merge/rebase in the latest main (to get CI to rerun) and bump the MMDA version in pyproject.toml
| id: Optional[int] = None, | ||
| doc: Optional["Document"] = None, | ||
| metadata: Optional[Metadata] = None, | ||
| allow_overlap: Optional[bool] = False, |
There was a problem hiding this comment.
allow_overlap=False as the default is a behavioral change and will almost certainly break stuff in practice. Can we switch the default to True?
| id: Optional[int] = None, | ||
| doc: Optional["Document"] = None, | ||
| metadata: Optional[Metadata] = None, | ||
| allow_overlap: Optional[bool] = False, |
There was a problem hiding this comment.
Same story here as BoxGroup -- False as the default is a breaking change in behavior, and there's no telling how much data we already have that violates it.
| @property | ||
| def xywh(self) -> Tuple[float, float, float, float]: | ||
| """Return a tuple of the (left, top, width, height) format.""" | ||
| return self.l, self.t, self.w, self.h |
There was a problem hiding this comment.
Let's leave the method in but add a logging.warning() with a deprecation notice. This may break clients of mmda that use .xywh
| center (bool) if True, only consider overlapping if this box's center is contained by other | ||
| """ | ||
| if self.page != other.page: | ||
| return False |
| f"Detected overlap with existing SpanGroup(s) {matches} for {span_group}" | ||
| f"Detected overlap! While processing the Span {span} as part of query SpanGroup {span_group.to_json()}, we found that it overlaps with existing SpanGroup(s):\n" | ||
| + "\n".join( | ||
| [f"\t{i}\t{m} " for i, m in zip(match_ids, matches)] |
There was a problem hiding this comment.
Thanks this is much more informative output.
| if page < 0: | ||
| raise ValueError(f"Page must be non-negative, got {page}") | ||
| if l < 0 or t < 0: | ||
| raise ValueError(f"Left and top must be non-negative, got {l} and {t}") |
There was a problem hiding this comment.
Did you find any evidence of these negatively positioned or dimensioned boxes when you were running through things? As with the allow_overlap flags in SpanGroup and BoxGroup, I think it won't be possible to audit what existing data already violates these constraints
regan-huff
left a comment
There was a problem hiding this comment.
I would love for this PR to include or be accompanied by documentation for the current best expectations for how SpanGroups and BoxGroups will be used.
I also support the more conservative approach Chris suggests of leaving the defaults in a state that will allow existing/old data to be loaded from the annotation store without throwing an error. But perhaps the documentation can explain that the defaults are for backwards compatibility.
I think what we'll try to do to handle the possible impact of these model changes is attempt to update the mmda version for the whole Semantic Reader more or less simultaneously, so we can hopefully have a tight response loop if we do discover some data incompatibilities with the updated mmda models.
Big PR. A lot of it is actually just auto linting/formatting. Here are main changes:
Box, I added some assertions to make sure problematic Boxes are caught at creation time:This has been helpful for debugging issues & the library assumes "nice" Boxes.
Box, I added another assertion for theis_overlap()method:For
Span, I added amerge_boxes: bool = Trueoption tosmall_spans_to_big_span(). This should preserve the default behavior. But gives us the option to not do that and just create a bigSpanthat has no largerBox. I would prefer that Not-Merging is actually the default behavior, but I think that might break too many things, so maybe in the future.For
BoxandSpan, I added utility methodcluster_boxes()andcluster_spans(). It's based on overlap.For
Box, I also added utility methodshrink(). It's not used anywhere now, but it's helpful for debugging.For
BoxGroupandSpanGroup, I've added a new flag in the constructor:allow_overlap: Optional[bool] = False. This should preserve default behavior, which is that it disallowsBoxorSpanwithin itself to have overlaps. For example, this should catch if a singleBoxGrouphas duplicateBoxesor a singleSpanGrouphas duplicateSpans. Otherwise, classes behave as they originally did.For
Indexers, I added a newBoxGroupIndexerclass that behaves similarly toSpanGroupIndexer. It's not used, but it was helpful for debugging.Besides the above library changes, almost everything else I added was a missing test.