adding relations#158
Conversation
… the same object, just the values need to be the same after to/from json
…t instantiate spangroups
…; base Relation class on storage of these names; define to and from_json
soldni
left a comment
There was a problem hiding this comment.
The overall design seems good to me! I don't quite understand why we need AnnotationName classes though. What does the extra overhead of this class get us?
| # TODO[kylel] - when does this ever get called? infinite loop? | ||
| return self.__getattribute__(field) |
There was a problem hiding this comment.
This SO answer has a good explainer. I don't think you want to keep this here--- __getattribute__ is called before __getattr__.
|
|
||
| @property | ||
| def type(self) -> str: | ||
| logging.warning(msg='`.type` to be deprecated in future versions. Use `.metadata.type`') |
There was a problem hiding this comment.
nit: would not use the root logger. I would add
Logger = logger.getLogger(__file__)after imports, and then call Logger.warning instead.
| # TODO[kylel] - when does this ever get called? infinite loop? | ||
| return self.__getattribute__(field) |
There was a problem hiding this comment.
You probably can remove it; this stack overflow post has a good explanation, but tl;dr is that __getattribute__ supersedes __getattr__ if defined, so you will never be in a situation where __getattribute__ should be called after __getattr__.
I suspect that the function of this call here is to raise an error if field is not defined; if that is the case, I would explicitly raise an AttributeError instead with a more descriptive error message.
| boxes: List[Box], | ||
| id: Optional[int] = None, | ||
| doc: Optional['Document'] = None, | ||
| field: Optional[str] = None, | ||
| metadata: Optional[Metadata] = None, |
There was a problem hiding this comment.
Not related to this PR specifically, but I think we should document these arguments in a docstring.
There was a problem hiding this comment.
yea we'll need that
|
|
||
| return doc | ||
|
|
||
| def locate_annotation(self, name: AnnotationName) -> Annotation: |
There was a problem hiding this comment.
I am not quite sure on why we need AnnotationName. If a name is just a combo of relation name + id, can we just expect a tuple of the two here? or require two args? Overall, having a class to just hold annotation names would just result in computational overhead.
Without the class, we would need to code somewhere how IDs are constructed in the library. For now, it's As well, we need some way to parse this ID for use in lookup of that specific element within a Document. I don't want |
| def from_str(cls, s: str) -> 'AnnotationName': | ||
| field, id = s.split('-') | ||
| id = int(id) | ||
| return AnnotationName(field=field, id=id) |
There was a problem hiding this comment.
| return AnnotationName(field=field, id=id) | |
| return cls(field=field, id=id) |
Prevents issues when inheriting (if we ever decide to).
|
|
||
| class AnnotationName: | ||
| """Stores a name that uniquely identifies this Annotation within a Document""" | ||
|
|
There was a problem hiding this comment.
| __slots__ = ("field", "id") | |
Speeds up class creation by roughly 20%:
>>> %timeit AnnotationName('relation', 1)
140 ns ± 0.391 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
# after adding slots
>>> %timeit AnnotationName('relation', 1)
115 ns ± 0.163 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
@kyleclo Sounds good! added two small suggestions to improve it, but otherwise ok to merge! |
This PR extends this library functionality substantially -- Adding a new Annotation type called Relation. A Relation is a link between 2 annotations (e.g. a Citation linked to its Bib Entry). The input Annotations are called
keyandvalue.A few things needed to change to support Relations:
Annotation NamesRelations store references to Annotation objects. But we didn't want
Relation.to_json()to also.to_json()those objects. We only want to store minimal identifiers of thekeyandvalue. Something short likebib_entry-5orsentence-13. We call these short stringsnames.To do this, we added to Annotation class, an optional attribute called
field: strwhich stores this name. It's automatically populated when you runDocument.annotate(new_field = list_of_annotations); each of those input annotations will have the new field name stored under.field.We also added a method
name()that returns the name of a particular Annotation object that is unique at the document-level. Names are a minimal class that basically stores.fieldand.id.In short, now after you annotate a Document with annotations, you can do stuff like:
Lookups based on names
To support reconstructing a Relation object given the names of
keyandvalue, we need the ability to lookup those involved Annotations. We introduce a new method to enable this:to and from JSON
Finally, we need some way to serializing to JSON and reconstructing from JSON. For serialization, now that we have Names, this makes the JSON quite minimal:
Reconstructing a Relation from JSON is more tricky because it's meaningless without a Document object. The Document object must also store the specific Annotations correctly so we can correctly perform the lookup based on these Names.
The API for this is similar, but you must also pass in the Document object: