Skip to content

Conversation

@BurnzZ
Copy link
Contributor

@BurnzZ BurnzZ commented Jan 27, 2023

This is related to #115 which implements approach 3.

swap_item_cls is omitted for now to reduce its complexity. We can add it later.

TODO:

  • docs
  • changelog

@codecov
Copy link

codecov bot commented Jan 27, 2023

Codecov Report

Merging #118 (d07f361) into feat-fields-disabled (bedf6c5) will decrease coverage by 0.40%.
The diff coverage is 95.83%.

❗ Current head d07f361 differs from pull request most recent head a493052. Consider uploading reports for the commit a493052 to get more accurate results

Additional details and impacted files
@@                   Coverage Diff                    @@
##           feat-fields-disabled     #118      +/-   ##
========================================================
- Coverage                 99.57%   99.18%   -0.40%     
========================================================
  Files                        25       25              
  Lines                       937      976      +39     
========================================================
+ Hits                        933      968      +35     
- Misses                        4        8       +4     
Impacted Files Coverage Δ
web_poet/pages.py 96.29% <93.54%> (-3.71%) ⬇️
web_poet/fields.py 99.07% <97.50%> (+0.13%) ⬆️
web_poet/__init__.py 100.00% <100.00%> (ø)
web_poet/_typing.py 90.47% <0.00%> (-9.53%) ⬇️

@BurnzZ BurnzZ mentioned this pull request Jan 30, 2023
1 task
@BurnzZ BurnzZ force-pushed the feat-select-fields branch from d07f361 to fee1e3f Compare January 30, 2023 11:26
@BurnzZ BurnzZ changed the base branch from master to feat-fields-disabled January 30, 2023 11:26
@BurnzZ BurnzZ force-pushed the feat-select-fields branch from fee1e3f to e1e2451 Compare February 1, 2023 02:13
@property
def fields_to_extract(self) -> Iterable[str]:
"""Returns an Iterable of field names which should populate the designated
``item_cls``.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm proposing to expose the fields_to_extract property method since it's a handy utility for frameworks like scrapy-poet to see which fields to populate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we choose a name in line with the corresponding attribute of @field to exclude or include fields from to_item output? (e.g. to_item_fields)


unknown_fields = set(fields) - set(page_obj_fields.keys()).union({"*"})
if unknown_fields:
self._handle_unknown_field(unknown_fields, select_fields.on_unknown_field)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling of unknown fields seems to be more optimal to be placed here as opposed to something like item_from_fields() which could've already extracted some fields (which could be expensive, i.e., having additional request) before erroring out.

item_cls: Type[T] = dict, # type: ignore[assignment]
*,
skip_nonitem_fields: bool = False,
field_names: Optional[Iterable[str]] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an unknown field has been unintentionally passed, it'd error out with:

# omitted stack trace

File ~/dev/z/web-poet/web_poet/fields.py:235, in <dictcomp>(.0)
    233     if skip_nonitem_fields:
    234         field_names = _without_unsupported_field_names(item_cls, field_names)
--> 235 return item_cls(**{name: getattr(obj, name) for name in field_names})

AttributeError: 'SomePage' object has no attribute 'unknown_field'

I think this is descriptive enough and we don't need additional handling of unknown fields here. Handling it in the .fields_to_extract property method should be enough.

field_names=self.fields_to_extract,
)

def _get_select_fields(self) -> Optional[SelectFields]:
Copy link
Contributor Author

@BurnzZ BurnzZ Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_select_fields() is not being cached.

I think this should be okay since the user might want to re-use the same page instance but with different SelectFields later on:

page = SomePage(SelectFields(fields={"*": False, "x": True}))
item1 = page.to_item()

page.select_fields = SelectFields(fields={"*": False, "x": True, "y": True})
item2 = page.to_item()

Any thoughts about this usage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with not caching.

However, I think we should avoid covering this scenario in the docs, and I am not sure if we should even cover it in tests. Changing the value of select_fields does not feel like a very elegant API to me.

@BurnzZ BurnzZ changed the title WIP: Introduce SelectFields Introduce SelectFields Feb 1, 2023
@BurnzZ BurnzZ marked this pull request as ready for review February 1, 2023 04:40
@BurnzZ BurnzZ requested review from Gallaecio, kmike and wRAR February 1, 2023 04:40
assert page.fields_to_extract == ["y"]
assert page.call_counter == {"y": 1}

# boolean-like values are not supported. They are simply ignored and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should raise an error, not ignore wrong values.

assert page.call_counter == {"x": 1, "y": 1}

# If a non-SelectFields instance was passed to the `select_fields` init
# parameter, it would simply ignore it and use the default field directives.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to me to raise an error in this case.

#:
#: Setting it to ``"raise"`` would raise an :class:`AttributeError`,
#: ``"warn"`` produces a :class:`UserWarning`, while ``"ignore"`` does nothing.
on_unknown_field: UnknownFieldActions = "raise"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Could you please elaborate on the use cases for on_unknown_field?

Also, what does unknown field mean? Does it mean that the field is not in the item, or does it mean that the field is not defined on the page object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that the field is not defined on the page object?

It's mostly regarding about this but because of the parity between the page object and its item class, the field should generally not be in the item as well.

The unknown field can be derived by something like set(fields_to_include) - set(fields_to_exclude). This means unknown fields that are to be excluded won't be considered as actual unknown fields.

I'm also thinking of removing this parameter and simply raise any unknown fields by default. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see.

It seems the case with field not in the item and a field not in the page object are very different.

If a field is not in the item, we should probably raise an exception. But there is a case where skip_nonitem_fields=True is used; in this case it actually could make sense to have fields in SelectFields which are not in the item.

If a field is not defined in the page object, but is present in the item, then final item.field returned by to_item is going to have a default value, None in most cases. It could also be the case that the attribute is populated in to_item method without using web-poet fields. It's not an error. It seems that we should allow to pick any fields which are present in the item, even if they're not fields of the page object.

We should have a clear answer though what happens when some of the item attributes are extracted using @fields, and some of them are added later, in the to_item method.

There is also a complicated case where there is a field defined, but then the value assigned to the final item is changed in to_item method. Probably it's fine not to support this case, but we should at least document it.

Overall the problem looks more complicated than I initially thought :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I suggested such an option, I was thinking of a scenario like this one:

B team writes BPage, subclassing APage from A team, and returning BItem, which does not subclass AItem. B team might not want new fields in APage to break BPage. But they might want a warning or error to look into it eventually in case we want to do something about it.

But it can indeed be more complex than that. When a field is removed, B team may want a different resolution. For example, they might want a warning for new fields, but an exception for missing fields.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Mar 1, 2023

Closing in favor of #138

@BurnzZ BurnzZ closed this Mar 1, 2023
@wRAR wRAR deleted the feat-select-fields branch May 24, 2025 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants