-
Notifications
You must be signed in to change notification settings - Fork 150
Optimize re-sync content #7754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Optimize re-sync content #7754
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| Optimized sync with content caching in `QueryExistingContents` and query fix in `QueryExistingArtifacts`. |
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -30,8 +30,32 @@ class QueryExistingContents(Stage): | |||||||||
|
|
||||||||||
| This stage drains all available items from `self._in_q` and batches everything into one large | ||||||||||
| call to the db for efficiency. | ||||||||||
|
|
||||||||||
| When `repo_version` is provided, content from that version is cached in memory on first | ||||||||||
| access (per content type) so that repeat syncs can resolve most items without per-batch | ||||||||||
| database queries. | ||||||||||
|
|
||||||||||
| Plugins with expensive fields that are not needed during sync can pass | ||||||||||
| `deferred_fields` - a mapping of content model class to a list of field names | ||||||||||
| to exclude from queries via Django's `defer()`. | ||||||||||
| """ | ||||||||||
|
|
||||||||||
| def __init__(self, repo_version=None, deferred_fields=None, *args, **kwargs): | ||||||||||
| super().__init__(*args, **kwargs) | ||||||||||
| self._repo_version = repo_version | ||||||||||
| self._deferred_fields = deferred_fields or {} | ||||||||||
| self._content_cache = {} | ||||||||||
|
|
||||||||||
| def _ensure_type_cache(self, model_type): | ||||||||||
| if model_type not in self._content_cache and self._repo_version is not None: | ||||||||||
| deferred = self._deferred_fields.get(model_type, ()) | ||||||||||
| cache = {} | ||||||||||
| for content in self._repo_version.get_content( | ||||||||||
| model_type.objects.defer(*deferred) | ||||||||||
| ).iterator(): | ||||||||||
| cache[content.natural_key()] = content | ||||||||||
| self._content_cache[model_type] = cache | ||||||||||
|
|
||||||||||
| async def run(self): | ||||||||||
| """ | ||||||||||
| The coroutine for this stage. | ||||||||||
|
|
@@ -42,25 +66,46 @@ async def run(self): | |||||||||
| async for batch in self.batches(): | ||||||||||
| content_q_by_type = defaultdict(lambda: Q(pk__in=[])) | ||||||||||
| d_content_by_nat_key = defaultdict(list) | ||||||||||
| cache_hits_by_type = defaultdict(set) | ||||||||||
|
|
||||||||||
| for d_content in batch: | ||||||||||
| if d_content.content._state.adding: | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Theoretically content could be passed through already saved, in which case I think we're probably not
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure I understand this. Why do we need to call
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it basically resets the orphan cleanup protection timer. |
||||||||||
| model_type = type(d_content.content) | ||||||||||
| unit_q = d_content.content.q() | ||||||||||
| content_q_by_type[model_type] = content_q_by_type[model_type] | unit_q | ||||||||||
| d_content_by_nat_key[d_content.content.natural_key()].append(d_content) | ||||||||||
|
|
||||||||||
| nat_key = d_content.content.natural_key() | ||||||||||
| await sync_to_async(self._ensure_type_cache)(model_type) | ||||||||||
| cached = self._content_cache.get(model_type, {}).get(nat_key) | ||||||||||
| if cached is not None: | ||||||||||
| d_content.content = cached | ||||||||||
| cache_hits_by_type[model_type].add(cached.pk) | ||||||||||
| else: | ||||||||||
| unit_q = d_content.content.q() | ||||||||||
| content_q_by_type[model_type] = content_q_by_type[model_type] | unit_q | ||||||||||
| d_content_by_nat_key[nat_key].append(d_content) | ||||||||||
|
|
||||||||||
| db_results_by_type = defaultdict(list) | ||||||||||
| for model_type, content_q in content_q_by_type.items(): | ||||||||||
| try: | ||||||||||
| await sync_to_async(model_type.objects.filter(content_q).touch)() | ||||||||||
| except AttributeError: | ||||||||||
| raise TypeError( | ||||||||||
| "Plugins which declare custom ORM managers on their content classes " | ||||||||||
| "should have those managers inherit from " | ||||||||||
| "pulpcore.plugin.models.ContentManager." | ||||||||||
| ) | ||||||||||
| deferred = self._deferred_fields.get(model_type, ()) | ||||||||||
| async for result in sync_to_async_iterable( | ||||||||||
| model_type.objects.filter(content_q).iterator() | ||||||||||
| model_type.objects.filter(content_q).defer(*deferred).iterator() | ||||||||||
| ): | ||||||||||
| db_results_by_type[model_type].append(result) | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Move this here and delete the added loops on line 107-108.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But this would cause "swap before touch" which we wanted to avoid.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to verify that it's OK to swap before touching. It MIGHT be ok, but IIRC the |
||||||||||
|
|
||||||||||
| all_types = set(cache_hits_by_type.keys()) | set(db_results_by_type.keys()) | ||||||||||
| for model_type in all_types: | ||||||||||
| pks = cache_hits_by_type.get(model_type, set()) | ||||||||||
| pks = pks | {r.pk for r in db_results_by_type.get(model_type, [])} | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
| if pks: | ||||||||||
| try: | ||||||||||
| await sync_to_async(model_type.objects.filter(pk__in=pks).touch)() | ||||||||||
| except AttributeError: | ||||||||||
| raise TypeError( | ||||||||||
| "Plugins which declare custom ORM managers on their content classes " | ||||||||||
| "should have those managers inherit from " | ||||||||||
| "pulpcore.plugin.models.ContentManager." | ||||||||||
| ) | ||||||||||
|
|
||||||||||
| for model_type, results in db_results_by_type.items(): | ||||||||||
| for result in results: | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A little worried that the additional loops here would outweigh the reduced # of queries
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can rework it, but that would result in "swap before touch". Alternatively, I can revert the last commit and keep the two queries. Which option would you prefer?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mdellweg I feel like it was either you or Dennis that dealt with the |
||||||||||
| for d_content in d_content_by_nat_key[result.natural_key()]: | ||||||||||
| d_content.content = result | ||||||||||
|
|
||||||||||
|
|
||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
deferredthe right way? Models can have a lot of fields, but only a few are ever used during a sync. Wouldn'tonlybe the more practical option?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this discussion (hidden because it was marked resolved) #7754 (comment)
TL;DR yes
.only()could be theoretically more efficient, but it could also easily cause a regression without plugin intervention. You have to keep track of which fields are being used rigorously.OTOH
.defer()would not cause a regression, and since it's really only a small number of fields on a small number of models that are a problem, it is less invasive.