[IMP] orm: parallel iter_browse #320

cawo-odoo · 2025-09-12T12:11:17Z

TODO: test on 1000 DBs, fixups

[IMP] orm: iter_browse accept generator or query as ids

This allows the caller to be memory efficient on huge numbers of ids, allowing for even more millions of records to be browsed.

[IMP] orm: add optional parallelism to iter_browse.attr()

In some cases, e.g. if it is known that calling a certain method on the model will only trigger inserts or it is clear that updates will be disjunct, such method calls can be done in parallel.

[IMP] orm: add optional parallelism to iter_browse.create()

Like the same support added to __attr__ in the parent commit, this can only be used by callers when it is known that database modifications will be distinct, not causing concurrency issues or side-effects on the results.

create returns an iter_browse object for the caller to browse created records. To support vast amounts of created records in multiprocessing strategy, we process values in a generator and initialize the returned iter_browse object with it. As this requires the caller of create to always consume/iterate the result (otherwise records will not be created), it is not applied to the other strategies as it would break existing API.

[IMP] orm: iter_browse.create() accept generator or query as values

Done to be able to create millions of records memory-efficiently.

robodoo · 2025-09-12T12:11:20Z

KangOl · 2025-09-12T12:23:50Z

upgradeci retry with always only base in all versions

src/util/orm.py

aj-fuentes

To simplify the code I think we should check UPG_PARALLEL_ITER_BROWSE only once.

src/util/orm.py

aj-fuentes

Proposal: get rid of the hack injecting params.

Code is simpler, easier to read, and less hacky.

src/util/orm.py

cawo-odoo · 2025-09-15T07:43:27Z

Proposal: get rid of the hack injecting params.

Code is simpler, easier to read, and less hacky.

It is true that it looks that way, but there's one big disadvantage: it reduces the possible use-cases, because now everything in there (including {kw,}args) needs to be 🥒able.

src/util/orm.py

aj-fuentes · 2025-09-15T07:50:03Z

Proposal: get rid of the hack injecting params.
Code is simpler, easier to read, and less hacky.

It is true that it looks that way, but there's one big disadvantage: it reduces the possible use-cases, because now everything in there (including {kw,}args) needs to be 🥒able.

Do you have a use case where we need to pass more complex objects? I think once we have it we can make this hacky.

src/util/orm.py

cawo-odoo · 2025-09-15T08:01:32Z

Do you have a use case where we need to pass more complex objects? I think once we have it we can make this hacky.

Honestly, I find it more hacky to pass around all the constant parameters through IPC for each task than to store them away once before fork. I think since we're in an importable script and do not need to rename the module like in upgrade scripts, we don't have to use the attribute on the function object, if that's what is ugly. I guess, we could also just use a global here (but i would need to try it first to be sure).

aj-fuentes · 2025-09-15T08:11:14Z

Do you have a use case where we need to pass more complex objects? I think once we have it we can make this hacky.

Honestly, I find it more hacky to pass around all the constant parameters through IPC for each task than to store them away once before fork. I think since we're in an importable script and do not need to rename the module like in upgrade scripts, we don't have to use the attribute on the function object, if that's what is ugly. I guess, we could also just use a global here (but i would need to try it first to be sure).

You are missing the point. In your analysis, which isn't wrong, you're tying the code to the internal implementation of the standard Python library. That's hacky (not to mention attaching values to functions). The interface of map allows to pass any extra parameters we need. That's the logical solution. There is zero need of extra "what's going on here", or "are we cleaning this up after?". OTOH in current implementation, we need to know and explain that it uses pickle, that we don't want to use it, that it is better because this and that... etc, to make the hacky implementation look like it is necessary. That's my point.

I take that perhaps you meant "less performant" instead of "more hacky", that I could agree. Yet the performance improvement is probably meaningless given the time the rest would take.

EDIT:
To be clear: I'm OK with current implementation. It can be r+ed once all details are tested.

I just don't like hacky stuff if there is a clearer way to express something. I agree that sometimes we need to sacrifice "clarity" for the sake of higher goals. I don't see them here.

cawo-odoo · 2025-09-15T08:54:58Z

you're tying the code to the internal implementation of the standard Python library.

What do you mean? The fact that it forks? This whole thing is tied to that, as are all of our other uses of ProcessPoolExecutor, that's why #320 (comment) is important.

I also wanted simplicity: in the callback signature, to avoid as many serialisation issues as we can. OTOH, your suggestion is more readable and easier to understand. I think both are valid goals and they are only in conflict here, because ProcessPoolExecutor.map (not map) is not a simple interface, it just pretends to be. I made this point on the last PR already and it didn't change, there are all kinds of quirks in this thing.

Anyway. It is true that input to this code - if used - has to be chosen very carefully for all kinds of reasons. So maybe it is a good idea to restrict, in this way, the arguments that can be passed to the model attr being called, for the purpose of limiting the use-cases to simpler ones that hopefully have a lower risk of dangerous use.

aj-fuentes · 2025-09-15T09:08:44Z

What do you mean? The fact that it forks?

No, the fact that map would use pickle to send the parameters.

Once more, I'm OK with this. You could probably include some comments about why you chose to do this. I'd say something like: "we want to avoid the overhead of serializing the same parameters for all ids via pickle in the map implementation. Warning: we need to clean them up in _end to avoid any potential conflicts".

Note that we need to keep in mind --some testing would be nice-- what happens for multiple calls to iter_browse with multiprocessing within the same run (upgrade step). I don't see anything problematic at the moment.

it just pretends to be

Yes. We should honor that --if we can--, that's the point.

src/util/orm.py

This allows the caller to be memory efficient on huge numbers of ids, allowing for even more millions of records to be browsed.

In some cases, e.g. if it is known that calling a certain method on the model will only trigger inserts or it is clear that updates will be disjunct, such method calls can be done in parallel.

Like the same support added to `__attr__` in the parent commit, this can only be used by callers when it is known that database modifications will be distinct, not causing concurrency issues or side-effects on the results. `create` returns an `iter_browse` object for the caller to browse created records. With the multiprocessing strategy, we make the following changes to it: - To support vast amounts of created records in multiprocessing strategy, we process values in a generator and initialize the returned `iter_browse` object with it. As this requires the caller of `create` to always consume/iterate the result (otherwise records will not be created), it is not applied to the other strategies as it would break existing API. - make __iter__ yield chunks if strategy is multiprocessing. This way, a caller can process chunks of freshly created records `for records in util.iter_browse(strategy="multiprocessing").create(SQLStr)` and since everything from input to output is a generator, will be perfectly memory efficient. - do not pass the logger to the returned `iter_browse` object from `create`, if the strategy is multiprocessing, because it will lead to interleaved logging from the input generator and this one when the caller iterates it.

Done to be able to create millions of records memory-efficiently.

aj-fuentes · 2025-10-08T13:08:52Z

src/util/orm.py

-        ids = []
-        size = len(values)
-        it = chunks(values, self._chunk_size, fmt=list)
+        if self._strategy == "multiprocessing" and not multi:


As discussed in DM: we should have two private implementations. The current one under _create and the new one in _create_multiplrocessing we call each depending on the options.

aj-fuentes · 2025-10-08T13:17:27Z

src/util/orm.py

+        if self._strategy == "multiprocessing" and not multi:
+            raise ValueError("The multiprocessing strategy only supports the multi version of `create`")
+
+        if isinstance(values, SQLStr):


Why don't we add a query parameter making values=None by default. To make this more explicit.

def create(self, values=None, query=None, **kw):

In any case if we keep hacking values for queries I wouldn't use SQLStr, just check against plain str. Normal values of create are never strings so this won't risk any confusion. I still believe being explicit is better though.

KangOl reviewed Sep 12, 2025

View reviewed changes

src/util/orm.py Outdated Show resolved Hide resolved

aj-fuentes reviewed Sep 12, 2025

View reviewed changes

src/util/orm.py Outdated Show resolved Hide resolved

aj-fuentes reviewed Sep 12, 2025

View reviewed changes

src/util/orm.py Outdated Show resolved Hide resolved

cawo-odoo force-pushed the master-imp_parallel_iter_browse-cawo branch from 866b0fe to c484df4 Compare September 12, 2025 12:47

aj-fuentes reviewed Sep 12, 2025

View reviewed changes

src/util/orm.py Show resolved Hide resolved

cawo-odoo force-pushed the master-imp_parallel_iter_browse-cawo branch from c484df4 to efa83ee Compare September 12, 2025 14:06

aj-fuentes reviewed Sep 12, 2025

View reviewed changes

src/util/orm.py Outdated Show resolved Hide resolved

src/util/orm.py Outdated Show resolved Hide resolved

src/util/orm.py Outdated Show resolved Hide resolved

src/util/orm.py Outdated Show resolved Hide resolved

aj-fuentes reviewed Sep 15, 2025

View reviewed changes

KangOl reviewed Sep 15, 2025

View reviewed changes

src/util/orm.py Outdated Show resolved Hide resolved

KangOl reviewed Sep 15, 2025

View reviewed changes

src/util/orm.py Outdated Show resolved Hide resolved

cawo-odoo force-pushed the master-imp_parallel_iter_browse-cawo branch 10 times, most recently from d16c3f9 to 37353dc Compare September 26, 2025 13:25

KangOl reviewed Sep 26, 2025

View reviewed changes

src/util/orm.py Outdated Show resolved Hide resolved

cawo-odoo force-pushed the master-imp_parallel_iter_browse-cawo branch from 37353dc to ef79dc6 Compare September 29, 2025 07:58

cawo-odoo force-pushed the master-imp_parallel_iter_browse-cawo branch 9 times, most recently from c592fe8 to 7a16d08 Compare October 1, 2025 14:32

cawo-odoo marked this pull request as ready for review October 1, 2025 14:46

cawo-odoo requested review from a team and asno-odoo October 1, 2025 14:46

cawo-odoo force-pushed the master-imp_parallel_iter_browse-cawo branch 4 times, most recently from 546ff12 to d1b8479 Compare October 7, 2025 08:00

cawo-odoo added 4 commits October 8, 2025 11:55

[IMP] orm: iter_browse accept generator or query as ids

9c9f200

This allows the caller to be memory efficient on huge numbers of ids, allowing for even more millions of records to be browsed.

[IMP] orm: add optional parallelism to iter_browse.__attr__()

4a56a32

In some cases, e.g. if it is known that calling a certain method on the model will only trigger inserts or it is clear that updates will be disjunct, such method calls can be done in parallel.

[IMP] orm: iter_browse.create() accept generator or query as values

060b1a3

Done to be able to create millions of records memory-efficiently.

cawo-odoo force-pushed the master-imp_parallel_iter_browse-cawo branch from d1b8479 to 060b1a3 Compare October 8, 2025 11:55

aj-fuentes reviewed Oct 8, 2025

View reviewed changes

[IMP] orm: parallel iter_browse #320

Are you sure you want to change the base?

[IMP] orm: parallel iter_browse #320

Uh oh!

Conversation

cawo-odoo commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[IMP] orm: iter_browse accept generator or query as ids

[IMP] orm: add optional parallelism to iter_browse.attr()

[IMP] orm: add optional parallelism to iter_browse.create()

[IMP] orm: iter_browse.create() accept generator or query as values

Uh oh!

robodoo commented Sep 12, 2025

Uh oh!

KangOl commented Sep 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aj-fuentes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aj-fuentes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cawo-odoo commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aj-fuentes commented Sep 15, 2025

Uh oh!

Uh oh!

cawo-odoo commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aj-fuentes commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cawo-odoo commented Sep 15, 2025

Uh oh!

aj-fuentes commented Sep 15, 2025

Uh oh!

Uh oh!

aj-fuentes Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

aj-fuentes Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cawo-odoo commented Sep 12, 2025 •

edited

Loading

cawo-odoo commented Sep 15, 2025 •

edited

Loading

cawo-odoo commented Sep 15, 2025 •

edited

Loading

aj-fuentes commented Sep 15, 2025 •

edited

Loading