-
Notifications
You must be signed in to change notification settings - Fork 76
[IMP] orm: parallel iter_browse #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[IMP] orm: parallel iter_browse #320
Conversation
upgradeci retry with always only base in all versions |
866b0fe
to
c484df4
Compare
c484df4
to
efa83ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify the code I think we should check UPG_PARALLEL_ITER_BROWSE only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposal: get rid of the hack injecting params
.
Code is simpler, easier to read, and less hacky.
It is true that it looks that way, but there's one big disadvantage: it reduces the possible use-cases, because now everything in there (including |
Do you have a use case where we need to pass more complex objects? I think once we have it we can make this hacky. |
Honestly, I find it more hacky to pass around all the constant parameters through IPC for each task than to store them away once before |
You are missing the point. In your analysis, which isn't wrong, you're tying the code to the internal implementation of the standard Python library. That's hacky (not to mention attaching values to functions). The interface of I take that perhaps you meant "less performant" instead of "more hacky", that I could agree. Yet the performance improvement is probably meaningless given the time the rest would take. EDIT: I just don't like hacky stuff if there is a clearer way to express something. I agree that sometimes we need to sacrifice "clarity" for the sake of higher goals. I don't see them here. |
What do you mean? The fact that it forks? This whole thing is tied to that, as are all of our other uses of ProcessPoolExecutor, that's why #320 (comment) is important. I also wanted simplicity: in the callback signature, to avoid as many serialisation issues as we can. OTOH, your suggestion is more readable and easier to understand. I think both are valid goals and they are only in conflict here, because Anyway. It is true that input to this code - if used - has to be chosen very carefully for all kinds of reasons. So maybe it is a good idea to restrict, in this way, the arguments that can be passed to the model attr being called, for the purpose of limiting the use-cases to simpler ones that hopefully have a lower risk of dangerous use. |
No, the fact that Once more, I'm OK with this. You could probably include some comments about why you chose to do this. I'd say something like: "we want to avoid the overhead of serializing the same parameters for all ids via pickle in the map implementation. Warning: we need to clean them up in _end to avoid any potential conflicts". Note that we need to keep in mind --some testing would be nice-- what happens for multiple calls to iter_browse with multiprocessing within the same run (upgrade step). I don't see anything problematic at the moment.
Yes. We should honor that --if we can--, that's the point. |
d16c3f9
to
37353dc
Compare
37353dc
to
ef79dc6
Compare
c592fe8
to
7a16d08
Compare
546ff12
to
d1b8479
Compare
This allows the caller to be memory efficient on huge numbers of ids, allowing for even more millions of records to be browsed.
In some cases, e.g. if it is known that calling a certain method on the model will only trigger inserts or it is clear that updates will be disjunct, such method calls can be done in parallel.
Like the same support added to `__attr__` in the parent commit, this can only be used by callers when it is known that database modifications will be distinct, not causing concurrency issues or side-effects on the results. `create` returns an `iter_browse` object for the caller to browse created records. With the multiprocessing strategy, we make the following changes to it: - To support vast amounts of created records in multiprocessing strategy, we process values in a generator and initialize the returned `iter_browse` object with it. As this requires the caller of `create` to always consume/iterate the result (otherwise records will not be created), it is not applied to the other strategies as it would break existing API. - make __iter__ yield chunks if strategy is multiprocessing. This way, a caller can process chunks of freshly created records `for records in util.iter_browse(strategy="multiprocessing").create(SQLStr)` and since everything from input to output is a generator, will be perfectly memory efficient. - do not pass the logger to the returned `iter_browse` object from `create`, if the strategy is multiprocessing, because it will lead to interleaved logging from the input generator and this one when the caller iterates it.
Done to be able to create millions of records memory-efficiently.
d1b8479
to
060b1a3
Compare
ids = [] | ||
size = len(values) | ||
it = chunks(values, self._chunk_size, fmt=list) | ||
if self._strategy == "multiprocessing" and not multi: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in DM: we should have two private implementations. The current one under _create
and the new one in _create_multiplrocessing
we call each depending on the options.
if self._strategy == "multiprocessing" and not multi: | ||
raise ValueError("The multiprocessing strategy only supports the multi version of `create`") | ||
|
||
if isinstance(values, SQLStr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we add a query
parameter making values=None
by default. To make this more explicit.
def create(self, values=None, query=None, **kw):
In any case if we keep hacking values
for queries I wouldn't use SQLStr
, just check against plain str
. Normal values of create
are never strings so this won't risk any confusion. I still believe being explicit is better though.
TODO: test on 1000 DBs, fixups
[IMP] orm: iter_browse accept generator or query as ids
This allows the caller to be memory efficient on huge numbers of ids, allowing for even more millions of records to be browsed.
[IMP] orm: add optional parallelism to iter_browse.attr()
In some cases, e.g. if it is known that calling a certain method on the model will only trigger inserts or it is clear that updates will be disjunct, such method calls can be done in parallel.
[IMP] orm: add optional parallelism to iter_browse.create()
Like the same support added to
__attr__
in the parent commit, this can only be used by callers when it is known that database modifications will be distinct, not causing concurrency issues or side-effects on the results.create
returns aniter_browse
object for the caller to browse created records. To support vast amounts of created records in multiprocessing strategy, we process values in a generator and initialize the returnediter_browse
object with it. As this requires the caller ofcreate
to always consume/iterate the result (otherwise records will not be created), it is not applied to the other strategies as it would break existing API.[IMP] orm: iter_browse.create() accept generator or query as values
Done to be able to create millions of records memory-efficiently.