-
Notifications
You must be signed in to change notification settings - Fork 88
Description
I have a large code I'm working on. Somewhere deep in the bowels of this code I have a line that essentially amounts to:
long_array * tall_arrayMuch later, 3-4 levels up the callstack, I have a function that applies a cpn.sum over one of the axes, essentially throwing away one of the (large) dimensions.
Right now, this code scales very inefficiently. I'm forced to use multiple nodes purely for memory capacity reasons, but the code actually doesn't need it: it runs far too quickly for distributed execution to make sense, and I'm essentially throwing away the compute.
The ideal solution would be to fuse the multiply and the sum to avoid the memory bloat. Doing this in user code is painful, because the code is specifically factored to be reusable. The offending multiply is multiple levels down the call stack, every one of which provides a conceptually distinct purpose, and could be called in arbitrary other code. I'm essentially breaking down the code's abstractions in order to apply this optimization manually.
If this could be done automatically with reasonable overhead that would be far more effective from a code reuse and readability perspective.
LANL/SLAC, medium priority.