Reduce multiplication depth in _gr_poly_compose_axnc #2483

user202729 · 2025-11-09T04:17:30Z

In this code, we want to compute $a, a^2, a^3,…, a^{\texttt{len1}}$.
Previously, it computes $a^i = a^{i-1}⋅ a$. Now, we use $a^i = a^{2^j} ⋅ a^{i-2^j}$, where $j$ is flint_ctz(i).
Both ways, len1 multiplications are performed. However, the latter makes the tree depth only $O(\log \texttt{len1})$.

The advantage is that if we have an arb_poly, then the accumulated error would be approximately linear in the computation tree depth. This change reduces the final error.

The disadvantage of this method includes

if somehow multiply an arbitrary entry by $a$ is faster than multiplying two arbitrary entries, then we get a slowdown. (Half of all multiplications still multiply by $a$, however).
over something like ℤ, where multiplying an integer with $x$ bits with an integer with $y$ bits have complexity roughly $O((x+y) \log\min(x, y))$, this would make the $\log$ factor larger.

What do you think?

Attached benchmark below. The code is rather ugly, but I'm pretty sure it measures the right thing.
c.c

fredrik-johansson · 2025-11-09T08:46:30Z

This problem shows up in a few places. Note that there is a _gr_vec_set_powers which can be overloaded by rings to do the optimal thing, so one solution would be to implement this function in terms of _gr_vec_set_powers + some reordering of coefficients.

In the Brent-Kung power series composition, I used this criterion for whether to favor reduced multiplication depth when computing successive powers:

flint/src/gr_poly/compose_series_brent_kung.c

Line 54 in c0d1790

    
           if (len2 >= n && (gr_ctx_is_finite(ctx) == T_TRUE || gr_ctx_has_real_prec(ctx) == T_TRUE))

In a finite ring, we probably don't have the issue of coefficient growth, so we benefit from half of the multiplications being squaring
If the base ring uses real approximations (e.g. floats or balls), assume that minimizing the depth minimizes error
In other cases, assume that coefficient growth is the main issue, so favor repeated multiplication by the generator

You could do the same thing here, and we should probably do something similar by default in _gr_vec_set_powers. Again, it's possible to do better for specific ring implementations though.

About the particular algorithm used: your way to reduce the depth to $O(\log n)$ looks a bit more complicated than just doing $a^n = a^{\lfloor n / 2 \rfloor} a^{\lceil n / 2 \rceil}$. I haven't thought carefully about this: is there a reason to favor one over the other? Edit: I guess the advantage is that you only need logarithmic scratch space.

user202729 · 2025-11-09T09:04:35Z

_gr_vec_set_powers

looks useful. We could use that as a subroutine here, the caveat is $O(n)$ extra space.

you only need logarithmic scratch space

that's the intention, indeed. Do you think it's worth it?

we benefit from half of the multiplications being squaring

actually this solution doesn't have that benefit either. Indeed I haven't thought about it.

if the goal is to use as many squares as possible, it's also possible to use only $O(\log n)$ extra space, at the cost of worse memory access pattern.

Something like

work(result, pos, apow)
{
    result[pos] *= apow
    if ((pos<<1) < len)
    {
        tmp = apow^2
        work(pos<<1, tmp)
        if ((pos<<1)+1 < len)
        {
            tmp *= a
            work((pos<<1)+1, tmp)
        }
    }
}

the actual code may want to use non-recursive implementation for performance.

non-recursive implementation

int maxbit = FLINT_CLOG2(len1);
gr_ptr t;
GR_TMP_INIT_VEC(t, maxbit, ctx);

status |= gr_set(GR_ENTRY(t, 0, sz), a, ctx);
status |= gr_mul(GR_ENTRY(res, 1, sz), GR_ENTRY(res, 1, sz), a, ctx);
for (int j = 1; j < maxbit; ++j)
{
	status |= gr_sqr(GR_ENTRY(t, j, sz), GR_ENTRY(t, j-1, sz), ctx);
	status |= gr_mul(GR_ENTRY(res, 1<<j, sz), GR_ENTRY(res, 1<<j, sz), GR_ENTRY(t, j, sz), ctx);
}
for (i = ((slong) 1 << (maxbit-1)) + 1; i < len1; i++)
{
	int bit = flint_ctz(i);
	status |= gr_mul(GR_ENTRY(t, maxbit-1-bit, sz), GR_ENTRY(t, maxbit-1-bit, sz), a, ctx);
	status |= gr_mul(GR_ENTRY(res, i>>bit, sz), GR_ENTRY(res, i>>bit, sz), GR_ENTRY(t, maxbit-1-bit, sz), ctx);

	for (int j = bit; j > 0; --j)
	{
		status |= gr_sqr(GR_ENTRY(t, maxbit-j, sz), GR_ENTRY(t, maxbit-j-1, sz), ctx);
		status |= gr_mul(GR_ENTRY(res, i>>(j-1), sz), GR_ENTRY(res, i>>(j-1), sz), GR_ENTRY(t, maxbit-j, sz), ctx);
	}
}

for (i = (len1 + 1) >> 1; i < ((slong)1 << (maxbit-1)); i++)
{
	int bit = flint_ctz(i);
	status |= gr_mul(GR_ENTRY(t, maxbit-2-bit, sz), GR_ENTRY(t, maxbit-2-bit, sz), a, ctx);
	status |= gr_mul(GR_ENTRY(res, i>>bit, sz), GR_ENTRY(res, i>>bit, sz), GR_ENTRY(t, maxbit-2-bit, sz), ctx);

	for (int j = bit; j > 0; --j)
	{
		status |= gr_sqr(GR_ENTRY(t, maxbit-j-1, sz), GR_ENTRY(t, maxbit-j-2, sz), ctx);
		status |= gr_mul(GR_ENTRY(res, i>>(j-1), sz), GR_ENTRY(res, i>>(j-1), sz), GR_ENTRY(t, maxbit-j-1, sz), ctx);
	}
}

GR_TMP_CLEAR_VEC(t, maxbit, ctx);

sanity check that it's correct

def ctz(n: int) -> int:
    assert n
    return (n & -n).bit_length() - 1

def ceil_log2(n: int) -> int:
    assert n > 0
    b = n.bit_length() - 1
    return b if (1 << b) == n else b + 1

for n in range(2, 100):
    maxbit = ceil_log2(n)
    t = [None]*maxbit
    seen = set()
    t[0] = 1
    def print_and_add(x, y):
        #print(x)
        assert x==y
        assert x not in seen
        seen.add(x)
    print_and_add(t[0], 1)
    for j in range(1, maxbit):
        t[j] = t[j-1]*2
        print_and_add(t[j], 1<<j)
    #
    for i in range((1<<(maxbit-1))+1, n):
        bit = ctz(i)
        t[maxbit-1-bit] += 1
        print_and_add(t[maxbit-1-bit], i>>bit)
        for j in range(bit, 0, -1):
            t[maxbit-j] = t[maxbit-j-1]*2
            print_and_add(t[maxbit-j], i>>(j-1))
    #
    for i in range((n+1)>>1, 1<<(maxbit-1)):
        bit = ctz(i)
        t[maxbit-2-bit] += 1
        print_and_add(t[maxbit-2-bit], i>>bit)
        for j in range(bit, 0, -1):
            t[maxbit-j-1] = t[maxbit-j-2]*2
            print_and_add(t[maxbit-j-1], i>>(j-1))
    assert seen==set(range(1, n))

user202729 · 2025-11-10T15:00:49Z

if this were C++ I could do template <typename Fn> void for_each_power(gr_ptr a, slong len, Fn fn) or such to avoid the code duplication. One can almost get the same thing with C function pointer, but they tend to be not inlined as well. (And the syntax is more ugly)

anyway, I believe it's fixed.

I have two possible algorithms for the compose.c. See the latest commit.

the old one is shorter, but does not use as much squares.
the new one is longer (more complicated), but half of all operations are squares.
the new one uses log n scratch space. The old one uses 2 log n.

which one do you prefer?

That said, maybe memory allocation is cheap enough that it would be faster to allocate a new buffer, _gr_vec_set_powers, then multiply.

fredrik-johansson · 2025-11-10T17:30:44Z

src/gr_generic/generic.c

+            if (i % 2 == 0)
+                status |= sqr(GR_ENTRY(res, i, sz), GR_ENTRY(res, i / 2, sz), ctx);
+            else
+                status |= mul(GR_ENTRY(res, i, sz), GR_ENTRY(res, i - 1, sz), x, ctx);


Should probably be mul(GR_ENTRY(res, i, sz), GR_ENTRY(res, (i + 1) / 2, sz), GR_ENTRY(res, i / 2, sz), ctx);

Actually res[i] = res[i-1] * res has the advantage that, even in a ball field, if the initial number is exact and has precision much less than prec, multiplying by that could be slightly faster than multiplying by a general number.

But if the goal is to reduce multiplication depth, res[i] = res[i/2] * res[(i+1)/2] is better. Quick testing shows res[i] = res[i/2] * res[(i+1)/2] resulting in less error too.

The $O(\log n)$ scratch space version is... difficult to implement that.

user202729 force-pushed the gr-poly-compose-binary branch 2 times, most recently from 5906852 to 9156a6e Compare November 9, 2025 07:12

user202729 force-pushed the gr-poly-compose-binary branch from 9156a6e to 099d6a9 Compare November 10, 2025 14:24

user202729 marked this pull request as draft November 10, 2025 14:40

Reduce multiplication depth in _gr_poly_compose_axnc

00ddb74

user202729 force-pushed the gr-poly-compose-binary branch 2 times, most recently from d1557e4 to 00ddb74 Compare November 10, 2025 14:43

Alternative algorithm

5e1f9e4

user202729 marked this pull request as ready for review November 10, 2025 15:10

fredrik-johansson reviewed Nov 10, 2025

View reviewed changes

Apply suggested change

9e67366

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce multiplication depth in _gr_poly_compose_axnc #2483

Reduce multiplication depth in _gr_poly_compose_axnc #2483

Uh oh!

user202729 commented Nov 9, 2025

Uh oh!

fredrik-johansson commented Nov 9, 2025 •

edited

Loading

Uh oh!

user202729 commented Nov 9, 2025 •

edited

Loading

Uh oh!

user202729 commented Nov 10, 2025 •

edited

Loading

Uh oh!

fredrik-johansson Nov 10, 2025

Uh oh!

user202729 Nov 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reduce multiplication depth in _gr_poly_compose_axnc #2483

Are you sure you want to change the base?

Reduce multiplication depth in _gr_poly_compose_axnc #2483

Uh oh!

Conversation

user202729 commented Nov 9, 2025

Uh oh!

fredrik-johansson commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

user202729 commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

user202729 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fredrik-johansson Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

user202729 Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fredrik-johansson commented Nov 9, 2025 •

edited

Loading

user202729 commented Nov 9, 2025 •

edited

Loading

user202729 commented Nov 10, 2025 •

edited

Loading

user202729 Nov 11, 2025 •

edited

Loading