Skip to content

Conversation

@user202729
Copy link
Contributor

In this code, we want to compute $a, a^2, a^3,…, a^{\texttt{len1}}$.
Previously, it computes $a^i = a^{i-1}⋅ a$. Now, we use $a^i = a^{2^j} ⋅ a^{i-2^j}$, where $j$ is flint_ctz(i).
Both ways, len1 multiplications are performed. However, the latter makes the tree depth only $O(\log \texttt{len1})$.

The advantage is that if we have an arb_poly, then the accumulated error would be approximately linear in the computation tree depth. This change reduces the final error.

The disadvantage of this method includes

  • if somehow multiply an arbitrary entry by $a$ is faster than multiplying two arbitrary entries, then we get a slowdown. (Half of all multiplications still multiply by $a$, however).
  • over something like ℤ, where multiplying an integer with $x$ bits with an integer with $y$ bits have complexity roughly $O((x+y) \log\min(x, y))$, this would make the $\log$ factor larger.

What do you think?

Attached benchmark below. The code is rather ugly, but I'm pretty sure it measures the right thing.
c.c

@user202729 user202729 force-pushed the gr-poly-compose-binary branch 2 times, most recently from 5906852 to 9156a6e Compare November 9, 2025 07:12
@fredrik-johansson
Copy link
Collaborator

fredrik-johansson commented Nov 9, 2025

This problem shows up in a few places. Note that there is a _gr_vec_set_powers which can be overloaded by rings to do the optimal thing, so one solution would be to implement this function in terms of _gr_vec_set_powers + some reordering of coefficients.

In the Brent-Kung power series composition, I used this criterion for whether to favor reduced multiplication depth when computing successive powers:

if (len2 >= n && (gr_ctx_is_finite(ctx) == T_TRUE || gr_ctx_has_real_prec(ctx) == T_TRUE))

  • In a finite ring, we probably don't have the issue of coefficient growth, so we benefit from half of the multiplications being squaring
  • If the base ring uses real approximations (e.g. floats or balls), assume that minimizing the depth minimizes error
  • In other cases, assume that coefficient growth is the main issue, so favor repeated multiplication by the generator

You could do the same thing here, and we should probably do something similar by default in _gr_vec_set_powers. Again, it's possible to do better for specific ring implementations though.

About the particular algorithm used: your way to reduce the depth to $O(\log n)$ looks a bit more complicated than just doing $a^n = a^{\lfloor n / 2 \rfloor} a^{\lceil n / 2 \rceil}$. I haven't thought carefully about this: is there a reason to favor one over the other? Edit: I guess the advantage is that you only need logarithmic scratch space.

@user202729
Copy link
Contributor Author

user202729 commented Nov 9, 2025

_gr_vec_set_powers

looks useful. We could use that as a subroutine here, the caveat is $O(n)$ extra space.

you only need logarithmic scratch space

that's the intention, indeed. Do you think it's worth it?

we benefit from half of the multiplications being squaring

actually this solution doesn't have that benefit either. Indeed I haven't thought about it.

if the goal is to use as many squares as possible, it's also possible to use only $O(\log n)$ extra space, at the cost of worse memory access pattern.

Something like

work(result, pos, apow)
{
    result[pos] *= apow
    if ((pos<<1) < len)
    {
        tmp = apow^2
        work(pos<<1, tmp)
        if ((pos<<1)+1 < len)
        {
            tmp *= a
            work((pos<<1)+1, tmp)
        }
    }
}

the actual code may want to use non-recursive implementation for performance.


non-recursive implementation

int maxbit = FLINT_CLOG2(len1);
gr_ptr t;
GR_TMP_INIT_VEC(t, maxbit, ctx);

status |= gr_set(GR_ENTRY(t, 0, sz), a, ctx);
status |= gr_mul(GR_ENTRY(res, 1, sz), GR_ENTRY(res, 1, sz), a, ctx);
for (int j = 1; j < maxbit; ++j)
{
	status |= gr_sqr(GR_ENTRY(t, j, sz), GR_ENTRY(t, j-1, sz), ctx);
	status |= gr_mul(GR_ENTRY(res, 1<<j, sz), GR_ENTRY(res, 1<<j, sz), GR_ENTRY(t, j, sz), ctx);
}
for (i = ((slong) 1 << (maxbit-1)) + 1; i < len1; i++)
{
	int bit = flint_ctz(i);
	status |= gr_mul(GR_ENTRY(t, maxbit-1-bit, sz), GR_ENTRY(t, maxbit-1-bit, sz), a, ctx);
	status |= gr_mul(GR_ENTRY(res, i>>bit, sz), GR_ENTRY(res, i>>bit, sz), GR_ENTRY(t, maxbit-1-bit, sz), ctx);

	for (int j = bit; j > 0; --j)
	{
		status |= gr_sqr(GR_ENTRY(t, maxbit-j, sz), GR_ENTRY(t, maxbit-j-1, sz), ctx);
		status |= gr_mul(GR_ENTRY(res, i>>(j-1), sz), GR_ENTRY(res, i>>(j-1), sz), GR_ENTRY(t, maxbit-j, sz), ctx);
	}
}

for (i = (len1 + 1) >> 1; i < ((slong)1 << (maxbit-1)); i++)
{
	int bit = flint_ctz(i);
	status |= gr_mul(GR_ENTRY(t, maxbit-2-bit, sz), GR_ENTRY(t, maxbit-2-bit, sz), a, ctx);
	status |= gr_mul(GR_ENTRY(res, i>>bit, sz), GR_ENTRY(res, i>>bit, sz), GR_ENTRY(t, maxbit-2-bit, sz), ctx);

	for (int j = bit; j > 0; --j)
	{
		status |= gr_sqr(GR_ENTRY(t, maxbit-j-1, sz), GR_ENTRY(t, maxbit-j-2, sz), ctx);
		status |= gr_mul(GR_ENTRY(res, i>>(j-1), sz), GR_ENTRY(res, i>>(j-1), sz), GR_ENTRY(t, maxbit-j-1, sz), ctx);
	}
}

GR_TMP_CLEAR_VEC(t, maxbit, ctx);

sanity check that it's correct

def ctz(n: int) -> int:
    assert n
    return (n & -n).bit_length() - 1

def ceil_log2(n: int) -> int:
    assert n > 0
    b = n.bit_length() - 1
    return b if (1 << b) == n else b + 1

for n in range(2, 100):
    maxbit = ceil_log2(n)
    t = [None]*maxbit
    seen = set()
    t[0] = 1
    def print_and_add(x, y):
        #print(x)
        assert x==y
        assert x not in seen
        seen.add(x)
    print_and_add(t[0], 1)
    for j in range(1, maxbit):
        t[j] = t[j-1]*2
        print_and_add(t[j], 1<<j)
    #
    for i in range((1<<(maxbit-1))+1, n):
        bit = ctz(i)
        t[maxbit-1-bit] += 1
        print_and_add(t[maxbit-1-bit], i>>bit)
        for j in range(bit, 0, -1):
            t[maxbit-j] = t[maxbit-j-1]*2
            print_and_add(t[maxbit-j], i>>(j-1))
    #
    for i in range((n+1)>>1, 1<<(maxbit-1)):
        bit = ctz(i)
        t[maxbit-2-bit] += 1
        print_and_add(t[maxbit-2-bit], i>>bit)
        for j in range(bit, 0, -1):
            t[maxbit-j-1] = t[maxbit-j-2]*2
            print_and_add(t[maxbit-j-1], i>>(j-1))
    assert seen==set(range(1, n))

@user202729 user202729 force-pushed the gr-poly-compose-binary branch from 9156a6e to 099d6a9 Compare November 10, 2025 14:24
@user202729 user202729 marked this pull request as draft November 10, 2025 14:40
@user202729 user202729 force-pushed the gr-poly-compose-binary branch 2 times, most recently from d1557e4 to 00ddb74 Compare November 10, 2025 14:43
@user202729
Copy link
Contributor Author

user202729 commented Nov 10, 2025

if this were C++ I could do template <typename Fn> void for_each_power(gr_ptr a, slong len, Fn fn) or such to avoid the code duplication. One can almost get the same thing with C function pointer, but they tend to be not inlined as well. (And the syntax is more ugly)

anyway, I believe it's fixed.

I have two possible algorithms for the compose.c. See the latest commit.

  • the old one is shorter, but does not use as much squares.
  • the new one is longer (more complicated), but half of all operations are squares.
  • the new one uses log n scratch space. The old one uses 2 log n.

which one do you prefer?

That said, maybe memory allocation is cheap enough that it would be faster to allocate a new buffer, _gr_vec_set_powers, then multiply.

@user202729 user202729 marked this pull request as ready for review November 10, 2025 15:10
if (i % 2 == 0)
status |= sqr(GR_ENTRY(res, i, sz), GR_ENTRY(res, i / 2, sz), ctx);
else
status |= mul(GR_ENTRY(res, i, sz), GR_ENTRY(res, i - 1, sz), x, ctx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be mul(GR_ENTRY(res, i, sz), GR_ENTRY(res, (i + 1) / 2, sz), GR_ENTRY(res, i / 2, sz), ctx);

Copy link
Contributor Author

@user202729 user202729 Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually res[i] = res[i-1] * res has the advantage that, even in a ball field, if the initial number is exact and has precision much less than prec, multiplying by that could be slightly faster than multiplying by a general number.

But if the goal is to reduce multiplication depth, res[i] = res[i/2] * res[(i+1)/2] is better. Quick testing shows res[i] = res[i/2] * res[(i+1)/2] resulting in less error too.

The $O(\log n)$ scratch space version is... difficult to implement that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants