Skip to content

Conversation

@stooart-mon
Copy link
Contributor

@stooart-mon stooart-mon commented Dec 1, 2025

MemorySegments allocated from shared Arena from
java.lang.foreign.Arena.ofShared() have their lifecycle controlled by jdk.internal.foreign.SharedSession. This class ensures that the MemorySegments can't be freed until after a thread has called Arena.close(). This is implemented using a counter that is atomically incremented when used, and decremented when not used, on every invocation of a downcall. While shared Arenas allow any thread to use it and to close it, this tracking has a cost when multiple threads are contended on it. This patch changes the implementation to use multiple counters to reduce contention. sun.nio.ch.IOUtil, java.nio.Buffer and sun.nio.ch.SimpleAsynchronousFileChannelImpl are modified as they have threads releasing the scope different from the ones that allocated them, so a ticket that tracks the counter has to be passed over.

The microbenchmark org.openjdk.bench.java.lang.foreign. CallOverheadConstant.panama_identity_memory_address_shared_3 was used to generate the following results. The scalability was checked on a number of platforms with the JMH parameter "-t" specifying the number of threads. Measurements are in ns/op .

The hardware are the Neoverse-N1, N2, V1 and V2, Intel Xeon 8375c and the AMD Epyc 9654.

Threads N1 N2 V1 V2 Xeon Epyc
1 30.88 32.15 33.54 32.82 27.46 8.45
2 142.56 134.48 132.01 131.50 116.68 46.53
4 310.18 282.75 287.59 271.82 251.88 86.11
8 702.02 710.29 736.72 670.63 533.46 194.60
16 1,436.17 1,684.80 1,833.69 1,782.78 1,100.15 827.28
24 2,185.55 2,508.86 2,732.22 2,815.26 1,646.09 1,530.28
32 2,942.48 3,432.84 3,643.64 3,782.23 2,236.81 2,278.52
48 4,466.56 5,174.72 5,401.95 5,621.41 4,926.30 3,026.58

After:

Threads N1 N2 V1 V2 Xeon Epyc
1 32.41 32.11 34.43 31.32 27.94 9.82
2 32.64 33.72 35.11 31.30 28.02 9.81
4 32.71 36.84 34.67 31.35 28.12 10.49
8 58.22 31.60 36.87 31.72 47.09 16.52
16 70.15 47.76 52.37 47.26 70.91 14.53
24 77.38 78.14 81.67 71.98 87.20 21.70
32 87.54 98.01 84.73 86.79 109.25 22.65
48 121.54 128.14 120.51 104.35 175.08 26.85

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Warning

 ⚠️ Found trailing period in issue title for 8371260: Improve scaling of downcalls using MemorySegments allocated with shared arenas.

Issue

  • JDK-8371260: Improve scaling of downcalls using MemorySegments allocated with shared arenas. (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28575/head:pull/28575
$ git checkout pull/28575

Update a local copy of the PR:
$ git checkout pull/28575
$ git pull https://git.openjdk.org/jdk.git pull/28575/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28575

View PR using the GUI difftool:
$ git pr show -t 28575

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28575.diff

Using Webrev

Link to Webrev Comment

…with shared arenas

MemorySegments allocated from shared Arena from
java.lang.foreign.Arena.ofShared() have their lifecycle controlled by
jdk.internal.foreign.SharedSession. This class ensures that the
MemorySegments can't be freed until after a thread has called
Arena.close(). This is implemented using a counter that is atomically
incremented when used, and decremented when not used, on every invocation
of a downcall. While shared Arenas allow any thread to use it and to
close it, this tracking has a cost when multiple threads are contended
on it. This patch changes the implementation to use multiple counters to
reduce contention. sun.nio.ch.IOUtil, java.nio.Buffer and
sun.nio.ch.SimpleAsynchronousFileChannelImpl are modified as they have
threads releasing the scope different from the ones that allocated them,
so a ticket that tracks the counter has to be passed over.

The microbenchmark org.openjdk.bench.java.lang.foreign.
CallOverheadConstant.panama_identity_memory_address_shared_3
was used to generate the following results. The scalability was checked
on a number of platforms with the JMH parameter "-t" specifying the
number of threads. Measurements are in ns/op .

The hardware are the Neoverse-N1, N2, V1 and V2, Intel Xeon 8375c and the
AMD Epyc 9654.

Threads    N1         N2            V1         V2       Xeon       Epyc
    1      30.88      32.15      33.54      32.82      27.46       8.45
    2     142.56     134.48     132.01     131.50     116.68      46.53
    4     310.18     282.75     287.59     271.82     251.88      86.11
    8     702.02     710.29     736.72     670.63     533.46     194.60
   16   1,436.17   1,684.80   1,833.69   1,782.78   1,100.15     827.28
   24   2,185.55   2,508.86   2,732.22   2,815.26   1,646.09   1,530.28
   32   2,942.48   3,432.84   3,643.64   3,782.23   2,236.81   2,278.52
   48   4,466.56   5,174.72   5,401.95   5,621.41   4,926.30   3,026.58

After
Threads    N1         N2            V1       V2       Xeon       Epyc
    1      32.41      32.11      34.43    31.32      27.94       9.82
    2      32.64      33.72      35.11    31.30      28.02       9.81
    4      32.71      36.84      34.67    31.35      28.12      10.49
    8      58.22      31.60      36.87    31.72      47.09      16.52
   16      70.15      47.76      52.37    47.26      70.91      14.53
   24      77.38      78.14      81.67    71.98      87.20      21.70
   32      87.54      98.01      84.73    86.79     109.25      22.65
   48     121.54     128.14     120.51   104.35     175.08      26.85
@bridgekeeper
Copy link

bridgekeeper bot commented Dec 1, 2025

👋 Welcome back smonteith! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 1, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot changed the title 8371260: Improve scaling of downcalls using MemorySegments allocated with shared arenas 8371260: Improve scaling of downcalls using MemorySegments allocated with shared arenas. Dec 1, 2025
@openjdk
Copy link

openjdk bot commented Dec 1, 2025

@stooart-mon The following labels will be automatically applied to this pull request:

  • core-libs
  • net
  • nio
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 1, 2025
@mlbridge
Copy link

mlbridge bot commented Dec 1, 2025

Webrevs

@AlanBateman
Copy link
Contributor

Can you confirm that you aren't planning to try to try to integrate this before the fork for JDK 26 this week? There are several discussion points, and some of the changes are in risky areas, will likely need back time in main line.

private final long position; // file position
private final PendingFuture<Integer,A> result;
private volatile boolean released;
private int ticket; // to release buffer scope
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for substitution, so needs to be declared with buf (further down), and needs to be volatile (can't use plain access as it may be released on different thread).

@stooart-mon
Copy link
Contributor Author

Can you confirm that you aren't planning to try to try to integrate this before the fork for JDK 26 this week? There are several discussion points, and some of the changes are in risky areas, will likely need back time in main line.

I am not planning, and wouldn't want to integrate this just before a fork. While I've checked it the best I can, I'd appreciate some scrutiny.

@mcimadamore
Copy link
Contributor

Thanks for working on this problem.

At the high level I understand your goal here. It is a bit unfortunate that we have to change the API for acquire/release which cascades in many smaller (but simple) changes throughout the JDK. But let's set that aside.

The main (and more serious) issue with this PR is the decoupling of the liveness bit from the ref counting part. This is something that will require some time to work through to ensure correctness is preserved in all cases.

More generally, it is generally advisable for such deep changes to perhaps reach out to panama-dev mailing list first, and maybe have a discussion/reach some consensus there, before moving ahead with a PR.

@stooart-mon
Copy link
Contributor Author

Thanks for working on this problem.

At the high level I understand your goal here. It is a bit unfortunate that we have to change the API for acquire/release which cascades in many smaller (but simple) changes throughout the JDK. But let's set that aside.
Yes, that was regrettable. Originally I retained the "void release0()" that recalculated the "ticket", which resulted in fewer changes. However, it was felt the explicit ticket passing was less likely to introduce errors.

The main (and more serious) issue with this PR is the decoupling of the liveness bit from the ref counting part. This is something that will require some time to work through to ensure correctness is preserved in all cases.

Yes, I had been grappling with the atomicity. The "acquireCount" and "state" are decoupled in the most recent revision of the, so this closing should either fail and not update state, or pass and complete the shutdown. However, the "state" and ScopedAccesses are complex, so I appreciate the scrutiny in that area.

More generally, it is generally advisable for such deep changes to perhaps reach out to panama-dev mailing list first, and maybe have a discussion/reach some consensus there, before moving ahead with a PR.

That was my mistake - I expected that with the foreign ABI being merged, discussions on internals would have reverted to here.

@mcimadamore
Copy link
Contributor

That was my mistake - I expected that with the foreign ABI being merged, discussions on internals would have reverted to here.

In general they are -- but I find that one issue with PRs is that they tend to be very code-centric, whereas in some cases it might be useful to discuss design options more abstractly and then converge towards an implementation. And doing design discussions via PR is suboptimal. But, you did right in bringing this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants