Skip to content

Conversation

@pvts-mat
Copy link
Contributor

@pvts-mat pvts-mat commented Nov 26, 2025

[LTS 9.2]
CVE-2025-38084 VULN-71577
CVE-2025-38085 VULN-71586
CVE-2024-57883 VULN-46929

Summary

The driving CVE was CVE-2025-38085. Fix for CVE-2025-38084 was included because it was closely related (same patch set). Additionally, the fix for CVE-2025-38085 required a prerequisite which had its own CVE-2024-57883.

The changes differ visibly from the upstream. Most of the differences result from using stable 5.15 backports which otherwise applied cleanly to the ciqlts9_2 codebase. The exception is 14967a9 which wasn't backported to 5.15 yet and it was adapted to ciqlts9_2 from the upstream by hand. The following table summarizes all the commits used and their role

  Subject kernel-mainline linux-5.15.y Role Ch-p basis Clean? Additional changes
1 `hugetlb: unshare some PMDs when splitting VMAs` b30c14c bd9a23a Prerequisite for (3) linux-5.15.y Yes None
2 `mm: hugetlb: independent PMD page table shared count` 59d9094 8410996 Prerequisite for (4), Fix for CVE-2024-57883 linux-5.15.y Yes `RH_KABI_*` macro to address kABI breakage
3 `mm/hugetlb: unshare page tables during VMA split, not before` 081056d 366298f Fix for CVE-2025-38084 linux-5.15.y Yes None
4 `mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race` 1013af4 a3d864c Fix for CVE-2025-38085 linux-5.15.y Yes Exposing `tlb_remove_table_sync_one()` done in 5.15 by a commit not backported to `ciqlts9_2`
5 `mm/hugetlb: make detecting shared pte more reliable` 35e8761 - Prerequisite for (6) kernel-mainline No Resolved trivial conflicts
6 `mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count` 14967a9 - Bugfix for (2) kernel-mainline No Adapting upstream to `ciqlts9_2` in a similar fasion as linux-5.15.y commit (2) did

Commits

89824bf:

hugetlb: unshare some PMDs when splitting VMAs

jira VULN-71585
cve-pre CVE-2025-38084
commit-author James Houghton <[email protected]>
commit b30c14cd61025eeea2f2e8569606cd167ba9ad2d
upstream-diff Stable 5.15 backport bd9a23a4bb8a320ffa6e45cf09a068ec0a335350
  was used for the actual (clean) cherry-pick

6b0f840:

mm: hugetlb: independent PMD page table shared count

jira VULN-46929
cve CVE-2024-57883
commit-author Liu Shixin <[email protected]>
commit 59d9094df3d79443937add8700b2ef1a866b1081
upstream-diff Stable 5.15 backport 8410996eb6fea116fe1483ed977aacf580eee7b4
  was used for the actual (clean) cherry-pick. Additionally the `atomic_t
  pt_share_count' field in `include/linux/mm_types.h' was wrapped in
  RH_KABI_BROKEN_INSERT macro to avoid kABI checker complains. It's
  justified, because the inserted field (it's included, as
  CONFIG_ARCH_WANT_HUGE_PMD_SHARE gets enabled for at least
  `kernel-x86_64-rhel.config') is placed within a union which already
  contained a field of the same type `atomic_t pt_frag_refcount', so the
  size of it cannot change.

d7056d3:

mm/hugetlb: unshare page tables during VMA split, not before

jira VULN-71585
cve CVE-2025-38084
commit-author Jann Horn <[email protected]>
commit 081056dc00a27bccb55ccc3c6f230a3d5fd3f7e0
upstream-diff Stable 5.15 backport 366298f2b04d2bf1f2f2b7078405bdf9df9bd5d0
  was used for the actual (clean) cherry-pick

76a736b:

mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race

jira VULN-71586
cve CVE-2025-38085
commit-author Jann Horn <[email protected]>
commit 1013af4f585fccc4d3e5c5824d174de2257f7d6d
upstream-diff Stable 5.15 backport a3d864c901a300c295692d129159fc3001a56185
  was used for the actual cherry-pick. Additionally the
  2ba99c5e08812494bc57f319fb562f527d9bacd8 minus changes in `mm/khugepaged.c'
  was included to expose the `tlb_remove_table_sync_one' function.

35e8761:

mm/hugetlb: make detecting shared pte more reliable

jira VULN-46929
cve-bf CVE-2024-57883
commit-author Miaohe Lin <[email protected]>
commit 3aa4ed8040e1535d95c03cef8b52cf11bf0d8546
upstream-diff Accounted for e95a9851787bbb3cd4deb40fe8bab03f731852d1 not
  being backported to ciqlts9_2 - dropped the unnecessary braces in a
  one-statement `if' conditional.

314f5ab:

mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count

jira VULN-46929
cve-bf CVE-2024-57883
commit-author Jane Chu <[email protected]>
commit 14967a9c7d247841b0312c48dcf8cd29e55a4cc8
upstream-diff |
  include/linux/mm_types.h
        Removed the definition of `ptdesc_pmd_is_shared()' function in
        alignment with stable-5.15 backport
        8410996eb6fea116fe1483ed977aacf580eee7b4 (it omits the definition
        of `ptdesc_pmd_pts_*()' functions family, to which
        `ptdesc_pmd_is_shared()' belongs).
  mm/hugetlb.c
        copy_hugetlb_page_range()
              1. Used CONFIG_ARCH_WANT_HUGE_PMD_SHARE instead of
                 CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING, because the latter
                 was introduced only in the non-backported commit
                 188cac58a8bcdf82c7f63275b68f7a46871e45d6.
              2. Since `ptdesc_pmd_is_shared()' was not defined, read the
                 `pt_share_count' field directly, as is don in the
                 stable-5.15 backport
                 8410996eb6fea116fe1483ed977aacf580eee7b4. (Compare
                 changes to `huge_pmd_unshare()' in `mm/hugetlb.c' between
                 upstream 59d9094df3d79443937add8700b2ef1a866b1081 and
                 stable-5.15 8410996eb6fea116fe1483ed977aacf580eee7b4.)
        huge_pmd_unshare()
              No change to the conditional. It was arguably not needed in
              the upstream as well, probably introduced only for the sake
              of clarity in the presence of `ptdesc_pmd_is_shared()'
              function, which is missing here.

kABI check: passed

$ DEBUG=1 CVE=CVE-batch-12 ./ninja.sh _kabi_check_kernel__x86_64--test--ciqlts9_2-CVE-batch-12

[0/1] kabi_check_kernel	Check ABI of kernel [ciqlts9_2-CVE-batch-12]	_kabi_check_kernel__x86_64--test--ciqlts9_2-CVE-batch-12
++ uname -m
+ python3 /data/src/ctrliq-github-haskell/kernel-dist-git-el-9.2/SOURCES/check-kabi -k /data/src/ctrliq-github-haskell/kernel-dist-git-el-9.2/SOURCES/Module.kabi_x86_64 -s vms/x86_64--build--ciqlts9_2/build_files/kernel-src-tree-ciqlts9_2-CVE-batch-12/Module.symvers
kABI check passed
+ touch state/kernels/ciqlts9_2-CVE-batch-12/x86_64/kabi_checked

Boot test: passed

boot-test.log

Kselftests: passed relative

Reference

kselftests–ciqlts9_2–run1.log

Patch

kselftests–ciqlts9_2-CVE-batch-12–run1.log

Comparison

The results in reference and patch were the same except for the net/forwarding:vxlan_asymmetric.sh test which failed in the patched version for some reason.

$ ktests.xsh diff -d kselftests--ciqlts9_2--run1.log kselftests--ciqlts9_2-CVE-batch-12--run1.log

Column    File
--------  --------------------------------------------
Status0   kselftests--ciqlts9_2--run1.log
Status1   kselftests--ciqlts9_2-CVE-batch-12--run1.log

TestCase                            Status0  Status1  Summary
net/forwarding:vxlan_asymmetric.sh  pass     fail     diff

The test was repeated on the patched version.

kselftests–ciqlts9_2-vxlan_asymmetric–run1.log
kselftests–ciqlts9_2-vxlan_asymmetric–run2.log
kselftests–ciqlts9_2-vxlan_asymmetric–run3.log

$ ktests.xsh diff kselftests--ciqlts9_2-vxlan_asymmetric*.log

Column    File
--------  ------------------------------------------------
Status0   kselftests--ciqlts9_2-vxlan_asymmetric--run1.log
Status1   kselftests--ciqlts9_2-vxlan_asymmetric--run2.log
Status2   kselftests--ciqlts9_2-vxlan_asymmetric--run3.log

TestCase                            Status0  Status1  Status2  Summary
net/forwarding:vxlan_asymmetric.sh  pass     pass     pass     same

@pvts-mat pvts-mat marked this pull request as ready for review November 27, 2025 13:40
@pvts-mat pvts-mat force-pushed the ciqlts9_2-CVE-batch-12 branch from 314f5ab to cda475b Compare November 30, 2025 23:19
@PlaidCat PlaidCat requested a review from a team December 2, 2025 22:09
jira VULN-71577
cve-pre CVE-2025-38084
commit-author James Houghton <[email protected]>
commit b30c14c
upstream-diff Stable 5.15 backport bd9a23a
  was used for the actual (clean) cherry-pick

PMD sharing can only be done in PUD_SIZE-aligned pieces of VMAs; however,
it is possible that HugeTLB VMAs are split without unsharing the PMDs
first.

Without this fix, it is possible to hit the uffd-wp-related WARN_ON_ONCE
in hugetlb_change_protection [1].  The key there is that
hugetlb_unshare_all_pmds will not attempt to unshare PMDs in
non-PUD_SIZE-aligned sections of the VMA.

It might seem ideal to unshare in hugetlb_vm_op_open, but we need to
unshare in both the new and old VMAs, so unsharing in hugetlb_vm_op_split
seems natural.

[1]: https://lore.kernel.org/linux-mm/CADrL8HVeOkj0QH5VZZbRzybNE8CG-tEGFshnA+bG9nMgcWtBSg@mail.gmail.com/

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6dfeaff ("hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp")
	Signed-off-by: James Houghton <[email protected]>
	Reviewed-by: Mike Kravetz <[email protected]>
	Acked-by: Peter Xu <[email protected]>
	Cc: Axel Rasmussen <[email protected]>
	Cc: Muchun Song <[email protected]>
	Cc: <[email protected]>
	Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit b30c14c)
	Signed-off-by: Marcin Wcisło <[email protected]>
jira VULN-46929
cve CVE-2024-57883
commit-author Liu Shixin <[email protected]>
commit 59d9094
upstream-diff Stable 5.15 backport 8410996eb6fea116fe1483ed977aacf580eee7b4
  was used for the actual (clean) cherry-pick. Additionally the `atomic_t
  pt_share_count' field in `include/linux/mm_types.h' was wrapped in
  RH_KABI_BROKEN_INSERT macro to avoid kABI checker complains. It's
  justified, because the inserted field (it's included, as
  CONFIG_ARCH_WANT_HUGE_PMD_SHARE gets enabled for at least
  `kernel-x86_64-rhel.config') is placed within a union which already
  contained a field of the same type `atomic_t pt_frag_refcount', so the
  size of it cannot change.

The folio refcount may be increased unexpectly through try_get_folio() by
caller such as split_huge_pages.  In huge_pmd_unshare(), we use refcount
to check whether a pmd page table is shared.  The check is incorrect if
the refcount is increased by the above caller, and this can cause the page
table leaked:

 BUG: Bad page state in process sh  pfn:109324
 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324
 flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff)
 page_type: f2(table)
 raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000
 raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000
 page dumped because: nonzero mapcount
 ...
 CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G    B              6.13.0-rc2master+ ctrliq#7
 Tainted: [B]=BAD_PAGE
 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
 Call trace:
  show_stack+0x20/0x38 (C)
  dump_stack_lvl+0x80/0xf8
  dump_stack+0x18/0x28
  bad_page+0x8c/0x130
  free_page_is_bad_report+0xa4/0xb0
  free_unref_page+0x3cc/0x620
  __folio_put+0xf4/0x158
  split_huge_pages_all+0x1e0/0x3e8
  split_huge_pages_write+0x25c/0x2d8
  full_proxy_write+0x64/0xd8
  vfs_write+0xcc/0x280
  ksys_write+0x70/0x110
  __arm64_sys_write+0x24/0x38
  invoke_syscall+0x50/0x120
  el0_svc_common.constprop.0+0xc8/0xf0
  do_el0_svc+0x24/0x38
  el0_svc+0x34/0x128
  el0t_64_sync_handler+0xc8/0xd0
  el0t_64_sync+0x190/0x198

The issue may be triggered by damon, offline_page, page_idle, etc, which
will increase the refcount of page table.

1. The page table itself will be discarded after reporting the
   "nonzero mapcount".

2. The HugeTLB page mapped by the page table miss freeing since we
   treat the page table as shared and a shared page table will not be
   unmapped.

Fix it by introducing independent PMD page table shared count.  As
described by comment, pt_index/pt_mm/pt_frag_refcount are used for s390
gmap, x86 pgds and powerpc, pt_share_count is used for x86/arm64/riscv
pmds, so we can reuse the field as pt_share_count.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 39dde65 ("[PATCH] shared page table for hugetlb page")
        Signed-off-by: Liu Shixin <[email protected]>
        Cc: Kefeng Wang <[email protected]>
        Cc: Ken Chen <[email protected]>
        Cc: Muchun Song <[email protected]>
        Cc: Nanyong Sun <[email protected]>
        Cc: Jane Chu <[email protected]>
        Cc: <[email protected]>
        Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 59d9094)
        Signed-off-by: Marcin Wcisło <[email protected]>
jira VULN-71577
cve CVE-2025-38084
commit-author Jann Horn <[email protected]>
commit 081056d
upstream-diff Stable 5.15 backport 366298f2b04d2bf1f2f2b7078405bdf9df9bd5d0
  was used for the actual (clean) cherry-pick

Currently, __split_vma() triggers hugetlb page table unsharing through
vm_ops->may_split().  This happens before the VMA lock and rmap locks are
taken - which is too early, it allows racing VMA-locked page faults in our
process and racing rmap walks from other processes to cause page tables to
be shared again before we actually perform the split.

Fix it by explicitly calling into the hugetlb unshare logic from
__split_vma() in the same place where THP splitting also happens.  At that
point, both the VMA and the rmap(s) are write-locked.

An annoying detail is that we can now call into the helper
hugetlb_unshare_pmds() from two different locking contexts:

1. from hugetlb_split(), holding:
    - mmap lock (exclusively)
    - VMA lock
    - file rmap lock (exclusively)
2. hugetlb_unshare_all_pmds(), which I think is designed to be able to
   call us with only the mmap lock held (in shared mode), but currently
   only runs while holding mmap lock (exclusively) and VMA lock

Backporting note:
This commit fixes a racy protection that was introduced in commit
b30c14c ("hugetlb: unshare some PMDs when splitting VMAs"); that
commit claimed to fix an issue introduced in 5.13, but it should actually
also go all the way back.

[[email protected]: v2]
  Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 39dde65 ("[PATCH] shared page table for hugetlb page")
	Signed-off-by: Jann Horn <[email protected]>
	Cc: Liam Howlett <[email protected]>
	Reviewed-by: Lorenzo Stoakes <[email protected]>
	Reviewed-by: Oscar Salvador <[email protected]>
	Cc: Lorenzo Stoakes <[email protected]>
	Cc: Vlastimil Babka <[email protected]>
	Cc: <[email protected]>	[b30c14c: hugetlb: unshare some PMDs when splitting VMAs]
	Cc: <[email protected]>
	Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 081056d)
	Signed-off-by: Marcin Wcisło <[email protected]>
jira VULN-71586
cve CVE-2025-38085
commit-author Jann Horn <[email protected]>
commit 1013af4
upstream-diff Stable 5.15 backport a3d864c901a300c295692d129159fc3001a56185
  was used for the actual cherry-pick. Additionally the
  2ba99c5 minus changes in `mm/khugepaged.c'
  was included to expose the `tlb_remove_table_sync_one' function.

huge_pmd_unshare() drops a reference on a page table that may have
previously been shared across processes, potentially turning it into a
normal page table used in another process in which unrelated VMAs can
afterwards be installed.

If this happens in the middle of a concurrent gup_fast(), gup_fast() could
end up walking the page tables of another process.  While I don't see any
way in which that immediately leads to kernel memory corruption, it is
really weird and unexpected.

Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one(),
just like we do in khugepaged when removing page tables for a THP
collapse.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 39dde65 ("[PATCH] shared page table for hugetlb page")
	Signed-off-by: Jann Horn <[email protected]>
	Reviewed-by: Lorenzo Stoakes <[email protected]>
	Cc: Liam Howlett <[email protected]>
	Cc: Muchun Song <[email protected]>
	Cc: Oscar Salvador <[email protected]>
	Cc: Vlastimil Babka <[email protected]>
	Cc: <[email protected]>
	Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 1013af4)
	Signed-off-by: Marcin Wcisło <[email protected]>
jira VULN-46929
cve-bf CVE-2024-57883
commit-author Miaohe Lin <[email protected]>
commit 3aa4ed8
upstream-diff Accounted for e95a985 not
  being backported to ciqlts9_2 - dropped the unnecessary braces in a
  one-statement `if' conditional.

If the pagetables are shared, we shouldn't copy or take references.  Since
src could have unshared and dst shares with another vma, huge_pte_none()
is thus used to determine whether dst_pte is shared.  But this check isn't
reliable.  A shared pte could have pte none in pagetable in fact.  The
page count of ptep page should be checked here in order to reliably
determine whether pte is shared.

[[email protected]: remove unused local variable dst_entry in copy_hugetlb_page_range()]
  Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
	Signed-off-by: Miaohe Lin <[email protected]>
	Signed-off-by: Lukas Bulwahn <[email protected]>
	Reviewed-by: Mike Kravetz <[email protected]>
	Cc: Muchun Song <[email protected]>
	Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 3aa4ed8)
	Signed-off-by: Marcin Wcisło <[email protected]>
jira VULN-46929
cve-bf CVE-2024-57883
commit-author Jane Chu <[email protected]>
commit 14967a9
upstream-diff |
  include/linux/mm_types.h
        Removed the definition of `ptdesc_pmd_is_shared()' function in
        alignment with stable-5.15 backport
        8410996eb6fea116fe1483ed977aacf580eee7b4 (it omits the definition
        of `ptdesc_pmd_pts_*()' functions family, to which
        `ptdesc_pmd_is_shared()' belongs).
  mm/hugetlb.c
        copy_hugetlb_page_range()
              1. Used CONFIG_ARCH_WANT_HUGE_PMD_SHARE instead of
                 CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING, because the latter
                 was introduced only in the non-backported commit
                 188cac5.
              2. Since `ptdesc_pmd_is_shared()' was not defined, read the
                 `pt_share_count' field directly, as is don in the
                 stable-5.15 backport
                 8410996eb6fea116fe1483ed977aacf580eee7b4. (Compare
                 changes to `huge_pmd_unshare()' in `mm/hugetlb.c' between
                 upstream 59d9094 and
                 stable-5.15 8410996eb6fea116fe1483ed977aacf580eee7b4.)
        huge_pmd_unshare()
              No change to the conditional. It was arguably not needed in
              the upstream as well, probably introduced only for the sake
              of clarity in the presence of `ptdesc_pmd_is_shared()'
              function, which is missing here.

commit 59d9094 ("mm: hugetlb: independent PMD page table shared
count") introduced ->pt_share_count dedicated to hugetlb PMD share count
tracking, but omitted fixing copy_hugetlb_page_range(), leaving the
function relying on page_count() for tracking that no longer works.

When lazy page table copy for hugetlb is disabled, that is, revert commit
bcd51a3 ("hugetlb: lazy page table copies in fork()") fork()'ing with
hugetlb PMD sharing quickly lockup -

[  239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s!
[  239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0
[  239.446631] Call Trace:
[  239.446633]  <TASK>
[  239.446636]  _raw_spin_lock+0x3f/0x60
[  239.446639]  copy_hugetlb_page_range+0x258/0xb50
[  239.446645]  copy_page_range+0x22b/0x2c0
[  239.446651]  dup_mmap+0x3e2/0x770
[  239.446654]  dup_mm.constprop.0+0x5e/0x230
[  239.446657]  copy_process+0xd17/0x1760
[  239.446660]  kernel_clone+0xc0/0x3e0
[  239.446661]  __do_sys_clone+0x65/0xa0
[  239.446664]  do_syscall_64+0x82/0x930
[  239.446668]  ? count_memcg_events+0xd2/0x190
[  239.446671]  ? syscall_trace_enter+0x14e/0x1f0
[  239.446676]  ? syscall_exit_work+0x118/0x150
[  239.446677]  ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0
[  239.446681]  ? clear_bhb_loop+0x30/0x80
[  239.446684]  ? clear_bhb_loop+0x30/0x80
[  239.446686]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

There are two options to resolve the potential latent issue:
  1. warn against PMD sharing in copy_hugetlb_page_range(),
  2. fix it.
This patch opts for the second option.
While at it, simplify the comment, the details are not actually relevant
anymore.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 59d9094 ("mm: hugetlb: independent PMD page table shared count")
	Signed-off-by: Jane Chu <[email protected]>
	Reviewed-by: Harry Yoo <[email protected]>
	Acked-by: Oscar Salvador <[email protected]>
	Acked-by: David Hildenbrand <[email protected]>
	Cc: Jann Horn <[email protected]>
	Cc: Liu Shixin <[email protected]>
	Cc: Muchun Song <[email protected]>
	Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 14967a9)
	Signed-off-by: Marcin Wcisło <[email protected]>
@pvts-mat pvts-mat force-pushed the ciqlts9_2-CVE-batch-12 branch from cda475b to 111f19f Compare December 2, 2025 23:08
@PlaidCat
Copy link
Collaborator

PlaidCat commented Dec 4, 2025

Needs removed

We cannot backport this one despite being a CVE mm: hugetlb: independent PMD page table shared count 9c9e4aa

The struct page doen't meet these criteria for RH_KABI_BROKEN_INSERT while the union size thing may be true it doesn't mean that already compiled OOT drivers would know what to do with this new bits if its incorrectly using pt_share_count because they're unware of it.

In addition red hat has yet to fix this CVE, I SUSPECT for the exact same reason and this breaks a core structure and they're doing everything they can to not address it its seems fix deferred may end up bing something where its arddressed on a new minor release when they can refresh the kABI for 9 and 10 ... but its also a year old so i'm not sure what they're doing.
https://access.redhat.com/security/cve/cve-2024-57883

  int huge_pmd_unshare(...) {
      BUG_ON(page_count(virt_to_page(ptep)) == 0);
      if (page_count(virt_to_page(ptep)) == 1)  // ← uses page refcount
          return 0;
      pud_clear(pud);
      put_page(virt_to_page(ptep));              // ← drops page refcount

we should just need to insert the new code to but its not clean.

  int huge_pmd_unshare(...) {
      BUG_ON(page_count(virt_to_page(ptep)) == 0);
      if (page_count(virt_to_page(ptep)) == 1)  // ← uses page refcount
          return 0;
      pud_clear(pud);

      tlb_remove_table_sync();                     // this its called before the drop 
      put_page(virt_to_page(ptep));              // ← drops page refcount

We can also drop this BFs as well:

  • mm/hugetlb: make detecting shared pte more reliable ab8f879
  • mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count 111f19f

Needs minor rework

mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race 40ea049
You are correct we do need the exposure code for tlb_remove_table_sync() but could you make it an idepented precondition I didn't really investigate why the other code was dropped from mm/khugepage.c.

Could yuou elaborate on that please?

@kerneltoast
Copy link
Collaborator

Needs removed

We cannot backport this one despite being a CVE mm: hugetlb: independent PMD page table shared count 9c9e4aa

The struct page doen't meet these criteria for RH_KABI_BROKEN_INSERT while the union size thing may be true it doesn't mean that already compiled OOT drivers would know what to do with this new bits if its incorrectly using pt_share_count because they're unware of it.
165912b91b03539257cf6f7b04ff)

Not quite. The trick with the unions in struct page is that every subsystem using struct page can reuse the scratch space however it pleases provided that it owns the page. When ownership of the page is released back to the buddy allocator, the contents of that scratch space don't matter. And then when that page is allocated again, the meaning of the scratch space will be up to the interpretation of that page owner.

mm: hugetlb: independent PMD page table shared count is OK.

@pvts-mat
Copy link
Contributor Author

pvts-mat commented Dec 4, 2025

In addition red hat has yet to fix this CVE, I SUSPECT for the exact same reason

It has relatively low CVSS score (5.5) and also severity level "LOW". I often see such CVEs neglected by RH, this could be one of them.

I wasn't 100% sure about this kABI issue admittedly, so I'm glad we're having this discussion.

@kerneltoast
Copy link
Collaborator

In addition red hat has yet to fix this CVE, I SUSPECT for the exact same reason

It has relatively low CVSS score (5.5) and also severity level "LOW". I often see such CVEs neglected by RH, this could be one of them.

I wasn't 100% sure about this kABI issue admittedly, so I'm glad we're having this discussion.

The resolution for the kABI failure is correct. There isn't a kABI break here, so it is correct to use RH_KABI_BROKEN_INSERT to silence the false positive from the kABI checker.

@pvts-mat
Copy link
Contributor Author

pvts-mat commented Dec 5, 2025

mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race 40ea049
You are correct we do need the exposure code for tlb_remove_table_sync() but could you make it an idepented precondition I didn't really investigate why the other code was dropped from mm/khugepage.c.

Could yuou elaborate on that please?

The only functional change in commit 2ba99c5 is that made to the mm/khugepaged.c file. The rest is just exposing tlb_remove_table_sync_one() function which was needed for that change. This commit was backported to 5.15 as 79ad784 and the a3d864c901a300c295692d129159fc3001a56185 commit from 5.15, used as a base for ciqlts9_2 backport relied on it, using tlb_remove_table_sync_one() as well. I could have

  1. exposed tlb_remove_table_sync_one() arbitrarily by hand, adding outside noise to the codebase,
  2. backported whole 79ad784 as prerequisite, introducing functional changes unrelated to CVE-2025-38085 fix, or
  3. exposed tlb_remove_table_sync_one() exactly as it was done in 79ad784, without any functional changes it introduced.

I opted for (3).

could you make it an idepented precondition

You mean to do (3) as a separate commit? Or to do (2)?

@PlaidCat
Copy link
Collaborator

PlaidCat commented Dec 5, 2025

In addition red hat has yet to fix this CVE, I SUSPECT for the exact same reason

It has relatively low CVSS score (5.5) and also severity level "LOW". I often see such CVEs neglected by RH, this could be one of them.
I wasn't 100% sure about this kABI issue admittedly, so I'm glad we're having this discussion.

The resolution for the kABI failure is correct. There isn't a kABI break here, so it is correct to use RH_KABI_BROKEN_INSERT to silence the false positive from the kABI checker.

To be clear @kerneltoast corrected me on some struct page SPECIFIC context that doesn't have a true kABI breakage. In another strucutre where its expected to be shared beyond the original owner the union would have a breakage because if the other user like an OOT driver didn't know about the additional parameter it could try to access a verion of the union that has bits in the union set in a way it doesn't understand causing a true breakage. Since only the owner is using this struct it will not be access by someone else with a different understanding of they types of data that could be in the union.

To clarify there is True kABI breakage, and Red Hats kABI checker which is not a deep checksum, so for lack of a better description its strcmp(old definition, new definition) != 0. It gets the job mostly done and can high level deal with conditions like this where it could be or could not actually break. It tends to fall on the side of be as restrictive as possible / big hammer approach hence the relatively simple approach. And its been in their workflow for decades so probalby not a priority to imporve a lot.

So i'll take back my Needs removed on advice and guidance from @kerneltoast

mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race 40ea049
You are correct we do need the exposure code for tlb_remove_table_sync() but could you make it an idepented precondition I didn't really investigate why the other code was dropped from mm/khugepage.c.
Could yuou elaborate on that please?

The only functional change in commit 2ba99c5 is that made to the mm/khugepaged.c file. The rest is just exposing tlb_remove_table_sync_one() function which was needed for that change. This commit was backported to 5.15 as 79ad784 and the a3d864c901a300c295692d129159fc3001a56185 commit from 5.15, used as a base for ciqlts9_2 backport relied on it, using tlb_remove_table_sync_one() as well. I could have

1. exposed `tlb_remove_table_sync_one()` arbitrarily by hand, adding outside noise to the codebase,

2. backported whole [79ad784](https://github.com/ctrliq/kernel-src-tree/commit/79ad784c9d2165de7811582aaa3012c997cf4d26) as prerequisite, introducing functional changes unrelated to [CVE-2025-38085](https://github.com/advisories/GHSA-5q6f-wm2q-mpgg) fix, or

3. exposed `tlb_remove_table_sync_one()` exactly as it was done in [79ad784](https://github.com/ctrliq/kernel-src-tree/commit/79ad784c9d2165de7811582aaa3012c997cf4d26), without any functional changes it introduced.

I opted for (3).

could you make it an idepented precondition

You mean to do (3) as a separate commit? Or to do (2)?

2 ... the exposure of the unless this exposes something really gross, my quick glance is it looks like it should slot right in. But its also fixing stuff so that is also good.

@pvts-mat
Copy link
Contributor Author

pvts-mat commented Dec 5, 2025

2 ... the exposure of the unless this exposes something really gross, my quick glance is it looks like it should slot right in.

Now that I tried to plug it in quickly I recollect why I avoided it - there are conflicts, and they don't seem trivial. Would have to take a closer look at the history of changes, especially given that the 5.15 version is different from upstream in that mm/khugepaged.c file. It is a bugfix, but there seem to be no CVE associated with it. Do you want me to work on including it?

@pvts-mat
Copy link
Contributor Author

pvts-mat commented Dec 5, 2025

To be clear @kerneltoast corrected me on some struct page SPECIFIC context that doesn't have a true kABI breakage. In another strucutre where its expected to be shared beyond the original owner the union would have a breakage because if the other user like an OOT driver didn't know about the additional parameter it could try to access a verion of the union that has bits in the union set in a way it doesn't understand causing a true breakage. Since only the owner is using this struct it will not be access by someone else with a different understanding of they types of data that could be in the union.

Yes, I get the levels of thought here. I was wondering though, if some union was "public", in a sense that it could be used not by its owner (which allocated and initialized it), then there would have to be some switch somewhere (a variable) encoding how it should be interpreted? It would therefore be reasonable to assume that a driver, or any other user, checks the value of this switch before using the union. Introducing a new field to it would would imply using a new value for the switch, which the driver should be able to recognize that it doesn't recognize, and reject the unknown data structure gracefully. At least that was my thinking which I saved from upstream-diff.

Is it too much to expect from drivers? Or is my asssumption wrong and can unions be used without such switch somehow?

@PlaidCat
Copy link
Collaborator

PlaidCat commented Dec 5, 2025

2 ... the exposure of the unless this exposes something really gross, my quick glance is it looks like it should slot right in.

Now that I tried to plug it in quickly I recollect why I avoided it - there are conflicts, and they don't seem trivial. Would have to take a closer look at the history of changes, especially given that the 5.15 version is different from upstream in that mm/khugepaged.c file. It is a bugfix, but there seem to be no CVE associated with it. Do you want me to work on including it?

What is the conflict, maybe I missed something when i glanced at the insertion points.

Can you show the conflicts in a diff in a comment?

Yes, I get the levels of thought here. I was wondering though, if some union was "public", in a sense that it could be used not by its owner (which allocated and initialized it), then there would have to be some switch somewhere (a variable) encoding how it should be interpreted? It would therefore be reasonable to assume that a driver, or any other user, checks the value of this switch before using the union. Introducing a new field to it would would imply using a new value for the switch, which the driver should be able to recognize that it doesn't recognize, and reject the unknown data structure gracefully. At least that was my thinking which I saved from upstream-diff.

Is it too much to expect from drivers? Or is my asssumption wrong and can unions be used without such switch somehow?

I think its hard to know how good the OOT drivers are because we can't always see their code. The big thing to consider is that even MAJOR driver vendors in the kernel src tree get regularly yelled at because of bad code quality and usually those LKML request come from their OOT 'dev' driver. My general assumption is that if it can be used wrong it will be used wrong.

@pvts-mat
Copy link
Contributor Author

pvts-mat commented Dec 5, 2025

Can you show the conflicts in a diff in a comment?

Which one? Variant from kernel-mainline or linux-5.15.y? They are different and both generate conflicts. Here you go FWIW:

linux-5.15.y 79ad784:

index 6223b06e48027462c443814499b9d6a20471ea50,1735123e462ad68c4dfd2f26c008451f16250538..0000000000000000000000000000000000000000
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@@ -1514,7 -1536,13 +1515,17 @@@ void collapse_pte_mapped_thp(struct mm_
  	}
  
  	/* step 4: collapse pmd */
++<<<<<<< HEAD
 +	collapse_and_free_pmd(mm, vma, haddr, pmd);
++=======
+ 	_pmd = pmdp_collapse_flush(vma, haddr, pmd);
+ 	mm_dec_nr_ptes(mm);
+ 	tlb_remove_table_sync_one();
+ 	pte_free(mm, pmd_pgtable(_pmd));
+ 
+ 	i_mmap_unlock_write(vma->vm_file->f_mapping);
+ 
++>>>>>>> 79ad784c9d2165de7811582aaa3012c997cf4d26 (mm/khugepaged: fix GUP-fast interaction by sending IPI)
  drop_hpage:
  	unlock_page(hpage);
  	put_page(hpage);
@@@ -1591,8 -1621,13 +1602,18 @@@ static void retract_page_tables(struct 
  		 * reverse order. Trylock is a way to avoid deadlock.
  		 */
  		if (mmap_write_trylock(mm)) {
++<<<<<<< HEAD
 +			if (!khugepaged_test_exit(mm))
 +				collapse_and_free_pmd(mm, vma, addr, pmd);
++=======
+ 			if (!khugepaged_test_exit(mm)) {
+ 				/* assume page table is clear */
+ 				_pmd = pmdp_collapse_flush(vma, addr, pmd);
+ 				mm_dec_nr_ptes(mm);
+ 				tlb_remove_table_sync_one();
+ 				pte_free(mm, pmd_pgtable(_pmd));
+ 			}
++>>>>>>> 79ad784c9d2165de7811582aaa3012c997cf4d26 (mm/khugepaged: fix GUP-fast interaction by sending IPI)
  			mmap_write_unlock(mm);
  		} else {
  			/* Try again later */

kernel-mainline 2ba99c5:

diff --cc mm/khugepaged.c
index 6223b06e48027462c443814499b9d6a20471ea50,294cb75d9c2225b8bd53ae70ab5cb061aacb9f09..0000000000000000000000000000000000000000
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@@ -1138,13 -1051,14 +1138,14 @@@ static void collapse_huge_page(struct m
  	_pmd = pmdp_collapse_flush(vma, address, pmd);
  	spin_unlock(pmd_ptl);
  	mmu_notifier_invalidate_range_end(&range);
+ 	tlb_remove_table_sync_one();
  
  	spin_lock(pte_ptl);
 -	result =  __collapse_huge_page_isolate(vma, address, pte, cc,
 -					       &compound_pagelist);
 +	isolated = __collapse_huge_page_isolate(vma, address, pte,
 +			&compound_pagelist);
  	spin_unlock(pte_ptl);
  
 -	if (unlikely(result != SCAN_SUCCEED)) {
 +	if (unlikely(!isolated)) {
  		pte_unmap(pte);
  		spin_lock(pmd_ptl);
  		BUG_ON(!pmd_none(*pmd));
@@@ -1410,9 -1401,17 +1411,13 @@@ static void collapse_and_free_pmd(struc
  	pmd_t pmd;
  
  	mmap_assert_write_locked(mm);
 -	if (vma->vm_file)
 -		lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
 -	/*
 -	 * All anon_vmas attached to the VMA have the same root and are
 -	 * therefore locked by the same lock.
 -	 */
 -	if (vma->anon_vma)
 -		lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
 -
 +	ptl = pmd_lock(vma->vm_mm, pmdp);
  	pmd = pmdp_collapse_flush(vma, addr, pmdp);
++<<<<<<< HEAD
 +	spin_unlock(ptl);
++=======
+ 	tlb_remove_table_sync_one();
++>>>>>>> 2ba99c5e08812494bc57f319fb562f527d9bacd8 (mm/khugepaged: fix GUP-fast interaction by sending IPI)
  	mm_dec_nr_ptes(mm);
  	page_table_check_pte_clear_range(mm, addr, pmd);
  	pte_free(mm, pmd_pgtable(pmd));

This says nothing, really. The deeper dive is unavoidable. Given below.

First we need to establish whether to use kernel-mainline or linux-5.15.y. Upstream generates smaller conflicts but it doesn't have to mean much.

kernel-mainline modifies functions collapse_huge_page(…), collapse_and_free_pmd(…).
linux-5.15.y modifies functions collapse_huge_page(…), collapse_pte_mapped_thp(…), retract_page_tables(…).

Cherry picking changes in collapse_huge_page(…) apply cleanly for both versions and result in the same code, so we can assume this part is settled.

kernel-mainline introduces function collapse_huge_page(…) in e59a47b and upgrades it in 8d3c106 to the form with which 2ba99c5 deals with. These two commits are not backported to linux-5.15.y. The first one is backported, however, to ciqlts9_2, but not the second one. So ciqlts9_2 is somewhat halfway there, which explains conflicts for both variants.

Introducing collapse_huge_page(…) was basically a code refactor, putting similar code into one place. linux-5.15.y variant of the patch spreads the tlb_remove_table_sync_one() call into these many places before the refactor. Provided ciqlts9_2 has collapse_huge_page(…) already defined the preferred cherry pick to use is the upstream variant.

The conflict for kernel-mainline's 2ba99c5 patch stems from the lack of backported collapse_huge_page(…) upgrade in 8d3c106. It is a patch, but not associated with any CVE and introduces a bug which would require at least one more patch to include, so perhaps it's best not to backport just for the sake of 2ba99c5.

This means manual resolution of kernel-mainline's 2ba99c5. The spin_unlock(ptl) line definitely has to stay, because it's a closing pair of ptl = pmd_lock(vma->vm_mm, pmdp); two lines before. The introduced tlb_remove_table_sync_one() call can then go either before or after the unlocking. Looking at the change made in collapse_huge_page(…) with a very similar code sequence

 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * This removes any huge TLB entry from the CPU so we won't allow
 	 * huge and small TLB entries for the same virtual address to
 	 * avoid the risk of CPU bugs in that area.
 	 *
 	 * Parallel fast GUP is fine since fast GUP will back off when
 	 * it detects PMD is changed.
 	 */
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
 	mmu_notifier_invalidate_range_end(&range);
+	tlb_remove_table_sync_one();

suggests that tlb_remove_table_sync_one() call doesn't require to be put within the lock, so the conflict resolution would be keeping both lines

spin_unlock(ptl);
tlb_remove_table_sync_one();

Here's how the modified PR would look like:

d23c840 hugetlb: unshare some PMDs when splitting VMAs
9c9e4aa mm: hugetlb: independent PMD page table shared count
408b318 mm/hugetlb: unshare page tables during VMA split, not before
e4c8f85 mm/khugepaged: fix GUP-fast interaction by sending IPI
cf25739 mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
03565d2 mm/hugetlb: make detecting shared pte more reliable
f8a3cfc mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count

Do you want me to poceed with this version?

@pvts-mat pvts-mat closed this Dec 5, 2025
@pvts-mat pvts-mat reopened this Dec 5, 2025
@PlaidCat
Copy link
Collaborator

PlaidCat commented Dec 5, 2025

Let me take a look, it'll probably be a monday thing

@kerneltoast
Copy link
Collaborator

To be clear @kerneltoast corrected me on some struct page SPECIFIC context that doesn't have a true kABI breakage. In another strucutre where its expected to be shared beyond the original owner the union would have a breakage because if the other user like an OOT driver didn't know about the additional parameter it could try to access a verion of the union that has bits in the union set in a way it doesn't understand causing a true breakage. Since only the owner is using this struct it will not be access by someone else with a different understanding of they types of data that could be in the union.

Yes, I get the levels of thought here. I was wondering though, if some union was "public", in a sense that it could be used not by its owner (which allocated and initialized it), then there would have to be some switch somewhere (a variable) encoding how it should be interpreted? It would therefore be reasonable to assume that a driver, or any other user, checks the value of this switch before using the union. Introducing a new field to it would would imply using a new value for the switch, which the driver should be able to recognize that it doesn't recognize, and reject the unknown data structure gracefully. At least that was my thinking which I saved from upstream-diff.

Is it too much to expect from drivers? Or is my asssumption wrong and can unions be used without such switch somehow?

A simpler way of looking at it: struct page comes with some extra bytes you can use as you please for your own purposes when you allocate a page. The union facilitates that, so that you can actually put a name and type on those extra bytes; otherwise, the extra bytes would be like u8 extra_bytes[40]; in struct page and then you'd see a mess of code everywhere doing stuff like struct myextrapagestuff *s = (void *)page->extra_bytes; in order to give those extra bytes a type.

Generally, a union can serve one of two purposes: it can either provide a cast to reinterpret your data in a more convenient way (e.g., a union of two u32 integers can be combined together and read as a single u64 without needing to do any bit shifting), OR it can allow you to reuse existing bytes for multiple mutually exclusive purposes. In the latter case, there must be some way of determining which union member is the one that accurately represents the type of data stored within.

If you have a union containing a float and an int, how do you know which one to use to interpret the data? If the union is passed around to code that can deal with both cases (float and int), then you'd need to also provide a flag that says "hey this is a float" or "hey this is an int".

If the union is reusing bytes for mutually exclusive purposes, then it means you don't need to be able to store both types of data at the same time, but you do need to be able to store both of them.

Adding a new member to a union without changing the size of the union isn't ever inherently a kABI break. What is a kABI break is trampling over data that something else stored and expects to retrieve fully intact.

Therefore, to verify kABI correctness for new union members, you only need to audit the code accessing the new member to make sure it doesn't modify those bytes while something else is actively using those bytes for a different union member. It is ultimately just a synchronization problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants