Fix gradient mismatch in checkpointing by preserving node metadata #2553

Copilot · 2025-09-30T12:56:47Z

Description

Fixes gradient computation errors in test_checkpointing_thunderfx by preserving PyTorch FX node metadata during checkpoint function conversion.

Problem

The test test_networks.py::test_checkpointing_thunderfx was failing with significant gradient mismatches between eager PyTorch and thunderfx, particularly on B200/GB200 GPUs:

AssertionError: Tensor-likes are not close!

Mismatched elements: 62 / 20480 (0.3%)
Greatest absolute difference: 9818.546875 at index (1, 7) (up to 0.001 allowed)
Greatest relative difference: 22.85102081298828 at index (1, 6) (up to 0.001 allowed)

Root Cause

In thunder/dynamo/utils.py, the _checkpoint_function_converter function converts PyTorch operators to Thunder operators inside checkpointed functions. However, it was not copying the FX node metadata (node.meta) from the original nodes to the new Thunder nodes.

This metadata contains critical information for gradient computation:

example_value: Tensor with shape, dtype, device information
requires_grad: Whether gradients should be computed
tensor_meta: Additional tensor properties (stride, layout, etc.)
Other PyTorch dynamo annotations needed for autograd

Without this metadata, the Thunder operators lacked the necessary information to properly participate in the autograd graph during the backward pass of activation checkpointing, resulting in incorrect gradient computation.

Solution

Added a single line to copy metadata when creating Thunder nodes:

thunder_node.meta = n.meta.copy()

This ensures all tensor properties are preserved, allowing the Thunder operators to correctly compute gradients during the recomputation phase of activation checkpointing.

Impact

Minimal change: Only 3 lines added (2 comments + 1 code)
No breaking changes: Only adds missing functionality
Targeted fix: Only affects checkpoint function conversion
Resolves: Gradient mismatch issues in checkpointing on all GPUs

Fixes #[issue_number_if_available]

Original prompt

This section details on the original issue you should resolve

<issue_title>test_networks::test_checkpointing_thunderfx fails on (G)B200 due to grads mismatch</issue_title>
<issue_description>## 🐛 Bug

test_networks.py::test_checkpointing_thunderfx fails due to grads mismatch between eager pytorch and thunderfx.

To Reproduce

Steps to reproduce the behavior:

Run test_networks.py::test_checkpointing_thunderfx

See grad mismatch e.g.
>       assert_close(grads_res, grads_ref, atol=1e-3, rtol=1e-3)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 62 / 20480 (0.3%)
E       Greatest absolute difference: 9818.546875 at index (1, 7) (up to 0.001 allowed)
E       Greatest relative difference: 22.85102081298828 at index (1, 6) (up to 0.001 allowed)
E
E       The failure occurred for item [1]
Expected behavior

Environment

pjnl-20250926

Additional context

It seems that the test case has been failing since mid August (b: 8/13, gb: 8/21)

The pytorch checkpointing function itself seems stable, from the file's commit history -- https://github.com/pytorch/pytorch/commits/viable/strict/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py</issue_description>

Comments on the Issue (you are @copilot in this section)
@IvanYashchuk @kiya00, could you please take a look at this failure? Can the repro script generation tools help with the starting point for this problem?

Fixes #2543

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

The issue was that when converting torch operators to thunder operators in checkpointed functions, the node metadata (containing requires_grad, dtype, shape, etc.) was not being copied. This caused gradient computation issues during backward pass, especially on B200/GB200 GPUs where numerical precision is critical. The fix adds a single line to copy metadata from original nodes to the new thunder nodes, preserving all tensor properties necessary for correct gradient computation. Co-authored-by: IvanYashchuk <[email protected]>

Initial plan

0127eba

Copilot AI assigned Copilot and IvanYashchuk Sep 30, 2025

Copilot started work on behalf of IvanYashchuk September 30, 2025 12:56 View session

Copilot AI changed the title ~~[WIP] test_networks::test_checkpointing_thunderfx fails on (G)B200 due to grads mismatch~~ Fix gradient mismatch in checkpointing by preserving node metadata Sep 30, 2025

Copilot AI requested a review from IvanYashchuk September 30, 2025 13:12

Copilot finished work on behalf of IvanYashchuk September 30, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix gradient mismatch in checkpointing by preserving node metadata #2553

Fix gradient mismatch in checkpointing by preserving node metadata #2553

Uh oh!

Copilot AI commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix gradient mismatch in checkpointing by preserving node metadata #2553

Are you sure you want to change the base?

Fix gradient mismatch in checkpointing by preserving node metadata #2553

Uh oh!

Conversation

Copilot AI commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Root Cause

Solution

Impact

To Reproduce

Expected behavior

Environment

Additional context

Comments on the Issue (you are @copilot in this section)

Uh oh!

Uh oh!

Copilot AI commented Sep 30, 2025 •

edited

Loading