Fix `ModelCheckpoint` with manual optimization and `every_n_train_steps` #21239

littlebullGit · 2025-09-22T02:14:50Z

This PR enhances callback to properly handle manual optimization scenarios, ensuring checkpoints reflect the intended model state and providing clear user guidance.

Fixes #20947

Key Changes:

Manual Optimization Support: Ensures checkpoints capture the model state before optimization when using manual optimization with every_n_train_steps.
User Warning: Adds a clear warning when pre-optimization state isn't saved, helping users understand the checkpoint behavior.
Documentation: Updates docstrings and examples to clarify the behavior with manual optimization.

Testing:

Added test cases to verify checkpoint behavior with manual optimization
Ensured backward compatibility with automatic optimization
Verified warning messages are shown in appropriate scenarios

📚 Documentation preview 📚: https://pytorch-lightning--21239.org.readthedocs.build/en/21239/

littlebullGit · 2025-09-23T01:13:38Z

The link error is (generated/CONTRIBUTING: line 6) broken https://medium.com/pytorch-lightning/quick-contribution-guide-86d977171b3a - 429 Client Error: Too Many Requests for url: https://medium.com/pytorch-lightning/quick-contribution-guide-86d977171b3a. Not related to my code. The other one is just timed out.
@Borda @SkafteNicki let me know how to proceed.

SkafteNicki · 2025-09-23T13:19:51Z

The link error is (generated/CONTRIBUTING: line 6) broken https://medium.com/pytorch-lightning/quick-contribution-guide-86d977171b3a - 429 Client Error: Too Many Requests for url: https://medium.com/pytorch-lightning/quick-contribution-guide-86d977171b3a. Not related to my code. The other one is just timed out.
@Borda @SkafteNicki let me know how to proceed.

Our CI is broken at the moment, nothing you can do. Please stand by while it being fixed.

littlebullGit · 2025-10-11T20:28:59Z

@SkafteNicki @justusschock can we try CI again ? the last run seems fine to me.

codecov · 2025-10-17T21:11:26Z

Codecov Report

❌ Patch coverage is 84.37500% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (b15d394) to head (c3d7394).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (b15d394) and HEAD (c3d7394). Click for more details.

HEAD has 575 uploads less than BASE

Flag BASE (b15d394) HEAD (c3d7394)

cpu 158 27

python3.10 35 6

lightning_fabric 35 0

pytest 80 0

lightning 87 15

python3.12.7 54 9

python3.11 34 6

python3.12 17 3

python 18 3

pytorch2.2.2 8 3

pytest-full 78 27

pytorch2.1 17 6

pytorch_lightning 36 12

pytorch2.5.1 9 3

pytorch2.7 9 3

pytorch2.4.1 9 3

pytorch2.6 9 3

pytorch2.8 9 3

pytorch2.3 8 3

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #21239     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         269      266      -3     
  Lines       23744    23720     -24     
=========================================
- Hits        20572    18697   -1875     
- Misses       3172     5023   +1851

tests/tests_pytorch/conftest.py

deependujha · 2025-10-18T05:11:58Z

Hey @littlebullGit, thanks for looking into the port issue! I was digging into the same thing, mentioned here.

The changes in this PR don’t fully overlap, so instead of merging directly, it might be cleaner to open a separate PR for this fix if that works for you.

update - 1:

Borrowed the code to #21195 to see if it works. Thanks for the great work.

update -2 :

Doesn't seem to work.

littlebullGit · 2025-10-19T02:13:57Z

@deependujha finally I got it fixed. All checks passed for this PR.
@SkafteNicki @Borda @justusschock My main fix is about Fix ModelCheckpoint. However, the build kept on running into unrelated Address in Use error. So, I took a deeper look and believe fixed that issue as well. Below is my writeup. I cannot find an issue direct related to it. #2537 seems related, but it is very old. So, let me know if you want me to open a different PR or just fixed in this PR.

Port Management Fix for EADDRINUSE Errors

Problem

Distributed tests were failing in CI with EADDRINUSE (address already in use) errors because:

OS TIME_WAIT state: When tests release ports, the OS keeps them reserved for 30-120 seconds
Immediate reuse: Test suite immediately tried to reuse the same ports → collision with TIME_WAIT
Untracked spawn ports: Multi-process tests (DDP spawn) set their own MASTER_PORT values that were never tracked
No cleanup mechanism: Released ports had no protection against immediate reallocation

Solution: Two-Layer Defense

Layer 1: Proactive Prevention (Deque)

1. Queue-Based Port Reuse Prevention

File: src/lightning/fabric/utilities/port_manager.py
What: _recently_released deque with 1024 slots

How:

# On release:
self._recently_released.append(port)

# On allocate:
if port not in self._recently_released:
    return port  # Safe to use

Why: Prevents rapid reuse of ports within our process, giving TIME_WAIT sufficient time to clear

2. OS-Based Port Allocation

File: src/lightning/fabric/utilities/port_manager.py
What: Use OS's bind(("", 0)) to pick ports, check against our deque

How:

for attempt in range(max_attempts):
    port = self._find_free_port()  # OS picks port
    
    # Check our tracking structures
    if port not in self._allocated_ports and port not in self._recently_released:
        self._allocated_ports.add(port)
        return port

Why: OS naturally avoids TIME_WAIT ports (without SO_REUSEADDR), we just prevent intra-process reuse

Layer 2: Reactive Recovery (Retry Logic)

3. Automatic Retry on EADDRINUSE

Files: tests/tests_fabric/conftest.py, tests/tests_pytorch/conftest.py
What: Added pytest_runtest_makereport() hook that catches EADDRINUSE errors

How:

def pytest_runtest_makereport(item, call):
    if "EADDRINUSE" in str(call.excinfo.value):
        if retry_count < 3:
            manager.release_all()  # Clear state
            time.sleep(0.5)        # Let ports settle
            raise Rerun()          # Retry test

Why: Handles the rare race window where a port enters TIME_WAIT between our check and TCPStore binding

Supporting Changes

4. Reserve Externally Assigned Ports

File: src/lightning/fabric/utilities/port_manager.py
What: Added reserve_existing_port() method
Why: Tracks ports assigned by external launchers (e.g., torch.distributed.spawn)

5. Check for Pre-Assigned MASTER_PORT

File: src/lightning/fabric/plugins/environments/lightning.py
What: Modified find_free_network_port() to check MASTER_PORT environment variable
Why: Prevents conflicts when external launchers pre-assign ports

6. Test Fixture Cleanup

Files: tests/tests_fabric/conftest.py, tests/tests_pytorch/conftest.py
What: Added post-test MASTER_PORT capture and cleanup
Why: Tracks ports set by spawned child processes that parent doesn't know about

Test Coverage

Added 3 tests for queue-based reuse prevention
Added 1 test for environment teardown
Updated 1 test for new allocation approach
Total: 45 tests passing in port_manager suite

Key Insights from Development

Why Both Layers Are Needed

Deque alone isn't enough:

If we allocate >1024 ports during TIME_WAIT window (30-120s), we cycle back to ports still in TIME_WAIT
Example: In heavy CI load, 1024 ports might be exhausted in <30s

Retry alone isn't enough:

Would constantly retry on every rapid reuse scenario
Wasteful and slow
Doesn't prevent the problem, just masks it

Together they work:

Deque prevents 99% of conflicts (intra-process reuse)
Retry handles the 1% edge case (race window)

Impact

✅ Eliminates EADDRINUSE errors in distributed test suites (DDP, FSDP, spawn-based tests)
✅ CI now passes on all port-sensitive tests
✅ Addresses root cause not just symptoms
✅ Graceful degradation via retry logic for unpredictable scenarios
✅ No user-facing changes entirely internal to test infrastructure

Files Changed

Core Implementation

src/lightning/fabric/utilities/port_manager.py - Port allocation logic
src/lightning/fabric/plugins/environments/lightning.py - Environment port handling

Test Infrastructure

tests/tests_fabric/conftest.py - Fabric test retry logic
tests/tests_pytorch/conftest.py - PyTorch test retry logic
tests/tests_fabric/utilities/test_port_manager.py - Port manager tests

Commit

Branch: fix/20947-checkpoint-manual-opt
Commit: 5f2079f42

Borda · 2025-10-22T06:21:07Z

However, the build kept on running into unrelated Address in Use error. So, I took a deeper look and believe fixed that issue as well. Below is my writeup. I cannot find an issue direct related to it.

Let's move it to its own PR for later tracking or need for further development 🦩

littlebullGit · 2025-10-23T01:45:58Z

EADDRINUSE
#21309
Please see above PR. I will rebase this PR once that one got merged. Thanks

Borda · 2025-10-23T15:14:10Z

#21309

merged, pls resolve conflicts :)

- Ensure checkpoints reflect the model state before optimization when using manual optimization - Add warning when pre-optimization state isn't saved - Update documentation to clarify the behavior with manual optimization Fixes Lightning-AI#20947

littlebullGit · 2025-11-13T04:12:26Z

#21309

merged, pls resolve conflicts :)

@Borda, it has been a while. Any issue with this PR ?

littlebullGit requested review from Borda, ethanwharris, justusschock, lantiga and tchaton as code owners September 22, 2025 02:14

github-actions bot added the pl Generic label for PyTorch Lightning package label Sep 22, 2025

littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch from 4f06495 to 16552e5 Compare September 22, 2025 03:53

Borda changed the title ~~Fix ModelCheckpoint with manual optimization and every_n_train_steps~~ Fix ModelCheckpoint with manual optimization and every_n_train_steps Sep 22, 2025

littlebullGit closed this Sep 22, 2025

littlebullGit reopened this Sep 22, 2025

littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch 2 times, most recently from 059625b to 7672de3 Compare October 1, 2025 12:48

littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch from ecfbb57 to a1250f7 Compare October 17, 2025 21:05

github-actions bot added the fabric lightning.fabric.Fabric label Oct 17, 2025

littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch 3 times, most recently from 49cc37d to 69819dc Compare October 18, 2025 04:10

deependujha reviewed Oct 18, 2025

View reviewed changes

tests/tests_pytorch/conftest.py Outdated Show resolved Hide resolved

littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch 5 times, most recently from 2e1d277 to 5f2079f Compare October 18, 2025 23:02

github-actions bot added the has conflicts label Oct 23, 2025

littlebullGit added 3 commits October 23, 2025 16:05

cleanup log after test

c68543f

fix future warning with torch.load

3a8aa9f

littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch from 5f2079f to 3a8aa9f Compare October 23, 2025 20:14

github-actions bot removed the has conflicts label Oct 23, 2025

littlebullGit added 5 commits October 30, 2025 17:47

Merge branch 'master' into fix/20947-checkpoint-manual-opt

ac4665d

Merge branch 'master' into fix/20947-checkpoint-manual-opt

02a026f

Merge branch 'master' into fix/20947-checkpoint-manual-opt

056daf8

Merge branch 'master' into fix/20947-checkpoint-manual-opt

0bc34ad

Merge branch 'master' into fix/20947-checkpoint-manual-opt

c3d7394

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `ModelCheckpoint` with manual optimization and `every_n_train_steps` #21239

Fix `ModelCheckpoint` with manual optimization and `every_n_train_steps` #21239

littlebullGit commented Sep 22, 2025 •

edited by Borda

Loading

Uh oh!

littlebullGit commented Sep 23, 2025 •

edited

Loading

Uh oh!

SkafteNicki commented Sep 23, 2025

Uh oh!

littlebullGit commented Oct 11, 2025

Uh oh!

codecov bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

deependujha commented Oct 18, 2025 •

edited

Loading

Uh oh!

littlebullGit commented Oct 19, 2025

Uh oh!

Borda commented Oct 22, 2025

Uh oh!

littlebullGit commented Oct 23, 2025

Uh oh!

Borda commented Oct 23, 2025

Uh oh!

littlebullGit commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix ModelCheckpoint with manual optimization and every_n_train_steps #21239

Are you sure you want to change the base?

Fix ModelCheckpoint with manual optimization and every_n_train_steps #21239

Conversation

littlebullGit commented Sep 22, 2025 • edited by Borda Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes:

Testing:

Uh oh!

littlebullGit commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SkafteNicki commented Sep 23, 2025

Uh oh!

littlebullGit commented Oct 11, 2025

Uh oh!

codecov bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

deependujha commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

littlebullGit commented Oct 19, 2025

Port Management Fix for EADDRINUSE Errors

Problem

Solution: Two-Layer Defense

Layer 1: Proactive Prevention (Deque)

1. Queue-Based Port Reuse Prevention

2. OS-Based Port Allocation

Layer 2: Reactive Recovery (Retry Logic)

3. Automatic Retry on EADDRINUSE

Supporting Changes

4. Reserve Externally Assigned Ports

5. Check for Pre-Assigned MASTER_PORT

6. Test Fixture Cleanup

Test Coverage

Key Insights from Development

Why Both Layers Are Needed

Impact

Files Changed

Core Implementation

Test Infrastructure

Commit

Uh oh!

Borda commented Oct 22, 2025

Uh oh!

littlebullGit commented Oct 23, 2025

Uh oh!

Borda commented Oct 23, 2025

Uh oh!

littlebullGit commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix `ModelCheckpoint` with manual optimization and `every_n_train_steps` #21239

Fix `ModelCheckpoint` with manual optimization and `every_n_train_steps` #21239

littlebullGit commented Sep 22, 2025 •

edited by Borda

Loading

littlebullGit commented Sep 23, 2025 •

edited

Loading

codecov bot commented Oct 17, 2025 •

edited

Loading

deependujha commented Oct 18, 2025 •

edited

Loading