Skip to content

Conversation

@littlebullGit
Copy link
Contributor

@littlebullGit littlebullGit commented Sep 22, 2025

This PR enhances callback to properly handle manual optimization scenarios, ensuring checkpoints reflect the intended model state and providing clear user guidance.

Fixes #20947

Key Changes:

  • Manual Optimization Support: Ensures checkpoints capture the model state before optimization when using manual optimization with every_n_train_steps.
  • User Warning: Adds a clear warning when pre-optimization state isn't saved, helping users understand the checkpoint behavior.
  • Documentation: Updates docstrings and examples to clarify the behavior with manual optimization.

Testing:

  • Added test cases to verify checkpoint behavior with manual optimization
  • Ensured backward compatibility with automatic optimization
  • Verified warning messages are shown in appropriate scenarios

📚 Documentation preview 📚: https://pytorch-lightning--21239.org.readthedocs.build/en/21239/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Sep 22, 2025
@littlebullGit littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch from 4f06495 to 16552e5 Compare September 22, 2025 03:53
@Borda Borda changed the title Fix ModelCheckpoint with manual optimization and every_n_train_steps Fix ModelCheckpoint with manual optimization and every_n_train_steps Sep 22, 2025
@littlebullGit littlebullGit reopened this Sep 22, 2025
@littlebullGit
Copy link
Contributor Author

littlebullGit commented Sep 23, 2025

The link error is (generated/CONTRIBUTING: line 6) broken https://medium.com/pytorch-lightning/quick-contribution-guide-86d977171b3a - 429 Client Error: Too Many Requests for url: https://medium.com/pytorch-lightning/quick-contribution-guide-86d977171b3a. Not related to my code. The other one is just timed out.
@Borda @SkafteNicki let me know how to proceed.

@SkafteNicki
Copy link
Collaborator

The link error is (generated/CONTRIBUTING: line 6) broken https://medium.com/pytorch-lightning/quick-contribution-guide-86d977171b3a - 429 Client Error: Too Many Requests for url: https://medium.com/pytorch-lightning/quick-contribution-guide-86d977171b3a. Not related to my code. The other one is just timed out.
@Borda @SkafteNicki let me know how to proceed.

Our CI is broken at the moment, nothing you can do. Please stand by while it being fixed.

@littlebullGit littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch 2 times, most recently from 059625b to 7672de3 Compare October 1, 2025 12:48
@littlebullGit
Copy link
Contributor Author

@SkafteNicki @justusschock can we try CI again ? the last run seems fine to me.

@littlebullGit littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch from ecfbb57 to a1250f7 Compare October 17, 2025 21:05
@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Oct 17, 2025
@codecov
Copy link

codecov bot commented Oct 17, 2025

Codecov Report

❌ Patch coverage is 84.37500% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (b15d394) to head (c3d7394).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (b15d394) and HEAD (c3d7394). Click for more details.

HEAD has 575 uploads less than BASE
Flag BASE (b15d394) HEAD (c3d7394)
cpu 158 27
python3.10 35 6
lightning_fabric 35 0
pytest 80 0
lightning 87 15
python3.12.7 54 9
python3.11 34 6
python3.12 17 3
python 18 3
pytorch2.2.2 8 3
pytest-full 78 27
pytorch2.1 17 6
pytorch_lightning 36 12
pytorch2.5.1 9 3
pytorch2.7 9 3
pytorch2.4.1 9 3
pytorch2.6 9 3
pytorch2.8 9 3
pytorch2.3 8 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21239     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         269      266      -3     
  Lines       23744    23720     -24     
=========================================
- Hits        20572    18697   -1875     
- Misses       3172     5023   +1851     

@littlebullGit littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch 3 times, most recently from 49cc37d to 69819dc Compare October 18, 2025 04:10
@deependujha
Copy link
Collaborator

deependujha commented Oct 18, 2025

Hey @littlebullGit, thanks for looking into the port issue! I was digging into the same thing, mentioned here.

The changes in this PR don’t fully overlap, so instead of merging directly, it might be cleaner to open a separate PR for this fix if that works for you.

update - 1:

Borrowed the code to #21195 to see if it works. Thanks for the great work.

update -2 :

Doesn't seem to work.

@littlebullGit littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch 5 times, most recently from 2e1d277 to 5f2079f Compare October 18, 2025 23:02
@littlebullGit
Copy link
Contributor Author

@deependujha finally I got it fixed. All checks passed for this PR.
@SkafteNicki @Borda @justusschock My main fix is about Fix ModelCheckpoint. However, the build kept on running into unrelated Address in Use error. So, I took a deeper look and believe fixed that issue as well. Below is my writeup. I cannot find an issue direct related to it. #2537 seems related, but it is very old. So, let me know if you want me to open a different PR or just fixed in this PR.

Port Management Fix for EADDRINUSE Errors

Problem

Distributed tests were failing in CI with EADDRINUSE (address already in use) errors because:

  1. OS TIME_WAIT state: When tests release ports, the OS keeps them reserved for 30-120 seconds
  2. Immediate reuse: Test suite immediately tried to reuse the same ports → collision with TIME_WAIT
  3. Untracked spawn ports: Multi-process tests (DDP spawn) set their own MASTER_PORT values that were never tracked
  4. No cleanup mechanism: Released ports had no protection against immediate reallocation

Solution: Two-Layer Defense

Layer 1: Proactive Prevention (Deque)

1. Queue-Based Port Reuse Prevention

  • File: src/lightning/fabric/utilities/port_manager.py
  • What: _recently_released deque with 1024 slots
  • How:
    # On release:
    self._recently_released.append(port)
    
    # On allocate:
    if port not in self._recently_released:
        return port  # Safe to use
  • Why: Prevents rapid reuse of ports within our process, giving TIME_WAIT sufficient time to clear

2. OS-Based Port Allocation

  • File: src/lightning/fabric/utilities/port_manager.py
  • What: Use OS's bind(("", 0)) to pick ports, check against our deque
  • How:
    for attempt in range(max_attempts):
        port = self._find_free_port()  # OS picks port
        
        # Check our tracking structures
        if port not in self._allocated_ports and port not in self._recently_released:
            self._allocated_ports.add(port)
            return port
  • Why: OS naturally avoids TIME_WAIT ports (without SO_REUSEADDR), we just prevent intra-process reuse

Layer 2: Reactive Recovery (Retry Logic)

3. Automatic Retry on EADDRINUSE

  • Files: tests/tests_fabric/conftest.py, tests/tests_pytorch/conftest.py
  • What: Added pytest_runtest_makereport() hook that catches EADDRINUSE errors
  • How:
    def pytest_runtest_makereport(item, call):
        if "EADDRINUSE" in str(call.excinfo.value):
            if retry_count < 3:
                manager.release_all()  # Clear state
                time.sleep(0.5)        # Let ports settle
                raise Rerun()          # Retry test
  • Why: Handles the rare race window where a port enters TIME_WAIT between our check and TCPStore binding

Supporting Changes

4. Reserve Externally Assigned Ports

  • File: src/lightning/fabric/utilities/port_manager.py
  • What: Added reserve_existing_port() method
  • Why: Tracks ports assigned by external launchers (e.g., torch.distributed.spawn)

5. Check for Pre-Assigned MASTER_PORT

  • File: src/lightning/fabric/plugins/environments/lightning.py
  • What: Modified find_free_network_port() to check MASTER_PORT environment variable
  • Why: Prevents conflicts when external launchers pre-assign ports

6. Test Fixture Cleanup

  • Files: tests/tests_fabric/conftest.py, tests/tests_pytorch/conftest.py
  • What: Added post-test MASTER_PORT capture and cleanup
  • Why: Tracks ports set by spawned child processes that parent doesn't know about

Test Coverage

  • Added 3 tests for queue-based reuse prevention
  • Added 1 test for environment teardown
  • Updated 1 test for new allocation approach
  • Total: 45 tests passing in port_manager suite

Key Insights from Development

Why Both Layers Are Needed

Deque alone isn't enough:

  • If we allocate >1024 ports during TIME_WAIT window (30-120s), we cycle back to ports still in TIME_WAIT
  • Example: In heavy CI load, 1024 ports might be exhausted in <30s

Retry alone isn't enough:

  • Would constantly retry on every rapid reuse scenario
  • Wasteful and slow
  • Doesn't prevent the problem, just masks it

Together they work:

  • Deque prevents 99% of conflicts (intra-process reuse)
  • Retry handles the 1% edge case (race window)

Impact

  • ✅ Eliminates EADDRINUSE errors in distributed test suites (DDP, FSDP, spawn-based tests)
  • ✅ CI now passes on all port-sensitive tests
  • ✅ Addresses root cause not just symptoms
  • ✅ Graceful degradation via retry logic for unpredictable scenarios
  • ✅ No user-facing changes entirely internal to test infrastructure

Files Changed

Core Implementation

  • src/lightning/fabric/utilities/port_manager.py - Port allocation logic
  • src/lightning/fabric/plugins/environments/lightning.py - Environment port handling

Test Infrastructure

  • tests/tests_fabric/conftest.py - Fabric test retry logic
  • tests/tests_pytorch/conftest.py - PyTorch test retry logic
  • tests/tests_fabric/utilities/test_port_manager.py - Port manager tests

Commit

Branch: fix/20947-checkpoint-manual-opt
Commit: 5f2079f42

@Borda
Copy link
Collaborator

Borda commented Oct 22, 2025

However, the build kept on running into unrelated Address in Use error. So, I took a deeper look and believe fixed that issue as well. Below is my writeup. I cannot find an issue direct related to it.

Let's move it to its own PR for later tracking or need for further development 🦩

@littlebullGit
Copy link
Contributor Author

EADDRINUSE
#21309
Please see above PR. I will rebase this PR once that one got merged. Thanks

@Borda
Copy link
Collaborator

Borda commented Oct 23, 2025

#21309

merged, pls resolve conflicts :)

- Ensure checkpoints reflect the model state before optimization when using manual optimization
- Add warning when pre-optimization state isn't saved
- Update documentation to clarify the behavior with manual optimization

Fixes Lightning-AI#20947
@littlebullGit littlebullGit force-pushed the fix/20947-checkpoint-manual-opt branch from 5f2079f to 3a8aa9f Compare October 23, 2025 20:14
@littlebullGit
Copy link
Contributor Author

#21309

merged, pls resolve conflicts :)

@Borda, it has been a while. Any issue with this PR ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package

Projects

None yet

4 participants