-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix ModelCheckpoint with manual optimization and every_n_train_steps
#21239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix ModelCheckpoint with manual optimization and every_n_train_steps
#21239
Conversation
4f06495 to
16552e5
Compare
ModelCheckpoint with manual optimization and every_n_train_steps
|
The link error is (generated/CONTRIBUTING: line 6) broken https://medium.com/pytorch-lightning/quick-contribution-guide-86d977171b3a - 429 Client Error: Too Many Requests for url: https://medium.com/pytorch-lightning/quick-contribution-guide-86d977171b3a. Not related to my code. The other one is just timed out. |
Our CI is broken at the moment, nothing you can do. Please stand by while it being fixed. |
059625b to
7672de3
Compare
|
@SkafteNicki @justusschock can we try CI again ? the last run seems fine to me. |
ecfbb57 to
a1250f7
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21239 +/- ##
=========================================
- Coverage 87% 79% -8%
=========================================
Files 269 266 -3
Lines 23744 23720 -24
=========================================
- Hits 20572 18697 -1875
- Misses 3172 5023 +1851 |
49cc37d to
69819dc
Compare
|
Hey @littlebullGit, thanks for looking into the port issue! I was digging into the same thing, mentioned here. The changes in this PR don’t fully overlap, so instead of merging directly, it might be cleaner to open a separate PR for this fix if that works for you.
Borrowed the code to #21195 to see if it works. Thanks for the great work.
Doesn't seem to work. |
2e1d277 to
5f2079f
Compare
|
@deependujha finally I got it fixed. All checks passed for this PR. Port Management Fix for EADDRINUSE ErrorsProblemDistributed tests were failing in CI with
Solution: Two-Layer DefenseLayer 1: Proactive Prevention (Deque)1. Queue-Based Port Reuse Prevention
2. OS-Based Port Allocation
Layer 2: Reactive Recovery (Retry Logic)3. Automatic Retry on EADDRINUSE
Supporting Changes4. Reserve Externally Assigned Ports
5. Check for Pre-Assigned MASTER_PORT
6. Test Fixture Cleanup
Test Coverage
Key Insights from DevelopmentWhy Both Layers Are NeededDeque alone isn't enough:
Retry alone isn't enough:
Together they work:
Impact
Files ChangedCore Implementation
Test Infrastructure
CommitBranch: |
Let's move it to its own PR for later tracking or need for further development 🦩 |
|
|
merged, pls resolve conflicts :) |
- Ensure checkpoints reflect the model state before optimization when using manual optimization - Add warning when pre-optimization state isn't saved - Update documentation to clarify the behavior with manual optimization Fixes Lightning-AI#20947
5f2079f to
3a8aa9f
Compare
This PR enhances callback to properly handle manual optimization scenarios, ensuring checkpoints reflect the intended model state and providing clear user guidance.
Fixes #20947
Key Changes:
every_n_train_steps.Testing:
📚 Documentation preview 📚: https://pytorch-lightning--21239.org.readthedocs.build/en/21239/