- 
                Notifications
    You must be signed in to change notification settings 
- Fork 560
Ensuring PJRT Client destroy/destructor is called #9675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            saarthak-aws
  wants to merge
  2
  commits into
  pytorch:master
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
saarthak-aws:pjrt-client-not-destroyed
  
      
      
   
  
    
  
  
  
 
  
      
    base: master
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          
  
     Open
                    Changes from all commits
      Commits
    
    
            Show all changes
          
          
            2 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      
    File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This violates Google's C++ style guide: https://google.github.io/styleguide/cppguide.html#Static_and_Global_Variables
For singleton objects, we deliberately do not want their destructors to be called, as that can lead to race condition at program exit time.
I'm not sure what this PR is trying to achieve. Could you clarify why you want to sure that the PjRt client dtor is called? Usually we don't destroy the singleton objects - we just let the OS reclaim the resources when the process terminates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhanyong-wan thanks for the feedback. Could you give an example of the race condition you mentioned and why it was not addressed until v2.8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style guide I mentioned noted: "When destructors are trivial, their execution is not subject to ordering at all (they are effectively not "run"); otherwise we are exposed to the risk of accessing objects after the end of their lifetime. Therefore, we only allow objects with static storage duration if they are trivially destructible. Fundamental types (like pointers and int) are trivially destructible, as are arrays of trivially destructible types."
For example, at program exit time there could be long-running threads accessing global variables. If a global variable is destructed, such access is undefined behavior.
As to why it wasn't addressed until v2.8, I don't know the history, but my guess is that we just noticed the potential race and decided to fix it.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PR #9384, we introduced StatusOr<T> for error handling, which can be trivially destructible when T is trivially destructible. However, looking at PjrtComputationClient's implementation with its explicit destructor and member variables, it appears to not be trivially destructible. Could you shed some light on why we think PjrtComputationClient could be trivially destructible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rajkthakur ,
StatusOr<T>is not trivially destructible, regardless of whetherTis trivially destructible.PjrtComputationClientis not trivially destructible and not meant to be. I don't understand what you mean by "we think PjrtComputationClient could be trivially destructible".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. It seems a bug that neuron hangs sometimes if the clean-up is left to the OS. My suggestion would be to root cause and fix that bug.
Re: the shutdown approach, I don't think we can count on no further access to client_ after the atexit hook is called. The whole point of Google's policy on global variable destruction is that there can be long-running threads after the exit hook is called. Think about the case where someone starts a computation in a long-running thread and then exit. The thread is never joined and thus may still access client_ after the program exit hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we investigate why leaving cleanup to the OS leaves Neuron backend in a bad state, do you have any thoughts on what would be the correct approach for implementing the Shutdown method?
We would have to leave the client_ accessible after we have destroyed the actual xla::PjRtClient (since destruction ends up calling PJRT_Client_Destroy). One way I can think of doing so is to switch to a stub implementation of _client at this point, so that long running threads can access _client, but they would get some default behavior. Is that the right approach/pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best course of action is to fix the hang, as implementing Shutdown correctly adds significant complexity to the design.
That said, here's how Shutdown should work if done correctly: it should allow in-flight computation that needs the client to finish, and it should let new computation (if any) that wants to use the client fail to get the client. This means we'll likely need to use a shared_ptr to hold the client (so that in-flight computation can extend its lifespan).
As you can see, this is doable but not trivial. Hence my advice to avoid it.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a shorter shutdown function could be something like:
PjRtClientintoxla::PjRtCApiClientPJRT_Client_Destroy(@zhanyong-wan what do you think?)
Notes:
PJRT_Client_Destroywas already called for the givenPJRT_Client, erroring out otherwise. Else, we are going to get UB.PjRtClient, this will probably need to be added as a new virtual function ofComputationClientWith all that said, I believe the best solution would be to figure out what exactly is causing the hanging problem. On the other hand, it feels like not calling
PJRT_Client_Destroyis a bug from the perspective of PJRT semantics (couldn't really find anywhere).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ysiraichi , let's avoid the hack and fix the hang. It adds significant complexity if different devices have different shutdown logic.