Fix OSError: [Errno 24] Too many open files in multi-copy benchmark #5037
      
        
          +6
        
        
          −0
        
        
          
        
      
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
When running benchmarks with a large number of copies, the process may raise:
OSError: [Errno 24] Too many open files.
Example command:
(fbgemm_gpu_env)$ ulimit -n 1048576
(fbgemm_gpu_env)$ python ./bench/tbe/tbe_inference_benchmark.py nbit-cpu
--num-embeddings=40000000 --bag-size=2 --embedding-dim=96
--batch-size=162 --num-tables=8 --weights-precision=int4
--output-dtype=fp32 --copies=96 --iters=30000
PyTorch multiprocessing provides two shared-memory strategies: 1.file_descriptor (default)
2.file_system
The default file_descriptor strategy uses file descriptors as shared memory handles, which can result in a large number of open FDs when many tensors are shared.
If the total number of open FDs exceeds the system limit and cannot be raised, the file_system strategy should be used instead.
This patch allows switching to the file_system strategy by setting:
export PYTORCH_SHARE_STRATEGY='file_system'
Reference:
https://pytorch.org/docs/stable/multiprocessing.html#sharing-strategies