report scuba events for detailed sparse static memory info #5029
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2042
Add granular sparse static memory breakdown metrics for TBE to enable validation of planner estimates against runtime memory usage. This implementation separates static sparse memory (weights, optimizer states, cache) from ephemeral memory (activations, IO buffers, gradients) and provides per-component HBM/UVM categorization. The existing
tbe.total_hbm_usageaggregates all memory without distinguishing between persistent storage and ephemeral buffers, making it difficult to identify and validate static sparse parameter estimates.Changes
1. New Scuba Metrics (
tbe_stats_reporters.py)Added 10 granular memory metrics to
SyncBatchODSStatsReporter:HBM metrics:
tbe.hbm.sparse_params- Embedding weights in HBMtbe.hbm.optimizer_states- Momentum states in HBMtbe.hbm.cache- Cache storage in HBMtbe.hbm.total_static_sparse- Total static memory in HBMtbe.hbm.ephemeral- Ephemeral memory in HBM (activations, temp buffers, etc.)UVM metrics: (same structure for UVM)
2. Memory Categorization Logic (
split_table_batched_embeddings_ops_training.py)_get_tensor_memory()- Get tensor memory size_categorize_memory_by_location()- Categorize tensors into HBM/UVM_report_tbe_mem_usage()with clean list-based tensor grouping3. Memory Components
Static Sparse:
weights_dev,weights_host,weights_uvmmomentum1_dev/host/uvm,momentum2_dev/host/uvmlxu_cache_weights,lxu_cache_state,lxu_state, cache aux dataEphemeral (calculated):
ephemeral = total_mem_usage - static_sparsedetailed analysis revealed QPS drop when enabling additional logging. head to head comparison of time it takes to do the logging reveals 4x increase in duration (see (https://fburl.com/scuba/tbe_stats_runtime/y85ur4k9)
ran the following models w/o added logging:
ran the following models w/ added logging:
Differential Revision: D84624978