report scuba events for detailed sparse static memory info #5029

ashuaibi7 · 2025-10-20T19:40:36Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2042

Add granular sparse static memory breakdown metrics for TBE to enable validation of planner estimates against runtime memory usage. This implementation separates static sparse memory (weights, optimizer states, cache) from ephemeral memory (activations, IO buffers, gradients) and provides per-component HBM/UVM categorization. The existing tbe.total_hbm_usage aggregates all memory without distinguishing between persistent storage and ephemeral buffers, making it difficult to identify and validate static sparse parameter estimates.

Changes

1. New Scuba Metrics (`tbe_stats_reporters.py`)

Added 10 granular memory metrics to SyncBatchODSStatsReporter:

HBM metrics:

tbe.hbm.sparse_params - Embedding weights in HBM
tbe.hbm.optimizer_states - Momentum states in HBM
tbe.hbm.cache - Cache storage in HBM
tbe.hbm.total_static_sparse - Total static memory in HBM
tbe.hbm.ephemeral - Ephemeral memory in HBM (activations, temp buffers, etc.)

UVM metrics: (same structure for UVM)

2. Memory Categorization Logic (`split_table_batched_embeddings_ops_training.py`)

Added helper methods:
- _get_tensor_memory() - Get tensor memory size
- _categorize_memory_by_location() - Categorize tensors into HBM/UVM
Refactored _report_tbe_mem_usage() with clean list-based tensor grouping

3. Memory Components

Static Sparse:

Weights: weights_dev, weights_host, weights_uvm
Optimizer: momentum1_dev/host/uvm, momentum2_dev/host/uvm
Cache: lxu_cache_weights, lxu_cache_state, lxu_state, cache aux data

Ephemeral (calculated):

ephemeral = total_mem_usage - static_sparse
Includes IO buffers, activations, gradients

detailed analysis revealed QPS drop when enabling additional logging. head to head comparison of time it takes to do the logging reveals 4x increase in duration (see (https://fburl.com/scuba/tbe_stats_runtime/y85ur4k9)

avg QPS across 4 runs w/o logging: 246k vs. avg QPS across 4 runs w/ added logging: 243k (~1.2% QPS drop)

ran the following models w/o added logging:

aps-icvrbase-tbe-dump-test-old-1-d234a33214
aps-icvrbase-tbe-dump-test-old-2-f5d7f5d97a
aps-icvrbase-tbe-dump-test-old-3-92aa2d14c3
aps-icvrbase-tbe-dump-test-old-timed-9ac1869846

ran the following models w/ added logging:

aps-icvrbase-tbe-dump-test-new-1-fcb93df6a6
aps-icvrbase-tbe-dump-test-new-2-3f15ec3a29
aps-icvrbase-tbe-dump-test-new-3-211d3c3f01
aps-icvrbase-tbe-dump-test-new-timed-6e6a932849

Differential Revision: D84624978

Summary: X-link: facebookresearch/FBGEMM#2042 Add granular sparse static memory breakdown metrics for TBE to enable validation of planner estimates against runtime memory usage. This implementation separates static sparse memory (weights, optimizer states, cache) from ephemeral memory (activations, IO buffers, gradients) and provides per-component HBM/UVM categorization. The existing `tbe.total_hbm_usage` aggregates all memory without distinguishing between persistent storage and ephemeral buffers, making it difficult to identify and validate static sparse parameter estimates. ## Changes ### 1. New Scuba Metrics (`tbe_stats_reporters.py`) Added 10 granular memory metrics to `SyncBatchODSStatsReporter`: **HBM metrics:** - `tbe.hbm.sparse_params` - Embedding weights in HBM - `tbe.hbm.optimizer_states` - Momentum states in HBM - `tbe.hbm.cache` - Cache storage in HBM - `tbe.hbm.total_static_sparse` - Total static memory in HBM - `tbe.hbm.ephemeral` - Ephemeral memory in HBM (activations, temp buffers, etc.) **UVM metrics:** (same structure for UVM) ### 2. Memory Categorization Logic (`split_table_batched_embeddings_ops_training.py`) - Added helper methods: - `_get_tensor_memory()` - Get tensor memory size - `_categorize_memory_by_location()` - Categorize tensors into HBM/UVM - Refactored `_report_tbe_mem_usage()` with clean list-based tensor grouping ### 3. Memory Components **Static Sparse:** - Weights: `weights_dev`, `weights_host`, `weights_uvm` - Optimizer: `momentum1_dev/host/uvm`, `momentum2_dev/host/uvm` - Cache: `lxu_cache_weights`, `lxu_cache_state`, `lxu_state`, cache aux data **Ephemeral (calculated):** - `ephemeral = total_mem_usage - static_sparse` - Includes IO buffers, activations, gradients detailed analysis revealed QPS drop when enabling additional logging. head to head comparison of time it takes to do the logging reveals 4x increase in duration (see (https://fburl.com/scuba/tbe_stats_runtime/y85ur4k9) - avg QPS across 4 runs w/o logging: 246k vs. avg QPS across 4 runs w/ added logging: 243k (~1.2% QPS drop) ran the following models w/o added logging: - aps-icvrbase-tbe-dump-test-old-1-d234a33214 - aps-icvrbase-tbe-dump-test-old-2-f5d7f5d97a - aps-icvrbase-tbe-dump-test-old-3-92aa2d14c3 - aps-icvrbase-tbe-dump-test-old-timed-9ac1869846 ran the following models w/ added logging: - aps-icvrbase-tbe-dump-test-new-1-fcb93df6a6 - aps-icvrbase-tbe-dump-test-new-2-3f15ec3a29 - aps-icvrbase-tbe-dump-test-new-3-211d3c3f01 - aps-icvrbase-tbe-dump-test-new-timed-6e6a932849 Differential Revision: D84624978

netlify · 2025-10-20T19:40:42Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`0a2d78c`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68f69038040fff00080058a1
😎 Deploy Preview	https://deploy-preview-5029--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

meta-codesync · 2025-10-20T19:40:45Z

@ashuaibi7 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84624978.

meta-cla bot added the cla signed label Oct 20, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

report scuba events for detailed sparse static memory info #5029

report scuba events for detailed sparse static memory info #5029

Uh oh!

ashuaibi7 commented Oct 20, 2025

Uh oh!

netlify bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

report scuba events for detailed sparse static memory info #5029

Are you sure you want to change the base?

report scuba events for detailed sparse static memory info #5029

Uh oh!

Conversation

ashuaibi7 commented Oct 20, 2025

Changes

1. New Scuba Metrics (tbe_stats_reporters.py)

2. Memory Categorization Logic (split_table_batched_embeddings_ops_training.py)

3. Memory Components

Uh oh!

netlify bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

meta-codesync bot commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. New Scuba Metrics (`tbe_stats_reporters.py`)

2. Memory Categorization Logic (`split_table_batched_embeddings_ops_training.py`)

netlify bot commented Oct 20, 2025 •

edited

Loading