Skip to content

Conversation

@ashuaibi7
Copy link

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2042

Add granular sparse static memory breakdown metrics for TBE to enable validation of planner estimates against runtime memory usage. This implementation separates static sparse memory (weights, optimizer states, cache) from ephemeral memory (activations, IO buffers, gradients) and provides per-component HBM/UVM categorization. The existing tbe.total_hbm_usage aggregates all memory without distinguishing between persistent storage and ephemeral buffers, making it difficult to identify and validate static sparse parameter estimates.

Changes

1. New Scuba Metrics (tbe_stats_reporters.py)

Added 10 granular memory metrics to SyncBatchODSStatsReporter:

HBM metrics:

  • tbe.hbm.sparse_params - Embedding weights in HBM
  • tbe.hbm.optimizer_states - Momentum states in HBM
  • tbe.hbm.cache - Cache storage in HBM
  • tbe.hbm.total_static_sparse - Total static memory in HBM
  • tbe.hbm.ephemeral - Ephemeral memory in HBM (activations, temp buffers, etc.)

UVM metrics: (same structure for UVM)

2. Memory Categorization Logic (split_table_batched_embeddings_ops_training.py)

  • Added helper methods:
    • _get_tensor_memory() - Get tensor memory size
    • _categorize_memory_by_location() - Categorize tensors into HBM/UVM
  • Refactored _report_tbe_mem_usage() with clean list-based tensor grouping

3. Memory Components

Static Sparse:

  • Weights: weights_dev, weights_host, weights_uvm
  • Optimizer: momentum1_dev/host/uvm, momentum2_dev/host/uvm
  • Cache: lxu_cache_weights, lxu_cache_state, lxu_state, cache aux data

Ephemeral (calculated):

  • ephemeral = total_mem_usage - static_sparse
  • Includes IO buffers, activations, gradients

detailed analysis revealed QPS drop when enabling additional logging. head to head comparison of time it takes to do the logging reveals 4x increase in duration (see (https://fburl.com/scuba/tbe_stats_runtime/y85ur4k9)

  • avg QPS across 4 runs w/o logging: 246k vs. avg QPS across 4 runs w/ added logging: 243k (~1.2% QPS drop)

ran the following models w/o added logging:

  • aps-icvrbase-tbe-dump-test-old-1-d234a33214
  • aps-icvrbase-tbe-dump-test-old-2-f5d7f5d97a
  • aps-icvrbase-tbe-dump-test-old-3-92aa2d14c3
  • aps-icvrbase-tbe-dump-test-old-timed-9ac1869846

ran the following models w/ added logging:

  • aps-icvrbase-tbe-dump-test-new-1-fcb93df6a6
  • aps-icvrbase-tbe-dump-test-new-2-3f15ec3a29
  • aps-icvrbase-tbe-dump-test-new-3-211d3c3f01
  • aps-icvrbase-tbe-dump-test-new-timed-6e6a932849

Differential Revision: D84624978

Summary:
X-link: facebookresearch/FBGEMM#2042

Add granular sparse static memory breakdown metrics for TBE to enable validation of planner estimates against runtime memory usage. This implementation separates static sparse memory (weights, optimizer states, cache) from ephemeral memory (activations, IO buffers, gradients) and provides per-component HBM/UVM categorization. The existing `tbe.total_hbm_usage` aggregates all memory without distinguishing between persistent storage and ephemeral buffers, making it difficult to identify and validate static sparse parameter estimates.

## Changes

### 1. New Scuba Metrics (`tbe_stats_reporters.py`)
Added 10 granular memory metrics to `SyncBatchODSStatsReporter`:

**HBM metrics:**
- `tbe.hbm.sparse_params` - Embedding weights in HBM
- `tbe.hbm.optimizer_states` - Momentum states in HBM
- `tbe.hbm.cache` - Cache storage in HBM
- `tbe.hbm.total_static_sparse` - Total static memory in HBM
- `tbe.hbm.ephemeral` - Ephemeral memory in HBM (activations, temp buffers, etc.)

**UVM metrics:** (same structure for UVM)

### 2. Memory Categorization Logic (`split_table_batched_embeddings_ops_training.py`)
- Added helper methods:
  - `_get_tensor_memory()` - Get tensor memory size
  - `_categorize_memory_by_location()` - Categorize tensors into HBM/UVM
- Refactored `_report_tbe_mem_usage()` with clean list-based tensor grouping

### 3. Memory Components
**Static Sparse:**
- Weights: `weights_dev`, `weights_host`, `weights_uvm`
- Optimizer: `momentum1_dev/host/uvm`, `momentum2_dev/host/uvm`
- Cache: `lxu_cache_weights`, `lxu_cache_state`, `lxu_state`, cache aux data

**Ephemeral (calculated):**
- `ephemeral = total_mem_usage - static_sparse`
- Includes IO buffers, activations, gradients



detailed analysis revealed QPS drop when enabling additional logging. head to head comparison of time it takes to do the logging reveals 4x increase in duration (see (https://fburl.com/scuba/tbe_stats_runtime/y85ur4k9)

- avg QPS across 4 runs w/o logging: 246k vs. avg QPS across 4 runs w/ added logging: 243k (~1.2% QPS drop)

ran the following models w/o added logging:
- aps-icvrbase-tbe-dump-test-old-1-d234a33214
- aps-icvrbase-tbe-dump-test-old-2-f5d7f5d97a
- aps-icvrbase-tbe-dump-test-old-3-92aa2d14c3
- aps-icvrbase-tbe-dump-test-old-timed-9ac1869846

ran the following models w/ added logging:
- aps-icvrbase-tbe-dump-test-new-1-fcb93df6a6
- aps-icvrbase-tbe-dump-test-new-2-3f15ec3a29
- aps-icvrbase-tbe-dump-test-new-3-211d3c3f01
- aps-icvrbase-tbe-dump-test-new-timed-6e6a932849

Differential Revision: D84624978
@netlify
Copy link

netlify bot commented Oct 20, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 0a2d78c
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68f69038040fff00080058a1
😎 Deploy Preview https://deploy-preview-5029--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the cla signed label Oct 20, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Oct 20, 2025

@ashuaibi7 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84624978.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant