Skip to content

Conversation

@nick-youngblut
Copy link
Contributor

@nick-youngblut nick-youngblut commented Dec 19, 2025

Summary

This PR enhances STATE's handling of highly variable genes (HVGs) by storing gene names directly in AnnData objects alongside the HVG expression matrix. This enables downstream tools like pdex to properly map predictions back to gene IDs without requiring additional metadata files.

Changes

Core Infrastructure

  • New constants module (src/state/tx/constants.py): Centralizes shared constants for TX workflows
  • New HVG utilities (src/state/tx/utils/hvg.py): Provides functions to retrieve and validate HVG gene names with fallback mechanisms

Preprocessing Enhancements

  • Enhanced preprocess_train: Now stores HVG gene names in adata.uns["X_hvg_var_names"] alongside the HVG matrix in adata.obsm["X_hvg"]
  • Updated inference preprocessing: Added validation and warning when HVG names are missing

CLI Improvements

  • Enhanced infer command: Added --verbose flag to show HVG name mapping details and status reporting
  • Updated predict command: Preserves HVG names in prediction outputs

Documentation & Migration

  • New repository guidelines (AGENTS.md): Comprehensive development guidelines
  • Migration guide (docs/migration/hvg_var_names.md): Backward compatibility notes and backfill script for existing data
  • Updated README: Added section on accessing HVG gene names

Testing

  • New test suites: Comprehensive tests for HVG utilities, inference pipeline, prediction outputs, and preprocessing workflows
  • All existing tests pass: No regressions introduced

Backward Compatibility

This change is fully backward compatible:

  • Existing preprocessed data: Inference commands continue working without modification
  • Non-blocking warnings: Users are notified when HVG names are missing but execution proceeds
  • Fallback mechanisms: Code can still recover gene names from adata.var.highly_variable when available
  • No API changes: Existing workflows continue functioning unchanged

Usage Examples

Accessing HVG Gene Names

import anndata as ad

# After preprocessing with latest STATE version
adata = ad.read_h5ad("preprocessed.h5ad")
hvg_names = adata.uns.get("X_hvg_var_names")

# Construct downstream AnnData for tools like pdex
adata_for_pdex = ad.AnnData(
    X=adata.obsm["X_hvg"],
    obs=adata.obs,
    var=pd.DataFrame(index=hvg_names),
)

Backfilling Existing Data

For pre-existing datasets, use the provided backfill script to add HVG names to existing files.

Technical Details

  • HVG names are stored as NumPy arrays of Python strings for h5ad compatibility
  • Naming convention {obsm_key}_var_names allows extension to other embedding types
  • Comprehensive validation ensures gene name arrays match embedding dimensions
  • Fallback logic prioritizes explicit uns keys over implicit var-based recovery

Testing

  • All existing tests pass
  • New test coverage includes:
    • HVG name retrieval with multiple fallback scenarios
    • Inference pipeline preservation of HVG metadata
    • Prediction output includes HVG names
    • Preprocessing correctly stores HVG names
    • End-to-end workflow validation

Risks & Mitigations

  • Low risk: Fully backward compatible with existing workflows
  • Data integrity: Validation ensures HVG name arrays match embedding dimensions
  • Performance: Minimal overhead - only stores additional metadata
  • Migration: Clear documentation and backfill scripts provided

Note

Stores HVG gene names in adata.uns['X_hvg_var_names'] and propagates them across preprocess, infer, and predict with utilities, CLI updates, docs, and tests.

  • Core TX:
    • Add state.tx.constants with HVG_VAR_NAMES_KEY and HVG_OBSM_KEY.
    • New state.tx.utils.hvg to fetch/validate HVG names with fallbacks ({obsm_key}_var_names, X_hvg_var_names, or var.highly_variable).
  • Preprocessing:
    • tx preprocess_train: writes HVG matrix to obsm['X_hvg'] and gene names to uns['X_hvg_var_names'].
    • tx preprocess_infer: logs HVG-name presence and warns if missing when using X_hvg.
  • Inference/Prediction CLI:
    • tx infer: new --verbose; reports HVG-name status and warns when missing.
    • tx predict: includes HVG names in outputs (uns['X_hvg_var_names']); sets var when dimensions match.
  • Datasets:
    • scGPTPerturbationDataset: default hvg_names_uns_key='X_hvg_var_names' and uses it to read gene names.
  • Docs:
    • Update README with accessing HVG names.
    • Add migration guide docs/migration/hvg_var_names.md.
    • Add AGENTS.md guidelines.
  • Tests:
    • Add tests for HVG utils, preprocess (train/infer), infer pipeline, and predict outputs ensuring uns['X_hvg_var_names'] presence.

Written by Cursor Bugbot for commit 4750d7a. This will update automatically on new commits. Configure here.

- Added a `--verbose` flag to the inference CLI for detailed gene name mapping output.
- Implemented HVG name retrieval and validation in preprocessing and prediction scripts.
- Introduced constants for HVG variable names and updated relevant functions to utilize them.
- Enhanced logging for HVG name availability and warnings for missing data.
- Updated dataset class to default to the new HVG names key.
- Added storage of highly variable gene (HVG) names in `adata.uns["X_hvg_var_names"]` for improved downstream mapping.
- Updated documentation to reflect changes in HVG gene name access and backward compatibility.
- Introduced tests to validate HVG name retrieval and ensure compatibility with existing workflows.
- Enhanced inference and preprocessing scripts to preserve HVG names during data processing.
@nick-youngblut nick-youngblut requested a review from a team as a code owner December 19, 2025 22:41
@gemini-code-assist
Copy link

Summary of Changes

Hello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the STATE framework's ability to manage Highly Variable Genes (HVGs) by ensuring their names are consistently stored and accessible within AnnData objects. This explicit storage facilitates seamless integration with downstream analysis tools that require gene ID mapping, improving data integrity and usability. The changes are implemented with full backward compatibility, providing clear migration paths and robust error handling for existing datasets.

Highlights

  • HVG Gene Name Storage: Highly Variable Gene (HVG) names are now explicitly stored in AnnData objects under adata.uns['X_hvg_var_names'] alongside the HVG expression matrix, enabling direct mapping of predictions to gene IDs.
  • New Utility Modules: Introduced src/state/tx/constants.py for centralized shared constants and src/state/tx/utils/hvg.py for functions to retrieve, validate, and detect preprocessing versions of HVG gene names with fallback mechanisms.
  • Preprocessing Enhancements: The preprocess_train command now automatically stores HVG gene names, and preprocess_infer includes validation and warnings for missing HVG names.
  • CLI Improvements: The infer command gained a --verbose flag to display HVG name mapping details, and the predict command now preserves HVG names in its output.
  • Documentation and Migration: Added new repository guidelines (AGENTS.md), a detailed migration guide (docs/migration/hvg_var_names.md) with a backfill script for existing data, and updated the README.md to reflect the new HVG name access.
  • Comprehensive Testing: New test suites have been added for HVG utilities, inference pipeline, prediction outputs, and preprocessing workflows, ensuring no regressions and proper functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a well-structured enhancement to store highly variable gene (HVG) names directly within AnnData objects, which is a valuable improvement for downstream tooling. The changes include new constants and utility modules, updates to preprocessing and CLI commands, and comprehensive documentation and testing. While the overall implementation is solid, I've identified a couple of critical issues, including a hardcoded file path and a logic error in a default parameter that could lead to incorrect behavior. I've also noted some opportunities for code simplification to improve maintainability. Addressing these points will make this a very strong contribution.

Comment on lines 337 to 339
gene_names = np.load(
"/large_storage/ctc/userspace/aadduri/datasets/tahoe_19k_to_2k_names.npy", allow_pickle=True
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This code contains a hardcoded absolute file path. This is a critical issue as it makes the code non-portable and will cause it to fail on any system where this specific path does not exist. This path should be removed or provided through a configuration option.

Copy link
Contributor Author

@nick-youngblut nick-youngblut Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinadduri
I'm not sure how to update this

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on January 16

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

- Moved HVG name assignment to a single conditional block for clarity and consistency.
- Removed redundant code for HVG name storage in `adata.uns` to enhance maintainability.
- Cleaned up test file by removing unused numpy import.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant