Store and Access HVG Gene Names in AnnData #246

nick-youngblut · 2025-12-19T22:41:34Z

Summary

This PR enhances STATE's handling of highly variable genes (HVGs) by storing gene names directly in AnnData objects alongside the HVG expression matrix. This enables downstream tools like pdex to properly map predictions back to gene IDs without requiring additional metadata files.

Changes

Core Infrastructure

New constants module (src/state/tx/constants.py): Centralizes shared constants for TX workflows
New HVG utilities (src/state/tx/utils/hvg.py): Provides functions to retrieve and validate HVG gene names with fallback mechanisms

Preprocessing Enhancements

Enhanced preprocess_train: Now stores HVG gene names in adata.uns["X_hvg_var_names"] alongside the HVG matrix in adata.obsm["X_hvg"]
Updated inference preprocessing: Added validation and warning when HVG names are missing

CLI Improvements

Enhanced infer command: Added --verbose flag to show HVG name mapping details and status reporting
Updated predict command: Preserves HVG names in prediction outputs

Documentation & Migration

New repository guidelines (AGENTS.md): Comprehensive development guidelines
Migration guide (docs/migration/hvg_var_names.md): Backward compatibility notes and backfill script for existing data
Updated README: Added section on accessing HVG gene names

Testing

New test suites: Comprehensive tests for HVG utilities, inference pipeline, prediction outputs, and preprocessing workflows
All existing tests pass: No regressions introduced

Backward Compatibility

This change is fully backward compatible:

Existing preprocessed data: Inference commands continue working without modification
Non-blocking warnings: Users are notified when HVG names are missing but execution proceeds
Fallback mechanisms: Code can still recover gene names from adata.var.highly_variable when available
No API changes: Existing workflows continue functioning unchanged

Usage Examples

Accessing HVG Gene Names

import anndata as ad

# After preprocessing with latest STATE version
adata = ad.read_h5ad("preprocessed.h5ad")
hvg_names = adata.uns.get("X_hvg_var_names")

# Construct downstream AnnData for tools like pdex
adata_for_pdex = ad.AnnData(
    X=adata.obsm["X_hvg"],
    obs=adata.obs,
    var=pd.DataFrame(index=hvg_names),
)

Backfilling Existing Data

For pre-existing datasets, use the provided backfill script to add HVG names to existing files.

Technical Details

HVG names are stored as NumPy arrays of Python strings for h5ad compatibility
Naming convention {obsm_key}_var_names allows extension to other embedding types
Comprehensive validation ensures gene name arrays match embedding dimensions
Fallback logic prioritizes explicit uns keys over implicit var-based recovery

Testing

All existing tests pass
New test coverage includes:
- HVG name retrieval with multiple fallback scenarios
- Inference pipeline preservation of HVG metadata
- Prediction output includes HVG names
- Preprocessing correctly stores HVG names
- End-to-end workflow validation

Risks & Mitigations

Low risk: Fully backward compatible with existing workflows
Data integrity: Validation ensures HVG name arrays match embedding dimensions
Performance: Minimal overhead - only stores additional metadata
Migration: Clear documentation and backfill scripts provided

Note

Stores HVG gene names in adata.uns['X_hvg_var_names'] and propagates them across preprocess, infer, and predict with utilities, CLI updates, docs, and tests.

Core TX:
- Add state.tx.constants with HVG_VAR_NAMES_KEY and HVG_OBSM_KEY.
- New state.tx.utils.hvg to fetch/validate HVG names with fallbacks ({obsm_key}_var_names, X_hvg_var_names, or var.highly_variable).
Preprocessing:
- tx preprocess_train: writes HVG matrix to obsm['X_hvg'] and gene names to uns['X_hvg_var_names'].
- tx preprocess_infer: logs HVG-name presence and warns if missing when using X_hvg.
Inference/Prediction CLI:
- tx infer: new --verbose; reports HVG-name status and warns when missing.
- tx predict: includes HVG names in outputs (uns['X_hvg_var_names']); sets var when dimensions match.
Datasets:
- scGPTPerturbationDataset: default hvg_names_uns_key='X_hvg_var_names' and uses it to read gene names.
Docs:
- Update README with accessing HVG names.
- Add migration guide docs/migration/hvg_var_names.md.
- Add AGENTS.md guidelines.
Tests:
- Add tests for HVG utils, preprocess (train/infer), infer pipeline, and predict outputs ensuring uns['X_hvg_var_names'] presence.

^{Written by Cursor Bugbot for commit 4750d7a. This will update automatically on new commits. Configure here.}

…mmands, coding style, testing, and security

- Added a `--verbose` flag to the inference CLI for detailed gene name mapping output. - Implemented HVG name retrieval and validation in preprocessing and prediction scripts. - Introduced constants for HVG variable names and updated relevant functions to utilize them. - Enhanced logging for HVG name availability and warnings for missing data. - Updated dataset class to default to the new HVG names key.

- Added storage of highly variable gene (HVG) names in `adata.uns["X_hvg_var_names"]` for improved downstream mapping. - Updated documentation to reflect changes in HVG gene name access and backward compatibility. - Introduced tests to validate HVG name retrieval and ensure compatibility with existing workflows. - Enhanced inference and preprocessing scripts to preserve HVG names during data processing.

gemini-code-assist · 2025-12-19T22:41:58Z

Summary of Changes

Hello @nick-youngblut, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the STATE framework's ability to manage Highly Variable Genes (HVGs) by ensuring their names are consistently stored and accessible within AnnData objects. This explicit storage facilitates seamless integration with downstream analysis tools that require gene ID mapping, improving data integrity and usability. The changes are implemented with full backward compatibility, providing clear migration paths and robust error handling for existing datasets.

Highlights

HVG Gene Name Storage: Highly Variable Gene (HVG) names are now explicitly stored in AnnData objects under adata.uns['X_hvg_var_names'] alongside the HVG expression matrix, enabling direct mapping of predictions to gene IDs.
New Utility Modules: Introduced src/state/tx/constants.py for centralized shared constants and src/state/tx/utils/hvg.py for functions to retrieve, validate, and detect preprocessing versions of HVG gene names with fallback mechanisms.
Preprocessing Enhancements: The preprocess_train command now automatically stores HVG gene names, and preprocess_infer includes validation and warnings for missing HVG names.
CLI Improvements: The infer command gained a --verbose flag to display HVG name mapping details, and the predict command now preserves HVG names in its output.
Documentation and Migration: Added new repository guidelines (AGENTS.md), a detailed migration guide (docs/migration/hvg_var_names.md) with a backfill script for existing data, and updated the README.md to reflect the new HVG name access.
Comprehensive Testing: New test suites have been added for HVG utilities, inference pipeline, prediction outputs, and preprocessing workflows, ensuring no regressions and proper functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a well-structured enhancement to store highly variable gene (HVG) names directly within AnnData objects, which is a valuable improvement for downstream tooling. The changes include new constants and utility modules, updates to preprocessing and CLI commands, and comprehensive documentation and testing. While the overall implementation is solid, I've identified a couple of critical issues, including a hardcoded file path and a logic error in a default parameter that could lead to incorrect behavior. I've also noted some opportunities for code simplification to improve maintainability. Addressing these points will make this a very strong contribution.

gemini-code-assist · 2025-12-19T22:43:39Z

src/state/_cli/_tx/_predict.py

            gene_names = np.load(
                "/large_storage/ctc/userspace/aadduri/datasets/tahoe_19k_to_2k_names.npy", allow_pickle=True
            )


This code contains a hardcoded absolute file path. This is a critical issue as it makes the code non-portable and will cause it to fail on any system where this specific path does not exist. This path should be removed or provided through a configuration option.

@abhinadduri
I'm not sure how to update this

src/state/tx/data/dataset/scgpt_perturbation_dataset.py

src/state/_cli/_tx/_infer.py

src/state/_cli/_tx/_predict.py

cursor

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on January 16

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

src/state/_cli/_tx/_predict.py

src/state/tx/data/dataset/scgpt_perturbation_dataset.py

- Moved HVG name assignment to a single conditional block for clarity and consistency. - Removed redundant code for HVG name storage in `adata.uns` to enhance maintainability. - Cleaned up test file by removing unused numpy import.

nick-youngblut added 3 commits December 19, 2025 13:12

docs: add repository guidelines for project structure, development co…

0ff67a0

…mmands, coding style, testing, and security

nick-youngblut requested a review from a team as a code owner December 19, 2025 22:41

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

cursor bot reviewed Dec 19, 2025

View reviewed changes

src/state/_cli/_tx/_predict.py Outdated Show resolved Hide resolved

src/state/tx/data/dataset/scgpt_perturbation_dataset.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Store and Access HVG Gene Names in AnnData #246

Store and Access HVG Gene Names in AnnData #246

Uh oh!

nick-youngblut commented Dec 19, 2025 •

edited by cursor bot

Loading

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

nick-youngblut Dec 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Store and Access HVG Gene Names in AnnData #246

Are you sure you want to change the base?

Store and Access HVG Gene Names in AnnData #246

Uh oh!

Conversation

nick-youngblut commented Dec 19, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Infrastructure

Preprocessing Enhancements

CLI Improvements

Documentation & Migration

Testing

Backward Compatibility

Usage Examples

Accessing HVG Gene Names

Backfilling Existing Data

Technical Details

Testing

Risks & Mitigations

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

nick-youngblut Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This is the final PR Bugbot will review for you during this billing cycle

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nick-youngblut commented Dec 19, 2025 •

edited by cursor bot

Loading

nick-youngblut Dec 19, 2025 •

edited

Loading