-
Notifications
You must be signed in to change notification settings - Fork 136
Store and Access HVG Gene Names in AnnData #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…mmands, coding style, testing, and security
- Added a `--verbose` flag to the inference CLI for detailed gene name mapping output. - Implemented HVG name retrieval and validation in preprocessing and prediction scripts. - Introduced constants for HVG variable names and updated relevant functions to utilize them. - Enhanced logging for HVG name availability and warnings for missing data. - Updated dataset class to default to the new HVG names key.
- Added storage of highly variable gene (HVG) names in `adata.uns["X_hvg_var_names"]` for improved downstream mapping. - Updated documentation to reflect changes in HVG gene name access and backward compatibility. - Introduced tests to validate HVG name retrieval and ensure compatibility with existing workflows. - Enhanced inference and preprocessing scripts to preserve HVG names during data processing.
Summary of ChangesHello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the STATE framework's ability to manage Highly Variable Genes (HVGs) by ensuring their names are consistently stored and accessible within AnnData objects. This explicit storage facilitates seamless integration with downstream analysis tools that require gene ID mapping, improving data integrity and usability. The changes are implemented with full backward compatibility, providing clear migration paths and robust error handling for existing datasets. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a well-structured enhancement to store highly variable gene (HVG) names directly within AnnData objects, which is a valuable improvement for downstream tooling. The changes include new constants and utility modules, updates to preprocessing and CLI commands, and comprehensive documentation and testing. While the overall implementation is solid, I've identified a couple of critical issues, including a hardcoded file path and a logic error in a default parameter that could lead to incorrect behavior. I've also noted some opportunities for code simplification to improve maintainability. Addressing these points will make this a very strong contribution.
| gene_names = np.load( | ||
| "/large_storage/ctc/userspace/aadduri/datasets/tahoe_19k_to_2k_names.npy", allow_pickle=True | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhinadduri
I'm not sure how to update this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on January 16
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
- Moved HVG name assignment to a single conditional block for clarity and consistency. - Removed redundant code for HVG name storage in `adata.uns` to enhance maintainability. - Cleaned up test file by removing unused numpy import.
Summary
This PR enhances STATE's handling of highly variable genes (HVGs) by storing gene names directly in AnnData objects alongside the HVG expression matrix. This enables downstream tools like
pdexto properly map predictions back to gene IDs without requiring additional metadata files.Changes
Core Infrastructure
src/state/tx/constants.py): Centralizes shared constants for TX workflowssrc/state/tx/utils/hvg.py): Provides functions to retrieve and validate HVG gene names with fallback mechanismsPreprocessing Enhancements
preprocess_train: Now stores HVG gene names inadata.uns["X_hvg_var_names"]alongside the HVG matrix inadata.obsm["X_hvg"]CLI Improvements
infercommand: Added--verboseflag to show HVG name mapping details and status reportingpredictcommand: Preserves HVG names in prediction outputsDocumentation & Migration
AGENTS.md): Comprehensive development guidelinesdocs/migration/hvg_var_names.md): Backward compatibility notes and backfill script for existing dataTesting
Backward Compatibility
This change is fully backward compatible:
adata.var.highly_variablewhen availableUsage Examples
Accessing HVG Gene Names
Backfilling Existing Data
For pre-existing datasets, use the provided backfill script to add HVG names to existing files.
Technical Details
{obsm_key}_var_namesallows extension to other embedding typesTesting
Risks & Mitigations
Note
Stores HVG gene names in
adata.uns['X_hvg_var_names']and propagates them across preprocess, infer, and predict with utilities, CLI updates, docs, and tests.state.tx.constantswithHVG_VAR_NAMES_KEYandHVG_OBSM_KEY.state.tx.utils.hvgto fetch/validate HVG names with fallbacks ({obsm_key}_var_names,X_hvg_var_names, orvar.highly_variable).tx preprocess_train: writes HVG matrix toobsm['X_hvg']and gene names touns['X_hvg_var_names'].tx preprocess_infer: logs HVG-name presence and warns if missing when usingX_hvg.tx infer: new--verbose; reports HVG-name status and warns when missing.tx predict: includes HVG names in outputs (uns['X_hvg_var_names']); setsvarwhen dimensions match.scGPTPerturbationDataset: defaulthvg_names_uns_key='X_hvg_var_names'and uses it to read gene names.READMEwith accessing HVG names.docs/migration/hvg_var_names.md.AGENTS.mdguidelines.uns['X_hvg_var_names']presence.Written by Cursor Bugbot for commit 4750d7a. This will update automatically on new commits. Configure here.