Skip to content

Allow aggregating expression by multiple columns#42

Merged
maddyduran merged 3 commits into
developfrom
agg_multiple_cols
Aug 19, 2025
Merged

Allow aggregating expression by multiple columns#42
maddyduran merged 3 commits into
developfrom
agg_multiple_cols

Conversation

@rfriedman22

Copy link
Copy Markdown
Collaborator

I extended hooke:::aggregated_expr_data() so that the user can specify multiple columns to aggregate by (e.g. both cell type and perturbation). I tested the updated function on default parameters and confirmed that it still returns the same result -- mean_expression, fraction_expressing, and specificity is equal for all values and the first column of the result is still cell_group if there is only one column. However, if multiple columns are specified, the result now puts all of those columns first, for example:

agg_expr <- hooke:::aggregated_expr_data(cds, c("perturbation", "cell_type_broad_abbrev")) %>%
  head()

returns:

  perturbation cell_type_broad_abbrev            gene_id gene_short_name
1     ctrl-inj                    RPC ENSDARG00000000001         slc35a5
2     ctrl-inj                    RPC ENSDARG00000000002          ccdc80
3     ctrl-inj                    RPC ENSDARG00000000018            nrf1
4     ctrl-inj                    RPC ENSDARG00000000019           ube2h
5     ctrl-inj                    RPC ENSDARG00000000068       slc9a3r1a
6     ctrl-inj                    RPC ENSDARG00000000069             dap
  fraction_expressing mean_expression specificity
1         0.002456332     0.002757203 0.051766341
2         0.002183406     0.001340389 0.075315424
3         0.086244541     0.083973716 0.077649102
4         0.021834061     0.019834487 0.016199531
5         0.003548035     0.002764963 0.008075803
6         0.036026201     0.034195728 0.027027247

@rfriedman22 rfriedman22 requested a review from maddyduran August 4, 2025 21:23
@rfriedman22

Copy link
Copy Markdown
Collaborator Author

Per discussion with Maddy, the specificity scoring is now optional. This is the slowest step, and if we are aggregating by e.g. cell type and perturbation, we don't want specificity metrics to be stratified by perturbation status.

@maddyduran maddyduran merged commit c685628 into develop Aug 19, 2025
1 check failed
@maddyduran maddyduran deleted the agg_multiple_cols branch August 19, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants