Skip to content

Conversation

@agerardy
Copy link
Collaborator

@agerardy agerardy commented Oct 7, 2025

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes
#944
This PR implements normalization support for 3D EHRData objects. The implementation enables all existing normalization functions to work with longitudinal data with shape (n_obs, n_var, n_timestamps) but maintains backward compatibility with 2D data.

Technical details
Treats .R as a named layer with 3D structure. Uses helper functions (_get_target_layer, _set_target_layer, and normalize_3d_data, _normalize_2d_data) to avoid code duplication.
Each variable is processed independently by flattening the time dimension (n_obs x n_timestamps), applying the sklearn normalization function, then reshaping to 3D.

Added tests for the new functions, including group functionality and NaN cases

Examples:

edata = ed.dt.ehrdata_blobs(n_observations=100, base_timepoints=24, cluster_std=0.5, n_centers=3, seasonality=True, 
    time_shifts=True, variable_length=False)

# standard scaling
ep.pp.scale_norm(edata)

# log transformation
ep.pp.offset_negative_values(edata)
ep.pp.log_norm(edata)

@agerardy agerardy linked an issue Oct 7, 2025 that may be closed by this pull request
14 tasks
if group_key is None:
var_values = scale_func(var_values)

if hasattr(edata, "R") and edata.R is not None and edata.R.ndim == 3:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the edata object has R, then it will be used regardless of what the layer argument specified.

This line is a good start to investigate and set a stopping point with a debugger, to see what branch of the if/else statement is actually entered by the code, and whether it matches what you think it should :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agerardy this hasn't been addressed right? Eljas first part of the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this logic to its own function _get_target_layer and changed it to check for layers first.

    if layer is None:
        if hasattr(edata, "R") and edata.R is not None:
            return edata.R, "R"
        else:
            return edata.X, "X"
    else:
        return edata.layers[layer], layer

I hope this is how it needs to work

@Zethson Zethson mentioned this pull request Oct 16, 2025
@agerardy agerardy marked this pull request as ready for review October 20, 2025 10:16
@agerardy agerardy requested a review from Zethson October 20, 2025 10:17
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Already looks pretty good.

  1. Many of my comments are repetitive so I stopped repeating them after some time 😄
  2. Many of your tests have tons of useless comments. Let the code speak for itself and clean up any LLM leftovers, please.
  3. Please also follow the comments that I make in Öyku's PRs. One of them is to improve the PR description and add some usage examples.

Just a first quick pass. I'll let @eroell have a go and then I might have a look again.

Thanks!

>>> edata = ed.dt.mimic_2()
>>> edata_norm = ep.pp.scale_norm(edata, copy=True)
>>> # Works automatically with both 2D and 3D data
>>> edata_3d_norm = ep.pp.scale_norm(edata_3d, copy=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it simple and not distinguish between 2D and 3D. We should rather finally have a proper 3D test dataset @eroell .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gathering a few comments from other places here, to make it less dispersed
a) Showing some output is very helpful, see e.g. here
b) One test of 2D or 3D is enough; Can you use a 3D here, ideally the physionet2012 dataset?
c) edata_3d variable would never have been introduced

Copy link
Collaborator Author

@agerardy agerardy Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just reply to this one and mark the duplicates resolved :) I've written the examples with fake numbers for now, but will try to get it actually running with the physionet2012 dataset. it certainly looks better but doesnt work yet

assert np.array_equal(expected_adata.X, ep.pp.log_norm(to_normalize_adata, copy=True).X)


def test_scale_norm_3d(edata_blob_small_3d):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every changed function should be tested - are you preferring to make one classic example here, which you then expand to every other normalization function once we have iterated, or create them now already? Both options are OK

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by this. I have written a 3D test function for every normalization function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Longitudinal normalization

4 participants