Skip to content

Add grouped_stats for categorical or continuous binning#774

Open
rhugonnet wants to merge 2 commits into
GlacioHack:mainfrom
rhugonnet:add_grouped_stats
Open

Add grouped_stats for categorical or continuous binning#774
rhugonnet wants to merge 2 commits into
GlacioHack:mainfrom
rhugonnet:add_grouped_stats

Conversation

@rhugonnet
Copy link
Copy Markdown
Member

@rhugonnet rhugonnet commented Nov 29, 2025

This PR is to discuss the implementation of binning following #668, by adding a minimal example using pandas.
For a reminder, see the discussion in that PR. For my justifications of the following implementation, see my points brought at the bottom, in these three comments here.

Here's an example of the function output/input (values would be passed as 1D array from point clouds, or 2D flattened array using ravel() from rasters):

arrays = {"slope": np.random.normal(size=100), "aspect": np.random.normal(size=100)}
values = {"band1": np.random.normal(size=100), "band2": np.random.normal(size=100)}
statistics = ["mean", "std"]
bins = [np.linspace(-2, 2, 10), 10]

df = _grouped_stats(arrays, bins, values, statistics)
                                                   band1         band2    
                                                    mean std      mean std
bin_slope        bin_aspect                                               
(-2.001, -1.556] (-2.3529999999999998, -1.925]       NaN NaN       NaN NaN
                 (-1.925, -1.501]                    NaN NaN       NaN NaN
                 (-1.501, -1.078]              -1.121366 NaN  0.936404 NaN
                 (-1.078, -0.654]                    NaN NaN       NaN NaN
                 (-0.654, -0.231]               1.140788 NaN  1.200292 NaN
...                                                  ...  ..       ...  ..
(1.556, 2.0]     (-0.231, 0.192]                0.502442 NaN -0.083028 NaN
                 (0.192, 0.616]                      NaN NaN       NaN NaN
                 (0.616, 1.039]                 2.094497 NaN  1.168949 NaN
                 (1.039, 1.463]                      NaN NaN       NaN NaN
                 (1.463, 1.886]                      NaN NaN       NaN NaN
[90 rows x 4 columns]

@rhugonnet
Copy link
Copy Markdown
Member Author

For rasters, we would add an option return_masks=True that also computes 2D masks for each bin combination derived with pd.cut(), and returns them either as a dictionary of masks.

@rhugonnet
Copy link
Copy Markdown
Member Author

@belletva @adebardo @adehecq

@adebardo
Copy link
Copy Markdown
Contributor

  • Do you think that if we want to add the ability to perform mask-based classification (for example, land cover), we would need a new function?
  • And therefore, could we have a function based on Pandas for the different processing steps?
  • And if I want to combine information between masks and binning, should I add that there as well in this module?

@rhugonnet
Copy link
Copy Markdown
Member Author

rhugonnet commented Dec 15, 2025

For a classif (categorical binning), the exact same function works, we just need to enforce bin length equal to the number of categories if we want to have them all separate (can be the default):

# HERE: Binning arrays are now categorical
arrays = {"classif1": np.random.random_integers(0, 10, size=100), "classif2": np.random.random_integers(10, 20,
                                                                                                     size=100)}
values = {"band1": np.random.normal(size=100), "band2": np.random.normal(size=100)}
statistics = ["mean", "std"]

# HERE: Enforce bins of length 10
bins = [10, 10]

df = _grouped_stats(arrays, bins, values, statistics)
df
                                band1               band2          
                                 mean       std      mean       std
bin_classif1  bin_classif2                                         
(-0.011, 1.0] (9.989, 11.0]  0.087436  1.723041 -0.835955  1.061464
              (11.0, 12.0]  -0.301942  1.084940 -0.527191  1.107538
              (12.0, 13.0]   1.185885       NaN  1.460575       NaN
              (13.0, 14.0]   0.713294  0.948106  0.073826  0.609808
              (14.0, 15.0]   0.817307       NaN -0.717526       NaN
...                               ...       ...       ...       ...
(9.0, 10.0]   (15.0, 16.0]        NaN       NaN       NaN       NaN
              (16.0, 17.0]        NaN       NaN       NaN       NaN
              (17.0, 18.0]   0.238242  0.557863  0.488529  1.221224
              (18.0, 19.0]  -0.912399       NaN  2.245666       NaN
              (19.0, 20.0]  -0.517997       NaN  0.346709       NaN
[100 rows x 4 columns]

If we want, we can also overwrite the output to show only the center value of the categories (instead of an interval; looks like the first bin is slightly shifted by 0.01 by default).

I'm not sure I understand "combine information between masks and binning", you mean do them both simultaneously? If that's it, then yes, we can have any number of categorical + continuous variables binned simultaneously, following the above. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants