Conversation
There was a problem hiding this comment.
Pull request overview
Adds an optional “on-the-fly” proximity computation path that calculates join-count expected values analytically from the edgelist (DuckDB) rather than relying on precomputed proximity tables, and introduces a test to validate the analytic results against the existing JCS proximity reference data.
Changes:
- Add
calculate_from_edgelistoption toPNAPixelDataset.proximity()/Proximityto compute proximity directly from the DuckDBedgelist. - Implement
jcs_with_analytical_stats()DuckDB query to produce proximity metrics with analytical expected mean/SD logic. - Add a new test comparing analytic proximity outputs to the precomputed JCS proximity dataset via correlation checks.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
src/pixelator/pna/analysis/proximity.py |
Adds DuckDB-backed analytic proximity query (jcs_with_analytical_stats). |
src/pixelator/pna/pixeldataset/__init__.py |
Exposes the new analytic path via calculate_from_edgelist and routes Proximity.to_polars() accordingly. |
tests/pna/analysis/test_analysis.py |
Adds a test validating analytic proximity outputs against stored JCS proximity reference data. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def to_df(self) -> pd.DataFrame: | ||
| """Get the edgelist as a pandas DataFrame.""" | ||
| return self.to_polars().to_pandas() | ||
|
|
||
| def to_polars(self) -> pl.DataFrame: | ||
| """Get the edgelist as a polars DataFrame.""" | ||
| return self._post_process( | ||
| self._querier.read_proximity( | ||
| if not self._calculate_from_edgelist: |
There was a problem hiding this comment.
Proximity.to_polars() / to_df() docstrings say “Get the edgelist…”, but these methods return proximity data. Since this method now has two execution paths, having accurate docstrings will help avoid confusion for API users.
There was a problem hiding this comment.
Yeah. We need to update the docstrings here. I think they probably suffer from some copy-pasting I have done at some point when building the basis here...
| [ | ||
| f"Proximity({n_proximity:,} elements", | ||
| f"add_marker_counts={self._add_marker_counts}", | ||
| f"add_logratio={self._add_log2_ratio_col})", | ||
| f"add_logratio={self._add_log2_ratio_col}", | ||
| f"calculate_from_edgelist={self._calculate_from_edgelist}", | ||
| ")", | ||
| ] |
There was a problem hiding this comment.
__str__() now builds the closing ) as a separate element in the list passed to ", ".join(...), which results in an extra comma before the closing parenthesis (e.g., ..., calculate_from_edgelist=False, )). Consider formatting the final string without introducing that trailing separator.
| COALESCE(obs.join_count, 0) as join_count, | ||
| COALESCE(exp.join_count_expected_mean, 0) as join_count_expected_mean, | ||
| LOG2(GREATEST(COALESCE(obs.join_count, 0), 1) / GREATEST(COALESCE(exp.join_count_expected_mean, 0), 1)) AS log2_ratio, | ||
| (COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z, |
There was a problem hiding this comment.
The generated SQL has a trailing comma after AS join_count_z, which will make the query fail to execute (syntax error before FROM). Remove the trailing comma in the SELECT list.
| (COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z, | |
| (COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z |
| if isinstance(components, str): | ||
| components = [components] | ||
| if isinstance(markers, str): | ||
| markers = [markers] | ||
|
|
||
| if markers is not None: | ||
| marker_list = ", ".join(f"'{m}'" for m in markers) if markers else "" | ||
| where_marker1 = f"WHERE marker_1 IN ({marker_list})" | ||
| and_marker1 = f"AND marker_1 IN ({marker_list})" | ||
| and_marker2 = f"AND marker_2 IN ({marker_list})" | ||
| else: | ||
| where_marker1 = "" | ||
| and_marker1 = "" | ||
| and_marker2 = "" | ||
|
|
||
| if components is not None: | ||
| component_list = ", ".join(f"'{c}'" for c in components) if components else "" | ||
| where_component = f"WHERE component IN ({component_list})" | ||
| else: |
There was a problem hiding this comment.
markers/components being an empty set/list currently produces SQL like IN (), which is invalid. This can happen because normalize_input_to_set([]) returns an empty set (and the rest of the codebase treats that as “no filter”). Consider normalizing empty collections to None (or generating TRUE conditions) before building the query.
| if markers is not None: | ||
| marker_list = ", ".join(f"'{m}'" for m in markers) if markers else "" | ||
| where_marker1 = f"WHERE marker_1 IN ({marker_list})" | ||
| and_marker1 = f"AND marker_1 IN ({marker_list})" | ||
| and_marker2 = f"AND marker_2 IN ({marker_list})" |
There was a problem hiding this comment.
The SQL is built via string interpolation of markers/components into quoted literals. To avoid quoting edge cases (e.g., names containing ') and reduce injection risk, prefer DuckDB parameter binding (e.g., marker_1 IN $markers) similar to PixelDataQuerier.read_proximity().
There was a problem hiding this comment.
I think copilot is right here. And it is probably something we should go over the code and look for in general.
| component_list_sql = ", ".join(f"'{c}'" for c in components) if components else "" | ||
|
|
There was a problem hiding this comment.
component_list_sql is computed but never used. Removing it would reduce noise and avoid confusion about intended query structure.
| component_list_sql = ", ".join(f"'{c}'" for c in components) if components else "" |
| COALESCE(obs.marker_B, exp.marker_B) as marker_2, | ||
| COALESCE(obs.join_count, 0) as join_count, | ||
| COALESCE(exp.join_count_expected_mean, 0) as join_count_expected_mean, | ||
| LOG2(GREATEST(COALESCE(obs.join_count, 0), 1) / GREATEST(COALESCE(exp.join_count_expected_mean, 0), 1)) AS log2_ratio, | ||
| (COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z, |
There was a problem hiding this comment.
jcs_with_analytical_stats() currently doesn’t return join_count_expected_sd (available in expected_agg) or a p-value column (the precomputed proximity table includes join_count_pvalue). If calculate_from_edgelist=True is meant to be a drop-in alternative, consider returning these columns too (e.g., expose the analytical SD and compute a two-sided p-value from join_count_z).
There was a problem hiding this comment.
This is a valid comment.
| calculate_from_edgelist: bool = False, | ||
| ): | ||
| """Create a new instance of Proximity.""" | ||
| self._querier = querier | ||
| self._components = normalize_input_to_set(components) | ||
| self._markers = normalize_input_to_set(markers) | ||
| self._add_marker_counts = add_marker_counts | ||
| self._add_log2_ratio_col = add_log2_ratio | ||
| self._calculate_from_edgelist = calculate_from_edgelist |
There was a problem hiding this comment.
With calculate_from_edgelist=True, __len__() / is_empty() and the string representation will still be driven by read_proximity_len() (the precomputed table). If the proximity table is absent, len() will be 0 even though to_polars() can produce results from the edgelist. Consider adjusting these methods to reflect the edgelist-based mode (or clearly documenting that length is unavailable/expensive in that mode).
There was a problem hiding this comment.
This is a valid comment.
johandahlberg
left a comment
There was a problem hiding this comment.
Great work on this! I think there there are some things, mostly around clarity and API compatibility that we need to address before merging this.
| if markers is not None: | ||
| marker_list = ", ".join(f"'{m}'" for m in markers) if markers else "" | ||
| where_marker1 = f"WHERE marker_1 IN ({marker_list})" | ||
| and_marker1 = f"AND marker_1 IN ({marker_list})" | ||
| and_marker2 = f"AND marker_2 IN ({marker_list})" |
There was a problem hiding this comment.
I think copilot is right here. And it is probably something we should go over the code and look for in general.
| if isinstance(markers, str): | ||
| markers = [markers] | ||
|
|
||
| if markers is not None: |
There was a problem hiding this comment.
I have a slight preference for using something like below for cases like we have here.
set_markers = markers is not None
where_marker1 = f"WHERE marker_1 IN ({marker_list})" if set_markers else ""
and_marker1 = f"AND marker_1 IN ({marker_list})" if set_markers else ""
Since the assignment reads on a single line, instead of having to go to another place to look for the alternative value. What do you think?
| and_marker1 = "" | ||
| and_marker2 = "" | ||
|
|
||
| if components is not None: |
There was a problem hiding this comment.
I think perhaps this would read more fluently:
component_list = ", ".join(f"'{c}'" for c in components) if components else ""
where_component = "TRUE" if component is None else f"WHERE component IN ({component_list})"
|
|
||
| component_list_sql = ", ".join(f"'{c}'" for c in components) if components else "" | ||
|
|
||
| cte_group_edges = f""" |
There was a problem hiding this comment.
I have two comments around this section.
Firstly, I think (maybe a bit paradoxically) that this would be more readable (at least to me) if we put all of it into one query string, rather than building it from multiple string variables.
Secondly, I think that we only need to filter for components once if we do something like:
current_edgelist AS (
SELECT * FROM edgelist WHERE {where_component}
),
group_edges AS (
SELECT sample, component, COUNT(*) as n_edges
FROM current_edgelist
GROUP BY sample, component
)
... etc
This should simplify the rest of the SQL expression.
| COALESCE(obs.marker_B, exp.marker_B) as marker_2, | ||
| COALESCE(obs.join_count, 0) as join_count, | ||
| COALESCE(exp.join_count_expected_mean, 0) as join_count_expected_mean, | ||
| LOG2(GREATEST(COALESCE(obs.join_count, 0), 1) / GREATEST(COALESCE(exp.join_count_expected_mean, 0), 1)) AS log2_ratio, | ||
| (COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z, |
There was a problem hiding this comment.
This is a valid comment.
| calculate_from_edgelist: bool = False, | ||
| ): | ||
| """Create a new instance of Proximity.""" | ||
| self._querier = querier | ||
| self._components = normalize_input_to_set(components) | ||
| self._markers = normalize_input_to_set(markers) | ||
| self._add_marker_counts = add_marker_counts | ||
| self._add_log2_ratio_col = add_log2_ratio | ||
| self._calculate_from_edgelist = calculate_from_edgelist |
There was a problem hiding this comment.
This is a valid comment.
| def to_df(self) -> pd.DataFrame: | ||
| """Get the edgelist as a pandas DataFrame.""" | ||
| return self.to_polars().to_pandas() | ||
|
|
||
| def to_polars(self) -> pl.DataFrame: | ||
| """Get the edgelist as a polars DataFrame.""" | ||
| return self._post_process( | ||
| self._querier.read_proximity( | ||
| if not self._calculate_from_edgelist: |
There was a problem hiding this comment.
Yeah. We need to update the docstrings here. I think they probably suffer from some copy-pasting I have done at some point when building the basis here...
Proximity scores can be computed relatively fast using analytical values for expected joint counts instead of permutations. We are bringing it in this PR as an optional way to compute proximity scores.
Fixes: PNA-2031
Type of change
How Has This Been Tested?
WIP
PR checklist:
pyproject.tomland cited it properly