Skip to content

Enable on-the-fly proximity score calculation#324

Open
ptajvar wants to merge 5 commits intodevfrom
feature/PNA-2301-enable-on-the-fly-proximity-score-calculation
Open

Enable on-the-fly proximity score calculation#324
ptajvar wants to merge 5 commits intodevfrom
feature/PNA-2301-enable-on-the-fly-proximity-score-calculation

Conversation

@ptajvar
Copy link
Contributor

@ptajvar ptajvar commented Mar 19, 2026

Proximity scores can be computed relatively fast using analytical values for expected joint counts instead of permutations. We are bringing it in this PR as an optional way to compute proximity scores.

Fixes: PNA-2031

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

WIP

PR checklist:

  • This comment contains a description of changes (with reason).
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • If a new tool or package is included, I have updated dependencies in pyproject.toml and cited it properly
  • I have checked my code and documentation and corrected any misspellings
  • I have documented any significant changes to the code in CHANGELOG.md

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional “on-the-fly” proximity computation path that calculates join-count expected values analytically from the edgelist (DuckDB) rather than relying on precomputed proximity tables, and introduces a test to validate the analytic results against the existing JCS proximity reference data.

Changes:

  • Add calculate_from_edgelist option to PNAPixelDataset.proximity() / Proximity to compute proximity directly from the DuckDB edgelist.
  • Implement jcs_with_analytical_stats() DuckDB query to produce proximity metrics with analytical expected mean/SD logic.
  • Add a new test comparing analytic proximity outputs to the precomputed JCS proximity dataset via correlation checks.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
src/pixelator/pna/analysis/proximity.py Adds DuckDB-backed analytic proximity query (jcs_with_analytical_stats).
src/pixelator/pna/pixeldataset/__init__.py Exposes the new analytic path via calculate_from_edgelist and routes Proximity.to_polars() accordingly.
tests/pna/analysis/test_analysis.py Adds a test validating analytic proximity outputs against stored JCS proximity reference data.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 286 to +292
def to_df(self) -> pd.DataFrame:
"""Get the edgelist as a pandas DataFrame."""
return self.to_polars().to_pandas()

def to_polars(self) -> pl.DataFrame:
"""Get the edgelist as a polars DataFrame."""
return self._post_process(
self._querier.read_proximity(
if not self._calculate_from_edgelist:
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proximity.to_polars() / to_df() docstrings say “Get the edgelist…”, but these methods return proximity data. Since this method now has two execution paths, having accurate docstrings will help avoid confusion for API users.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We need to update the docstrings here. I think they probably suffer from some copy-pasting I have done at some point when building the basis here...

Comment on lines 308 to 314
[
f"Proximity({n_proximity:,} elements",
f"add_marker_counts={self._add_marker_counts}",
f"add_logratio={self._add_log2_ratio_col})",
f"add_logratio={self._add_log2_ratio_col}",
f"calculate_from_edgelist={self._calculate_from_edgelist}",
")",
]
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__str__() now builds the closing ) as a separate element in the list passed to ", ".join(...), which results in an extra comma before the closing parenthesis (e.g., ..., calculate_from_edgelist=False, )). Consider formatting the final string without introducing that trailing separator.

Copilot uses AI. Check for mistakes.
COALESCE(obs.join_count, 0) as join_count,
COALESCE(exp.join_count_expected_mean, 0) as join_count_expected_mean,
LOG2(GREATEST(COALESCE(obs.join_count, 0), 1) / GREATEST(COALESCE(exp.join_count_expected_mean, 0), 1)) AS log2_ratio,
(COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z,
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated SQL has a trailing comma after AS join_count_z, which will make the query fail to execute (syntax error before FROM). Remove the trailing comma in the SELECT list.

Suggested change
(COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z,
(COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z

Copilot uses AI. Check for mistakes.
Comment on lines +338 to +356
if isinstance(components, str):
components = [components]
if isinstance(markers, str):
markers = [markers]

if markers is not None:
marker_list = ", ".join(f"'{m}'" for m in markers) if markers else ""
where_marker1 = f"WHERE marker_1 IN ({marker_list})"
and_marker1 = f"AND marker_1 IN ({marker_list})"
and_marker2 = f"AND marker_2 IN ({marker_list})"
else:
where_marker1 = ""
and_marker1 = ""
and_marker2 = ""

if components is not None:
component_list = ", ".join(f"'{c}'" for c in components) if components else ""
where_component = f"WHERE component IN ({component_list})"
else:
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markers/components being an empty set/list currently produces SQL like IN (), which is invalid. This can happen because normalize_input_to_set([]) returns an empty set (and the rest of the codebase treats that as “no filter”). Consider normalizing empty collections to None (or generating TRUE conditions) before building the query.

Copilot uses AI. Check for mistakes.
Comment on lines +343 to +347
if markers is not None:
marker_list = ", ".join(f"'{m}'" for m in markers) if markers else ""
where_marker1 = f"WHERE marker_1 IN ({marker_list})"
and_marker1 = f"AND marker_1 IN ({marker_list})"
and_marker2 = f"AND marker_2 IN ({marker_list})"
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL is built via string interpolation of markers/components into quoted literals. To avoid quoting edge cases (e.g., names containing ') and reduce injection risk, prefer DuckDB parameter binding (e.g., marker_1 IN $markers) similar to PixelDataQuerier.read_proximity().

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think copilot is right here. And it is probably something we should go over the code and look for in general.

Comment on lines +359 to +360
component_list_sql = ", ".join(f"'{c}'" for c in components) if components else ""

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

component_list_sql is computed but never used. Removing it would reduce noise and avoid confusion about intended query structure.

Suggested change
component_list_sql = ", ".join(f"'{c}'" for c in components) if components else ""

Copilot uses AI. Check for mistakes.
Comment on lines +474 to +478
COALESCE(obs.marker_B, exp.marker_B) as marker_2,
COALESCE(obs.join_count, 0) as join_count,
COALESCE(exp.join_count_expected_mean, 0) as join_count_expected_mean,
LOG2(GREATEST(COALESCE(obs.join_count, 0), 1) / GREATEST(COALESCE(exp.join_count_expected_mean, 0), 1)) AS log2_ratio,
(COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z,
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jcs_with_analytical_stats() currently doesn’t return join_count_expected_sd (available in expected_agg) or a p-value column (the precomputed proximity table includes join_count_pvalue). If calculate_from_edgelist=True is meant to be a drop-in alternative, consider returning these columns too (e.g., expose the analytical SD and compute a two-sided p-value from join_count_z).

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid comment.

Comment on lines +175 to +183
calculate_from_edgelist: bool = False,
):
"""Create a new instance of Proximity."""
self._querier = querier
self._components = normalize_input_to_set(components)
self._markers = normalize_input_to_set(markers)
self._add_marker_counts = add_marker_counts
self._add_log2_ratio_col = add_log2_ratio
self._calculate_from_edgelist = calculate_from_edgelist
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With calculate_from_edgelist=True, __len__() / is_empty() and the string representation will still be driven by read_proximity_len() (the precomputed table). If the proximity table is absent, len() will be 0 even though to_polars() can produce results from the edgelist. Consider adjusting these methods to reflect the edgelist-based mode (or clearly documenting that length is unavailable/expensive in that mode).

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid comment.

Copy link
Contributor

@johandahlberg johandahlberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! I think there there are some things, mostly around clarity and API compatibility that we need to address before merging this.

Comment on lines +343 to +347
if markers is not None:
marker_list = ", ".join(f"'{m}'" for m in markers) if markers else ""
where_marker1 = f"WHERE marker_1 IN ({marker_list})"
and_marker1 = f"AND marker_1 IN ({marker_list})"
and_marker2 = f"AND marker_2 IN ({marker_list})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think copilot is right here. And it is probably something we should go over the code and look for in general.

if isinstance(markers, str):
markers = [markers]

if markers is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for using something like below for cases like we have here.

set_markers = markers is not None
where_marker1 = f"WHERE marker_1 IN ({marker_list})" if set_markers else ""
and_marker1 = f"AND marker_1 IN ({marker_list})" if set_markers else ""

Since the assignment reads on a single line, instead of having to go to another place to look for the alternative value. What do you think?

and_marker1 = ""
and_marker2 = ""

if components is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think perhaps this would read more fluently:

component_list = ", ".join(f"'{c}'" for c in components) if components else ""
where_component = "TRUE" if component is None else f"WHERE component IN ({component_list})"


component_list_sql = ", ".join(f"'{c}'" for c in components) if components else ""

cte_group_edges = f"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two comments around this section.

Firstly, I think (maybe a bit paradoxically) that this would be more readable (at least to me) if we put all of it into one query string, rather than building it from multiple string variables.

Secondly, I think that we only need to filter for components once if we do something like:

       current_edgelist AS (
            SELECT * FROM edgelist WHERE {where_component}
        ),
        group_edges AS (
            SELECT sample, component, COUNT(*) as n_edges 
            FROM current_edgelist
            GROUP BY sample, component
        )
... etc

This should simplify the rest of the SQL expression.

Comment on lines +474 to +478
COALESCE(obs.marker_B, exp.marker_B) as marker_2,
COALESCE(obs.join_count, 0) as join_count,
COALESCE(exp.join_count_expected_mean, 0) as join_count_expected_mean,
LOG2(GREATEST(COALESCE(obs.join_count, 0), 1) / GREATEST(COALESCE(exp.join_count_expected_mean, 0), 1)) AS log2_ratio,
(COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid comment.

Comment on lines +175 to +183
calculate_from_edgelist: bool = False,
):
"""Create a new instance of Proximity."""
self._querier = querier
self._components = normalize_input_to_set(components)
self._markers = normalize_input_to_set(markers)
self._add_marker_counts = add_marker_counts
self._add_log2_ratio_col = add_log2_ratio
self._calculate_from_edgelist = calculate_from_edgelist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid comment.

Comment on lines 286 to +292
def to_df(self) -> pd.DataFrame:
"""Get the edgelist as a pandas DataFrame."""
return self.to_polars().to_pandas()

def to_polars(self) -> pl.DataFrame:
"""Get the edgelist as a polars DataFrame."""
return self._post_process(
self._querier.read_proximity(
if not self._calculate_from_edgelist:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We need to update the docstrings here. I think they probably suffer from some copy-pasting I have done at some point when building the basis here...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants