Enable on-the-fly proximity score calculation by ptajvar · Pull Request #324 · PixelgenTechnologies/pixelator

ptajvar · 2026-03-19T14:42:50Z

Proximity scores can be computed relatively fast using analytical values for expected joint counts instead of permutations. We are bringing it in this PR as an optional way to compute proximity scores.

Fixes: PNA-2031

Type of change

New feature (non-breaking change which adds functionality)

How Has This Been Tested?

WIP

PR checklist:

This comment contains a description of changes (with reason).
I have performed a self-review of my own code
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
If a new tool or package is included, I have updated dependencies in pyproject.toml and cited it properly
I have checked my code and documentation and corrected any misspellings
I have documented any significant changes to the code in CHANGELOG.md

Copilot

Pull request overview

Adds an optional “on-the-fly” proximity computation path that calculates join-count expected values analytically from the edgelist (DuckDB) rather than relying on precomputed proximity tables, and introduces a test to validate the analytic results against the existing JCS proximity reference data.

Changes:

Add calculate_from_edgelist option to PNAPixelDataset.proximity() / Proximity to compute proximity directly from the DuckDB edgelist.
Implement jcs_with_analytical_stats() DuckDB query to produce proximity metrics with analytical expected mean/SD logic.
Add a new test comparing analytic proximity outputs to the precomputed JCS proximity dataset via correlation checks.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
`src/pixelator/pna/analysis/proximity.py`	Adds DuckDB-backed analytic proximity query (`jcs_with_analytical_stats`).
`src/pixelator/pna/pixeldataset/__init__.py`	Exposes the new analytic path via `calculate_from_edgelist` and routes `Proximity.to_polars()` accordingly.
`tests/pna/analysis/test_analysis.py`	Adds a test validating analytic proximity outputs against stored JCS proximity reference data.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T07:39:29Z

src/pixelator/pna/pixeldataset/__init__.py

    def to_df(self) -> pd.DataFrame:
        """Get the edgelist as a pandas DataFrame."""
        return self.to_polars().to_pandas()

    def to_polars(self) -> pl.DataFrame:
        """Get the edgelist as a polars DataFrame."""
-        return self._post_process(
-            self._querier.read_proximity(
+        if not self._calculate_from_edgelist:


Proximity.to_polars() / to_df() docstrings say “Get the edgelist…”, but these methods return proximity data. Since this method now has two execution paths, having accurate docstrings will help avoid confusion for API users.

Yeah. We need to update the docstrings here. I think they probably suffer from some copy-pasting I have done at some point when building the basis here...

Copilot · 2026-03-23T07:39:30Z

src/pixelator/pna/pixeldataset/__init__.py

            [
                f"Proximity({n_proximity:,} elements",
                f"add_marker_counts={self._add_marker_counts}",
-                f"add_logratio={self._add_log2_ratio_col})",
+                f"add_logratio={self._add_log2_ratio_col}",
+                f"calculate_from_edgelist={self._calculate_from_edgelist}",
+                ")",
            ]


__str__() now builds the closing ) as a separate element in the list passed to ", ".join(...), which results in an extra comma before the closing parenthesis (e.g., ..., calculate_from_edgelist=False, )). Consider formatting the final string without introducing that trailing separator.

Copilot · 2026-03-23T07:39:30Z

src/pixelator/pna/analysis/proximity.py

+                COALESCE(obs.join_count, 0) as join_count,
+                COALESCE(exp.join_count_expected_mean, 0) as join_count_expected_mean,
+                LOG2(GREATEST(COALESCE(obs.join_count, 0), 1) / GREATEST(COALESCE(exp.join_count_expected_mean, 0), 1)) AS log2_ratio,
+                (COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z,


The generated SQL has a trailing comma after AS join_count_z, which will make the query fail to execute (syntax error before FROM). Remove the trailing comma in the SELECT list.

Suggested change

(COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z,

(COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z

Copilot · 2026-03-23T07:39:30Z

src/pixelator/pna/analysis/proximity.py

+    if isinstance(components, str):
+        components = [components]
+    if isinstance(markers, str):
+        markers = [markers]
+
+    if markers is not None:
+        marker_list = ", ".join(f"'{m}'" for m in markers) if markers else ""
+        where_marker1 = f"WHERE marker_1 IN ({marker_list})"
+        and_marker1 = f"AND marker_1 IN ({marker_list})"
+        and_marker2 = f"AND marker_2 IN ({marker_list})"
+    else:
+        where_marker1 = ""
+        and_marker1 = ""
+        and_marker2 = ""
+
+    if components is not None:
+        component_list = ", ".join(f"'{c}'" for c in components) if components else ""
+        where_component = f"WHERE component IN ({component_list})"
+    else:


markers/components being an empty set/list currently produces SQL like IN (), which is invalid. This can happen because normalize_input_to_set([]) returns an empty set (and the rest of the codebase treats that as “no filter”). Consider normalizing empty collections to None (or generating TRUE conditions) before building the query.

Copilot · 2026-03-23T07:39:30Z

src/pixelator/pna/analysis/proximity.py

+    if markers is not None:
+        marker_list = ", ".join(f"'{m}'" for m in markers) if markers else ""
+        where_marker1 = f"WHERE marker_1 IN ({marker_list})"
+        and_marker1 = f"AND marker_1 IN ({marker_list})"
+        and_marker2 = f"AND marker_2 IN ({marker_list})"


The SQL is built via string interpolation of markers/components into quoted literals. To avoid quoting edge cases (e.g., names containing ') and reduce injection risk, prefer DuckDB parameter binding (e.g., marker_1 IN $markers) similar to PixelDataQuerier.read_proximity().

I think copilot is right here. And it is probably something we should go over the code and look for in general.

Copilot · 2026-03-23T07:39:30Z

src/pixelator/pna/analysis/proximity.py

+    component_list_sql = ", ".join(f"'{c}'" for c in components) if components else ""
+


component_list_sql is computed but never used. Removing it would reduce noise and avoid confusion about intended query structure.

Suggested change

component_list_sql = ", ".join(f"'{c}'" for c in components) if components else ""

Copilot · 2026-03-23T07:39:31Z

src/pixelator/pna/analysis/proximity.py

+                COALESCE(obs.marker_B, exp.marker_B) as marker_2,
+                COALESCE(obs.join_count, 0) as join_count,
+                COALESCE(exp.join_count_expected_mean, 0) as join_count_expected_mean,
+                LOG2(GREATEST(COALESCE(obs.join_count, 0), 1) / GREATEST(COALESCE(exp.join_count_expected_mean, 0), 1)) AS log2_ratio,
+                (COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z,


jcs_with_analytical_stats() currently doesn’t return join_count_expected_sd (available in expected_agg) or a p-value column (the precomputed proximity table includes join_count_pvalue). If calculate_from_edgelist=True is meant to be a drop-in alternative, consider returning these columns too (e.g., expose the analytical SD and compute a two-sided p-value from join_count_z).

This is a valid comment.

Copilot · 2026-03-23T07:39:31Z

src/pixelator/pna/pixeldataset/__init__.py

+        calculate_from_edgelist: bool = False,
    ):
        """Create a new instance of Proximity."""
        self._querier = querier
        self._components = normalize_input_to_set(components)
        self._markers = normalize_input_to_set(markers)
        self._add_marker_counts = add_marker_counts
        self._add_log2_ratio_col = add_log2_ratio
+        self._calculate_from_edgelist = calculate_from_edgelist


With calculate_from_edgelist=True, __len__() / is_empty() and the string representation will still be driven by read_proximity_len() (the precomputed table). If the proximity table is absent, len() will be 0 even though to_polars() can produce results from the edgelist. Consider adjusting these methods to reflect the edgelist-based mode (or clearly documenting that length is unavailable/expensive in that mode).

This is a valid comment.

johandahlberg

Great work on this! I think there there are some things, mostly around clarity and API compatibility that we need to address before merging this.

johandahlberg · 2026-03-23T09:39:36Z

src/pixelator/pna/analysis/proximity.py

+    if markers is not None:
+        marker_list = ", ".join(f"'{m}'" for m in markers) if markers else ""
+        where_marker1 = f"WHERE marker_1 IN ({marker_list})"
+        and_marker1 = f"AND marker_1 IN ({marker_list})"
+        and_marker2 = f"AND marker_2 IN ({marker_list})"


I think copilot is right here. And it is probably something we should go over the code and look for in general.

johandahlberg · 2026-03-23T09:43:16Z

src/pixelator/pna/analysis/proximity.py

+    if isinstance(markers, str):
+        markers = [markers]
+
+    if markers is not None:


I have a slight preference for using something like below for cases like we have here.

set_markers = markers is not None where_marker1 = f"WHERE marker_1 IN ({marker_list})" if set_markers else "" and_marker1 = f"AND marker_1 IN ({marker_list})" if set_markers else ""

Since the assignment reads on a single line, instead of having to go to another place to look for the alternative value. What do you think?

johandahlberg · 2026-03-23T09:48:27Z

src/pixelator/pna/analysis/proximity.py

+        and_marker1 = ""
+        and_marker2 = ""
+
+    if components is not None:


I think perhaps this would read more fluently:

component_list = ", ".join(f"'{c}'" for c in components) if components else "" where_component = "TRUE" if component is None else f"WHERE component IN ({component_list})"

johandahlberg · 2026-03-23T09:53:29Z

src/pixelator/pna/analysis/proximity.py

+
+    component_list_sql = ", ".join(f"'{c}'" for c in components) if components else ""
+
+    cte_group_edges = f"""


I have two comments around this section.

Firstly, I think (maybe a bit paradoxically) that this would be more readable (at least to me) if we put all of it into one query string, rather than building it from multiple string variables.

Secondly, I think that we only need to filter for components once if we do something like:

current_edgelist AS ( SELECT * FROM edgelist WHERE {where_component} ), group_edges AS ( SELECT sample, component, COUNT(*) as n_edges FROM current_edgelist GROUP BY sample, component ) ... etc

This should simplify the rest of the SQL expression.

johandahlberg · 2026-03-23T09:54:46Z

src/pixelator/pna/analysis/proximity.py

+                COALESCE(obs.marker_B, exp.marker_B) as marker_2,
+                COALESCE(obs.join_count, 0) as join_count,
+                COALESCE(exp.join_count_expected_mean, 0) as join_count_expected_mean,
+                LOG2(GREATEST(COALESCE(obs.join_count, 0), 1) / GREATEST(COALESCE(exp.join_count_expected_mean, 0), 1)) AS log2_ratio,
+                (COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z,


This is a valid comment.

johandahlberg · 2026-03-23T09:54:54Z

src/pixelator/pna/pixeldataset/__init__.py

+        calculate_from_edgelist: bool = False,
    ):
        """Create a new instance of Proximity."""
        self._querier = querier
        self._components = normalize_input_to_set(components)
        self._markers = normalize_input_to_set(markers)
        self._add_marker_counts = add_marker_counts
        self._add_log2_ratio_col = add_log2_ratio
+        self._calculate_from_edgelist = calculate_from_edgelist


This is a valid comment.

johandahlberg · 2026-03-23T09:56:17Z

src/pixelator/pna/pixeldataset/__init__.py

    def to_df(self) -> pd.DataFrame:
        """Get the edgelist as a pandas DataFrame."""
        return self.to_polars().to_pandas()

    def to_polars(self) -> pl.DataFrame:
        """Get the edgelist as a polars DataFrame."""
-        return self._post_process(
-            self._querier.read_proximity(
+        if not self._calculate_from_edgelist:


Yeah. We need to update the docstrings here. I think they probably suffer from some copy-pasting I have done at some point when building the basis here...

ptajvar added 5 commits March 18, 2026 15:27

placement

7825c5b

main code

0eee86b

remove unused function

623ab66

add proximity test

d942648

add z scores

41a7ee5

ptajvar requested review from Copilot, elhb and johandahlberg March 23, 2026 07:34

Copilot started reviewing on behalf of ptajvar March 23, 2026 07:35 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

johandahlberg reviewed Mar 23, 2026

View reviewed changes

	(COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z,
	(COALESCE(obs.join_count, 0) - COALESCE(exp.join_count_expected_mean, 0)) / COALESCE(exp.join_count_expected_sd, 1) AS join_count_z

		component_list_sql = ", ".join(f"'{c}'" for c in components) if components else ""


		component_list_sql = ", ".join(f"'{c}'" for c in components) if components else ""

		cte_group_edges = f"""

Conversation

ptajvar commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change

How Has This Been Tested?

PR checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johandahlberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ptajvar commented Mar 19, 2026 •

edited

Loading