Adding optimization rewrite pass to utilize server with information about masked columns #443

knassre-bodo · 2025-10-09T04:49:56Z

Augmenting relational optimization to rewrite expressions containing an UNMASK operator when a server is mounted to the PyDough session (and the environment variable is activated):

When this is the case, two additional shuttles are added to the additional_shuttles lest, before the masking literal comparisons shuttle.
The first shuttle, MaskServerCandidateShuttle is a no-op shuttle that just traverses the entire tree to find expressions that can potentially be rewritten and adds them to a pool.
The second shuttle, MaskServerRewriteShuttle looks for expressions in the candidate shuttle's pool, and once it finds one it sends every candidate in the pool into a batch request to the mask server, processing the output results to create the new relational node. The candidate pool is then emptied so future invocations will not re-do the same batch calculation.

…ptbank_filter_count_01

knassre-bodo · 2025-10-15T16:49:35Z

tests/test_mock_mask_server.py

            [
                MaskServerInput(
-                    table_path="srv.db.tbl",
+                    table_path="db.tbl",


The srv. part is added elsewhere

knassre-bodo · 2025-10-16T19:20:42Z

pydough/conversion/mask_server_candidate_visitor.py

+        self.stack.clear()
+
+    def visit_call_expression(self, expr: CallExpression) -> None:
+        # TODO: ADD COMMENTS


Suggested change

# TODO: ADD COMMENTS

john-sanchez31

Comments regarding IN and ISIN operators and a type hint

john-sanchez31 · 2025-10-17T14:54:37Z

pydough/conversion/mask_server_candidate_visitor.py

+    mapping each such operator to the string name used in the linear string
+    serialization format recognized by the Mask Server.
+
+    Note: ISIN is handled separately.


Are these the operators used in the mock server? If so, we should add IN and NOT_IN (can be found in the lookup table)

These are the operators used in the real server (which the mock server should emulate). And the point isn't to include all of their operators (e.g. we don't do regex), its to include all of the mappings from our operators to theirs. ISIN is handled separately from this mapping, and we don't currently use NOT_ISIN at all, we just do ISIN and sometimes wrap the result in a NOT call. There is no operator in PyDough which maps to NOT_ISIN.

pydough/conversion/mask_server_candidate_visitor.py

pydough/mask_server/mask_server_candidate_visitor.py

…N CI]

…ter/hour/minute/second, coalesce,iff, join_strings, smallest/largest, and abs

… handled cases where the in/not in list contains a NULL

juankx-bodo

I'll let the approval to @hadia206 and @john-sanchez31

juankx-bodo · 2025-10-31T05:52:06Z

tests/test_mock_mask_server.py

-                    table_path="srv.db.orders",
+                    table_path="db.orders",
                    column_name="order_date",
                    expression=["BETWEEN", 3, "__col__", "2025-01-01", "2025-02-01"],


We need to test the use of QUOTE, for values using predicate reserved words like an OP name or __col__

We should also test values having single-quote, double-quote, comma, square brackets and curly braces

john-sanchez31

Left some comments, after those are fixed or clarified I'll approve but overall LGTM!

john-sanchez31 · 2025-11-03T14:40:52Z

pydough/mask_server/mask_server_candidate_visitor.py

+        self.processed_candidates: set[RelationalExpression] = set()
+        """
+        The set of all relational expressions that have already been added to
+        the candidate pool at lest once. This is used to avoid adding the same


Suggested change

the candidate pool at lest once. This is used to avoid adding the same

the candidate pool at least once. This is used to avoid adding the same

john-sanchez31 · 2025-11-03T15:07:50Z

pydough/conversion/mask_server_candidate_visitor.py

+            or step_literal not in ([1], ["NULL"])
+        ):
+            return None
+        print(start_literal, stop_literal, step_literal)


Is this print part of the code? If not it should be deleted

john-sanchez31 · 2025-11-03T15:12:48Z

pydough/mask_server/mask_server_candidate_visitor.py

+        result.extend(in_list)
+        return result
+
+    def convert_slice_call_to_server_expression(


As my understanding this function only supports positives values, but PyDough can support slicing with negative values. How does that work? I assume those values wouldn't enter this function so is it managed somewhere else?

john-sanchez31 · 2025-11-03T15:14:18Z

pydough/mask_server/mask_server_candidate_visitor.py

+    def convert_join_strings_call_to_server_expression(
+        self, input_exprs: list[list[str | int | float | None | bool] | None]
+    ) -> list[str | int | float | None | bool] | None:
+        """


Add the Args and Returns section in this docstring (for consistency)

john-sanchez31 · 2025-11-03T15:15:05Z

pydough/mask_server/mask_server_candidate_visitor.py

+            "JOIN_STRINGS operator requires at least three inputs."
+        )
+        # If the delimiter expression could not be converted, return None.
+        delimiter_expr = input_exprs[0]


Type hint missing

john-sanchez31 · 2025-11-03T15:20:48Z

pydough/mask_server/mask_server_candidate_visitor.py

+        amount: int,
+        unit_str: str,
+    ) -> list[str | int | float | None | bool] | None:
+        """


Add a Returns section

john-sanchez31

Just fix the type hints missing and TODO docstrings, but overall LGTM! Nice job with the new dry run algorithm impressive!

john-sanchez31 · 2025-11-24T22:19:35Z

pydough/mask_server/mask_server.py

    """

-    def __init__(self, base_url: str, token: str | None = None):
+    def __init__(self, base_url: str, server_address: str, token: str | None = None):


What would be the difference between base_url and server_addresss?

I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?

Where will server_address be configured? This is very specific to the database instance we are connecting to. For example, metadata can be re-used for the same database on different servers, even with different engines. However, the server_address is directly associated (1:1) with the database instance.

I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?

Yes, that is correct.

Where will server_address be configured?

When you configure/mount the MaskServerInfo class, you pass in the server_address (same place the token gets passed).

john-sanchez31 · 2025-11-24T22:38:58Z

pydough/mask_server/mask_server.py

+                response: dict = item.get("response", None)
+                if response is None:
+                    # In this case, use a dummy value as a default to indicate
+                    # the dry run was successful


Did you mean to indicate the dry run was unsuccessful?

No, I mean successful. I do need to adjust this slightly.

pydough/mask_server/mask_server_candidate_visitor.py

pydough/mask_server/mask_server_rewrite_shuttle.py

pydough/mask_server/min_cover_set.py

juankx-bodo · 2025-11-26T01:55:59Z

pydough/configs/session.py

        self._error_builder = builder

+    @property
+    def mask_server(self) -> Union["MaskServerInfo", None]:


Is there a special reason why to use Union["MaskServerInfo", None] instead of MaskServerInfo | None with from __future__ import annotations?

Because it is imported via if TYPE_CHECKING:, so MaskServerInfo won't always be imported, but the type checker will recognize "MaskServerInfo" (which can't be done with | None). This is how we avoid circular imports.

juankx-bodo · 2025-11-26T02:02:04Z

pydough/conversion/relational_converter.py

                        pydop.MaskedExpressionFunctionOperator(
-                            hybrid_expr.column.column_property, True
+                            hybrid_expr.column.column_property,
+                            node.collection.collection.table_path,


Is this the reason why we need to use the full table path in metadata?

EXACTLY (plus its a good idea in general)

juankx-bodo · 2025-11-26T02:10:48Z

pydough/conversion/relational_converter.py

-    # PYDOUGH_ENABLE_MASK_REWRITES is set to 1.
+    # PYDOUGH_ENABLE_MASK_REWRITES is set to 1. If a masking rewrite server has
+    # been attached to the session, include the shuttles for that as well.
    if os.getenv("PYDOUGH_ENABLE_MASK_REWRITES") == "1":


Is there a reson why PYDOUGH_ENABLE_MASK_REWRITES is not in PyDoughConfigs?

Because we wanted an environment variable as a "switch"

juankx-bodo · 2025-11-26T02:31:59Z

pydough/mask_server/mask_server.py

    """

-    def __init__(self, base_url: str, token: str | None = None):
+    def __init__(self, base_url: str, server_address: str, token: str | None = None):


I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?

Where will server_address be configured? This is very specific to the database instance we are connecting to. For example, metadata can be re-used for the same database on different servers, even with different engines. However, the server_address is directly associated (1:1) with the database instance.

juankx-bodo · 2025-11-26T02:47:59Z

pydough/mask_server/mask_server.py

+        for idx, item in enumerate(batch):
+            pyd_logger.info(
+                f"({idx + 1}) {item.table_path}.{item.column_name}: {item.expression}"
+            )


Should this log entry be debug level instead of info?

🤷 Rn I'm keeping everything the same logging level for simplicity. We can revise down if we think it is appropriate.

juankx-bodo · 2025-11-26T02:55:58Z

pydough/mask_server/mask_server.py

+        request: ServerRequest = self.generate_request(
+            batch, path, method, dry_run, hard_limit
+        )
        response_json = self.connection.send_server_request(request)


In case of a predicate_server failure, users will not be able to query the database at all. Not even with the MASK functions. This could be a critical point of failure for the system.

Failure vs error are very different. If there is a legitimate error with connecting to the server, my understanding was that we wanted to abort. If the server responds just fine but indicates it failed to derive an answer, then that's fine and we proceed normally.

pydough/mask_server/mask_server.py

juankx-bodo · 2025-11-26T03:04:23Z

pydough/mask_server/mask_server.py

-                "column_reference": f"{item.table_path}.{item.column_name}",
+                "column_ref": {
+                    "kind": "fqn",
+                    "value": f"{self.server_address}.{item.table_path}.{item.column_name}",


separator should be "/". item.table_path is a composed name with elements separated by ".". Any element could be enclosed with double-quotes or backtick and have "." as part of the element name. Additionally, any character in the name equals to the enclosure char will be escaped using the same character twice.

We could wait to see the real thing implementation before this kind of changes.

Gonna do this. The problem for table path is how to handle varying edge cases of what table_path looks like:

db.schema.col -> db/schema/col

"a.b"."c.d"."e.f" -> a.b/c.d/e.f

juankx-bodo · 2025-11-26T14:55:17Z

pydough/mask_server/mask_server.py

+
        assert batch != [], "Batch cannot be empty."

        path: str = "v1/predicates/batch-evaluate"


path could be a class variable, so we don't need to pass it as parameter to other class methods like generate_request()

Good point, moved into a class var of MaskServerInfo that gets passed into the ServerRequest by generate_request

juankx-bodo · 2025-11-26T15:04:32Z

pydough/mask_server/mask_server.py

+        self,
+        batch: list[MaskServerInput],
+        path: str,
+        method: RequestMethod,


Including path and method in parameters looks like an attempt of doing generate_request() more general. However, due to all other specific parameters and actions I think this method is very specific for batch-evaluate. Maybe path and method could be class properties since them will not change for this method. If more request methods are required in future those paths and methods could also be part of the class.

I think method doesn't need to be a class property, can just get baked into the method's construciton of an ServerRequest instance.

juankx-bodo · 2025-11-26T15:19:37Z

pydough/mask_server/mask_server.py

        """
-        Generate a list of server outputs from the server response.
+        Generate a list of server outputs from the server response of a
+        non-dry-run request.


What happens when the request is a dry-run? We are calling generate_result(response_json) in
L174 for all batch-evaluate requests.

I didn't liked the design idea to have the dry-run and the actual call in the same API path because they are different things called on different times. We can't change that but could it make sense to separate them on our side? At least how do we process the response?

This comment should be rolled-back. The function is the same for both, the difference is that dry-runs have an empty payload for the records.

juankx-bodo · 2025-11-26T15:31:32Z

pydough/mask_server/mask_server_candidate_visitor.py

+    - `DATEDIFF`
+    """
+
+    PREDICATE_OPERATORS: set[str] = {


What is the criteria for a predicate operator to be included here?

These are the operators that are actually predicates, e.g. they return a boolean.

E.g. SUBSTRING can be inside the expression, but should not be the expression itself.

E.g. we wouldn't send abs(expr + 2) to the predicate server, but we would send abs(expr + 2) < 13, we wouldn't send LOWER(expr[:5]) but we would send CONTAINS(LOWER(expr[:5]), 'a')

juankx-bodo · 2025-11-26T15:42:44Z

pydough/mask_server/mask_server_candidate_visitor.py

+                # from the earlier check.
+                for inp in input_exprs:
+                    assert inp is not None
+                    result.extend(inp)


Remember that a literal string may require to use QUOTE if it matches an operator name.

Ahhh good point. I'll do that for literal string handling.

john-sanchez31 · 2025-12-03T22:23:43Z

pydough/mask_server/mask_server.py

                },
                ...
            ],
            "expression_format": {"name": "linear", "version": "0.2.0"}


Suggested change

"expression_format": {"name": "linear", "version": "0.2.0"}

"expression_format": {"name": "linear", "version": "0.2.0"},

john-sanchez31 · 2025-12-03T22:46:04Z

pydough/mask_server/mask_server_rewrite_shuttle.py

+    Mask Server and replacing the candidate expressions with the appropriate
+    responses from the server.


Suggested change

Mask Server and replacing the candidate expressions with the appropriate

responses from the server.

Mask Server. First send all candidates using the dry run flag, then selects the best candidates to be replaced with the appropriate response from the Mask Server.

john-sanchez31 · 2025-12-04T16:04:42Z

pydough/mask_server/mask_server_candidate_visitor.py

+        self.processed_candidates: set[RelationalExpression] = set()
+        """
+        The set of all relational expressions that have already been added to
+        the candidate pool at lest once. This is used to avoid adding the same


john-sanchez31 · 2025-12-04T16:06:29Z

pydough/mask_server/mask_server_candidate_visitor.py

+    def convert_join_strings_call_to_server_expression(
+        self, input_exprs: list[list[str | int | float | None | bool] | None]
+    ) -> list[str | int | float | None | bool] | None:
+        """


john-sanchez31 · 2025-12-04T16:06:54Z

pydough/mask_server/mask_server_candidate_visitor.py

+            "JOIN_STRINGS operator requires at least three inputs."
+        )
+        # If the delimiter expression could not be converted, return None.
+        delimiter_expr = input_exprs[0]


john-sanchez31 · 2025-12-04T16:07:38Z

pydough/mask_server/mask_server_candidate_visitor.py

+        amount: int,
+        unit_str: str,
+    ) -> list[str | int | float | None | bool] | None:
+        """


knassre-bodo added 10 commits October 9, 2025 00:49

Initial implementaitons of candidate vs rewrite shuttle

4d6488c

Initial implementation of predicate server integration working on cry…

5369379

…ptbank_filter_count_01

WIP adding to lookup table

36cab6e

Rewriting the rest of the filter count queries

ed6650c

Moving server address into mask server info setup

cc2bbed

[RUN ALL]

a6d4b29

Adding more tests

beadb15

Merge branch 'main' into kian/mask_server_rewrite

1b4bcac

Switching up relational shuttle handling for simplification

5ea82f1

Minor adjustments to file placement

f0f512c

knassre-bodo commented Oct 15, 2025

View reviewed changes

knassre-bodo added 5 commits October 15, 2025 13:32

Moved some logic from rewrite shuttle to candidate visitor

54ecef1

Added more tests

557aaeb

Added rewrite shuttle docstrings/comments

6b109d9

Adding remaining documentation

1377916

Removing dead rule

891c472

knassre-bodo marked this pull request as ready for review October 16, 2025 19:08

knassre-bodo added 2 commits October 16, 2025 15:08

Merge branch 'main' into kian/mask_server_rewrite

7d7580b

[RUN ALL]

62db4bf

knassre-bodo requested review from a team, hadia206, john-sanchez31 and juankx-bodo and removed request for a team October 16, 2025 19:09

[RUN ALL]

c9f6a59

knassre-bodo commented Oct 16, 2025

View reviewed changes

john-sanchez31 reviewed Oct 17, 2025

View reviewed changes

knassre-bodo added 3 commits October 26, 2025 09:25

Adding logging to keep track of the batch requests sent

7c37110

Ensuring non-predicate sub-expressions are not sent to the server [RU…

127244f

…N CI]

Ensuring non-predicate sub-expressions are not sent to the server [RU…

1f2dc6d

…N CI]

Adding date/datetime/timestamp literal handling tests [RUN CI]

b278f9b

knassre-bodo requested a review from john-sanchez31 October 29, 2025 17:10

knassre-bodo added 2 commits October 30, 2025 17:04

Added new operators support, need to add new tests for datetime, quar…

dcbb69c

…ter/hour/minute/second, coalesce,iff, join_strings, smallest/largest, and abs

Added more tests, handled predicate pushdown bug with least/greatest,…

feabd8a

… handled cases where the in/not in list contains a NULL

juankx-bodo reviewed Oct 31, 2025

View reviewed changes

Added remaining tests [RUN CI]

940dd16

john-sanchez31 reviewed Nov 3, 2025

View reviewed changes

knassre-bodo added 10 commits November 5, 2025 13:46

Predicate server revisions with new API

a6f6a37

JSON request/response reformatting WIP

af10c5b

Adding four-phase algorithm, need to implement step #3

0371ec5

Updating rewrite handling, need to add DP algorithm

3996ced

Finishing implementation of min cover set

29e0e3f

Added edge case tests for selection algorithm

f9c05b2

Minor test adjustment

4f274fd

Minor test adjustment

18379ef

Merge branch 'main' into kian/mask_server_rewrite

f512f8b

Resolving conflicts [RUN ALL]

90f0671

knassre-bodo requested review from john-sanchez31 and juankx-bodo November 24, 2025 18:43

john-sanchez31 reviewed Nov 25, 2025

View reviewed changes

juankx-bodo reviewed Nov 26, 2025

View reviewed changes

knassre-bodo added 3 commits November 26, 2025 11:05

Merge branch 'main' into kian/mask_server_rewrite

f6a571b

Added the FQN slash handling

b728348

Revisions, QUOTE operator handling, docstrings/documentation [RUN ALL]

8e03b04

knassre-bodo requested review from john-sanchez31 and juankx-bodo December 2, 2025 19:14

Fixing mask server tests [RUN ALL]

a3c79cf

john-sanchez31 reviewed Dec 4, 2025

View reviewed changes

	the candidate pool at lest once. This is used to avoid adding the same
	the candidate pool at least once. This is used to avoid adding the same


		assert batch != [], "Batch cannot be empty."

		path: str = "v1/predicates/batch-evaluate"

	"expression_format": {"name": "linear", "version": "0.2.0"}
	"expression_format": {"name": "linear", "version": "0.2.0"},

		Mask Server and replacing the candidate expressions with the appropriate
		responses from the server.

	Mask Server and replacing the candidate expressions with the appropriate
	responses from the server.
	Mask Server. First send all candidates using the dry run flag, then selects the best candidates to be replaced with the appropriate response from the Mask Server.

Adding optimization rewrite pass to utilize server with information about masked columns #443

Are you sure you want to change the base?

Adding optimization rewrite pass to utilize server with information about masked columns #443

Uh oh!

Conversation

knassre-bodo commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-sanchez31 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

juankx-bodo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-sanchez31 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-sanchez31 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo commented Oct 9, 2025 •

edited

Loading

knassre-bodo Oct 23, 2025 •

edited

Loading

knassre-bodo Dec 1, 2025 •

edited

Loading

knassre-bodo Dec 1, 2025 •

edited

Loading

knassre-bodo Dec 2, 2025 •

edited

Loading

knassre-bodo Dec 2, 2025 •

edited

Loading