Skip to content

Conversation

@knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Oct 9, 2025

Augmenting relational optimization to rewrite expressions containing an UNMASK operator when a server is mounted to the PyDough session (and the environment variable is activated):

  • When this is the case, two additional shuttles are added to the additional_shuttles lest, before the masking literal comparisons shuttle.
  • The first shuttle, MaskServerCandidateShuttle is a no-op shuttle that just traverses the entire tree to find expressions that can potentially be rewritten and adds them to a pool.
  • The second shuttle, MaskServerRewriteShuttle looks for expressions in the candidate shuttle's pool, and once it finds one it sends every candidate in the pool into a batch request to the mask server, processing the output results to create the new relational node. The candidate pool is then emptied so future invocations will not re-do the same batch calculation.

[
MaskServerInput(
table_path="srv.db.tbl",
table_path="db.tbl",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The srv. part is added elsewhere

@knassre-bodo knassre-bodo marked this pull request as ready for review October 16, 2025 19:08
@knassre-bodo knassre-bodo requested review from a team, hadia206, john-sanchez31 and juankx-bodo and removed request for a team October 16, 2025 19:09
self.stack.clear()

def visit_call_expression(self, expr: CallExpression) -> None:
# TODO: ADD COMMENTS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: ADD COMMENTS

Copy link
Contributor

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments regarding IN and ISIN operators and a type hint

mapping each such operator to the string name used in the linear string
serialization format recognized by the Mask Server.
Note: ISIN is handled separately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the operators used in the mock server? If so, we should add IN and NOT_IN (can be found in the lookup table)

Copy link
Contributor Author

@knassre-bodo knassre-bodo Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the operators used in the real server (which the mock server should emulate). And the point isn't to include all of their operators (e.g. we don't do regex), its to include all of the mappings from our operators to theirs. ISIN is handled separately from this mapping, and we don't currently use NOT_ISIN at all, we just do ISIN and sometimes wrap the result in a NOT call. There is no operator in PyDough which maps to NOT_ISIN.

…ter/hour/minute/second, coalesce,iff, join_strings, smallest/largest, and abs
… handled cases where the in/not in list contains a NULL
Copy link
Contributor

@juankx-bodo juankx-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let the approval to @hadia206 and @john-sanchez31

table_path="srv.db.orders",
table_path="db.orders",
column_name="order_date",
expression=["BETWEEN", 3, "__col__", "2025-01-01", "2025-02-01"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to test the use of QUOTE, for values using predicate reserved words like an OP name or __col__

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also test values having single-quote, double-quote, comma, square brackets and curly braces

Copy link
Contributor

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, after those are fixed or clarified I'll approve but overall LGTM!

self.processed_candidates: set[RelationalExpression] = set()
"""
The set of all relational expressions that have already been added to
the candidate pool at lest once. This is used to avoid adding the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the candidate pool at lest once. This is used to avoid adding the same
the candidate pool at least once. This is used to avoid adding the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

or step_literal not in ([1], ["NULL"])
):
return None
print(start_literal, stop_literal, step_literal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this print part of the code? If not it should be deleted

result.extend(in_list)
return result

def convert_slice_call_to_server_expression(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As my understanding this function only supports positives values, but PyDough can support slicing with negative values. How does that work? I assume those values wouldn't enter this function so is it managed somewhere else?

def convert_join_strings_call_to_server_expression(
self, input_exprs: list[list[str | int | float | None | bool] | None]
) -> list[str | int | float | None | bool] | None:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the Args and Returns section in this docstring (for consistency)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

"JOIN_STRINGS operator requires at least three inputs."
)
# If the delimiter expression could not be converted, return None.
delimiter_expr = input_exprs[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hint missing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

amount: int,
unit_str: str,
) -> list[str | int | float | None | bool] | None:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a Returns section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

Copy link
Contributor

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fix the type hints missing and TODO docstrings, but overall LGTM! Nice job with the new dry run algorithm impressive!

"""

def __init__(self, base_url: str, token: str | None = None):
def __init__(self, base_url: str, server_address: str, token: str | None = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the difference between base_url and server_addresss?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?

Where will server_address be configured? This is very specific to the database instance we are connecting to. For example, metadata can be re-used for the same database on different servers, even with different engines. However, the server_address is directly associated (1:1) with the database instance.

Copy link
Contributor Author

@knassre-bodo knassre-bodo Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?

Yes, that is correct.

Where will server_address be configured?

When you configure/mount the MaskServerInfo class, you pass in the server_address (same place the token gets passed).

response: dict = item.get("response", None)
if response is None:
# In this case, use a dummy value as a default to indicate
# the dry run was successful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to indicate the dry run was unsuccessful?

Copy link
Contributor Author

@knassre-bodo knassre-bodo Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean successful. I do need to adjust this slightly.

self._error_builder = builder

@property
def mask_server(self) -> Union["MaskServerInfo", None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a special reason why to use Union["MaskServerInfo", None] instead of MaskServerInfo | None with from __future__ import annotations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is imported via if TYPE_CHECKING:, so MaskServerInfo won't always be imported, but the type checker will recognize "MaskServerInfo" (which can't be done with | None). This is how we avoid circular imports.

pydop.MaskedExpressionFunctionOperator(
hybrid_expr.column.column_property, True
hybrid_expr.column.column_property,
node.collection.collection.table_path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the reason why we need to use the full table path in metadata?

Copy link
Contributor Author

@knassre-bodo knassre-bodo Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXACTLY (plus its a good idea in general)

# PYDOUGH_ENABLE_MASK_REWRITES is set to 1.
# PYDOUGH_ENABLE_MASK_REWRITES is set to 1. If a masking rewrite server has
# been attached to the session, include the shuttles for that as well.
if os.getenv("PYDOUGH_ENABLE_MASK_REWRITES") == "1":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reson why PYDOUGH_ENABLE_MASK_REWRITES is not in PyDoughConfigs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we wanted an environment variable as a "switch"

"""

def __init__(self, base_url: str, token: str | None = None):
def __init__(self, base_url: str, server_address: str, token: str | None = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?

Where will server_address be configured? This is very specific to the database instance we are connecting to. For example, metadata can be re-used for the same database on different servers, even with different engines. However, the server_address is directly associated (1:1) with the database instance.

for idx, item in enumerate(batch):
pyd_logger.info(
f"({idx + 1}) {item.table_path}.{item.column_name}: {item.expression}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this log entry be debug level instead of info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 Rn I'm keeping everything the same logging level for simplicity. We can revise down if we think it is appropriate.

request: ServerRequest = self.generate_request(
batch, path, method, dry_run, hard_limit
)
response_json = self.connection.send_server_request(request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a predicate_server failure, users will not be able to query the database at all. Not even with the MASK functions. This could be a critical point of failure for the system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure vs error are very different. If there is a legitimate error with connecting to the server, my understanding was that we wanted to abort. If the server responds just fine but indicates it failed to derive an answer, then that's fine and we proceed normally.

"column_reference": f"{item.table_path}.{item.column_name}",
"column_ref": {
"kind": "fqn",
"value": f"{self.server_address}.{item.table_path}.{item.column_name}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separator should be "/". item.table_path is a composed name with elements separated by ".". Any element could be enclosed with double-quotes or backtick and have "." as part of the element name. Additionally, any character in the name equals to the enclosure char will be escaped using the same character twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could wait to see the real thing implementation before this kind of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna do this. The problem for table path is how to handle varying edge cases of what table_path looks like:

  • db.schema.col -> db/schema/col
  • "a.b"."c.d"."e.f" -> a.b/c.d/e.f


assert batch != [], "Batch cannot be empty."

path: str = "v1/predicates/batch-evaluate"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path could be a class variable, so we don't need to pass it as parameter to other class methods like generate_request()

Copy link
Contributor Author

@knassre-bodo knassre-bodo Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, moved into a class var of MaskServerInfo that gets passed into the ServerRequest by generate_request

self,
batch: list[MaskServerInput],
path: str,
method: RequestMethod,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including path and method in parameters looks like an attempt of doing generate_request() more general. However, due to all other specific parameters and actions I think this method is very specific for batch-evaluate. Maybe path and method could be class properties since them will not change for this method. If more request methods are required in future those paths and methods could also be part of the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think method doesn't need to be a class property, can just get baked into the method's construciton of an ServerRequest instance.

"""
Generate a list of server outputs from the server response.
Generate a list of server outputs from the server response of a
non-dry-run request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the request is a dry-run? We are calling generate_result(response_json) in
L174 for all batch-evaluate requests.

I didn't liked the design idea to have the dry-run and the actual call in the same API path because they are different things called on different times. We can't change that but could it make sense to separate them on our side? At least how do we process the response?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be rolled-back. The function is the same for both, the difference is that dry-runs have an empty payload for the records.

- `DATEDIFF`
"""

PREDICATE_OPERATORS: set[str] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the criteria for a predicate operator to be included here?

Copy link
Contributor Author

@knassre-bodo knassre-bodo Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the operators that are actually predicates, e.g. they return a boolean.

E.g. SUBSTRING can be inside the expression, but should not be the expression itself.

E.g. we wouldn't send abs(expr + 2) to the predicate server, but we would send abs(expr + 2) < 13, we wouldn't send LOWER(expr[:5]) but we would send CONTAINS(LOWER(expr[:5]), 'a')

# from the earlier check.
for inp in input_exprs:
assert inp is not None
result.extend(inp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that a literal string may require to use QUOTE if it matches an operator name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh good point. I'll do that for literal string handling.

},
...
],
"expression_format": {"name": "linear", "version": "0.2.0"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"expression_format": {"name": "linear", "version": "0.2.0"}
"expression_format": {"name": "linear", "version": "0.2.0"},

Comment on lines +31 to +32
Mask Server and replacing the candidate expressions with the appropriate
responses from the server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Mask Server and replacing the candidate expressions with the appropriate
responses from the server.
Mask Server. First send all candidates using the dry run flag, then selects the best candidates to be replaced with the appropriate response from the Mask Server.

self.processed_candidates: set[RelationalExpression] = set()
"""
The set of all relational expressions that have already been added to
the candidate pool at lest once. This is used to avoid adding the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

def convert_join_strings_call_to_server_expression(
self, input_exprs: list[list[str | int | float | None | bool] | None]
) -> list[str | int | float | None | bool] | None:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

"JOIN_STRINGS operator requires at least three inputs."
)
# If the delimiter expression could not be converted, return None.
delimiter_expr = input_exprs[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

amount: int,
unit_str: str,
) -> list[str | int | float | None | bool] | None:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants