-
Notifications
You must be signed in to change notification settings - Fork 3
Adding optimization rewrite pass to utilize server with information about masked columns #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ptbank_filter_count_01
| [ | ||
| MaskServerInput( | ||
| table_path="srv.db.tbl", | ||
| table_path="db.tbl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The srv. part is added elsewhere
| self.stack.clear() | ||
|
|
||
| def visit_call_expression(self, expr: CallExpression) -> None: | ||
| # TODO: ADD COMMENTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # TODO: ADD COMMENTS |
john-sanchez31
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments regarding IN and ISIN operators and a type hint
| mapping each such operator to the string name used in the linear string | ||
| serialization format recognized by the Mask Server. | ||
| Note: ISIN is handled separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these the operators used in the mock server? If so, we should add IN and NOT_IN (can be found in the lookup table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the operators used in the real server (which the mock server should emulate). And the point isn't to include all of their operators (e.g. we don't do regex), its to include all of the mappings from our operators to theirs. ISIN is handled separately from this mapping, and we don't currently use NOT_ISIN at all, we just do ISIN and sometimes wrap the result in a NOT call. There is no operator in PyDough which maps to NOT_ISIN.
…ter/hour/minute/second, coalesce,iff, join_strings, smallest/largest, and abs
… handled cases where the in/not in list contains a NULL
juankx-bodo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let the approval to @hadia206 and @john-sanchez31
| table_path="srv.db.orders", | ||
| table_path="db.orders", | ||
| column_name="order_date", | ||
| expression=["BETWEEN", 3, "__col__", "2025-01-01", "2025-02-01"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to test the use of QUOTE, for values using predicate reserved words like an OP name or __col__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also test values having single-quote, double-quote, comma, square brackets and curly braces
john-sanchez31
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, after those are fixed or clarified I'll approve but overall LGTM!
| self.processed_candidates: set[RelationalExpression] = set() | ||
| """ | ||
| The set of all relational expressions that have already been added to | ||
| the candidate pool at lest once. This is used to avoid adding the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| the candidate pool at lest once. This is used to avoid adding the same | |
| the candidate pool at least once. This is used to avoid adding the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
| or step_literal not in ([1], ["NULL"]) | ||
| ): | ||
| return None | ||
| print(start_literal, stop_literal, step_literal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this print part of the code? If not it should be deleted
| result.extend(in_list) | ||
| return result | ||
|
|
||
| def convert_slice_call_to_server_expression( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As my understanding this function only supports positives values, but PyDough can support slicing with negative values. How does that work? I assume those values wouldn't enter this function so is it managed somewhere else?
| def convert_join_strings_call_to_server_expression( | ||
| self, input_exprs: list[list[str | int | float | None | bool] | None] | ||
| ) -> list[str | int | float | None | bool] | None: | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the Args and Returns section in this docstring (for consistency)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
| "JOIN_STRINGS operator requires at least three inputs." | ||
| ) | ||
| # If the delimiter expression could not be converted, return None. | ||
| delimiter_expr = input_exprs[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hint missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
| amount: int, | ||
| unit_str: str, | ||
| ) -> list[str | int | float | None | bool] | None: | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a Returns section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
john-sanchez31
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fix the type hints missing and TODO docstrings, but overall LGTM! Nice job with the new dry run algorithm impressive!
| """ | ||
|
|
||
| def __init__(self, base_url: str, token: str | None = None): | ||
| def __init__(self, base_url: str, server_address: str, token: str | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the difference between base_url and server_addresss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?
Where will server_address be configured? This is very specific to the database instance we are connecting to. For example, metadata can be re-used for the same database on different servers, even with different engines. However, the server_address is directly associated (1:1) with the database instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?
Yes, that is correct.
Where will server_address be configured?
When you configure/mount the MaskServerInfo class, you pass in the server_address (same place the token gets passed).
pydough/mask_server/mask_server.py
Outdated
| response: dict = item.get("response", None) | ||
| if response is None: | ||
| # In this case, use a dummy value as a default to indicate | ||
| # the dry run was successful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to indicate the dry run was unsuccessful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean successful. I do need to adjust this slightly.
| self._error_builder = builder | ||
|
|
||
| @property | ||
| def mask_server(self) -> Union["MaskServerInfo", None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a special reason why to use Union["MaskServerInfo", None] instead of MaskServerInfo | None with from __future__ import annotations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is imported via if TYPE_CHECKING:, so MaskServerInfo won't always be imported, but the type checker will recognize "MaskServerInfo" (which can't be done with | None). This is how we avoid circular imports.
| pydop.MaskedExpressionFunctionOperator( | ||
| hybrid_expr.column.column_property, True | ||
| hybrid_expr.column.column_property, | ||
| node.collection.collection.table_path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the reason why we need to use the full table path in metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EXACTLY (plus its a good idea in general)
| # PYDOUGH_ENABLE_MASK_REWRITES is set to 1. | ||
| # PYDOUGH_ENABLE_MASK_REWRITES is set to 1. If a masking rewrite server has | ||
| # been attached to the session, include the shuttles for that as well. | ||
| if os.getenv("PYDOUGH_ENABLE_MASK_REWRITES") == "1": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reson why PYDOUGH_ENABLE_MASK_REWRITES is not in PyDoughConfigs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we wanted an environment variable as a "switch"
| """ | ||
|
|
||
| def __init__(self, base_url: str, token: str | None = None): | ||
| def __init__(self, base_url: str, server_address: str, token: str | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?
Where will server_address be configured? This is very specific to the database instance we are connecting to. For example, metadata can be re-used for the same database on different servers, even with different engines. However, the server_address is directly associated (1:1) with the database instance.
| for idx, item in enumerate(batch): | ||
| pyd_logger.info( | ||
| f"({idx + 1}) {item.table_path}.{item.column_name}: {item.expression}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this log entry be debug level instead of info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 Rn I'm keeping everything the same logging level for simplicity. We can revise down if we think it is appropriate.
| request: ServerRequest = self.generate_request( | ||
| batch, path, method, dry_run, hard_limit | ||
| ) | ||
| response_json = self.connection.send_server_request(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of a predicate_server failure, users will not be able to query the database at all. Not even with the MASK functions. This could be a critical point of failure for the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure vs error are very different. If there is a legitimate error with connecting to the server, my understanding was that we wanted to abort. If the server responds just fine but indicates it failed to derive an answer, then that's fine and we proceed normally.
pydough/mask_server/mask_server.py
Outdated
| "column_reference": f"{item.table_path}.{item.column_name}", | ||
| "column_ref": { | ||
| "kind": "fqn", | ||
| "value": f"{self.server_address}.{item.table_path}.{item.column_name}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separator should be "/". item.table_path is a composed name with elements separated by ".". Any element could be enclosed with double-quotes or backtick and have "." as part of the element name. Additionally, any character in the name equals to the enclosure char will be escaped using the same character twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could wait to see the real thing implementation before this kind of changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna do this. The problem for table path is how to handle varying edge cases of what table_path looks like:
db.schema.col->db/schema/col"a.b"."c.d"."e.f"->a.b/c.d/e.f
pydough/mask_server/mask_server.py
Outdated
|
|
||
| assert batch != [], "Batch cannot be empty." | ||
|
|
||
| path: str = "v1/predicates/batch-evaluate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path could be a class variable, so we don't need to pass it as parameter to other class methods like generate_request()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, moved into a class var of MaskServerInfo that gets passed into the ServerRequest by generate_request
pydough/mask_server/mask_server.py
Outdated
| self, | ||
| batch: list[MaskServerInput], | ||
| path: str, | ||
| method: RequestMethod, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including path and method in parameters looks like an attempt of doing generate_request() more general. However, due to all other specific parameters and actions I think this method is very specific for batch-evaluate. Maybe path and method could be class properties since them will not change for this method. If more request methods are required in future those paths and methods could also be part of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think method doesn't need to be a class property, can just get baked into the method's construciton of an ServerRequest instance.
pydough/mask_server/mask_server.py
Outdated
| """ | ||
| Generate a list of server outputs from the server response. | ||
| Generate a list of server outputs from the server response of a | ||
| non-dry-run request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the request is a dry-run? We are calling generate_result(response_json) in
L174 for all batch-evaluate requests.
I didn't liked the design idea to have the dry-run and the actual call in the same API path because they are different things called on different times. We can't change that but could it make sense to separate them on our side? At least how do we process the response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should be rolled-back. The function is the same for both, the difference is that dry-runs have an empty payload for the records.
| - `DATEDIFF` | ||
| """ | ||
|
|
||
| PREDICATE_OPERATORS: set[str] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the criteria for a predicate operator to be included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the operators that are actually predicates, e.g. they return a boolean.
E.g. SUBSTRING can be inside the expression, but should not be the expression itself.
E.g. we wouldn't send abs(expr + 2) to the predicate server, but we would send abs(expr + 2) < 13, we wouldn't send LOWER(expr[:5]) but we would send CONTAINS(LOWER(expr[:5]), 'a')
| # from the earlier check. | ||
| for inp in input_exprs: | ||
| assert inp is not None | ||
| result.extend(inp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that a literal string may require to use QUOTE if it matches an operator name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh good point. I'll do that for literal string handling.
| }, | ||
| ... | ||
| ], | ||
| "expression_format": {"name": "linear", "version": "0.2.0"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "expression_format": {"name": "linear", "version": "0.2.0"} | |
| "expression_format": {"name": "linear", "version": "0.2.0"}, |
| Mask Server and replacing the candidate expressions with the appropriate | ||
| responses from the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Mask Server and replacing the candidate expressions with the appropriate | |
| responses from the server. | |
| Mask Server. First send all candidates using the dry run flag, then selects the best candidates to be replaced with the appropriate response from the Mask Server. |
| self.processed_candidates: set[RelationalExpression] = set() | ||
| """ | ||
| The set of all relational expressions that have already been added to | ||
| the candidate pool at lest once. This is used to avoid adding the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
| def convert_join_strings_call_to_server_expression( | ||
| self, input_exprs: list[list[str | int | float | None | bool] | None] | ||
| ) -> list[str | int | float | None | bool] | None: | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
| "JOIN_STRINGS operator requires at least three inputs." | ||
| ) | ||
| # If the delimiter expression could not be converted, return None. | ||
| delimiter_expr = input_exprs[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
| amount: int, | ||
| unit_str: str, | ||
| ) -> list[str | int | float | None | bool] | None: | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
Augmenting relational optimization to rewrite expressions containing an UNMASK operator when a server is mounted to the PyDough session (and the environment variable is activated):
additional_shuttleslest, before the masking literal comparisons shuttle.MaskServerCandidateShuttleis a no-op shuttle that just traverses the entire tree to find expressions that can potentially be rewritten and adds them to a pool.MaskServerRewriteShuttlelooks for expressions in the candidate shuttle's pool, and once it finds one it sends every candidate in the pool into a batch request to the mask server, processing the output results to create the new relational node. The candidate pool is then emptied so future invocations will not re-do the same batch calculation.