-
-
Notifications
You must be signed in to change notification settings - Fork 13
Make tests pass with openml-services #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Reviewer's GuideAdjusts OpenML routing, URL formatting, and tests to align with the openml-services stack, externalizing server/minio URLs into config and simplifying flow subflow structures. Sequence diagram for dataset request using routing configurationsequenceDiagram
actor User
participant API as OpenML_API
participant DatasetsRouter
participant Formatting as Formatting_module
participant Config as Config_module
participant ConfigFile as config_toml
User->>API: GET /openml/datasets/{id}
API->>DatasetsRouter: get_dataset(dataset_id)
DatasetsRouter->>Formatting: _format_dataset_url(dataset_row)
Formatting->>Config: load_routing_configuration()
Config->>ConfigFile: read_text()
ConfigFile-->>Config: TOML contents
Config-->>Formatting: routing configuration (server_url, minio_url)
Formatting-->>DatasetsRouter: dataset_url
DatasetsRouter->>Formatting: _format_parquet_url(dataset_row)
Formatting->>Config: load_routing_configuration() [cached]
Config-->>Formatting: routing configuration
Formatting-->>DatasetsRouter: parquet_url
DatasetsRouter-->>API: DatasetMetadata JSON (without minio_url field)
API-->>User: HTTP 200 with dataset metadata and service-based URLs
Class diagram for updated configuration and dataset schemasclassDiagram
class ConfigModule {
Path CONFIG_PATH
+TomlTable _load_configuration(file Path)
+TomlTable load_routing_configuration(file Path)
+TomlTable load_database_configuration(file Path)
}
class DatasetMetadata {
+int did
+str name
+str version
+str url
+str~None parquet_url
+int file_id
+DatasetFileFormat format_
}
class JsonLDGraph {
+str~dict~context
+list~JsonLDObject~ graph
}
class DatasetFileFormat {
<<enumeration>>
ARFF
CSV
PARQUET
}
DatasetMetadata ..> DatasetFileFormat : uses
ConfigModule <.. DatasetMetadata : provides_routing_for_urls
ConfigModule <.. JsonLDGraph : shared_configuration_context
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- In
convert_flow_naming_and_defaults, the loopfor subflow in flow["subflows"]: subflow = convert_flow_naming_and_defaults(subflow)does not update the list entries; consider enumerating and assigning back intoflow["subflows"][i]or rebuilding the list so converted subflows are preserved. - Now that
server_urlandminio_urlcome from config and are concatenated with path fragments (e.g. in_format_dataset_urland_format_parquet_url), it would be safer to normalize/validate that these base URLs always end with exactly one trailing slash (or use a URL join helper) to avoid subtle double-slash or missing-slash issues when config changes. - Instead of adding
# type: ignoreonJsonLDGraph.context, consider adjusting the annotation (e.g. allowing a broader mapping type) or using atyping.Annotated/AliasChoicesstyle approach so that Pydantic’s serialization alias still type-checks cleanly.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `convert_flow_naming_and_defaults`, the loop `for subflow in flow["subflows"]: subflow = convert_flow_naming_and_defaults(subflow)` does not update the list entries; consider enumerating and assigning back into `flow["subflows"][i]` or rebuilding the list so converted subflows are preserved.
- Now that `server_url` and `minio_url` come from config and are concatenated with path fragments (e.g. in `_format_dataset_url` and `_format_parquet_url`), it would be safer to normalize/validate that these base URLs always end with exactly one trailing slash (or use a URL join helper) to avoid subtle double-slash or missing-slash issues when config changes.
- Instead of adding `# type: ignore` on `JsonLDGraph.context`, consider adjusting the annotation (e.g. allowing a broader mapping type) or using a `typing.Annotated`/`AliasChoices` style approach so that Pydantic’s serialization alias still type-checks cleanly.
## Individual Comments
### Comment 1
<location> `tests/routers/openml/migration/flows_migration_test.py:41-42` </location>
<code_context>
new = nested_remove_single_element_list(new)
expected = php_api.get(f"/flow/{flow_id}").json()["flow"]
+ if subflow := expected.get("component"):
+ expected["component"] = subflow["flow"]
# The reason we don't transform "new" to str is that it becomes harder to ignore numeric type
</code_context>
<issue_to_address>
**suggestion:** Component post-processing assumes a single `component` entry with a `flow` key and may not handle multiple components
`component` is assumed to be a dict with a `flow` key here. If the PHP API ever returns multiple components (e.g. a list) or changes the shape, this will fail or produce incorrect comparisons. Consider asserting the expected shape and, if a list is returned, normalizing each element (as in `convert_flow_naming_and_defaults`) so migration tests handle multi-component flows correctly.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| expected = php_api.get(f"/flow/{flow_id}").json()["flow"] | ||
| if subflow := expected.get("component"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Component post-processing assumes a single component entry with a flow key and may not handle multiple components
component is assumed to be a dict with a flow key here. If the PHP API ever returns multiple components (e.g. a list) or changes the shape, this will fail or produce incorrect comparisons. Consider asserting the expected shape and, if a list is returned, normalizing each element (as in convert_flow_naming_and_defaults) so migration tests handle multi-component flows correctly.
As per the linked issue, the database setup container exiting is actually what caused the non-zero return value for docker compose up, even when we expect the database setup container to exit.
Currently still maintain the relevant definition files in this repository to allow them to change independently for a little while when the server is under most active development. We can then consider which changes should be merged to services to reduce duplication again.
Update the repo and tests to work with openml services.
For now we are duplicating some of the code as this allows me some flexibility to change things for testing purposes. Depending on how divergent these things are, it might make sense to either just use services directly, or provide something on top of it.
Summary by Sourcery
Adjust routing, formatting, and test expectations to align the API with the openml-services deployment and data model.
Bug Fixes:
Enhancements:
Tests: