Skip to content

Conversation

@PGijsbers
Copy link
Contributor

@PGijsbers PGijsbers commented Dec 16, 2025

Update the repo and tests to work with openml services.
For now we are duplicating some of the code as this allows me some flexibility to change things for testing purposes. Depending on how divergent these things are, it might make sense to either just use services directly, or provide something on top of it.

Summary by Sourcery

Adjust routing, formatting, and test expectations to align the API with the openml-services deployment and data model.

Bug Fixes:

  • Fix dataset status transition tests to target dynamic in-preparation and deactivated dataset IDs.
  • Correct flow subflow handling and migration tests to match the updated flow/component structure returned by the PHP API.
  • Normalize dataset and task URL expectations in migration and router tests to account for base URL changes and HttpURL serialization quirks.

Enhancements:

  • Introduce shared TOML-based routing configuration and use it to generate dataset, parquet, and task URLs instead of hard-coded hosts.
  • Simplify flow subflow representation to return nested flows directly rather than wrapped objects with identifiers.
  • Relax JSON-LD schema typing for mldcat_ap graph context to accommodate a wider range of context values.

Tests:

  • Update OpenML router tests and migration tests to assert against the new URLs, statuses, and flow structures used by openml-services.

@coderabbitai
Copy link

coderabbitai bot commented Dec 16, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch all-green

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai
Copy link

sourcery-ai bot commented Dec 16, 2025

Reviewer's Guide

Adjusts OpenML routing, URL formatting, and tests to align with the openml-services stack, externalizing server/minio URLs into config and simplifying flow subflow structures.

Sequence diagram for dataset request using routing configuration

sequenceDiagram
    actor User
    participant API as OpenML_API
    participant DatasetsRouter
    participant Formatting as Formatting_module
    participant Config as Config_module
    participant ConfigFile as config_toml

    User->>API: GET /openml/datasets/{id}
    API->>DatasetsRouter: get_dataset(dataset_id)
    DatasetsRouter->>Formatting: _format_dataset_url(dataset_row)
    Formatting->>Config: load_routing_configuration()
    Config->>ConfigFile: read_text()
    ConfigFile-->>Config: TOML contents
    Config-->>Formatting: routing configuration (server_url, minio_url)
    Formatting-->>DatasetsRouter: dataset_url
    DatasetsRouter->>Formatting: _format_parquet_url(dataset_row)
    Formatting->>Config: load_routing_configuration() [cached]
    Config-->>Formatting: routing configuration
    Formatting-->>DatasetsRouter: parquet_url
    DatasetsRouter-->>API: DatasetMetadata JSON (without minio_url field)
    API-->>User: HTTP 200 with dataset metadata and service-based URLs
Loading

Class diagram for updated configuration and dataset schemas

classDiagram
    class ConfigModule {
        Path CONFIG_PATH
        +TomlTable _load_configuration(file Path)
        +TomlTable load_routing_configuration(file Path)
        +TomlTable load_database_configuration(file Path)
    }

    class DatasetMetadata {
        +int did
        +str name
        +str version
        +str url
        +str~None parquet_url
        +int file_id
        +DatasetFileFormat format_
    }

    class JsonLDGraph {
        +str~dict~context
        +list~JsonLDObject~ graph
    }

    class DatasetFileFormat {
        <<enumeration>>
        ARFF
        CSV
        PARQUET
    }

    DatasetMetadata ..> DatasetFileFormat : uses
    ConfigModule <.. DatasetMetadata : provides_routing_for_urls
    ConfigModule <.. JsonLDGraph : shared_configuration_context
Loading

File-Level Changes

Change Details Files
Externalize routing configuration (server and MinIO URLs) and reuse it across formatting and task URL generation.
  • Introduce CONFIG_PATH and a shared _load_configuration helper in config.py
  • Add load_routing_configuration to read routing section from config.toml
  • Refactor load_database_configuration to use _load_configuration
  • Extend config.toml with [routing] section containing minio_url and server_url
  • Use load_routing_configuration() in core/formatting.py to derive dataset and parquet URLs instead of hard-coded hosts
  • Use load_routing_configuration() in tasks router to inject base_url into JSON template
src/config.py
src/config.toml
src/core/formatting.py
src/routers/openml/tasks.py
Align dataset- and task-related tests and schemas with new openml-services behavior and URL layout.
  • Update expected dataset URL, parquet URL, and status fields in datasets_test to match openml-services and new routing
  • Change tests that reference fixed dataset IDs to use constants for IN_PREPARATION_ID and DEACTIVATED_DATASETS
  • Adjust constants for in-preparation and deactivated datasets to match new fixture data
  • Remove deprecated minio_url field from DatasetMetadata schema and stop populating it in datasets router
  • Normalize URL comparison in datasets migration test to account for port 80 omission
  • Update expected data_splits_url in task_test to use php-api host and port
tests/routers/openml/datasets_test.py
tests/constants.py
src/schemas/datasets/openml.py
src/routers/openml/datasets.py
tests/routers/openml/migration/datasets_migration_test.py
tests/routers/openml/task_test.py
Simplify OpenML flow subflow structure to match new API and update tests/migration logic accordingly.
  • Change flows.get_flow to return subflows as a flat list of child flows instead of identifier/flow wrapper objects
  • Update flows_test expected JSON shape for subflows to match new flat structure
  • Adjust flows migration test conversion to treat subflows as plain flows instead of nested under subflow['flow'] and map expected component accordingly
src/routers/openml/flows.py
tests/routers/openml/flows_test.py
tests/routers/openml/migration/flows_migration_test.py
Align test HTTP clients and dataset/task URLs with new PHP API service endpoints and path conventions.
  • Introduce PHP_API_URL constant in tests/conftest.py and point it to openml-php-rest-api:80/api/v1/json
  • Update php_api fixture to use PHP_API_URL
  • Adjust expected dataset URL/parquet URL and task data_splits_url hostnames to php-api/minio to match the openml-services stack
tests/conftest.py
tests/routers/openml/datasets_test.py
tests/routers/openml/task_test.py
Minor schema typing adjustment for JSON-LD graph context to satisfy type checking.
  • Add type: ignore to JsonLDGraph.context Field definition where serialization_alias is '@context'
src/schemas/datasets/mldcat_ap.py

Possibly linked issues

  • #(unknown): PR updates parquet_url construction and related tests to match the new MinIO bucket and server URL structure.
  • #: PR introduces routing config and replaces hard-coded URLs, directly addressing the issue’s test server configurability.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In convert_flow_naming_and_defaults, the loop for subflow in flow["subflows"]: subflow = convert_flow_naming_and_defaults(subflow) does not update the list entries; consider enumerating and assigning back into flow["subflows"][i] or rebuilding the list so converted subflows are preserved.
  • Now that server_url and minio_url come from config and are concatenated with path fragments (e.g. in _format_dataset_url and _format_parquet_url), it would be safer to normalize/validate that these base URLs always end with exactly one trailing slash (or use a URL join helper) to avoid subtle double-slash or missing-slash issues when config changes.
  • Instead of adding # type: ignore on JsonLDGraph.context, consider adjusting the annotation (e.g. allowing a broader mapping type) or using a typing.Annotated/AliasChoices style approach so that Pydantic’s serialization alias still type-checks cleanly.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `convert_flow_naming_and_defaults`, the loop `for subflow in flow["subflows"]: subflow = convert_flow_naming_and_defaults(subflow)` does not update the list entries; consider enumerating and assigning back into `flow["subflows"][i]` or rebuilding the list so converted subflows are preserved.
- Now that `server_url` and `minio_url` come from config and are concatenated with path fragments (e.g. in `_format_dataset_url` and `_format_parquet_url`), it would be safer to normalize/validate that these base URLs always end with exactly one trailing slash (or use a URL join helper) to avoid subtle double-slash or missing-slash issues when config changes.
- Instead of adding `# type: ignore` on `JsonLDGraph.context`, consider adjusting the annotation (e.g. allowing a broader mapping type) or using a `typing.Annotated`/`AliasChoices` style approach so that Pydantic’s serialization alias still type-checks cleanly.

## Individual Comments

### Comment 1
<location> `tests/routers/openml/migration/flows_migration_test.py:41-42` </location>
<code_context>
     new = nested_remove_single_element_list(new)

     expected = php_api.get(f"/flow/{flow_id}").json()["flow"]
+    if subflow := expected.get("component"):
+        expected["component"] = subflow["flow"]
     # The reason we don't transform "new" to str is that it becomes harder to ignore numeric type
</code_context>

<issue_to_address>
**suggestion:** Component post-processing assumes a single `component` entry with a `flow` key and may not handle multiple components

`component` is assumed to be a dict with a `flow` key here. If the PHP API ever returns multiple components (e.g. a list) or changes the shape, this will fail or produce incorrect comparisons. Consider asserting the expected shape and, if a list is returned, normalizing each element (as in `convert_flow_naming_and_defaults`) so migration tests handle multi-component flows correctly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 41 to +42
expected = php_api.get(f"/flow/{flow_id}").json()["flow"]
if subflow := expected.get("component"):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Component post-processing assumes a single component entry with a flow key and may not handle multiple components

component is assumed to be a dict with a flow key here. If the PHP API ever returns multiple components (e.g. a list) or changes the shape, this will fail or produce incorrect comparisons. Consider asserting the expected shape and, if a list is returned, normalizing each element (as in convert_flow_naming_and_defaults) so migration tests handle multi-component flows correctly.

As per the linked issue, the database setup container exiting is
actually what caused the non-zero return value for docker compose up,
even when we expect the database setup container to exit.
@PGijsbers PGijsbers merged commit 323b87b into develop Dec 18, 2025
5 checks passed
@PGijsbers PGijsbers deleted the all-green branch December 18, 2025 12:44
@PGijsbers PGijsbers restored the all-green branch December 18, 2025 12:45
PGijsbers added a commit that referenced this pull request Dec 19, 2025
Currently still maintain the relevant definition files in this repository to allow them to change independently for a little while when the server is under most active development. We can then consider which changes should be merged to services to reduce duplication again.
@PGijsbers PGijsbers deleted the all-green branch December 19, 2025 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants