Make tests pass with openml-services #217

PGijsbers · 2025-12-16T14:16:32Z

Update the repo and tests to work with openml services.
For now we are duplicating some of the code as this allows me some flexibility to change things for testing purposes. Depending on how divergent these things are, it might make sense to either just use services directly, or provide something on top of it.

Summary by Sourcery

Adjust routing, formatting, and test expectations to align the API with the openml-services deployment and data model.

Bug Fixes:

Fix dataset status transition tests to target dynamic in-preparation and deactivated dataset IDs.
Correct flow subflow handling and migration tests to match the updated flow/component structure returned by the PHP API.
Normalize dataset and task URL expectations in migration and router tests to account for base URL changes and HttpURL serialization quirks.

Enhancements:

Introduce shared TOML-based routing configuration and use it to generate dataset, parquet, and task URLs instead of hard-coded hosts.
Simplify flow subflow representation to return nested flows directly rather than wrapped objects with identifiers.
Relax JSON-LD schema typing for mldcat_ap graph context to accommodate a wider range of context values.

Tests:

Update OpenML router tests and migration tests to assert against the new URLs, statuses, and flow structures used by openml-services.

coderabbitai · 2025-12-16T14:16:40Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch all-green

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai · 2025-12-16T14:16:41Z

Reviewer's Guide

Adjusts OpenML routing, URL formatting, and tests to align with the openml-services stack, externalizing server/minio URLs into config and simplifying flow subflow structures.

Sequence diagram for dataset request using routing configuration

sequenceDiagram
    actor User
    participant API as OpenML_API
    participant DatasetsRouter
    participant Formatting as Formatting_module
    participant Config as Config_module
    participant ConfigFile as config_toml

    User->>API: GET /openml/datasets/{id}
    API->>DatasetsRouter: get_dataset(dataset_id)
    DatasetsRouter->>Formatting: _format_dataset_url(dataset_row)
    Formatting->>Config: load_routing_configuration()
    Config->>ConfigFile: read_text()
    ConfigFile-->>Config: TOML contents
    Config-->>Formatting: routing configuration (server_url, minio_url)
    Formatting-->>DatasetsRouter: dataset_url
    DatasetsRouter->>Formatting: _format_parquet_url(dataset_row)
    Formatting->>Config: load_routing_configuration() [cached]
    Config-->>Formatting: routing configuration
    Formatting-->>DatasetsRouter: parquet_url
    DatasetsRouter-->>API: DatasetMetadata JSON (without minio_url field)
    API-->>User: HTTP 200 with dataset metadata and service-based URLs

Class diagram for updated configuration and dataset schemas

classDiagram
    class ConfigModule {
        Path CONFIG_PATH
        +TomlTable _load_configuration(file Path)
        +TomlTable load_routing_configuration(file Path)
        +TomlTable load_database_configuration(file Path)
    }

    class DatasetMetadata {
        +int did
        +str name
        +str version
        +str url
        +str~None parquet_url
        +int file_id
        +DatasetFileFormat format_
    }

    class JsonLDGraph {
        +str~dict~context
        +list~JsonLDObject~ graph
    }

    class DatasetFileFormat {
        <<enumeration>>
        ARFF
        CSV
        PARQUET
    }

    DatasetMetadata ..> DatasetFileFormat : uses
    ConfigModule <.. DatasetMetadata : provides_routing_for_urls
    ConfigModule <.. JsonLDGraph : shared_configuration_context

File-Level Changes

Change	Details	Files
Externalize routing configuration (server and MinIO URLs) and reuse it across formatting and task URL generation.	Introduce CONFIG_PATH and a shared _load_configuration helper in config.py Add load_routing_configuration to read routing section from config.toml Refactor load_database_configuration to use _load_configuration Extend config.toml with [routing] section containing minio_url and server_url Use load_routing_configuration() in core/formatting.py to derive dataset and parquet URLs instead of hard-coded hosts Use load_routing_configuration() in tasks router to inject base_url into JSON template	`src/config.py` `src/config.toml` `src/core/formatting.py` `src/routers/openml/tasks.py`
Align dataset- and task-related tests and schemas with new openml-services behavior and URL layout.	Update expected dataset URL, parquet URL, and status fields in datasets_test to match openml-services and new routing Change tests that reference fixed dataset IDs to use constants for IN_PREPARATION_ID and DEACTIVATED_DATASETS Adjust constants for in-preparation and deactivated datasets to match new fixture data Remove deprecated minio_url field from DatasetMetadata schema and stop populating it in datasets router Normalize URL comparison in datasets migration test to account for port 80 omission Update expected data_splits_url in task_test to use php-api host and port	`tests/routers/openml/datasets_test.py` `tests/constants.py` `src/schemas/datasets/openml.py` `src/routers/openml/datasets.py` `tests/routers/openml/migration/datasets_migration_test.py` `tests/routers/openml/task_test.py`
Simplify OpenML flow subflow structure to match new API and update tests/migration logic accordingly.	Change flows.get_flow to return subflows as a flat list of child flows instead of identifier/flow wrapper objects Update flows_test expected JSON shape for subflows to match new flat structure Adjust flows migration test conversion to treat subflows as plain flows instead of nested under subflow['flow'] and map expected component accordingly	`src/routers/openml/flows.py` `tests/routers/openml/flows_test.py` `tests/routers/openml/migration/flows_migration_test.py`
Align test HTTP clients and dataset/task URLs with new PHP API service endpoints and path conventions.	Introduce PHP_API_URL constant in tests/conftest.py and point it to openml-php-rest-api:80/api/v1/json Update php_api fixture to use PHP_API_URL Adjust expected dataset URL/parquet URL and task data_splits_url hostnames to php-api/minio to match the openml-services stack	`tests/conftest.py` `tests/routers/openml/datasets_test.py` `tests/routers/openml/task_test.py`
Minor schema typing adjustment for JSON-LD graph context to satisfy type checking.	Add type: ignore to JsonLDGraph.context Field definition where serialization_alias is '@context'	`src/schemas/datasets/mldcat_ap.py`

Possibly linked issues

#(unknown): PR updates parquet_url construction and related tests to match the new MinIO bucket and server URL structure.
#: PR introduces routing config and replaces hard-coded URLs, directly addressing the issue’s test server configurability.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

In convert_flow_naming_and_defaults, the loop for subflow in flow["subflows"]: subflow = convert_flow_naming_and_defaults(subflow) does not update the list entries; consider enumerating and assigning back into flow["subflows"][i] or rebuilding the list so converted subflows are preserved.
Now that server_url and minio_url come from config and are concatenated with path fragments (e.g. in _format_dataset_url and _format_parquet_url), it would be safer to normalize/validate that these base URLs always end with exactly one trailing slash (or use a URL join helper) to avoid subtle double-slash or missing-slash issues when config changes.
Instead of adding # type: ignore on JsonLDGraph.context, consider adjusting the annotation (e.g. allowing a broader mapping type) or using a typing.Annotated/AliasChoices style approach so that Pydantic’s serialization alias still type-checks cleanly.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `convert_flow_naming_and_defaults`, the loop `for subflow in flow["subflows"]: subflow = convert_flow_naming_and_defaults(subflow)` does not update the list entries; consider enumerating and assigning back into `flow["subflows"][i]` or rebuilding the list so converted subflows are preserved.
- Now that `server_url` and `minio_url` come from config and are concatenated with path fragments (e.g. in `_format_dataset_url` and `_format_parquet_url`), it would be safer to normalize/validate that these base URLs always end with exactly one trailing slash (or use a URL join helper) to avoid subtle double-slash or missing-slash issues when config changes.
- Instead of adding `# type: ignore` on `JsonLDGraph.context`, consider adjusting the annotation (e.g. allowing a broader mapping type) or using a `typing.Annotated`/`AliasChoices` style approach so that Pydantic’s serialization alias still type-checks cleanly.

## Individual Comments

### Comment 1
<location> `tests/routers/openml/migration/flows_migration_test.py:41-42` </location>
<code_context>
     new = nested_remove_single_element_list(new)

     expected = php_api.get(f"/flow/{flow_id}").json()["flow"]
+    if subflow := expected.get("component"):
+        expected["component"] = subflow["flow"]
     # The reason we don't transform "new" to str is that it becomes harder to ignore numeric type
</code_context>

<issue_to_address>
**suggestion:** Component post-processing assumes a single `component` entry with a `flow` key and may not handle multiple components

`component` is assumed to be a dict with a `flow` key here. If the PHP API ever returns multiple components (e.g. a list) or changes the shape, this will fail or produce incorrect comparisons. Consider asserting the expected shape and, if a list is returned, normalizing each element (as in `convert_flow_naming_and_defaults`) so migration tests handle multi-component flows correctly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-12-16T14:18:22Z

tests/routers/openml/migration/flows_migration_test.py

    expected = php_api.get(f"/flow/{flow_id}").json()["flow"]
+    if subflow := expected.get("component"):


suggestion: Component post-processing assumes a single component entry with a flow key and may not handle multiple components

component is assumed to be a dict with a flow key here. If the PHP API ever returns multiple components (e.g. a list) or changes the shape, this will fail or produce incorrect comparisons. Consider asserting the expected shape and, if a list is returned, normalizing each element (as in convert_flow_naming_and_defaults) so migration tests handle multi-component flows correctly.

As per the linked issue, the database setup container exiting is actually what caused the non-zero return value for docker compose up, even when we expect the database setup container to exit.

Currently still maintain the relevant definition files in this repository to allow them to change independently for a little while when the server is under most active development. We can then consider which changes should be merged to services to reduce duplication again.

PGijsbers added 6 commits December 12, 2025 15:12

Declare constant for php api url

39ddd19

non-php non-slow tests green

0997ee3

Dataset migration tests are green

a11c08e

Update expected output for dataset metadata

699c7cb

Update test to account for removal of one nested level

8bbb7a0

Make non-integration tests pass again

d2814e1

sourcery-ai bot reviewed Dec 16, 2025

View reviewed changes

PGijsbers added 8 commits December 17, 2025 16:48

Update files to work with latest versions of services

175326f

Update the workflow for new docker compose

bf5becd

Add environment file for elastic search

cca1c13

Add PHP configuration

df51570

Correct container names

4669e4f

Without parallel build and index wait process is no longer needed

e43f697

add a network test

c490886

Add back wait

1050576

PGijsbers force-pushed the all-green branch from 764408f to 1050576 Compare December 18, 2025 10:13

PGijsbers added 3 commits December 18, 2025 11:16

Check if containers exit non-zero instead of the compose command

dc32125

As per the linked issue, the database setup container exiting is actually what caused the non-zero return value for docker compose up, even when we expect the database setup container to exit.

Clean up workflow

a718089

Change state back to account for database update script

02de613

PGijsbers merged commit 323b87b into develop Dec 18, 2025
5 checks passed

PGijsbers deleted the all-green branch December 18, 2025 12:44

PGijsbers restored the all-green branch December 18, 2025 12:45

PGijsbers deleted the all-green branch December 19, 2025 08:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make tests pass with openml-services #217

Make tests pass with openml-services #217

Uh oh!

PGijsbers commented Dec 16, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 16, 2025 •

edited

Loading

Review skipped

Other AI code review bot(s) detected

Uh oh!

sourcery-ai bot commented Dec 16, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		expected = php_api.get(f"/flow/{flow_id}").json()["flow"]
		if subflow := expected.get("component"):

Uh oh!

Make tests pass with openml-services #217

Make tests pass with openml-services #217

Uh oh!

Conversation

PGijsbers commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

coderabbitai bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Other AI code review bot(s) detected

Uh oh!

sourcery-ai bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for dataset request using routing configuration

Class diagram for updated configuration and dataset schemas

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PGijsbers commented Dec 16, 2025 •

edited

Loading

coderabbitai bot commented Dec 16, 2025 •

edited

Loading

sourcery-ai bot commented Dec 16, 2025 •

edited

Loading