adds simple smoke tests by warmbowski · Pull Request #76 · allenai/olmo-eval

warmbowski · 2026-02-27T18:10:17Z

Description

Adds some simple smoke tests for running against deployed models on the litellm provider. These are mostly recreate some sanity check logic from this script that Pradeep runs against deployed olmo3 models on deployment: https://github.com/allenai/open-instruct/blob/pd-olmo-api-eval/scripts/evaluate_api_endpoints.py#L125

closes https://github.com/allenai/playground-issues-repo/issues/965

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)

Testing

Unit tests pass locally (pytest tests/ --ignore=tests/integration/)
Integration tests pass (if applicable)
New tests added for new functionality

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have added/updated documentation as needed
My changes generate no new warnings
Any dependent changes have been merged and published

warmbowski · 2026-03-19T20:52:19Z

    extracted_answer: Any = None
    metadata: dict[str, Any] = field(default_factory=dict)
    tool_calls: list[ToolCall] | None = None
+    provider_extras: dict[str, Any] = field(default_factory=dict)


this is a dict set up for provider specific (in this case litellm) properties from the response. Only thing add in this is has_reasoning. I suppose other providers could also implement this on LMOutput (i.e. embedded thinking tag). Mainly, I am wondering if this is okay to add this dict (can rename, maybe can just be in metadata, but that looks like its for metric calls).

undfined · 2026-03-30T17:16:08Z

+
+
+@dataclass(frozen=True, slots=True)
+class SubstringScorer(Scorer):


Look fro existing implementation of this, I think we have something.

undfined

lgtm

warmbowski requested a review from undfined February 27, 2026 18:10

warmbowski changed the title ~~adds smoke simple smoke tests~~ adds simple smoke tests Feb 27, 2026

warmbowski force-pushed the paull/add_labs_smoke_tests_for_deployed_models branch from 88eda73 to 4ea4ed0 Compare March 17, 2026 23:58

warmbowski marked this pull request as ready for review March 19, 2026 20:47

warmbowski commented Mar 19, 2026

View reviewed changes

undfined reviewed Mar 30, 2026

View reviewed changes

warmbowski added 4 commits April 8, 2026 08:32

adds smoke simple smoke tests

34d8e14

fix type errors

87f5995

refactor reasoning check into provider specific dict

ca7a14c

refactor for feedback

e922f33

warmbowski force-pushed the paull/add_labs_smoke_tests_for_deployed_models branch from 332a01a to e922f33 Compare April 8, 2026 15:51

warmbowski requested a review from undfined April 8, 2026 17:37

undfined approved these changes Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds simple smoke tests#76

adds simple smoke tests#76
warmbowski wants to merge 4 commits into
mainfrom
paull/add_labs_smoke_tests_for_deployed_models

warmbowski commented Feb 27, 2026 •

edited

Loading

Uh oh!

warmbowski Mar 19, 2026

Uh oh!

undfined Mar 30, 2026

Uh oh!

undfined left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		@dataclass(frozen=True, slots=True)
		class SubstringScorer(Scorer):

Conversation

warmbowski commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Uh oh!

warmbowski Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

undfined Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

undfined left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

warmbowski commented Feb 27, 2026 •

edited

Loading