Skip to content

Update write-kaggle-benchmarks skill for v0.5.0 SDK + CLI updates#11

Merged
dolaameng merged 3 commits into
mainfrom
marvin/update-write-kaggle-benchmarks-skill
Jun 4, 2026
Merged

Update write-kaggle-benchmarks skill for v0.5.0 SDK + CLI updates#11
dolaameng merged 3 commits into
mainfrom
marvin/update-write-kaggle-benchmarks-skill

Conversation

@kaggle-agent

Copy link
Copy Markdown
Collaborator

Syncs write-kaggle-benchmarks/SKILL.md with recent updates to its two dependency skills:

Changes

1. Model slug format (correctness fix — old guidance was inverted)

  • All -m / dict-key examples switched from google/gemini-3.5-flash / anthropic/claude-haiku-4-5 to bare canonical slugs (gemini-2.5-pro, claude-sonnet-4).
  • Slug gotcha rewritten to teach the CLI's normalization rules (prefix stripped, @-, output always canonical) and tell agents to standardize on the bare slug.

2. New "SDK features beyond the basics" section covering v0.5.0 additions absent from the recipe:

  • Return-type annotation rule for tasks that return a value.
  • store_task=False for sub-tasks.
  • Run object properties (run.passed, run.result, run.assertion_results).
  • reasoning= and temperature= on llm.prompt(), plus kbench.last_reasoning_traces().
  • Video / audio content types.
  • chats.fork() and contexts.enter() for branching / isolated histories.
  • ChatRoom + Participant multi-agent pattern (Pattern I.5).
  • Custom assertions via @assertion_handler() and assert_tool_was_invoked.
  • kbench.benchmark as an alias for kbench.task.

3. CLI nuances added to Gotchas:

  • Dataset detach warning when re-pushing without -d.
  • -f -s together to backfill source notebooks into cached runs.
  • Sequential (non-interleaved) log streaming behavior.

4. Left alone (already correct): pacing / checkpoint discipline, silent-no-op .run() gotcha, init/auth env vars, --wait/--poll-interval semantics, repeated-flag gotcha, delete note, publish flag.

Why a human PR

Filed by Marvin directly (rather than via sloppy) because the orchestrator currently hardcodes branch=ci when cloning, and this repo's default is main — sloppy task limagoog-20260604162052-ad482b13 failed for that reason.

@google-cla

google-cla Bot commented Jun 4, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Comment thread write-kaggle-benchmarks/SKILL.md Outdated

#### LLM resolution precedence (highest → lowest):
1. **Explicit model in code**: `task.run(llm=kbench.llms["google/gemini-3.5-flash"])`
1. **Explicit model in code**: `task.run(llm=kbench.llms["gemini-2.5-pro"])`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert all "gemini-2.5-pro" to "gemini-3.5-flash". But we don't need the prefix like "google/" and "anthropic/" any more for these models.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 342365b — reverted to gemini-3.5-flash / claude-haiku-4-5 (bare, no prefix). The slug-normalization gotcha already calls out that prefixes are unnecessary.

Sync the recipe with its two dependency skills (kaggle-benchmarks and
kaggle-cli):

- Switch model examples to bare canonical slugs (gemini-2.5-pro,
  claude-sonnet-4) and rewrite the slug gotcha to reflect the CLI's
  normalization rules.
- Add CLI nuances: dataset-detach warning on re-push without -d, -f -s
  to backfill source notebooks into cached runs, sequential log
  streaming behavior.
- New "SDK features beyond the basics" section covering v0.5.0
  additions: return-type annotation rule, store_task=False for
  sub-tasks, run object properties, reasoning=/temperature= on
  llm.prompt(), video/audio content types, chats.fork()/contexts.enter(),
  ChatRoom + Participant multi-agent pattern, @assertion_handler() and
  assert_tool_was_invoked, kbench.benchmark alias.
Per @dolaameng: keep the original example model names but drop the
provider prefixes (google/, anthropic/) — the canonical-bare-slug
guidance still holds.
@kaggle-agent kaggle-agent force-pushed the marvin/update-write-kaggle-benchmarks-skill branch from 342365b to 550de0c Compare June 4, 2026 16:52
@dolaameng dolaameng requested a review from nicholaskang-us June 4, 2026 17:39
Comment thread write-kaggle-benchmarks/SKILL.md Outdated

### Multimodal inputs
```python
from kaggle_benchmarks.content_types import videos, audios

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dolaameng - 1) what about image inputs? 2) can we clarify that video input is for youtube only

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 887047d per the videos and images cookbook recipes:

  1. Images — added the three images.from_url/from_path/from_base64 constructors and a note on the llm.prompt (auto-base64) vs user.send (raw URL pass-through) distinction.
  2. Videos — made the constraint explicit: YouTube URLs only, select Gemini models only; local files / non-YouTube URLs will error.

Per @nicholaskang-us review comment on PR #11. Expands the Multimodal
inputs section to (1) show the three images.from_* constructors plus
the llm.prompt vs user.send URL-handling distinction, and (2) make the
YouTube-only constraint for videos explicit.
@dolaameng dolaameng merged commit d823e96 into main Jun 4, 2026
6 checks passed
@dolaameng dolaameng deleted the marvin/update-write-kaggle-benchmarks-skill branch June 4, 2026 17:58
dolaameng pushed a commit that referenced this pull request Jun 4, 2026
Adds an **Advanced use cases** section to
`write-kaggle-benchmarks/README.md` pointing agents at the two upstream
skill files when they need surface area beyond the focused recipe:

-
[`kaggle-benchmarks/SKILL.md`](https://github.com/Kaggle/kaggle-benchmarks/blob/ci/skills/kaggle-benchmarks/SKILL.md)
— full SDK reference
-
[`kaggle-cli/skills/references/benchmarks.md`](https://github.com/Kaggle/kaggle-cli/blob/main/skills/references/benchmarks.md)
— full CLI reference

Motivated by the [2026-06-04 skill-test
rerun](#11) where many
rubric misses came from SDK surface the focused recipe intentionally
doesn't name (`assert_not_empty`, `kbench.system.send`, `extract_code` +
`script_runner`, `assess_response_with_judge`, `.evaluate(llm=[...])`,
`chats.new`). Pointing agents at the upstream files lets them self-serve
when those advanced patterns come up.

---------

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants