Update write-kaggle-benchmarks skill for v0.5.0 SDK + CLI updates#11
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
|
||
| #### LLM resolution precedence (highest → lowest): | ||
| 1. **Explicit model in code**: `task.run(llm=kbench.llms["google/gemini-3.5-flash"])` | ||
| 1. **Explicit model in code**: `task.run(llm=kbench.llms["gemini-2.5-pro"])` |
There was a problem hiding this comment.
Let's revert all "gemini-2.5-pro" to "gemini-3.5-flash". But we don't need the prefix like "google/" and "anthropic/" any more for these models.
There was a problem hiding this comment.
Done in 342365b — reverted to gemini-3.5-flash / claude-haiku-4-5 (bare, no prefix). The slug-normalization gotcha already calls out that prefixes are unnecessary.
Sync the recipe with its two dependency skills (kaggle-benchmarks and kaggle-cli): - Switch model examples to bare canonical slugs (gemini-2.5-pro, claude-sonnet-4) and rewrite the slug gotcha to reflect the CLI's normalization rules. - Add CLI nuances: dataset-detach warning on re-push without -d, -f -s to backfill source notebooks into cached runs, sequential log streaming behavior. - New "SDK features beyond the basics" section covering v0.5.0 additions: return-type annotation rule, store_task=False for sub-tasks, run object properties, reasoning=/temperature= on llm.prompt(), video/audio content types, chats.fork()/contexts.enter(), ChatRoom + Participant multi-agent pattern, @assertion_handler() and assert_tool_was_invoked, kbench.benchmark alias.
Per @dolaameng: keep the original example model names but drop the provider prefixes (google/, anthropic/) — the canonical-bare-slug guidance still holds.
342365b to
550de0c
Compare
|
|
||
| ### Multimodal inputs | ||
| ```python | ||
| from kaggle_benchmarks.content_types import videos, audios |
There was a problem hiding this comment.
@dolaameng - 1) what about image inputs? 2) can we clarify that video input is for youtube only
There was a problem hiding this comment.
Addressed in 887047d per the videos and images cookbook recipes:
- Images — added the three
images.from_url/from_path/from_base64constructors and a note on thellm.prompt(auto-base64) vsuser.send(raw URL pass-through) distinction. - Videos — made the constraint explicit: YouTube URLs only, select Gemini models only; local files / non-YouTube URLs will error.
Per @nicholaskang-us review comment on PR #11. Expands the Multimodal inputs section to (1) show the three images.from_* constructors plus the llm.prompt vs user.send URL-handling distinction, and (2) make the YouTube-only constraint for videos explicit.
Adds an **Advanced use cases** section to `write-kaggle-benchmarks/README.md` pointing agents at the two upstream skill files when they need surface area beyond the focused recipe: - [`kaggle-benchmarks/SKILL.md`](https://github.com/Kaggle/kaggle-benchmarks/blob/ci/skills/kaggle-benchmarks/SKILL.md) — full SDK reference - [`kaggle-cli/skills/references/benchmarks.md`](https://github.com/Kaggle/kaggle-cli/blob/main/skills/references/benchmarks.md) — full CLI reference Motivated by the [2026-06-04 skill-test rerun](#11) where many rubric misses came from SDK surface the focused recipe intentionally doesn't name (`assert_not_empty`, `kbench.system.send`, `extract_code` + `script_runner`, `assess_response_with_judge`, `.evaluate(llm=[...])`, `chats.new`). Pointing agents at the upstream files lets them self-serve when those advanced patterns come up. --------- Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
Syncs
write-kaggle-benchmarks/SKILL.mdwith recent updates to its two dependency skills:Changes
1. Model slug format (correctness fix — old guidance was inverted)
-m/ dict-key examples switched fromgoogle/gemini-3.5-flash/anthropic/claude-haiku-4-5to bare canonical slugs (gemini-2.5-pro,claude-sonnet-4).@→-, output always canonical) and tell agents to standardize on the bare slug.2. New "SDK features beyond the basics" section covering v0.5.0 additions absent from the recipe:
store_task=Falsefor sub-tasks.run.passed,run.result,run.assertion_results).reasoning=andtemperature=onllm.prompt(), pluskbench.last_reasoning_traces().chats.fork()andcontexts.enter()for branching / isolated histories.ChatRoom+Participantmulti-agent pattern (Pattern I.5).@assertion_handler()andassert_tool_was_invoked.kbench.benchmarkas an alias forkbench.task.3. CLI nuances added to Gotchas:
-d.-f -stogether to backfill source notebooks into cached runs.4. Left alone (already correct): pacing / checkpoint discipline, silent-no-op
.run()gotcha, init/auth env vars,--wait/--poll-intervalsemantics, repeated-flag gotcha,deletenote, publish flag.Why a human PR
Filed by Marvin directly (rather than via sloppy) because the orchestrator currently hardcodes
branch=ciwhen cloning, and this repo's default ismain— sloppy task limagoog-20260604162052-ad482b13 failed for that reason.