Skip to content

feat: gold standard reference implementations for scenarios #2

@zenprocess

Description

@zenprocess

Priority: High

PawBench has no ground truth outputs. SWE-bench has human-verified patches.

Proposal

For each PawStyle scenario, provide a reference implementation:

  • scenarios/pawstyle-independent/reference/server.py — working backend
  • scenarios/pawstyle-independent/reference/index.html — working frontend
  • scenarios/pawstyle-independent/endpoints.json — expected API contract

Score generated code against the reference:

  • API endpoints match (paths, methods, status codes)
  • JSON response schemas match (field names, types)
  • AST-level comparison for key patterns (not string matching)

Depends on: sandbox eval (execution-based scoring)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions