Priority: High
PawBench has no ground truth outputs. SWE-bench has human-verified patches.
Proposal
For each PawStyle scenario, provide a reference implementation:
scenarios/pawstyle-independent/reference/server.py — working backend
scenarios/pawstyle-independent/reference/index.html — working frontend
scenarios/pawstyle-independent/endpoints.json — expected API contract
Score generated code against the reference:
- API endpoints match (paths, methods, status codes)
- JSON response schemas match (field names, types)
- AST-level comparison for key patterns (not string matching)
Depends on: sandbox eval (execution-based scoring)
Priority: High
PawBench has no ground truth outputs. SWE-bench has human-verified patches.
Proposal
For each PawStyle scenario, provide a reference implementation:
scenarios/pawstyle-independent/reference/server.py— working backendscenarios/pawstyle-independent/reference/index.html— working frontendscenarios/pawstyle-independent/endpoints.json— expected API contractScore generated code against the reference:
Depends on: sandbox eval (execution-based scoring)