feat: gold standard reference implementations for scenarios

## Priority: High

PawBench has no ground truth outputs. SWE-bench has human-verified patches.

### Proposal

For each PawStyle scenario, provide a reference implementation:
- `scenarios/pawstyle-independent/reference/server.py` — working backend
- `scenarios/pawstyle-independent/reference/index.html` — working frontend
- `scenarios/pawstyle-independent/endpoints.json` — expected API contract

Score generated code against the reference:
- API endpoints match (paths, methods, status codes)
- JSON response schemas match (field names, types)
- AST-level comparison for key patterns (not string matching)

Depends on: sandbox eval (execution-based scoring)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: gold standard reference implementations for scenarios #2

Priority: High

Proposal

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: gold standard reference implementations for scenarios #2

Description

Priority: High

Proposal

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions