-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Problem
generate_and_evaluate.py and evaluate.py are getting harder to maintain as more runtime options are added.
Right now, argument parsing, defaults, validation, and execution logic are all mixed together in the same flow. This makes the code harder to read, harder to test, and harder to extend without creating more coupling.
Proposal
I want to move runtime configuration to a Pydantic-based schema and use that as the single source of truth for run options.
Instead of directly relying on many flat CLI arguments throughout execution code, the CLI layer will map arguments into a typed RunConfig object. Pipeline code will consume this config object.
Why this matters
- Keeps execution code focused on execution, not argument plumbing.
- Centralizes validation and default handling.
- Makes invalid configs fail early with clear errors.
- Improves readability for contributors and reviewers.
- Makes future additions safer (new options can be added in schema first, then used where needed).
Planned steps
- Introduce Pydantic config models (
RunConfig+ nested sections). - Add a CLI adapter that converts CLI args to
RunConfig. - Refactor runtime entrypoints to use
RunConfigdirectly. - Add tests for config parsing, defaults, and validation.
- Keep behavior and outputs unchanged during this refactor.
- Add
pydanticto dependencies.
Expected impact on execution
This is a structural refactor, not a behavior change.
The generation/evaluation flow should remain the same. The main change is how runtime options are represented and validated before execution starts.
Follow-up ideas
After this lands, we can add config file support:
- JSON config loading first
- YAML support later
- CLI overrides on top of config file values
Example (before vs after in code)
Current style (flat args used everywhere):
def main(args):
if args.swap_mode == "both":
...
judge = make_model(
model=args.judge_model,
max_tokens=args.max_out_tokens_judge,
max_model_len=args.max_model_len,
)Proposed style (typed config object):
def main(config: RunConfig):
if config.judge.swap_mode == "both":
...
judge = make_model(
model=config.models.judge_model,
max_tokens=config.judge.max_out_tokens,
max_model_len=config.runtime.max_model_len,
)