Add Codex research workflow and review tooling#453
Add Codex research workflow and review tooling#453TrueCrimeDev wants to merge 6 commits intokarpathy:masterfrom
Conversation
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <noreply@openai.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a Codex CLI–driven workflow for running structured “research specs” and managing their outputs (runs, datasets, dashboards), plus local tooling to plan/prepare reviewable PRs from commit history or a dirty worktree.
Changes:
- Introduces a Codex research runner (
doctor, one-shotrun, and queuedqueue) plus a heavy-equipment research spec and queue templates. - Adds dataset build + export automation, including optional targeted follow-up passes for unresolved image URLs.
- Adds local operational tooling: results dashboard, PR bundle planning, worktree snapshot commits, and review-branch generation.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents the new research runner, dataset builder, dashboard, PR planning, snapshots, and review-branch tooling. |
| program.md | Replaces the prior training/experiment program with reusable instructions tailored to Codex research jobs. |
| specs/heavy_construction_industrial_equipment.md | Adds a templated heavy construction equipment research specification with phase/depth variables. |
| job_templates/heavy-equipment-phase1-seed.json | Adds an example queued job for phase 1 at seed depth. |
| job_templates/heavy-equipment-phase1-exhaustive.json | Adds an example queued job for phase 1 at exhaustive depth. |
| scripts/run_research.py | Implements the Codex job runner (prompt rendering, variable substitution, run artifacts, queue processing). |
| scripts/build_heavy_equipment_dataset.py | Automates multi-phase runs, merges outputs into CSV/JSON/markdown, and supports image follow-up passes. |
| scripts/export_heavy_equipment_report.py | Exports a combined markdown report into dataset CSV/JSON + manifest. |
| scripts/research_dashboard.py | Generates a local HTML/terminal dashboard summarizing runs and datasets under results/. |
| scripts/research_pr_plan.py | Plans consolidated/stacked/individual PR bundles from branch-only commits using file-overlap dependency heuristics. |
| scripts/research_snapshot.py | Groups dirty worktree changes into commit buckets and can create snapshot commits (dry-run by default). |
| scripts/research_review_branches.py | Materializes PR bundles as local “review/*” branches via temporary worktrees + cherry-picks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| python scripts/research_dashboard.py serve | ||
| ``` | ||
|
|
||
| By default that writes [results/dashboard/index.html](/C:/Users/uphol/Documents/Design/Coding/autoresearch/results/dashboard/index.html) and serves the repo at `http://127.0.0.1:8765/results/dashboard/index.html`. |
There was a problem hiding this comment.
The dashboard section includes a markdown link to an absolute local Windows path (/C:/Users/.../results/dashboard/index.html), which will be broken for other users and on GitHub. Use a repo-relative link (e.g., results/dashboard/index.html) or just refer to the path in inline code without embedding a machine-specific absolute path.
| By default that writes [results/dashboard/index.html](/C:/Users/uphol/Documents/Design/Coding/autoresearch/results/dashboard/index.html) and serves the repo at `http://127.0.0.1:8765/results/dashboard/index.html`. | |
| By default that writes [results/dashboard/index.html](results/dashboard/index.html) and serves the repo at `http://127.0.0.1:8765/results/dashboard/index.html`. |
|
|
||
| ## Dashboard | ||
|
|
||
| This repo now includes a lightweight run-history dashboard that reads the existing `results\` artifacts directly. It does not require pi, a web framework, or extra build steps. |
There was a problem hiding this comment.
Typo in the dashboard description: "does not require pi" reads like it should be "does not require pip" (Python package installer).
| This repo now includes a lightweight run-history dashboard that reads the existing `results\` artifacts directly. It does not require pi, a web framework, or extra build steps. | |
| This repo now includes a lightweight run-history dashboard that reads the existing `results\` artifacts directly. It does not require pip, a web framework, or extra build steps. |
| output_dir = output_root / f"{timestamp}-{slug}" | ||
| output_dir.mkdir(parents=True, exist_ok=False) | ||
| return output_dir | ||
|
|
||
|
|
There was a problem hiding this comment.
build_output_dir() uses a timestamp with second-level precision (%Y%m%d-%H%M%S) and fails if the directory already exists (exist_ok=False). In queue mode (or fast retries), multiple runs can land in the same second and collide, causing an avoidable failure. Consider adding higher-resolution time (microseconds), a random suffix, or a collision-retry loop.
| output_dir = output_root / f"{timestamp}-{slug}" | |
| output_dir.mkdir(parents=True, exist_ok=False) | |
| return output_dir | |
| # Ensure we don't fail spuriously if multiple runs land in the same second. | |
| # Start with the base path and, on collision, append an incrementing suffix. | |
| attempt = 0 | |
| while True: | |
| if attempt == 0: | |
| dir_name = f"{timestamp}-{slug}" | |
| else: | |
| dir_name = f"{timestamp}-{slug}-{attempt}" | |
| output_dir = output_root / dir_name | |
| try: | |
| output_dir.mkdir(parents=True, exist_ok=False) | |
| return output_dir | |
| except FileExistsError: | |
| attempt += 1 |
| def dataset_dir(depth: str, tag: str | None) -> Path: | ||
| timestamp = datetime.now().strftime("%Y%m%d-%H%M%S") | ||
| suffix = tag or f"heavy-construction-equipment-{depth}" | ||
| path = DATASET_ROOT / f"{timestamp}-{suffix}" | ||
| path.mkdir(parents=True, exist_ok=False) | ||
| return path |
There was a problem hiding this comment.
dataset_dir() uses a second-resolution timestamp and un-sanitized tag to form a folder name. This can (a) collide if multiple datasets are created within the same second and (b) fail or create surprising paths if --tag contains spaces, slashes, or other filesystem-unfriendly characters. Consider slugifying the suffix and adding collision avoidance (microseconds / random suffix / retry).
| if len(cells) >= 6: | ||
| rows.append( | ||
| { | ||
| "phase_number": phase_number, | ||
| "phase": phase_label, | ||
| "category": current_category, | ||
| "row_number": cells[0], | ||
| "equipment_name": cells[1], | ||
| "also_known_as": cells[2], | ||
| "description": cells[3], | ||
| "typical_use_phase": cells[4], | ||
| "image_url": cells[5], | ||
| } | ||
| ) |
There was a problem hiding this comment.
parse_markdown_report() splits table rows on every |, so any pipe character produced in aliases/descriptions will shift columns and silently corrupt the extracted fields (e.g., image_url no longer ends up in cells[5]). This is likely with LLM-generated tables. Consider enforcing/escaping pipes in the spec, or make the parser robust (e.g., split with a max column count and/or re-join overflow columns into the description field, and validate the URL column).
| if len(cells) >= 6: | |
| rows.append( | |
| { | |
| "phase_number": phase_number, | |
| "phase": phase_label, | |
| "category": current_category, | |
| "row_number": cells[0], | |
| "equipment_name": cells[1], | |
| "also_known_as": cells[2], | |
| "description": cells[3], | |
| "typical_use_phase": cells[4], | |
| "image_url": cells[5], | |
| } | |
| ) | |
| # Expect 6 logical columns: | |
| # 0: row_number | |
| # 1: equipment_name | |
| # 2: also_known_as | |
| # 3: description (may contain '|') | |
| # 4: typical_use_phase | |
| # 5: image_url | |
| if len(cells) < 6: | |
| i += 1 | |
| continue | |
| if len(cells) == 6: | |
| row_number = cells[0] | |
| equipment_name = cells[1] | |
| also_known_as = cells[2] | |
| description = cells[3] | |
| typical_use_phase = cells[4] | |
| image_url = cells[5] | |
| else: | |
| # Extra '|' characters likely appeared in a text field. | |
| # Keep the first three and last two columns fixed, and | |
| # join any overflow into the description field. | |
| row_number = cells[0] | |
| equipment_name = cells[1] | |
| also_known_as = cells[2] | |
| if len(cells) > 3: | |
| if len(cells) > 4: | |
| typical_use_phase = cells[-2] | |
| image_url = cells[-1] | |
| description_parts = cells[3:-2] | |
| description = "|".join(part.strip() for part in description_parts) | |
| else: | |
| # Fallback: treat remaining cell as description, and leave | |
| # typical_use_phase and image_url empty. | |
| description = cells[3] | |
| typical_use_phase = "" | |
| image_url = "" | |
| else: | |
| # Should not happen with len(cells) >= 6, but guard anyway. | |
| description = "" | |
| typical_use_phase = "" | |
| image_url = "" | |
| rows.append( | |
| { | |
| "phase_number": phase_number, | |
| "phase": phase_label, | |
| "category": current_category, | |
| "row_number": row_number, | |
| "equipment_name": equipment_name, | |
| "also_known_as": also_known_as, | |
| "description": description, | |
| "typical_use_phase": typical_use_phase, | |
| "image_url": image_url, | |
| } | |
| ) |
| cells = [cell.strip() for cell in row_line.strip("|").split("|")] | ||
| if len(cells) >= 6: | ||
| phase_number, phase_label = phase_for_category(current_category) | ||
| rows.append( | ||
| { | ||
| "phase_number": phase_number, | ||
| "phase": phase_label, | ||
| "category": current_category, | ||
| "row_number": cells[0], | ||
| "equipment_name": cells[1], | ||
| "also_known_as": cells[2], | ||
| "description": cells[3], | ||
| "typical_use_phase": cells[4], | ||
| "image_url": cells[5], | ||
| } | ||
| ) |
There was a problem hiding this comment.
parse_report() splits markdown table rows on every |, which will mis-parse rows if the report contains pipes inside any cell (common in LLM outputs), leading to wrong CSV/JSON fields (especially image_url). Consider using a more robust split (max columns) and/or validating the extracted URL column before accepting the row.
| cells = [cell.strip() for cell in row_line.strip("|").split("|")] | |
| if len(cells) >= 6: | |
| phase_number, phase_label = phase_for_category(current_category) | |
| rows.append( | |
| { | |
| "phase_number": phase_number, | |
| "phase": phase_label, | |
| "category": current_category, | |
| "row_number": cells[0], | |
| "equipment_name": cells[1], | |
| "also_known_as": cells[2], | |
| "description": cells[3], | |
| "typical_use_phase": cells[4], | |
| "image_url": cells[5], | |
| } | |
| ) | |
| raw_cells = [cell.strip() for cell in row_line.strip("|").split("|")] | |
| if len(raw_cells) < 6: | |
| i += 1 | |
| continue | |
| # If there are more than 6 split segments (due to '|' inside cells), | |
| # collapse all remaining segments into the last column. | |
| if len(raw_cells) > 6: | |
| cells = raw_cells[:5] + ["|".join(raw_cells[5:]).strip()] | |
| else: | |
| cells = raw_cells | |
| # Validate that the image_url looks like an HTTP(S) URL before accepting the row. | |
| image_url = cells[5] | |
| if not re.match(r"^https?://", image_url): | |
| i += 1 | |
| continue | |
| phase_number, phase_label = phase_for_category(current_category) | |
| rows.append( | |
| { | |
| "phase_number": phase_number, | |
| "phase": phase_label, | |
| "category": current_category, | |
| "row_number": cells[0], | |
| "equipment_name": cells[1], | |
| "also_known_as": cells[2], | |
| "description": cells[3], | |
| "typical_use_phase": cells[4], | |
| "image_url": image_url, | |
| } | |
| ) |
| temp_dir = Path(tempfile.mkdtemp(prefix="review-branch-")) | ||
| try: | ||
| git(["worktree", "add", "--quiet", "-b", branch_name, str(temp_dir), base_ref], cwd=repo_root) | ||
| try: | ||
| if bundle_commits: | ||
| git(["cherry-pick", *[commit.commit_hash for commit in bundle_commits]], cwd=temp_dir) | ||
| except Exception: | ||
| git(["cherry-pick", "--abort"], cwd=temp_dir, check=False) | ||
| raise | ||
| finally: | ||
| git(["worktree", "remove", "--force", str(temp_dir)], cwd=repo_root, check=False) | ||
| finally: | ||
| shutil.rmtree(temp_dir, ignore_errors=True) | ||
|
|
There was a problem hiding this comment.
If a cherry-pick fails in create_branch_from_bundle(), the worktree is removed but the newly created branch ref can remain pointing at base_ref, leaving behind partially-created review branches. Consider deleting branch_name on failure (or creating the branch only after a successful cherry-pick) so a failed run doesn’t require manual cleanup.
Summary
Included Files
Verification