Add Codex research workflow and review tooling by TrueCrimeDev · Pull Request #453 · karpathy/autoresearch

TrueCrimeDev · 2026-03-31T00:58:10Z

Summary

add a Codex-powered research runner and heavy-equipment research workflow
add dataset build/export automation with targeted image follow-up support
add local dashboard, snapshot, PR planning, and review-branch tooling

Included Files

README.md
program.md
specs/heavy_construction_industrial_equipment.md
job_templates/heavy-equipment-phase1-seed.json
job_templates/heavy-equipment-phase1-exhaustive.json
scripts/run_research.py
scripts/build_heavy_equipment_dataset.py
scripts/export_heavy_equipment_report.py
scripts/research_dashboard.py
scripts/research_pr_plan.py
scripts/research_snapshot.py
scripts/research_review_branches.py

Verification

python -m py_compile scripts\run_research.py scripts\build_heavy_equipment_dataset.py scripts\export_heavy_equipment_report.py scripts\research_dashboard.py scripts\research_pr_plan.py scripts\research_snapshot.py scripts\research_review_branches.py
python scripts\build_heavy_equipment_dataset.py --help
python scripts\research_dashboard.py summary --limit-runs 6 --limit-datasets 3
python scripts\research_dashboard.py export-html
python scripts\research_pr_plan.py list
python scripts\research_pr_plan.py plan --all
python scripts\research_pr_plan.py plan --all --mode stacked
python scripts\research_snapshot.py plan
python scripts\research_review_branches.py --all

Co-authored-by: Codex <noreply@openai.com>

Copilot

Pull request overview

This PR adds a Codex CLI–driven workflow for running structured “research specs” and managing their outputs (runs, datasets, dashboards), plus local tooling to plan/prepare reviewable PRs from commit history or a dirty worktree.

Changes:

Introduces a Codex research runner (doctor, one-shot run, and queued queue) plus a heavy-equipment research spec and queue templates.
Adds dataset build + export automation, including optional targeted follow-up passes for unresolved image URLs.
Adds local operational tooling: results dashboard, PR bundle planning, worktree snapshot commits, and review-branch generation.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
README.md	Documents the new research runner, dataset builder, dashboard, PR planning, snapshots, and review-branch tooling.
program.md	Replaces the prior training/experiment program with reusable instructions tailored to Codex research jobs.
specs/heavy_construction_industrial_equipment.md	Adds a templated heavy construction equipment research specification with phase/depth variables.
job_templates/heavy-equipment-phase1-seed.json	Adds an example queued job for phase 1 at seed depth.
job_templates/heavy-equipment-phase1-exhaustive.json	Adds an example queued job for phase 1 at exhaustive depth.
scripts/run_research.py	Implements the Codex job runner (prompt rendering, variable substitution, run artifacts, queue processing).
scripts/build_heavy_equipment_dataset.py	Automates multi-phase runs, merges outputs into CSV/JSON/markdown, and supports image follow-up passes.
scripts/export_heavy_equipment_report.py	Exports a combined markdown report into dataset CSV/JSON + manifest.
scripts/research_dashboard.py	Generates a local HTML/terminal dashboard summarizing runs and datasets under `results/`.
scripts/research_pr_plan.py	Plans consolidated/stacked/individual PR bundles from branch-only commits using file-overlap dependency heuristics.
scripts/research_snapshot.py	Groups dirty worktree changes into commit buckets and can create snapshot commits (dry-run by default).
scripts/research_review_branches.py	Materializes PR bundles as local “review/*” branches via temporary worktrees + cherry-picks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T01:02:35Z

+python scripts/research_dashboard.py serve
+```
+
+By default that writes [results/dashboard/index.html](/C:/Users/uphol/Documents/Design/Coding/autoresearch/results/dashboard/index.html) and serves the repo at `http://127.0.0.1:8765/results/dashboard/index.html`.


The dashboard section includes a markdown link to an absolute local Windows path (/C:/Users/.../results/dashboard/index.html), which will be broken for other users and on GitHub. Use a repo-relative link (e.g., results/dashboard/index.html) or just refer to the path in inline code without embedding a machine-specific absolute path.

Suggested change

By default that writes [results/dashboard/index.html](/C:/Users/uphol/Documents/Design/Coding/autoresearch/results/dashboard/index.html) and serves the repo at `http://127.0.0.1:8765/results/dashboard/index.html`.

By default that writes [results/dashboard/index.html](results/dashboard/index.html) and serves the repo at `http://127.0.0.1:8765/results/dashboard/index.html`.

Copilot · 2026-03-31T01:02:35Z

+
+## Dashboard
+
+This repo now includes a lightweight run-history dashboard that reads the existing `results\` artifacts directly. It does not require pi, a web framework, or extra build steps.


Typo in the dashboard description: "does not require pi" reads like it should be "does not require pip" (Python package installer).

Suggested change

This repo now includes a lightweight run-history dashboard that reads the existing `results\` artifacts directly. It does not require pi, a web framework, or extra build steps.

This repo now includes a lightweight run-history dashboard that reads the existing `results\` artifacts directly. It does not require pip, a web framework, or extra build steps.

Copilot · 2026-03-31T01:02:35Z

+    output_dir = output_root / f"{timestamp}-{slug}"
+    output_dir.mkdir(parents=True, exist_ok=False)
+    return output_dir
+
+


build_output_dir() uses a timestamp with second-level precision (%Y%m%d-%H%M%S) and fails if the directory already exists (exist_ok=False). In queue mode (or fast retries), multiple runs can land in the same second and collide, causing an avoidable failure. Consider adding higher-resolution time (microseconds), a random suffix, or a collision-retry loop.

Suggested change

output_dir = output_root / f"{timestamp}-{slug}"

output_dir.mkdir(parents=True, exist_ok=False)

return output_dir

# Ensure we don't fail spuriously if multiple runs land in the same second.

# Start with the base path and, on collision, append an incrementing suffix.

attempt = 0

while True:

if attempt == 0:

dir_name = f"{timestamp}-{slug}"

else:

dir_name = f"{timestamp}-{slug}-{attempt}"

output_dir = output_root / dir_name

try:

output_dir.mkdir(parents=True, exist_ok=False)

return output_dir

except FileExistsError:

attempt += 1

Copilot · 2026-03-31T01:02:35Z

+def dataset_dir(depth: str, tag: str | None) -> Path:
+    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
+    suffix = tag or f"heavy-construction-equipment-{depth}"
+    path = DATASET_ROOT / f"{timestamp}-{suffix}"
+    path.mkdir(parents=True, exist_ok=False)
+    return path


dataset_dir() uses a second-resolution timestamp and un-sanitized tag to form a folder name. This can (a) collide if multiple datasets are created within the same second and (b) fail or create surprising paths if --tag contains spaces, slashes, or other filesystem-unfriendly characters. Consider slugifying the suffix and adding collision avoidance (microseconds / random suffix / retry).

Copilot · 2026-03-31T01:02:36Z

+                if len(cells) >= 6:
+                    rows.append(
+                        {
+                            "phase_number": phase_number,
+                            "phase": phase_label,
+                            "category": current_category,
+                            "row_number": cells[0],
+                            "equipment_name": cells[1],
+                            "also_known_as": cells[2],
+                            "description": cells[3],
+                            "typical_use_phase": cells[4],
+                            "image_url": cells[5],
+                        }
+                    )


parse_markdown_report() splits table rows on every |, so any pipe character produced in aliases/descriptions will shift columns and silently corrupt the extracted fields (e.g., image_url no longer ends up in cells[5]). This is likely with LLM-generated tables. Consider enforcing/escaping pipes in the spec, or make the parser robust (e.g., split with a max column count and/or re-join overflow columns into the description field, and validate the URL column).

Suggested change

if len(cells) >= 6:

rows.append(

{

"phase_number": phase_number,

"phase": phase_label,

"category": current_category,

"row_number": cells[0],

"equipment_name": cells[1],

"also_known_as": cells[2],

"description": cells[3],

"typical_use_phase": cells[4],

"image_url": cells[5],

}

)

# Expect 6 logical columns:

# 0: row_number

# 1: equipment_name

# 2: also_known_as

# 3: description (may contain '|')

# 4: typical_use_phase

# 5: image_url

if len(cells) < 6:

i += 1

continue

if len(cells) == 6:

row_number = cells[0]

equipment_name = cells[1]

also_known_as = cells[2]

description = cells[3]

typical_use_phase = cells[4]

image_url = cells[5]

else:

# Extra '|' characters likely appeared in a text field.

# Keep the first three and last two columns fixed, and

# join any overflow into the description field.

row_number = cells[0]

equipment_name = cells[1]

also_known_as = cells[2]

if len(cells) > 3:

if len(cells) > 4:

typical_use_phase = cells[-2]

image_url = cells[-1]

description_parts = cells[3:-2]

description = "|".join(part.strip() for part in description_parts)

else:

# Fallback: treat remaining cell as description, and leave

# typical_use_phase and image_url empty.

description = cells[3]

typical_use_phase = ""

image_url = ""

else:

# Should not happen with len(cells) >= 6, but guard anyway.

description = ""

typical_use_phase = ""

image_url = ""

rows.append(

{

"phase_number": phase_number,

"phase": phase_label,

"category": current_category,

"row_number": row_number,

"equipment_name": equipment_name,

"also_known_as": also_known_as,

"description": description,

"typical_use_phase": typical_use_phase,

"image_url": image_url,

}

)

Copilot · 2026-03-31T01:02:36Z

+                cells = [cell.strip() for cell in row_line.strip("|").split("|")]
+                if len(cells) >= 6:
+                    phase_number, phase_label = phase_for_category(current_category)
+                    rows.append(
+                        {
+                            "phase_number": phase_number,
+                            "phase": phase_label,
+                            "category": current_category,
+                            "row_number": cells[0],
+                            "equipment_name": cells[1],
+                            "also_known_as": cells[2],
+                            "description": cells[3],
+                            "typical_use_phase": cells[4],
+                            "image_url": cells[5],
+                        }
+                    )


parse_report() splits markdown table rows on every |, which will mis-parse rows if the report contains pipes inside any cell (common in LLM outputs), leading to wrong CSV/JSON fields (especially image_url). Consider using a more robust split (max columns) and/or validating the extracted URL column before accepting the row.

Suggested change

cells = [cell.strip() for cell in row_line.strip("|").split("|")]

if len(cells) >= 6:

phase_number, phase_label = phase_for_category(current_category)

rows.append(

{

"phase_number": phase_number,

"phase": phase_label,

"category": current_category,

"row_number": cells[0],

"equipment_name": cells[1],

"also_known_as": cells[2],

"description": cells[3],

"typical_use_phase": cells[4],

"image_url": cells[5],

}

)

raw_cells = [cell.strip() for cell in row_line.strip("|").split("|")]

if len(raw_cells) < 6:

i += 1

continue

# If there are more than 6 split segments (due to '|' inside cells),

# collapse all remaining segments into the last column.

if len(raw_cells) > 6:

cells = raw_cells[:5] + ["|".join(raw_cells[5:]).strip()]

else:

cells = raw_cells

# Validate that the image_url looks like an HTTP(S) URL before accepting the row.

image_url = cells[5]

if not re.match(r"^https?://", image_url):

i += 1

continue

phase_number, phase_label = phase_for_category(current_category)

rows.append(

{

"phase_number": phase_number,

"phase": phase_label,

"category": current_category,

"row_number": cells[0],

"equipment_name": cells[1],

"also_known_as": cells[2],

"description": cells[3],

"typical_use_phase": cells[4],

"image_url": image_url,

}

)

Copilot · 2026-03-31T01:02:36Z

+    temp_dir = Path(tempfile.mkdtemp(prefix="review-branch-"))
+    try:
+        git(["worktree", "add", "--quiet", "-b", branch_name, str(temp_dir), base_ref], cwd=repo_root)
+        try:
+            if bundle_commits:
+                git(["cherry-pick", *[commit.commit_hash for commit in bundle_commits]], cwd=temp_dir)
+        except Exception:
+            git(["cherry-pick", "--abort"], cwd=temp_dir, check=False)
+            raise
+        finally:
+            git(["worktree", "remove", "--force", str(temp_dir)], cwd=repo_root, check=False)
+    finally:
+        shutil.rmtree(temp_dir, ignore_errors=True)
+


If a cherry-pick fails in create_branch_from_bundle(), the worktree is removed but the newly created branch ref can remain pointing at base_ref, leaving behind partially-created review branches. Consider deleting branch_name on failure (or creating the branch only after a successful cherry-pick) so a failed run doesn’t require manual cleanup.

UHQ-Actual and others added 6 commits March 30, 2026 18:35

feat(scripts): snapshot automation updates

f591b35

Co-authored-by: Codex <noreply@openai.com>

feat(specs): snapshot research spec updates

d9f1b58

Co-authored-by: Codex <noreply@openai.com>

chore(job-templates): snapshot queue template updates

01a4708

Co-authored-by: Codex <noreply@openai.com>

docs: snapshot guidance updates

b66d1d4

Co-authored-by: Codex <noreply@openai.com>

feat(scripts): snapshot automation updates

5fae368

Co-authored-by: Codex <noreply@openai.com>

docs: snapshot guidance updates

7a5b78f

Co-authored-by: Codex <noreply@openai.com>

Copilot AI review requested due to automatic review settings March 31, 2026 00:58

Copilot started reviewing on behalf of TrueCrimeDev March 31, 2026 00:58 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Codex research workflow and review tooling#453

Add Codex research workflow and review tooling#453
TrueCrimeDev wants to merge 6 commits intokarpathy:masterfrom
TrueCrimeDev:review/autoresearch/consolidated

TrueCrimeDev commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	By default that writes [results/dashboard/index.html](/C:/Users/uphol/Documents/Design/Coding/autoresearch/results/dashboard/index.html) and serves the repo at `http://127.0.0.1:8765/results/dashboard/index.html`.
	By default that writes [results/dashboard/index.html](results/dashboard/index.html) and serves the repo at `http://127.0.0.1:8765/results/dashboard/index.html`.


		## Dashboard

		This repo now includes a lightweight run-history dashboard that reads the existing `results\` artifacts directly. It does not require pi, a web framework, or extra build steps.

-    output_dir = output_root / f"{timestamp}-{slug}"
-    output_dir.mkdir(parents=True, exist_ok=False)
-    return output_dir
+    # Ensure we don't fail spuriously if multiple runs land in the same second.
+    # Start with the base path and, on collision, append an incrementing suffix.
+    attempt = 0
+    while True:
+        if attempt == 0:
+            dir_name = f"{timestamp}-{slug}"
+        else:
+            dir_name = f"{timestamp}-{slug}-{attempt}"
+        output_dir = output_root / dir_name
+        try:
+            output_dir.mkdir(parents=True, exist_ok=False)
+            return output_dir
+        except FileExistsError:
+            attempt += 1

-                if len(cells) >= 6:
-                    rows.append(
-                        {
-                            "phase_number": phase_number,
-                            "phase": phase_label,
-                            "category": current_category,
-                            "row_number": cells[0],
-                            "equipment_name": cells[1],
-                            "also_known_as": cells[2],
-                            "description": cells[3],
-                            "typical_use_phase": cells[4],
-                            "image_url": cells[5],
-                        }
-                    )
+                # Expect 6 logical columns:
+                # 0: row_number
+                # 1: equipment_name
+                # 2: also_known_as
+                # 3: description (may contain '|')
+                # 4: typical_use_phase
+                # 5: image_url
+                if len(cells) < 6:
+                    i += 1
+                    continue
+                if len(cells) == 6:
+                    row_number = cells[0]
+                    equipment_name = cells[1]
+                    also_known_as = cells[2]
+                    description = cells[3]
+                    typical_use_phase = cells[4]
+                    image_url = cells[5]
+                else:
+                    # Extra '|' characters likely appeared in a text field.
+                    # Keep the first three and last two columns fixed, and
+                    # join any overflow into the description field.
+                    row_number = cells[0]
+                    equipment_name = cells[1]
+                    also_known_as = cells[2]
+                    if len(cells) > 3:
+                        if len(cells) > 4:
+                            typical_use_phase = cells[-2]
+                            image_url = cells[-1]
+                            description_parts = cells[3:-2]
+                            description = "|".join(part.strip() for part in description_parts)
+                        else:
+                            # Fallback: treat remaining cell as description, and leave
+                            # typical_use_phase and image_url empty.
+                            description = cells[3]
+                            typical_use_phase = ""
+                            image_url = ""
+                    else:
+                        # Should not happen with len(cells) >= 6, but guard anyway.
+                        description = ""
+                        typical_use_phase = ""
+                        image_url = ""
+                rows.append(
+                    {
+                        "phase_number": phase_number,
+                        "phase": phase_label,
+                        "category": current_category,
+                        "row_number": row_number,
+                        "equipment_name": equipment_name,
+                        "also_known_as": also_known_as,
+                        "description": description,
+                        "typical_use_phase": typical_use_phase,
+                        "image_url": image_url,
+                    }
+                )

Conversation

TrueCrimeDev commented Mar 31, 2026

Summary

Included Files

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants