# Autoresearch agent

> [!NOTE]
> Code available [here](https://github.com/unionai/unionai-examples/tree/main/v2/tutorials/autoresearch).

This tutorial wraps an autonomous AI research loop in a single Flyte task. The task spins up a GPU container, installs the [Claude Code](https://docs.anthropic.com/en/docs/claude-code/overview) CLI, clones a research repository, and points Claude Code at a `program.md` brief. The agent runs experiments to improve a model, writes results to disk, and the task then commits the changes and opens a pull request — with a progress plot rendered both in the PR and in the Flyte UI.

It's an example of using Flyte as durable infrastructure for long-running, autonomous agent work:

- **A GPU `TaskEnvironment`** with the API-key and GitHub secrets the agent needs.
- **`report=True`** to stream a progress plot into the Flyte UI.
- **A reconnecting `run.wait()`** loop in the driver so a dropped client connection doesn't lose track of a multi-hour run.

> [!WARNING]
> This example drives a coding agent that executes arbitrary code and pushes commits to a GitHub repository. Run it against a repository you control, and review the constants described below before launching.

## Define the container image

The image is kept in its own `_image.py` module so edits to the agent logic in `run.py` don't invalidate the image cache. Node.js and the Claude Code CLI are installed at run time (see below) to keep the image small.

```
# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "flyte>=2.0.0b22",
#     "PyGithub>=2.5.0",
#     "matplotlib>=3.7.0",
#     "pandas>=2.0.0",
# ]
# ///
#
# Stable image definition — kept separate from run.py so edits to run.py
# don't invalidate the image cache. Only touch this file when the image itself needs to change.

import flyte

# {{docs-fragment image}}
image = (
    flyte.Image.from_uv_script(__file__, name="autoresearch-agent", pre=True)
    .with_apt_packages("git")
)
# {{/docs-fragment image}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/autoresearch/_image.py*

## Define the task environment

The task needs a GPU, a generous disk for the cloned repo and model weights, and two secrets: a GitHub token (to clone and push) and an Anthropic API key (for Claude Code).

```
# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "flyte>=2.0.0b22",
#     "PyGithub>=2.5.0",
#     "matplotlib>=3.7.0",
# ]
# ///

"""
AutoResearch Agent - Runs the autoresearch workflow using Claude Code CLI in a GPU environment.

This agent:
1. Starts a GPU-enabled container
2. Installs Claude Code CLI
3. Clones the autoresearch repository
4. Points Claude Code at program.md as the prompt and lets it run
5. Commits the result (CSV + code changes in train/) and creates a PR
"""

import os
import shlex
import subprocess
from dataclasses import dataclass
from pathlib import Path
from typing import Optional

from github import Auth, Github

import flyte
import flyte.report
from _image import image as autoresearch_image

GITHUB_USERNAME = "parnianz"
GITHUB_EMAIL = "parnianzargham@gmail.com"
AUTORESEARCH_REPO_URL = "https://github.com/unionai-oss/autoresearch.git"
AUTORESEARCH_REPO_FULL_NAME = "unionai-oss/autoresearch"

# {{docs-fragment env}}
autoresearch_env = flyte.TaskEnvironment(
    name="autoresearch-agent",
    resources=flyte.Resources(
        cpu=8,
        memory="32Gi",
        gpu="T4:1",
        disk="100Gi",
    ),
    secrets=[
        flyte.Secret(key="github_token", as_env_var="GITHUB_TOKEN"),
        flyte.Secret(key="internal-anthropic-api-key", as_env_var="ANTHROPIC_API_KEY"),
    ],
    image=autoresearch_image,
)
# {{/docs-fragment env}}

# {{docs-fragment result}}
@dataclass
class AutoResearchResult:
    """Result of the autoresearch run."""

    pr_url: str
    pr_number: int
    branch_name: str
    files_changed: list[str]
    success: bool
    error_message: Optional[str] = None
# {{/docs-fragment result}}

def clone_repository(repo_url: str, work_dir: Path, github_token: str) -> Path:
    """Clone the autoresearch repository with authentication."""
    repo_name = repo_url.rstrip("/").split("/")[-1].replace(".git", "")
    repo_path = work_dir / repo_name

    # Inject token into HTTPS URL for authentication
    authenticated_url = repo_url.replace(
        "https://", f"https://{GITHUB_USERNAME}:{github_token}@"
    )

    if repo_path.exists():
        subprocess.run(["git", "pull"], cwd=repo_path, check=True)
    else:
        subprocess.run(["git", "clone", authenticated_url, str(repo_path)], check=True)

    return repo_path

# {{docs-fragment task}}
@autoresearch_env.task(report=True)
async def run_autoresearch() -> AutoResearchResult:
    """
    Run the autoresearch workflow end-to-end.

    Steps:
    - Clone https://github.com/unionai-oss/autoresearch
    - Configure git identity
    - Create a new branch
    - Run Claude Code CLI with program.md as the prompt
    - Commit results (CSV + train/ changes)
    - Push and open a PR against the autoresearch repo
    """
    github_token = os.environ["GITHUB_TOKEN"]
    anthropic_api_key = os.environ["ANTHROPIC_API_KEY"]

    # --- Install Node.js + Claude Code at runtime (keeps image small and submission fast) ---
    import tarfile
    import urllib.request as _urllib

    subprocess.run(["apt-get", "update", "-y"], check=False)
    subprocess.run(["apt-get", "install", "-y", "git"], check=False)

    node_url = "https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.gz"
    node_tar = Path("/tmp/node.tar.gz")
    print(f"Downloading Node.js from {node_url}...", flush=True)
    _urllib.urlretrieve(node_url, node_tar)
    size_mb = node_tar.stat().st_size / 1024 / 1024
    print(f"Downloaded {size_mb:.1f} MB to {node_tar}", flush=True)
    if size_mb < 1:
        raise RuntimeError(f"Node.js download appears empty/corrupt ({size_mb:.2f} MB) — network may be restricted")
    node_dir = Path("/tmp/node")
    node_dir.mkdir(exist_ok=True)
    print("Extracting Node.js...", flush=True)
    with tarfile.open(node_tar, "r:gz") as tar:
        members = [m for m in tar.getmembers() if m.name.split("/", 1)[-1]]
        for m in members:
            m.name = m.name.split("/", 1)[-1]
        tar.extractall(str(node_dir), members=[m for m in members if m.name])

    # Add node/npm to PATH for this process and all subprocesses
    node_bin = str(node_dir / "bin")
    os.environ["PATH"] = node_bin + ":" + os.environ.get("PATH", "")
    print(f"Node version: {subprocess.run(['node', '--version'], capture_output=True, text=True).stdout.strip()}", flush=True)

    npm_prefix = "/tmp/npm-global"
    Path(npm_prefix).mkdir(exist_ok=True)
    subprocess.run(["npm", "install", "-g", "--prefix", npm_prefix, "@anthropic-ai/claude-code"], check=True)
    os.environ["PATH"] = str(Path(npm_prefix) / "bin") + ":" + os.environ["PATH"]
    print("Node.js + Claude Code installed.", flush=True)

    # --- Clone repo ---
    work_dir = Path("/tmp/autoresearch_workspace")
    work_dir.mkdir(exist_ok=True, parents=True)
    repo_path = clone_repository(AUTORESEARCH_REPO_URL, work_dir, github_token)

    # --- Git identity ---
    subprocess.run(
        ["git", "config", "--global", "user.email", GITHUB_EMAIL], check=True
    )
    subprocess.run(
        ["git", "config", "--global", "user.name", GITHUB_USERNAME], check=True
    )

    # --- Create branch ---
    import time as _time
    branch_name = f"autoresearch/claude-run-{int(_time.time())}"
    try:
        subprocess.run(
            ["git", "checkout", "-b", branch_name],
            cwd=repo_path,
            check=True,
        )
    except subprocess.CalledProcessError:
        subprocess.run(
            ["git", "checkout", branch_name],
            cwd=repo_path,
            check=True,
        )

    # --- Read program.md to use as the Claude Code prompt ---
    program_md = repo_path / "program.md"
    if not program_md.exists():
        raise FileNotFoundError(
            f"program.md not found in {repo_path}. "
            "Make sure the autoresearch repo has a program.md at its root."
        )

    program_md_content = program_md.read_text()
    print(f"Loaded prompt from program.md ({len(program_md_content)} chars)")
    # {{/docs-fragment task}}

    # Install repo dependencies before handing off to Claude
    for pip_cmd in [
        ["pip", "install", "-e", "."],
        ["pip", "install", "-r", "requirements.txt"],
    ]:
        req_file = repo_path / pip_cmd[-1] if pip_cmd[-1].startswith("req") else None
        if req_file is None or req_file.exists():
            dep_result = subprocess.run(
                pip_cmd, cwd=repo_path, capture_output=True, text=True
            )
            print(f"{' '.join(pip_cmd)}:\n{dep_result.stdout}", flush=True)
            if dep_result.returncode != 0:
                print(f"(non-fatal) {dep_result.stderr}", flush=True)

    # Wrap the program.md content with explicit instructions to write outputs to disk
    prompt = f"""You are running inside an automated GPU pipeline. You MUST write all outputs to disk as actual files.

Here are your instructions from program.md:

{program_md_content}

LOGGING INSTRUCTIONS (follow exactly):
- Before you start any training, print this exact line: [AUTORESEARCH] Training started
- Before training, print what change you are testing: [AUTORESEARCH] Change: <one line description of the code change being tested>
- When training finishes, print this exact line: [AUTORESEARCH] Training finished
- After training, print the key metric value: [AUTORESEARCH] Metric: <metric name>=<value>
- When writing results to CSV, print this exact line: [AUTORESEARCH] Writing results to CSV

IMPORTANT: After completing the above instructions, make sure you have:
1. Written the final results to a CSV file in this repository (e.g. results/results.csv or similar)
2. Saved all code changes you made to the train/ directory (or wherever the training code lives)
3. All files must be written to the current working directory so they appear in git status
If any command fails, debug and fix it rather than stopping. Do not just print results — write them to files on disk."""

    # --- Pre-flight: verify claude is installed and API key is reachable ---
    version_check = subprocess.run(
        ["claude", "--version"], capture_output=True, text=True
    )
    print(f"claude version: {version_check.stdout.strip()} | stderr: {version_check.stderr.strip()}", flush=True)
    if version_check.returncode != 0:
        raise RuntimeError(f"claude CLI not found or broken: {version_check.stderr}")

    # --- Disable Claude Code sandbox ---
    # In Kubernetes/Flyte pods, Claude Code's sandbox tries to spin up a nested container
    # which fails silently and causes file writes to go to an ephemeral space instead of
    # the real working directory. Disabling it makes writes land in the actual filesystem.
    claude_config_dir = Path("/root/.claude")
    claude_config_dir.mkdir(parents=True, exist_ok=True)
    settings = claude_config_dir / "settings.json"
    import json as _json
    existing = _json.loads(settings.read_text()) if settings.exists() else {}
    existing["sandbox"] = False
    settings.write_text(_json.dumps(existing, indent=2))
    print(f"Wrote Claude Code settings: {settings.read_text()}", flush=True)

    # --- Run Claude Code CLI ---
    # Matches swe_agent.py exactly: prompt as positional arg, CI=true enables non-interactive mode
    cmd = [
        "claude",
        "--dangerously-skip-permissions",
        "--max-turns", "100",
        "--model", "claude-haiku-4-5-20251001",
        prompt,
    ]

    print(f"Running: {shlex.join(cmd[:3])} <prompt>", flush=True)

    claude_env = {
        **os.environ,
        "ANTHROPIC_API_KEY": anthropic_api_key,
        "CLAUDE_SKIP_PERMISSIONS": "true",
        "CI": "true",  # Enables non-interactive mode (no TTY required)
    }

    # Stream output line by line so logs appear in real time instead of buffering until done
    proc = subprocess.Popen(
        cmd,
        cwd=repo_path,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,  # merge stderr into stdout stream
        text=True,
        env=claude_env,
    )

    stdout_lines = []
    for line in proc.stdout:
        line = line.rstrip("\n")
        print(line, flush=True)
        stdout_lines.append(line)

    proc.wait()
    full_output = "\n".join(stdout_lines)
    print(f"Claude Code exit code: {proc.returncode}", flush=True)

    if proc.returncode != 0:
        raise RuntimeError(
            f"Claude Code CLI exited with code {proc.returncode}\n"
            f"output: {full_output[-2000:]}"
        )

    # --- Collect changed files ---
    git_status = subprocess.run(
        ["git", "status", "--porcelain"],
        cwd=repo_path,
        capture_output=True,
        text=True,
        check=True,
    )

    print(f"Git status:\n{git_status.stdout}", flush=True)

    files_changed = []
    for line in git_status.stdout.strip().splitlines():
        if line:
            # git status --porcelain: first two chars are XY status flags
            file_path = line[3:].strip()
            files_changed.append(file_path)

    # Also list all files in repo dir for debugging
    all_files = subprocess.run(
        ["find", ".", "-type", "f", "-not", "-path", "./.git/*"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )
    print(f"All files in repo:\n{all_files.stdout}", flush=True)

    if not files_changed:
        raise RuntimeError(
            "Claude Code ran successfully but produced no file changes.\n"
            f"output: {full_output[-2000:]}"
        )

    # --- Commit ---
    subprocess.run(["git", "add", "."], cwd=repo_path, check=True)
    subprocess.run(["git", "add", "-f", "results.tsv"], cwd=repo_path, check=False)
    subprocess.run(["git", "add", "-f", "results/"], cwd=repo_path, check=False)
    commit_message = (
        "feat: autoresearch run via Claude Code\n\n"
        "Added research results (CSV) and updated train/ code changes.\n"
        "Generated by the autoresearch Flyte agent."
    )
    subprocess.run(
        ["git", "commit", "-m", commit_message],
        cwd=repo_path,
        check=True,
    )

    # --- Push ---
    print(f"GitHub token present: {bool(github_token)}, length: {len(github_token) if github_token else 0}", flush=True)
    authenticated_url = AUTORESEARCH_REPO_URL.replace(
        "https://", f"https://{GITHUB_USERNAME}:{github_token}@"
    )
    subprocess.run(
        ["git", "remote", "set-url", "origin", authenticated_url],
        cwd=repo_path,
        check=True,
    )
    push_result = subprocess.run(
        ["git", "push", "-u", "origin", branch_name, "--force"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )
    print(f"Push stdout: {push_result.stdout}", flush=True)
    print(f"Push stderr: {push_result.stderr}", flush=True)
    if push_result.returncode != 0:
        raise RuntimeError(f"git push failed (exit {push_result.returncode}):\n{push_result.stderr}")

    # --- Create PR via PyGithub ---
    auth = Auth.Token(github_token)
    gh = Github(auth=auth)
    repo = gh.get_repo(AUTORESEARCH_REPO_FULL_NAME)

    csv_files = [f for f in files_changed if f.endswith(".csv")]
    train_files = [f for f in files_changed if "train" in f]

    pr_body = f"""## AutoResearch Run

This PR was automatically generated by the autoresearch Flyte agent using Claude Code CLI.

### What changed
- **Result CSV files**: {', '.join(f'`{f}`' for f in csv_files) or 'none detected'}
- **Train code changes**: {', '.join(f'`{f}`' for f in train_files) or 'none detected'}

### All changed files
{chr(10).join(f'- `{f}`' for f in files_changed)}

---
🤖 Generated by [autoresearch Flyte agent](https://github.com/unionai-oss/autoresearch)
"""

    existing_prs = list(repo.get_pulls(state="open", head=f"unionai-oss:{branch_name}"))
    if existing_prs:
        pr = existing_prs[0]
        print(f"PR already exists: {pr.html_url}", flush=True)
    else:
        pr = repo.create_pull(
            title="feat: autoresearch results + train changes",
            body=pr_body,
            head=branch_name,
            base="master",
        )
        print(f"PR created: {pr.html_url}", flush=True)

    # --- Generate progress plot from results.tsv ---
    plot_path = repo_path / "progress.png"
    results_tsv = repo_path / "results.tsv"
    if results_tsv.exists():
        import matplotlib
        matplotlib.use("Agg")
        import matplotlib.pyplot as plt
        import pandas as pd

        df = pd.read_csv(str(results_tsv), sep="\t")
        df["val_bpb"] = pd.to_numeric(df["val_bpb"], errors="coerce")
        df["memory_gb"] = pd.to_numeric(df["memory_gb"], errors="coerce")
        df["status"] = df["status"].str.strip().str.upper()

        # Filter out crashes for plotting
        valid = df[df["status"] != "CRASH"].copy()
        valid = valid.reset_index(drop=True)

        if len(valid) > 0 and valid["val_bpb"].notna().any():
            baseline_bpb = valid.loc[0, "val_bpb"]
            best = valid["val_bpb"].min()

            # Only plot points at or below baseline (the interesting region)
            below = valid[valid["val_bpb"] <= baseline_bpb + 0.0005]

            fig, ax = plt.subplots(figsize=(16, 8))

            # Plot discarded as faint background dots
            disc = below[below["status"] == "DISCARD"]
            ax.scatter(disc.index, disc["val_bpb"],
                       c="#cccccc", s=12, alpha=0.5, zorder=2, label="Discarded")

            # Plot kept experiments as prominent green dots
            kept_v = below[below["status"] == "KEEP"]
            ax.scatter(kept_v.index, kept_v["val_bpb"],
                       c="#2ecc71", s=50, zorder=4, label="Kept", edgecolors="black", linewidths=0.5)

            # Running minimum step line
            kept_mask = valid["status"] == "KEEP"
            kept_idx = valid.index[kept_mask]
            kept_bpb = valid.loc[kept_mask, "val_bpb"]
            running_min = kept_bpb.cummin()
            ax.step(kept_idx, running_min, where="post", color="#27ae60",
                    linewidth=2, alpha=0.7, zorder=3, label="Running best")

            # Label each kept experiment with its description
            for idx, bpb in zip(kept_idx, kept_bpb):
                desc = str(valid.loc[idx, "description"]).strip()
                if len(desc) > 45:
                    desc = desc[:42] + "..."
                ax.annotate(desc, (idx, bpb),
                            textcoords="offset points",
                            xytext=(6, 6), fontsize=8.0,
                            color="#1a7a3a", alpha=0.9,
                            rotation=30, ha="left", va="bottom")

            n_total = len(df)
            n_kept = len(df[df["status"] == "KEEP"])
            ax.set_xlabel("Experiment #", fontsize=12)
            ax.set_ylabel("Validation BPB (lower is better)", fontsize=12)
            ax.set_title(f"Autoresearch Progress: {n_total} Experiments, {n_kept} Kept Improvements", fontsize=14)
            ax.legend(loc="upper right", fontsize=9)
            ax.grid(True, alpha=0.2)

            margin = (baseline_bpb - best) * 0.15
            ax.set_ylim(best - margin, baseline_bpb + margin)

            plt.tight_layout()
            plt.savefig(str(plot_path), dpi=150, bbox_inches="tight")
            plt.close(fig)
            print(f"Saved plot to {plot_path}", flush=True)

            # Upload plot to PR as a comment with base64 inline image
            import base64
            img_b64 = base64.b64encode(plot_path.read_bytes()).decode()
            pr_comment = (
                "## Autoresearch Progress\n\n"
                f"![Autoresearch Progress](data:image/png;base64,{img_b64})"
            )
            pr.create_issue_comment(pr_comment)
            print("Posted plot as PR comment.", flush=True)

            # Force-add plot to git and amend commit
            subprocess.run(["git", "add", "-f", str(plot_path)], cwd=repo_path, check=False)
            subprocess.run(
                ["git", "commit", "--amend", "--no-edit"],
                cwd=repo_path, check=False,
            )
            subprocess.run(
                ["git", "push", "-u", "origin", branch_name, "--force"],
                cwd=repo_path, check=False,
            )

            # Show plot in Flyte UI via report
            await flyte.report.replace.aio(
                f"<h2>Autoresearch Progress</h2>"
                f'<img src="data:image/png;base64,{img_b64}" style="max-width:100%"/>'
                f'<p><a href="{pr.html_url}">View PR</a></p>'
            )
            await flyte.report.flush.aio()
        else:
            print("results.tsv found but no valid val_bpb rows — skipping plot.", flush=True)
    else:
        print("results.tsv not found — skipping plot.", flush=True)

    return AutoResearchResult(
        pr_url=pr.html_url,
        pr_number=pr.number,
        branch_name=branch_name,
        files_changed=files_changed,
        success=True,
    )

# {{docs-fragment main}}
if __name__ == "__main__":
    import time

    flyte.init_from_config()

    run = flyte.with_runcontext(mode="remote").run(run_autoresearch)

    print(f"AutoResearch run started: {run.url}")
    print("Waiting for completion...")

    while True:
        try:
            run.wait()
            break
        except Exception as e:
            print(f"Connection dropped ({e}), reconnecting in 30s...")
            time.sleep(30)

    print(f"Done! See run at: {run.url}")
# {{/docs-fragment main}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/autoresearch/run.py*

The agent targets a specific repository, identity, and branch via module-level constants. Update these to point at your own fork before running:

GITHUB_USERNAME = "<YOUR_GITHUB_USERNAME>"
GITHUB_EMAIL = "you@example.com"
AUTORESEARCH_REPO_URL = "https://github.com/<YOUR_ORG>/<YOUR_REPO>.git"
AUTORESEARCH_REPO_FULL_NAME = "<YOUR_ORG>/<YOUR_REPO>"
```

## Model the result

The task returns a typed result describing the pull request it created.

```
# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "flyte>=2.0.0b22",
#     "PyGithub>=2.5.0",
#     "matplotlib>=3.7.0",
# ]
# ///

"""
AutoResearch Agent - Runs the autoresearch workflow using Claude Code CLI in a GPU environment.

This agent:
1. Starts a GPU-enabled container
2. Installs Claude Code CLI
3. Clones the autoresearch repository
4. Points Claude Code at program.md as the prompt and lets it run
5. Commits the result (CSV + code changes in train/) and creates a PR
"""

import os
import shlex
import subprocess
from dataclasses import dataclass
from pathlib import Path
from typing import Optional

from github import Auth, Github

import flyte
import flyte.report
from _image import image as autoresearch_image

GITHUB_USERNAME = "parnianz"
GITHUB_EMAIL = "parnianzargham@gmail.com"
AUTORESEARCH_REPO_URL = "https://github.com/unionai-oss/autoresearch.git"
AUTORESEARCH_REPO_FULL_NAME = "unionai-oss/autoresearch"

# {{docs-fragment env}}
autoresearch_env = flyte.TaskEnvironment(
    name="autoresearch-agent",
    resources=flyte.Resources(
        cpu=8,
        memory="32Gi",
        gpu="T4:1",
        disk="100Gi",
    ),
    secrets=[
        flyte.Secret(key="github_token", as_env_var="GITHUB_TOKEN"),
        flyte.Secret(key="internal-anthropic-api-key", as_env_var="ANTHROPIC_API_KEY"),
    ],
    image=autoresearch_image,
)
# {{/docs-fragment env}}

# {{docs-fragment result}}
@dataclass
class AutoResearchResult:
    """Result of the autoresearch run."""

    pr_url: str
    pr_number: int
    branch_name: str
    files_changed: list[str]
    success: bool
    error_message: Optional[str] = None
# {{/docs-fragment result}}

def clone_repository(repo_url: str, work_dir: Path, github_token: str) -> Path:
    """Clone the autoresearch repository with authentication."""
    repo_name = repo_url.rstrip("/").split("/")[-1].replace(".git", "")
    repo_path = work_dir / repo_name

    # Inject token into HTTPS URL for authentication
    authenticated_url = repo_url.replace(
        "https://", f"https://{GITHUB_USERNAME}:{github_token}@"
    )

    if repo_path.exists():
        subprocess.run(["git", "pull"], cwd=repo_path, check=True)
    else:
        subprocess.run(["git", "clone", authenticated_url, str(repo_path)], check=True)

    return repo_path

# {{docs-fragment task}}
@autoresearch_env.task(report=True)
async def run_autoresearch() -> AutoResearchResult:
    """
    Run the autoresearch workflow end-to-end.

    Steps:
    - Clone https://github.com/unionai-oss/autoresearch
    - Configure git identity
    - Create a new branch
    - Run Claude Code CLI with program.md as the prompt
    - Commit results (CSV + train/ changes)
    - Push and open a PR against the autoresearch repo
    """
    github_token = os.environ["GITHUB_TOKEN"]
    anthropic_api_key = os.environ["ANTHROPIC_API_KEY"]

    # --- Install Node.js + Claude Code at runtime (keeps image small and submission fast) ---
    import tarfile
    import urllib.request as _urllib

    subprocess.run(["apt-get", "update", "-y"], check=False)
    subprocess.run(["apt-get", "install", "-y", "git"], check=False)

    node_url = "https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.gz"
    node_tar = Path("/tmp/node.tar.gz")
    print(f"Downloading Node.js from {node_url}...", flush=True)
    _urllib.urlretrieve(node_url, node_tar)
    size_mb = node_tar.stat().st_size / 1024 / 1024
    print(f"Downloaded {size_mb:.1f} MB to {node_tar}", flush=True)
    if size_mb < 1:
        raise RuntimeError(f"Node.js download appears empty/corrupt ({size_mb:.2f} MB) — network may be restricted")
    node_dir = Path("/tmp/node")
    node_dir.mkdir(exist_ok=True)
    print("Extracting Node.js...", flush=True)
    with tarfile.open(node_tar, "r:gz") as tar:
        members = [m for m in tar.getmembers() if m.name.split("/", 1)[-1]]
        for m in members:
            m.name = m.name.split("/", 1)[-1]
        tar.extractall(str(node_dir), members=[m for m in members if m.name])

    # Add node/npm to PATH for this process and all subprocesses
    node_bin = str(node_dir / "bin")
    os.environ["PATH"] = node_bin + ":" + os.environ.get("PATH", "")
    print(f"Node version: {subprocess.run(['node', '--version'], capture_output=True, text=True).stdout.strip()}", flush=True)

    npm_prefix = "/tmp/npm-global"
    Path(npm_prefix).mkdir(exist_ok=True)
    subprocess.run(["npm", "install", "-g", "--prefix", npm_prefix, "@anthropic-ai/claude-code"], check=True)
    os.environ["PATH"] = str(Path(npm_prefix) / "bin") + ":" + os.environ["PATH"]
    print("Node.js + Claude Code installed.", flush=True)

    # --- Clone repo ---
    work_dir = Path("/tmp/autoresearch_workspace")
    work_dir.mkdir(exist_ok=True, parents=True)
    repo_path = clone_repository(AUTORESEARCH_REPO_URL, work_dir, github_token)

    # --- Git identity ---
    subprocess.run(
        ["git", "config", "--global", "user.email", GITHUB_EMAIL], check=True
    )
    subprocess.run(
        ["git", "config", "--global", "user.name", GITHUB_USERNAME], check=True
    )

    # --- Create branch ---
    import time as _time
    branch_name = f"autoresearch/claude-run-{int(_time.time())}"
    try:
        subprocess.run(
            ["git", "checkout", "-b", branch_name],
            cwd=repo_path,
            check=True,
        )
    except subprocess.CalledProcessError:
        subprocess.run(
            ["git", "checkout", branch_name],
            cwd=repo_path,
            check=True,
        )

    # --- Read program.md to use as the Claude Code prompt ---
    program_md = repo_path / "program.md"
    if not program_md.exists():
        raise FileNotFoundError(
            f"program.md not found in {repo_path}. "
            "Make sure the autoresearch repo has a program.md at its root."
        )

    program_md_content = program_md.read_text()
    print(f"Loaded prompt from program.md ({len(program_md_content)} chars)")
    # {{/docs-fragment task}}

    # Install repo dependencies before handing off to Claude
    for pip_cmd in [
        ["pip", "install", "-e", "."],
        ["pip", "install", "-r", "requirements.txt"],
    ]:
        req_file = repo_path / pip_cmd[-1] if pip_cmd[-1].startswith("req") else None
        if req_file is None or req_file.exists():
            dep_result = subprocess.run(
                pip_cmd, cwd=repo_path, capture_output=True, text=True
            )
            print(f"{' '.join(pip_cmd)}:\n{dep_result.stdout}", flush=True)
            if dep_result.returncode != 0:
                print(f"(non-fatal) {dep_result.stderr}", flush=True)

    # Wrap the program.md content with explicit instructions to write outputs to disk
    prompt = f"""You are running inside an automated GPU pipeline. You MUST write all outputs to disk as actual files.

Here are your instructions from program.md:

{program_md_content}

LOGGING INSTRUCTIONS (follow exactly):
- Before you start any training, print this exact line: [AUTORESEARCH] Training started
- Before training, print what change you are testing: [AUTORESEARCH] Change: <one line description of the code change being tested>
- When training finishes, print this exact line: [AUTORESEARCH] Training finished
- After training, print the key metric value: [AUTORESEARCH] Metric: <metric name>=<value>
- When writing results to CSV, print this exact line: [AUTORESEARCH] Writing results to CSV

IMPORTANT: After completing the above instructions, make sure you have:
1. Written the final results to a CSV file in this repository (e.g. results/results.csv or similar)
2. Saved all code changes you made to the train/ directory (or wherever the training code lives)
3. All files must be written to the current working directory so they appear in git status
If any command fails, debug and fix it rather than stopping. Do not just print results — write them to files on disk."""

    # --- Pre-flight: verify claude is installed and API key is reachable ---
    version_check = subprocess.run(
        ["claude", "--version"], capture_output=True, text=True
    )
    print(f"claude version: {version_check.stdout.strip()} | stderr: {version_check.stderr.strip()}", flush=True)
    if version_check.returncode != 0:
        raise RuntimeError(f"claude CLI not found or broken: {version_check.stderr}")

    # --- Disable Claude Code sandbox ---
    # In Kubernetes/Flyte pods, Claude Code's sandbox tries to spin up a nested container
    # which fails silently and causes file writes to go to an ephemeral space instead of
    # the real working directory. Disabling it makes writes land in the actual filesystem.
    claude_config_dir = Path("/root/.claude")
    claude_config_dir.mkdir(parents=True, exist_ok=True)
    settings = claude_config_dir / "settings.json"
    import json as _json
    existing = _json.loads(settings.read_text()) if settings.exists() else {}
    existing["sandbox"] = False
    settings.write_text(_json.dumps(existing, indent=2))
    print(f"Wrote Claude Code settings: {settings.read_text()}", flush=True)

    # --- Run Claude Code CLI ---
    # Matches swe_agent.py exactly: prompt as positional arg, CI=true enables non-interactive mode
    cmd = [
        "claude",
        "--dangerously-skip-permissions",
        "--max-turns", "100",
        "--model", "claude-haiku-4-5-20251001",
        prompt,
    ]

    print(f"Running: {shlex.join(cmd[:3])} <prompt>", flush=True)

    claude_env = {
        **os.environ,
        "ANTHROPIC_API_KEY": anthropic_api_key,
        "CLAUDE_SKIP_PERMISSIONS": "true",
        "CI": "true",  # Enables non-interactive mode (no TTY required)
    }

    # Stream output line by line so logs appear in real time instead of buffering until done
    proc = subprocess.Popen(
        cmd,
        cwd=repo_path,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,  # merge stderr into stdout stream
        text=True,
        env=claude_env,
    )

    stdout_lines = []
    for line in proc.stdout:
        line = line.rstrip("\n")
        print(line, flush=True)
        stdout_lines.append(line)

    proc.wait()
    full_output = "\n".join(stdout_lines)
    print(f"Claude Code exit code: {proc.returncode}", flush=True)

    if proc.returncode != 0:
        raise RuntimeError(
            f"Claude Code CLI exited with code {proc.returncode}\n"
            f"output: {full_output[-2000:]}"
        )

    # --- Collect changed files ---
    git_status = subprocess.run(
        ["git", "status", "--porcelain"],
        cwd=repo_path,
        capture_output=True,
        text=True,
        check=True,
    )

    print(f"Git status:\n{git_status.stdout}", flush=True)

    files_changed = []
    for line in git_status.stdout.strip().splitlines():
        if line:
            # git status --porcelain: first two chars are XY status flags
            file_path = line[3:].strip()
            files_changed.append(file_path)

    # Also list all files in repo dir for debugging
    all_files = subprocess.run(
        ["find", ".", "-type", "f", "-not", "-path", "./.git/*"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )
    print(f"All files in repo:\n{all_files.stdout}", flush=True)

    if not files_changed:
        raise RuntimeError(
            "Claude Code ran successfully but produced no file changes.\n"
            f"output: {full_output[-2000:]}"
        )

    # --- Commit ---
    subprocess.run(["git", "add", "."], cwd=repo_path, check=True)
    subprocess.run(["git", "add", "-f", "results.tsv"], cwd=repo_path, check=False)
    subprocess.run(["git", "add", "-f", "results/"], cwd=repo_path, check=False)
    commit_message = (
        "feat: autoresearch run via Claude Code\n\n"
        "Added research results (CSV) and updated train/ code changes.\n"
        "Generated by the autoresearch Flyte agent."
    )
    subprocess.run(
        ["git", "commit", "-m", commit_message],
        cwd=repo_path,
        check=True,
    )

    # --- Push ---
    print(f"GitHub token present: {bool(github_token)}, length: {len(github_token) if github_token else 0}", flush=True)
    authenticated_url = AUTORESEARCH_REPO_URL.replace(
        "https://", f"https://{GITHUB_USERNAME}:{github_token}@"
    )
    subprocess.run(
        ["git", "remote", "set-url", "origin", authenticated_url],
        cwd=repo_path,
        check=True,
    )
    push_result = subprocess.run(
        ["git", "push", "-u", "origin", branch_name, "--force"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )
    print(f"Push stdout: {push_result.stdout}", flush=True)
    print(f"Push stderr: {push_result.stderr}", flush=True)
    if push_result.returncode != 0:
        raise RuntimeError(f"git push failed (exit {push_result.returncode}):\n{push_result.stderr}")

    # --- Create PR via PyGithub ---
    auth = Auth.Token(github_token)
    gh = Github(auth=auth)
    repo = gh.get_repo(AUTORESEARCH_REPO_FULL_NAME)

    csv_files = [f for f in files_changed if f.endswith(".csv")]
    train_files = [f for f in files_changed if "train" in f]

    pr_body = f"""## AutoResearch Run

This PR was automatically generated by the autoresearch Flyte agent using Claude Code CLI.

### What changed
- **Result CSV files**: {', '.join(f'`{f}`' for f in csv_files) or 'none detected'}
- **Train code changes**: {', '.join(f'`{f}`' for f in train_files) or 'none detected'}

### All changed files
{chr(10).join(f'- `{f}`' for f in files_changed)}

---
🤖 Generated by [autoresearch Flyte agent](https://github.com/unionai-oss/autoresearch)
"""

    existing_prs = list(repo.get_pulls(state="open", head=f"unionai-oss:{branch_name}"))
    if existing_prs:
        pr = existing_prs[0]
        print(f"PR already exists: {pr.html_url}", flush=True)
    else:
        pr = repo.create_pull(
            title="feat: autoresearch results + train changes",
            body=pr_body,
            head=branch_name,
            base="master",
        )
        print(f"PR created: {pr.html_url}", flush=True)

    # --- Generate progress plot from results.tsv ---
    plot_path = repo_path / "progress.png"
    results_tsv = repo_path / "results.tsv"
    if results_tsv.exists():
        import matplotlib
        matplotlib.use("Agg")
        import matplotlib.pyplot as plt
        import pandas as pd

        df = pd.read_csv(str(results_tsv), sep="\t")
        df["val_bpb"] = pd.to_numeric(df["val_bpb"], errors="coerce")
        df["memory_gb"] = pd.to_numeric(df["memory_gb"], errors="coerce")
        df["status"] = df["status"].str.strip().str.upper()

        # Filter out crashes for plotting
        valid = df[df["status"] != "CRASH"].copy()
        valid = valid.reset_index(drop=True)

        if len(valid) > 0 and valid["val_bpb"].notna().any():
            baseline_bpb = valid.loc[0, "val_bpb"]
            best = valid["val_bpb"].min()

            # Only plot points at or below baseline (the interesting region)
            below = valid[valid["val_bpb"] <= baseline_bpb + 0.0005]

            fig, ax = plt.subplots(figsize=(16, 8))

            # Plot discarded as faint background dots
            disc = below[below["status"] == "DISCARD"]
            ax.scatter(disc.index, disc["val_bpb"],
                       c="#cccccc", s=12, alpha=0.5, zorder=2, label="Discarded")

            # Plot kept experiments as prominent green dots
            kept_v = below[below["status"] == "KEEP"]
            ax.scatter(kept_v.index, kept_v["val_bpb"],
                       c="#2ecc71", s=50, zorder=4, label="Kept", edgecolors="black", linewidths=0.5)

            # Running minimum step line
            kept_mask = valid["status"] == "KEEP"
            kept_idx = valid.index[kept_mask]
            kept_bpb = valid.loc[kept_mask, "val_bpb"]
            running_min = kept_bpb.cummin()
            ax.step(kept_idx, running_min, where="post", color="#27ae60",
                    linewidth=2, alpha=0.7, zorder=3, label="Running best")

            # Label each kept experiment with its description
            for idx, bpb in zip(kept_idx, kept_bpb):
                desc = str(valid.loc[idx, "description"]).strip()
                if len(desc) > 45:
                    desc = desc[:42] + "..."
                ax.annotate(desc, (idx, bpb),
                            textcoords="offset points",
                            xytext=(6, 6), fontsize=8.0,
                            color="#1a7a3a", alpha=0.9,
                            rotation=30, ha="left", va="bottom")

            n_total = len(df)
            n_kept = len(df[df["status"] == "KEEP"])
            ax.set_xlabel("Experiment #", fontsize=12)
            ax.set_ylabel("Validation BPB (lower is better)", fontsize=12)
            ax.set_title(f"Autoresearch Progress: {n_total} Experiments, {n_kept} Kept Improvements", fontsize=14)
            ax.legend(loc="upper right", fontsize=9)
            ax.grid(True, alpha=0.2)

            margin = (baseline_bpb - best) * 0.15
            ax.set_ylim(best - margin, baseline_bpb + margin)

            plt.tight_layout()
            plt.savefig(str(plot_path), dpi=150, bbox_inches="tight")
            plt.close(fig)
            print(f"Saved plot to {plot_path}", flush=True)

            # Upload plot to PR as a comment with base64 inline image
            import base64
            img_b64 = base64.b64encode(plot_path.read_bytes()).decode()
            pr_comment = (
                "## Autoresearch Progress\n\n"
                f"![Autoresearch Progress](data:image/png;base64,{img_b64})"
            )
            pr.create_issue_comment(pr_comment)
            print("Posted plot as PR comment.", flush=True)

            # Force-add plot to git and amend commit
            subprocess.run(["git", "add", "-f", str(plot_path)], cwd=repo_path, check=False)
            subprocess.run(
                ["git", "commit", "--amend", "--no-edit"],
                cwd=repo_path, check=False,
            )
            subprocess.run(
                ["git", "push", "-u", "origin", branch_name, "--force"],
                cwd=repo_path, check=False,
            )

            # Show plot in Flyte UI via report
            await flyte.report.replace.aio(
                f"<h2>Autoresearch Progress</h2>"
                f'<img src="data:image/png;base64,{img_b64}" style="max-width:100%"/>'
                f'<p><a href="{pr.html_url}">View PR</a></p>'
            )
            await flyte.report.flush.aio()
        else:
            print("results.tsv found but no valid val_bpb rows — skipping plot.", flush=True)
    else:
        print("results.tsv not found — skipping plot.", flush=True)

    return AutoResearchResult(
        pr_url=pr.html_url,
        pr_number=pr.number,
        branch_name=branch_name,
        files_changed=files_changed,
        success=True,
    )

# {{docs-fragment main}}
if __name__ == "__main__":
    import time

    flyte.init_from_config()

    run = flyte.with_runcontext(mode="remote").run(run_autoresearch)

    print(f"AutoResearch run started: {run.url}")
    print("Waiting for completion...")

    while True:
        try:
            run.wait()
            break
        except Exception as e:
            print(f"Connection dropped ({e}), reconnecting in 30s...")
            time.sleep(30)

    print(f"Done! See run at: {run.url}")
# {{/docs-fragment main}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/autoresearch/run.py*

## The autoresearch task

The task is a long, sequential procedure. It starts by installing Node.js and Claude Code at run time, cloning the repo, configuring git, creating a branch, and loading `program.md` as the prompt:

```
# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "flyte>=2.0.0b22",
#     "PyGithub>=2.5.0",
#     "matplotlib>=3.7.0",
# ]
# ///

"""
AutoResearch Agent - Runs the autoresearch workflow using Claude Code CLI in a GPU environment.

This agent:
1. Starts a GPU-enabled container
2. Installs Claude Code CLI
3. Clones the autoresearch repository
4. Points Claude Code at program.md as the prompt and lets it run
5. Commits the result (CSV + code changes in train/) and creates a PR
"""

import os
import shlex
import subprocess
from dataclasses import dataclass
from pathlib import Path
from typing import Optional

from github import Auth, Github

import flyte
import flyte.report
from _image import image as autoresearch_image

GITHUB_USERNAME = "parnianz"
GITHUB_EMAIL = "parnianzargham@gmail.com"
AUTORESEARCH_REPO_URL = "https://github.com/unionai-oss/autoresearch.git"
AUTORESEARCH_REPO_FULL_NAME = "unionai-oss/autoresearch"

# {{docs-fragment env}}
autoresearch_env = flyte.TaskEnvironment(
    name="autoresearch-agent",
    resources=flyte.Resources(
        cpu=8,
        memory="32Gi",
        gpu="T4:1",
        disk="100Gi",
    ),
    secrets=[
        flyte.Secret(key="github_token", as_env_var="GITHUB_TOKEN"),
        flyte.Secret(key="internal-anthropic-api-key", as_env_var="ANTHROPIC_API_KEY"),
    ],
    image=autoresearch_image,
)
# {{/docs-fragment env}}

# {{docs-fragment result}}
@dataclass
class AutoResearchResult:
    """Result of the autoresearch run."""

    pr_url: str
    pr_number: int
    branch_name: str
    files_changed: list[str]
    success: bool
    error_message: Optional[str] = None
# {{/docs-fragment result}}

def clone_repository(repo_url: str, work_dir: Path, github_token: str) -> Path:
    """Clone the autoresearch repository with authentication."""
    repo_name = repo_url.rstrip("/").split("/")[-1].replace(".git", "")
    repo_path = work_dir / repo_name

    # Inject token into HTTPS URL for authentication
    authenticated_url = repo_url.replace(
        "https://", f"https://{GITHUB_USERNAME}:{github_token}@"
    )

    if repo_path.exists():
        subprocess.run(["git", "pull"], cwd=repo_path, check=True)
    else:
        subprocess.run(["git", "clone", authenticated_url, str(repo_path)], check=True)

    return repo_path

# {{docs-fragment task}}
@autoresearch_env.task(report=True)
async def run_autoresearch() -> AutoResearchResult:
    """
    Run the autoresearch workflow end-to-end.

    Steps:
    - Clone https://github.com/unionai-oss/autoresearch
    - Configure git identity
    - Create a new branch
    - Run Claude Code CLI with program.md as the prompt
    - Commit results (CSV + train/ changes)
    - Push and open a PR against the autoresearch repo
    """
    github_token = os.environ["GITHUB_TOKEN"]
    anthropic_api_key = os.environ["ANTHROPIC_API_KEY"]

    # --- Install Node.js + Claude Code at runtime (keeps image small and submission fast) ---
    import tarfile
    import urllib.request as _urllib

    subprocess.run(["apt-get", "update", "-y"], check=False)
    subprocess.run(["apt-get", "install", "-y", "git"], check=False)

    node_url = "https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.gz"
    node_tar = Path("/tmp/node.tar.gz")
    print(f"Downloading Node.js from {node_url}...", flush=True)
    _urllib.urlretrieve(node_url, node_tar)
    size_mb = node_tar.stat().st_size / 1024 / 1024
    print(f"Downloaded {size_mb:.1f} MB to {node_tar}", flush=True)
    if size_mb < 1:
        raise RuntimeError(f"Node.js download appears empty/corrupt ({size_mb:.2f} MB) — network may be restricted")
    node_dir = Path("/tmp/node")
    node_dir.mkdir(exist_ok=True)
    print("Extracting Node.js...", flush=True)
    with tarfile.open(node_tar, "r:gz") as tar:
        members = [m for m in tar.getmembers() if m.name.split("/", 1)[-1]]
        for m in members:
            m.name = m.name.split("/", 1)[-1]
        tar.extractall(str(node_dir), members=[m for m in members if m.name])

    # Add node/npm to PATH for this process and all subprocesses
    node_bin = str(node_dir / "bin")
    os.environ["PATH"] = node_bin + ":" + os.environ.get("PATH", "")
    print(f"Node version: {subprocess.run(['node', '--version'], capture_output=True, text=True).stdout.strip()}", flush=True)

    npm_prefix = "/tmp/npm-global"
    Path(npm_prefix).mkdir(exist_ok=True)
    subprocess.run(["npm", "install", "-g", "--prefix", npm_prefix, "@anthropic-ai/claude-code"], check=True)
    os.environ["PATH"] = str(Path(npm_prefix) / "bin") + ":" + os.environ["PATH"]
    print("Node.js + Claude Code installed.", flush=True)

    # --- Clone repo ---
    work_dir = Path("/tmp/autoresearch_workspace")
    work_dir.mkdir(exist_ok=True, parents=True)
    repo_path = clone_repository(AUTORESEARCH_REPO_URL, work_dir, github_token)

    # --- Git identity ---
    subprocess.run(
        ["git", "config", "--global", "user.email", GITHUB_EMAIL], check=True
    )
    subprocess.run(
        ["git", "config", "--global", "user.name", GITHUB_USERNAME], check=True
    )

    # --- Create branch ---
    import time as _time
    branch_name = f"autoresearch/claude-run-{int(_time.time())}"
    try:
        subprocess.run(
            ["git", "checkout", "-b", branch_name],
            cwd=repo_path,
            check=True,
        )
    except subprocess.CalledProcessError:
        subprocess.run(
            ["git", "checkout", branch_name],
            cwd=repo_path,
            check=True,
        )

    # --- Read program.md to use as the Claude Code prompt ---
    program_md = repo_path / "program.md"
    if not program_md.exists():
        raise FileNotFoundError(
            f"program.md not found in {repo_path}. "
            "Make sure the autoresearch repo has a program.md at its root."
        )

    program_md_content = program_md.read_text()
    print(f"Loaded prompt from program.md ({len(program_md_content)} chars)")
    # {{/docs-fragment task}}

    # Install repo dependencies before handing off to Claude
    for pip_cmd in [
        ["pip", "install", "-e", "."],
        ["pip", "install", "-r", "requirements.txt"],
    ]:
        req_file = repo_path / pip_cmd[-1] if pip_cmd[-1].startswith("req") else None
        if req_file is None or req_file.exists():
            dep_result = subprocess.run(
                pip_cmd, cwd=repo_path, capture_output=True, text=True
            )
            print(f"{' '.join(pip_cmd)}:\n{dep_result.stdout}", flush=True)
            if dep_result.returncode != 0:
                print(f"(non-fatal) {dep_result.stderr}", flush=True)

    # Wrap the program.md content with explicit instructions to write outputs to disk
    prompt = f"""You are running inside an automated GPU pipeline. You MUST write all outputs to disk as actual files.

Here are your instructions from program.md:

{program_md_content}

LOGGING INSTRUCTIONS (follow exactly):
- Before you start any training, print this exact line: [AUTORESEARCH] Training started
- Before training, print what change you are testing: [AUTORESEARCH] Change: <one line description of the code change being tested>
- When training finishes, print this exact line: [AUTORESEARCH] Training finished
- After training, print the key metric value: [AUTORESEARCH] Metric: <metric name>=<value>
- When writing results to CSV, print this exact line: [AUTORESEARCH] Writing results to CSV

IMPORTANT: After completing the above instructions, make sure you have:
1. Written the final results to a CSV file in this repository (e.g. results/results.csv or similar)
2. Saved all code changes you made to the train/ directory (or wherever the training code lives)
3. All files must be written to the current working directory so they appear in git status
If any command fails, debug and fix it rather than stopping. Do not just print results — write them to files on disk."""

    # --- Pre-flight: verify claude is installed and API key is reachable ---
    version_check = subprocess.run(
        ["claude", "--version"], capture_output=True, text=True
    )
    print(f"claude version: {version_check.stdout.strip()} | stderr: {version_check.stderr.strip()}", flush=True)
    if version_check.returncode != 0:
        raise RuntimeError(f"claude CLI not found or broken: {version_check.stderr}")

    # --- Disable Claude Code sandbox ---
    # In Kubernetes/Flyte pods, Claude Code's sandbox tries to spin up a nested container
    # which fails silently and causes file writes to go to an ephemeral space instead of
    # the real working directory. Disabling it makes writes land in the actual filesystem.
    claude_config_dir = Path("/root/.claude")
    claude_config_dir.mkdir(parents=True, exist_ok=True)
    settings = claude_config_dir / "settings.json"
    import json as _json
    existing = _json.loads(settings.read_text()) if settings.exists() else {}
    existing["sandbox"] = False
    settings.write_text(_json.dumps(existing, indent=2))
    print(f"Wrote Claude Code settings: {settings.read_text()}", flush=True)

    # --- Run Claude Code CLI ---
    # Matches swe_agent.py exactly: prompt as positional arg, CI=true enables non-interactive mode
    cmd = [
        "claude",
        "--dangerously-skip-permissions",
        "--max-turns", "100",
        "--model", "claude-haiku-4-5-20251001",
        prompt,
    ]

    print(f"Running: {shlex.join(cmd[:3])} <prompt>", flush=True)

    claude_env = {
        **os.environ,
        "ANTHROPIC_API_KEY": anthropic_api_key,
        "CLAUDE_SKIP_PERMISSIONS": "true",
        "CI": "true",  # Enables non-interactive mode (no TTY required)
    }

    # Stream output line by line so logs appear in real time instead of buffering until done
    proc = subprocess.Popen(
        cmd,
        cwd=repo_path,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,  # merge stderr into stdout stream
        text=True,
        env=claude_env,
    )

    stdout_lines = []
    for line in proc.stdout:
        line = line.rstrip("\n")
        print(line, flush=True)
        stdout_lines.append(line)

    proc.wait()
    full_output = "\n".join(stdout_lines)
    print(f"Claude Code exit code: {proc.returncode}", flush=True)

    if proc.returncode != 0:
        raise RuntimeError(
            f"Claude Code CLI exited with code {proc.returncode}\n"
            f"output: {full_output[-2000:]}"
        )

    # --- Collect changed files ---
    git_status = subprocess.run(
        ["git", "status", "--porcelain"],
        cwd=repo_path,
        capture_output=True,
        text=True,
        check=True,
    )

    print(f"Git status:\n{git_status.stdout}", flush=True)

    files_changed = []
    for line in git_status.stdout.strip().splitlines():
        if line:
            # git status --porcelain: first two chars are XY status flags
            file_path = line[3:].strip()
            files_changed.append(file_path)

    # Also list all files in repo dir for debugging
    all_files = subprocess.run(
        ["find", ".", "-type", "f", "-not", "-path", "./.git/*"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )
    print(f"All files in repo:\n{all_files.stdout}", flush=True)

    if not files_changed:
        raise RuntimeError(
            "Claude Code ran successfully but produced no file changes.\n"
            f"output: {full_output[-2000:]}"
        )

    # --- Commit ---
    subprocess.run(["git", "add", "."], cwd=repo_path, check=True)
    subprocess.run(["git", "add", "-f", "results.tsv"], cwd=repo_path, check=False)
    subprocess.run(["git", "add", "-f", "results/"], cwd=repo_path, check=False)
    commit_message = (
        "feat: autoresearch run via Claude Code\n\n"
        "Added research results (CSV) and updated train/ code changes.\n"
        "Generated by the autoresearch Flyte agent."
    )
    subprocess.run(
        ["git", "commit", "-m", commit_message],
        cwd=repo_path,
        check=True,
    )

    # --- Push ---
    print(f"GitHub token present: {bool(github_token)}, length: {len(github_token) if github_token else 0}", flush=True)
    authenticated_url = AUTORESEARCH_REPO_URL.replace(
        "https://", f"https://{GITHUB_USERNAME}:{github_token}@"
    )
    subprocess.run(
        ["git", "remote", "set-url", "origin", authenticated_url],
        cwd=repo_path,
        check=True,
    )
    push_result = subprocess.run(
        ["git", "push", "-u", "origin", branch_name, "--force"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )
    print(f"Push stdout: {push_result.stdout}", flush=True)
    print(f"Push stderr: {push_result.stderr}", flush=True)
    if push_result.returncode != 0:
        raise RuntimeError(f"git push failed (exit {push_result.returncode}):\n{push_result.stderr}")

    # --- Create PR via PyGithub ---
    auth = Auth.Token(github_token)
    gh = Github(auth=auth)
    repo = gh.get_repo(AUTORESEARCH_REPO_FULL_NAME)

    csv_files = [f for f in files_changed if f.endswith(".csv")]
    train_files = [f for f in files_changed if "train" in f]

    pr_body = f"""## AutoResearch Run

This PR was automatically generated by the autoresearch Flyte agent using Claude Code CLI.

### What changed
- **Result CSV files**: {', '.join(f'`{f}`' for f in csv_files) or 'none detected'}
- **Train code changes**: {', '.join(f'`{f}`' for f in train_files) or 'none detected'}

### All changed files
{chr(10).join(f'- `{f}`' for f in files_changed)}

---
🤖 Generated by [autoresearch Flyte agent](https://github.com/unionai-oss/autoresearch)
"""

    existing_prs = list(repo.get_pulls(state="open", head=f"unionai-oss:{branch_name}"))
    if existing_prs:
        pr = existing_prs[0]
        print(f"PR already exists: {pr.html_url}", flush=True)
    else:
        pr = repo.create_pull(
            title="feat: autoresearch results + train changes",
            body=pr_body,
            head=branch_name,
            base="master",
        )
        print(f"PR created: {pr.html_url}", flush=True)

    # --- Generate progress plot from results.tsv ---
    plot_path = repo_path / "progress.png"
    results_tsv = repo_path / "results.tsv"
    if results_tsv.exists():
        import matplotlib
        matplotlib.use("Agg")
        import matplotlib.pyplot as plt
        import pandas as pd

        df = pd.read_csv(str(results_tsv), sep="\t")
        df["val_bpb"] = pd.to_numeric(df["val_bpb"], errors="coerce")
        df["memory_gb"] = pd.to_numeric(df["memory_gb"], errors="coerce")
        df["status"] = df["status"].str.strip().str.upper()

        # Filter out crashes for plotting
        valid = df[df["status"] != "CRASH"].copy()
        valid = valid.reset_index(drop=True)

        if len(valid) > 0 and valid["val_bpb"].notna().any():
            baseline_bpb = valid.loc[0, "val_bpb"]
            best = valid["val_bpb"].min()

            # Only plot points at or below baseline (the interesting region)
            below = valid[valid["val_bpb"] <= baseline_bpb + 0.0005]

            fig, ax = plt.subplots(figsize=(16, 8))

            # Plot discarded as faint background dots
            disc = below[below["status"] == "DISCARD"]
            ax.scatter(disc.index, disc["val_bpb"],
                       c="#cccccc", s=12, alpha=0.5, zorder=2, label="Discarded")

            # Plot kept experiments as prominent green dots
            kept_v = below[below["status"] == "KEEP"]
            ax.scatter(kept_v.index, kept_v["val_bpb"],
                       c="#2ecc71", s=50, zorder=4, label="Kept", edgecolors="black", linewidths=0.5)

            # Running minimum step line
            kept_mask = valid["status"] == "KEEP"
            kept_idx = valid.index[kept_mask]
            kept_bpb = valid.loc[kept_mask, "val_bpb"]
            running_min = kept_bpb.cummin()
            ax.step(kept_idx, running_min, where="post", color="#27ae60",
                    linewidth=2, alpha=0.7, zorder=3, label="Running best")

            # Label each kept experiment with its description
            for idx, bpb in zip(kept_idx, kept_bpb):
                desc = str(valid.loc[idx, "description"]).strip()
                if len(desc) > 45:
                    desc = desc[:42] + "..."
                ax.annotate(desc, (idx, bpb),
                            textcoords="offset points",
                            xytext=(6, 6), fontsize=8.0,
                            color="#1a7a3a", alpha=0.9,
                            rotation=30, ha="left", va="bottom")

            n_total = len(df)
            n_kept = len(df[df["status"] == "KEEP"])
            ax.set_xlabel("Experiment #", fontsize=12)
            ax.set_ylabel("Validation BPB (lower is better)", fontsize=12)
            ax.set_title(f"Autoresearch Progress: {n_total} Experiments, {n_kept} Kept Improvements", fontsize=14)
            ax.legend(loc="upper right", fontsize=9)
            ax.grid(True, alpha=0.2)

            margin = (baseline_bpb - best) * 0.15
            ax.set_ylim(best - margin, baseline_bpb + margin)

            plt.tight_layout()
            plt.savefig(str(plot_path), dpi=150, bbox_inches="tight")
            plt.close(fig)
            print(f"Saved plot to {plot_path}", flush=True)

            # Upload plot to PR as a comment with base64 inline image
            import base64
            img_b64 = base64.b64encode(plot_path.read_bytes()).decode()
            pr_comment = (
                "## Autoresearch Progress\n\n"
                f"![Autoresearch Progress](data:image/png;base64,{img_b64})"
            )
            pr.create_issue_comment(pr_comment)
            print("Posted plot as PR comment.", flush=True)

            # Force-add plot to git and amend commit
            subprocess.run(["git", "add", "-f", str(plot_path)], cwd=repo_path, check=False)
            subprocess.run(
                ["git", "commit", "--amend", "--no-edit"],
                cwd=repo_path, check=False,
            )
            subprocess.run(
                ["git", "push", "-u", "origin", branch_name, "--force"],
                cwd=repo_path, check=False,
            )

            # Show plot in Flyte UI via report
            await flyte.report.replace.aio(
                f"<h2>Autoresearch Progress</h2>"
                f'<img src="data:image/png;base64,{img_b64}" style="max-width:100%"/>'
                f'<p><a href="{pr.html_url}">View PR</a></p>'
            )
            await flyte.report.flush.aio()
        else:
            print("results.tsv found but no valid val_bpb rows — skipping plot.", flush=True)
    else:
        print("results.tsv not found — skipping plot.", flush=True)

    return AutoResearchResult(
        pr_url=pr.html_url,
        pr_number=pr.number,
        branch_name=branch_name,
        files_changed=files_changed,
        success=True,
    )

# {{docs-fragment main}}
if __name__ == "__main__":
    import time

    flyte.init_from_config()

    run = flyte.with_runcontext(mode="remote").run(run_autoresearch)

    print(f"AutoResearch run started: {run.url}")
    print("Waiting for completion...")

    while True:
        try:
            run.wait()
            break
        except Exception as e:
            print(f"Connection dropped ({e}), reconnecting in 30s...")
            time.sleep(30)

    print(f"Done! See run at: {run.url}")
# {{/docs-fragment main}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/autoresearch/run.py*

From there the task:

1. Wraps the `program.md` brief with explicit logging and "write outputs to disk" instructions.
2. Disables the Claude Code sandbox (it conflicts with the Flyte pod's container) and runs the CLI non-interactively, streaming its output to the Flyte logs in real time.
3. Collects the files the agent changed via `git status`, commits them, and force-pushes the branch.
4. Opens (or reuses) a pull request with [PyGithub](https://pygithub.readthedocs.io/).
5. If the agent produced a `results.tsv`, renders a progress plot of validation bits-per-byte, attaches it to the PR, and streams it into the Flyte UI:

```
# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "flyte>=2.0.0b22",
#     "PyGithub>=2.5.0",
#     "matplotlib>=3.7.0",
# ]
# ///

"""
AutoResearch Agent - Runs the autoresearch workflow using Claude Code CLI in a GPU environment.

This agent:
1. Starts a GPU-enabled container
2. Installs Claude Code CLI
3. Clones the autoresearch repository
4. Points Claude Code at program.md as the prompt and lets it run
5. Commits the result (CSV + code changes in train/) and creates a PR
"""

import os
import shlex
import subprocess
from dataclasses import dataclass
from pathlib import Path
from typing import Optional

from github import Auth, Github

import flyte
import flyte.report
from _image import image as autoresearch_image

GITHUB_USERNAME = "parnianz"
GITHUB_EMAIL = "parnianzargham@gmail.com"
AUTORESEARCH_REPO_URL = "https://github.com/unionai-oss/autoresearch.git"
AUTORESEARCH_REPO_FULL_NAME = "unionai-oss/autoresearch"

# {{docs-fragment env}}
autoresearch_env = flyte.TaskEnvironment(
    name="autoresearch-agent",
    resources=flyte.Resources(
        cpu=8,
        memory="32Gi",
        gpu="T4:1",
        disk="100Gi",
    ),
    secrets=[
        flyte.Secret(key="github_token", as_env_var="GITHUB_TOKEN"),
        flyte.Secret(key="internal-anthropic-api-key", as_env_var="ANTHROPIC_API_KEY"),
    ],
    image=autoresearch_image,
)
# {{/docs-fragment env}}

# {{docs-fragment result}}
@dataclass
class AutoResearchResult:
    """Result of the autoresearch run."""

    pr_url: str
    pr_number: int
    branch_name: str
    files_changed: list[str]
    success: bool
    error_message: Optional[str] = None
# {{/docs-fragment result}}

def clone_repository(repo_url: str, work_dir: Path, github_token: str) -> Path:
    """Clone the autoresearch repository with authentication."""
    repo_name = repo_url.rstrip("/").split("/")[-1].replace(".git", "")
    repo_path = work_dir / repo_name

    # Inject token into HTTPS URL for authentication
    authenticated_url = repo_url.replace(
        "https://", f"https://{GITHUB_USERNAME}:{github_token}@"
    )

    if repo_path.exists():
        subprocess.run(["git", "pull"], cwd=repo_path, check=True)
    else:
        subprocess.run(["git", "clone", authenticated_url, str(repo_path)], check=True)

    return repo_path

# {{docs-fragment task}}
@autoresearch_env.task(report=True)
async def run_autoresearch() -> AutoResearchResult:
    """
    Run the autoresearch workflow end-to-end.

    Steps:
    - Clone https://github.com/unionai-oss/autoresearch
    - Configure git identity
    - Create a new branch
    - Run Claude Code CLI with program.md as the prompt
    - Commit results (CSV + train/ changes)
    - Push and open a PR against the autoresearch repo
    """
    github_token = os.environ["GITHUB_TOKEN"]
    anthropic_api_key = os.environ["ANTHROPIC_API_KEY"]

    # --- Install Node.js + Claude Code at runtime (keeps image small and submission fast) ---
    import tarfile
    import urllib.request as _urllib

    subprocess.run(["apt-get", "update", "-y"], check=False)
    subprocess.run(["apt-get", "install", "-y", "git"], check=False)

    node_url = "https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.gz"
    node_tar = Path("/tmp/node.tar.gz")
    print(f"Downloading Node.js from {node_url}...", flush=True)
    _urllib.urlretrieve(node_url, node_tar)
    size_mb = node_tar.stat().st_size / 1024 / 1024
    print(f"Downloaded {size_mb:.1f} MB to {node_tar}", flush=True)
    if size_mb < 1:
        raise RuntimeError(f"Node.js download appears empty/corrupt ({size_mb:.2f} MB) — network may be restricted")
    node_dir = Path("/tmp/node")
    node_dir.mkdir(exist_ok=True)
    print("Extracting Node.js...", flush=True)
    with tarfile.open(node_tar, "r:gz") as tar:
        members = [m for m in tar.getmembers() if m.name.split("/", 1)[-1]]
        for m in members:
            m.name = m.name.split("/", 1)[-1]
        tar.extractall(str(node_dir), members=[m for m in members if m.name])

    # Add node/npm to PATH for this process and all subprocesses
    node_bin = str(node_dir / "bin")
    os.environ["PATH"] = node_bin + ":" + os.environ.get("PATH", "")
    print(f"Node version: {subprocess.run(['node', '--version'], capture_output=True, text=True).stdout.strip()}", flush=True)

    npm_prefix = "/tmp/npm-global"
    Path(npm_prefix).mkdir(exist_ok=True)
    subprocess.run(["npm", "install", "-g", "--prefix", npm_prefix, "@anthropic-ai/claude-code"], check=True)
    os.environ["PATH"] = str(Path(npm_prefix) / "bin") + ":" + os.environ["PATH"]
    print("Node.js + Claude Code installed.", flush=True)

    # --- Clone repo ---
    work_dir = Path("/tmp/autoresearch_workspace")
    work_dir.mkdir(exist_ok=True, parents=True)
    repo_path = clone_repository(AUTORESEARCH_REPO_URL, work_dir, github_token)

    # --- Git identity ---
    subprocess.run(
        ["git", "config", "--global", "user.email", GITHUB_EMAIL], check=True
    )
    subprocess.run(
        ["git", "config", "--global", "user.name", GITHUB_USERNAME], check=True
    )

    # --- Create branch ---
    import time as _time
    branch_name = f"autoresearch/claude-run-{int(_time.time())}"
    try:
        subprocess.run(
            ["git", "checkout", "-b", branch_name],
            cwd=repo_path,
            check=True,
        )
    except subprocess.CalledProcessError:
        subprocess.run(
            ["git", "checkout", branch_name],
            cwd=repo_path,
            check=True,
        )

    # --- Read program.md to use as the Claude Code prompt ---
    program_md = repo_path / "program.md"
    if not program_md.exists():
        raise FileNotFoundError(
            f"program.md not found in {repo_path}. "
            "Make sure the autoresearch repo has a program.md at its root."
        )

    program_md_content = program_md.read_text()
    print(f"Loaded prompt from program.md ({len(program_md_content)} chars)")
    # {{/docs-fragment task}}

    # Install repo dependencies before handing off to Claude
    for pip_cmd in [
        ["pip", "install", "-e", "."],
        ["pip", "install", "-r", "requirements.txt"],
    ]:
        req_file = repo_path / pip_cmd[-1] if pip_cmd[-1].startswith("req") else None
        if req_file is None or req_file.exists():
            dep_result = subprocess.run(
                pip_cmd, cwd=repo_path, capture_output=True, text=True
            )
            print(f"{' '.join(pip_cmd)}:\n{dep_result.stdout}", flush=True)
            if dep_result.returncode != 0:
                print(f"(non-fatal) {dep_result.stderr}", flush=True)

    # Wrap the program.md content with explicit instructions to write outputs to disk
    prompt = f"""You are running inside an automated GPU pipeline. You MUST write all outputs to disk as actual files.

Here are your instructions from program.md:

{program_md_content}

LOGGING INSTRUCTIONS (follow exactly):
- Before you start any training, print this exact line: [AUTORESEARCH] Training started
- Before training, print what change you are testing: [AUTORESEARCH] Change: <one line description of the code change being tested>
- When training finishes, print this exact line: [AUTORESEARCH] Training finished
- After training, print the key metric value: [AUTORESEARCH] Metric: <metric name>=<value>
- When writing results to CSV, print this exact line: [AUTORESEARCH] Writing results to CSV

IMPORTANT: After completing the above instructions, make sure you have:
1. Written the final results to a CSV file in this repository (e.g. results/results.csv or similar)
2. Saved all code changes you made to the train/ directory (or wherever the training code lives)
3. All files must be written to the current working directory so they appear in git status
If any command fails, debug and fix it rather than stopping. Do not just print results — write them to files on disk."""

    # --- Pre-flight: verify claude is installed and API key is reachable ---
    version_check = subprocess.run(
        ["claude", "--version"], capture_output=True, text=True
    )
    print(f"claude version: {version_check.stdout.strip()} | stderr: {version_check.stderr.strip()}", flush=True)
    if version_check.returncode != 0:
        raise RuntimeError(f"claude CLI not found or broken: {version_check.stderr}")

    # --- Disable Claude Code sandbox ---
    # In Kubernetes/Flyte pods, Claude Code's sandbox tries to spin up a nested container
    # which fails silently and causes file writes to go to an ephemeral space instead of
    # the real working directory. Disabling it makes writes land in the actual filesystem.
    claude_config_dir = Path("/root/.claude")
    claude_config_dir.mkdir(parents=True, exist_ok=True)
    settings = claude_config_dir / "settings.json"
    import json as _json
    existing = _json.loads(settings.read_text()) if settings.exists() else {}
    existing["sandbox"] = False
    settings.write_text(_json.dumps(existing, indent=2))
    print(f"Wrote Claude Code settings: {settings.read_text()}", flush=True)

    # --- Run Claude Code CLI ---
    # Matches swe_agent.py exactly: prompt as positional arg, CI=true enables non-interactive mode
    cmd = [
        "claude",
        "--dangerously-skip-permissions",
        "--max-turns", "100",
        "--model", "claude-haiku-4-5-20251001",
        prompt,
    ]

    print(f"Running: {shlex.join(cmd[:3])} <prompt>", flush=True)

    claude_env = {
        **os.environ,
        "ANTHROPIC_API_KEY": anthropic_api_key,
        "CLAUDE_SKIP_PERMISSIONS": "true",
        "CI": "true",  # Enables non-interactive mode (no TTY required)
    }

    # Stream output line by line so logs appear in real time instead of buffering until done
    proc = subprocess.Popen(
        cmd,
        cwd=repo_path,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,  # merge stderr into stdout stream
        text=True,
        env=claude_env,
    )

    stdout_lines = []
    for line in proc.stdout:
        line = line.rstrip("\n")
        print(line, flush=True)
        stdout_lines.append(line)

    proc.wait()
    full_output = "\n".join(stdout_lines)
    print(f"Claude Code exit code: {proc.returncode}", flush=True)

    if proc.returncode != 0:
        raise RuntimeError(
            f"Claude Code CLI exited with code {proc.returncode}\n"
            f"output: {full_output[-2000:]}"
        )

    # --- Collect changed files ---
    git_status = subprocess.run(
        ["git", "status", "--porcelain"],
        cwd=repo_path,
        capture_output=True,
        text=True,
        check=True,
    )

    print(f"Git status:\n{git_status.stdout}", flush=True)

    files_changed = []
    for line in git_status.stdout.strip().splitlines():
        if line:
            # git status --porcelain: first two chars are XY status flags
            file_path = line[3:].strip()
            files_changed.append(file_path)

    # Also list all files in repo dir for debugging
    all_files = subprocess.run(
        ["find", ".", "-type", "f", "-not", "-path", "./.git/*"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )
    print(f"All files in repo:\n{all_files.stdout}", flush=True)

    if not files_changed:
        raise RuntimeError(
            "Claude Code ran successfully but produced no file changes.\n"
            f"output: {full_output[-2000:]}"
        )

    # --- Commit ---
    subprocess.run(["git", "add", "."], cwd=repo_path, check=True)
    subprocess.run(["git", "add", "-f", "results.tsv"], cwd=repo_path, check=False)
    subprocess.run(["git", "add", "-f", "results/"], cwd=repo_path, check=False)
    commit_message = (
        "feat: autoresearch run via Claude Code\n\n"
        "Added research results (CSV) and updated train/ code changes.\n"
        "Generated by the autoresearch Flyte agent."
    )
    subprocess.run(
        ["git", "commit", "-m", commit_message],
        cwd=repo_path,
        check=True,
    )

    # --- Push ---
    print(f"GitHub token present: {bool(github_token)}, length: {len(github_token) if github_token else 0}", flush=True)
    authenticated_url = AUTORESEARCH_REPO_URL.replace(
        "https://", f"https://{GITHUB_USERNAME}:{github_token}@"
    )
    subprocess.run(
        ["git", "remote", "set-url", "origin", authenticated_url],
        cwd=repo_path,
        check=True,
    )
    push_result = subprocess.run(
        ["git", "push", "-u", "origin", branch_name, "--force"],
        cwd=repo_path,
        capture_output=True,
        text=True,
    )
    print(f"Push stdout: {push_result.stdout}", flush=True)
    print(f"Push stderr: {push_result.stderr}", flush=True)
    if push_result.returncode != 0:
        raise RuntimeError(f"git push failed (exit {push_result.returncode}):\n{push_result.stderr}")

    # --- Create PR via PyGithub ---
    auth = Auth.Token(github_token)
    gh = Github(auth=auth)
    repo = gh.get_repo(AUTORESEARCH_REPO_FULL_NAME)

    csv_files = [f for f in files_changed if f.endswith(".csv")]
    train_files = [f for f in files_changed if "train" in f]

    pr_body = f"""## AutoResearch Run

This PR was automatically generated by the autoresearch Flyte agent using Claude Code CLI.

### What changed
- **Result CSV files**: {', '.join(f'`{f}`' for f in csv_files) or 'none detected'}
- **Train code changes**: {', '.join(f'`{f}`' for f in train_files) or 'none detected'}

### All changed files
{chr(10).join(f'- `{f}`' for f in files_changed)}

---
🤖 Generated by [autoresearch Flyte agent](https://github.com/unionai-oss/autoresearch)
"""

    existing_prs = list(repo.get_pulls(state="open", head=f"unionai-oss:{branch_name}"))
    if existing_prs:
        pr = existing_prs[0]
        print(f"PR already exists: {pr.html_url}", flush=True)
    else:
        pr = repo.create_pull(
            title="feat: autoresearch results + train changes",
            body=pr_body,
            head=branch_name,
            base="master",
        )
        print(f"PR created: {pr.html_url}", flush=True)

    # --- Generate progress plot from results.tsv ---
    plot_path = repo_path / "progress.png"
    results_tsv = repo_path / "results.tsv"
    if results_tsv.exists():
        import matplotlib
        matplotlib.use("Agg")
        import matplotlib.pyplot as plt
        import pandas as pd

        df = pd.read_csv(str(results_tsv), sep="\t")
        df["val_bpb"] = pd.to_numeric(df["val_bpb"], errors="coerce")
        df["memory_gb"] = pd.to_numeric(df["memory_gb"], errors="coerce")
        df["status"] = df["status"].str.strip().str.upper()

        # Filter out crashes for plotting
        valid = df[df["status"] != "CRASH"].copy()
        valid = valid.reset_index(drop=True)

        if len(valid) > 0 and valid["val_bpb"].notna().any():
            baseline_bpb = valid.loc[0, "val_bpb"]
            best = valid["val_bpb"].min()

            # Only plot points at or below baseline (the interesting region)
            below = valid[valid["val_bpb"] <= baseline_bpb + 0.0005]

            fig, ax = plt.subplots(figsize=(16, 8))

            # Plot discarded as faint background dots
            disc = below[below["status"] == "DISCARD"]
            ax.scatter(disc.index, disc["val_bpb"],
                       c="#cccccc", s=12, alpha=0.5, zorder=2, label="Discarded")

            # Plot kept experiments as prominent green dots
            kept_v = below[below["status"] == "KEEP"]
            ax.scatter(kept_v.index, kept_v["val_bpb"],
                       c="#2ecc71", s=50, zorder=4, label="Kept", edgecolors="black", linewidths=0.5)

            # Running minimum step line
            kept_mask = valid["status"] == "KEEP"
            kept_idx = valid.index[kept_mask]
            kept_bpb = valid.loc[kept_mask, "val_bpb"]
            running_min = kept_bpb.cummin()
            ax.step(kept_idx, running_min, where="post", color="#27ae60",
                    linewidth=2, alpha=0.7, zorder=3, label="Running best")

            # Label each kept experiment with its description
            for idx, bpb in zip(kept_idx, kept_bpb):
                desc = str(valid.loc[idx, "description"]).strip()
                if len(desc) > 45:
                    desc = desc[:42] + "..."
                ax.annotate(desc, (idx, bpb),
                            textcoords="offset points",
                            xytext=(6, 6), fontsize=8.0,
                            color="#1a7a3a", alpha=0.9,
                            rotation=30, ha="left", va="bottom")

            n_total = len(df)
            n_kept = len(df[df["status"] == "KEEP"])
            ax.set_xlabel("Experiment #", fontsize=12)
            ax.set_ylabel("Validation BPB (lower is better)", fontsize=12)
            ax.set_title(f"Autoresearch Progress: {n_total} Experiments, {n_kept} Kept Improvements", fontsize=14)
            ax.legend(loc="upper right", fontsize=9)
            ax.grid(True, alpha=0.2)

            margin = (baseline_bpb - best) * 0.15
            ax.set_ylim(best - margin, baseline_bpb + margin)

            plt.tight_layout()
            plt.savefig(str(plot_path), dpi=150, bbox_inches="tight")
            plt.close(fig)
            print(f"Saved plot to {plot_path}", flush=True)

            # Upload plot to PR as a comment with base64 inline image
            import base64
            img_b64 = base64.b64encode(plot_path.read_bytes()).decode()
            pr_comment = (
                "## Autoresearch Progress\n\n"
                f"![Autoresearch Progress](data:image/png;base64,{img_b64})"
            )
            pr.create_issue_comment(pr_comment)
            print("Posted plot as PR comment.", flush=True)

            # Force-add plot to git and amend commit
            subprocess.run(["git", "add", "-f", str(plot_path)], cwd=repo_path, check=False)
            subprocess.run(
                ["git", "commit", "--amend", "--no-edit"],
                cwd=repo_path, check=False,
            )
            subprocess.run(
                ["git", "push", "-u", "origin", branch_name, "--force"],
                cwd=repo_path, check=False,
            )

            # Show plot in Flyte UI via report
            await flyte.report.replace.aio(
                f"<h2>Autoresearch Progress</h2>"
                f'<img src="data:image/png;base64,{img_b64}" style="max-width:100%"/>'
                f'<p><a href="{pr.html_url}">View PR</a></p>'
            )
            await flyte.report.flush.aio()
        else:
            print("results.tsv found but no valid val_bpb rows — skipping plot.", flush=True)
    else:
        print("results.tsv not found — skipping plot.", flush=True)

    return AutoResearchResult(
        pr_url=pr.html_url,
        pr_number=pr.number,
        branch_name=branch_name,
        files_changed=files_changed,
        success=True,
    )

# {{docs-fragment main}}
if __name__ == "__main__":
    import time

    flyte.init_from_config()

    run = flyte.with_runcontext(mode="remote").run(run_autoresearch)

    print(f"AutoResearch run started: {run.url}")
    print("Waiting for completion...")

    while True:
        try:
            run.wait()
            break
        except Exception as e:
            print(f"Connection dropped ({e}), reconnecting in 30s...")
            time.sleep(30)

    print(f"Done! See run at: {run.url}")
# {{/docs-fragment main}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/tutorials/autoresearch/run.py*

The entry point submits the task in `remote` mode and reconnects automatically if the client connection drops during the long run.

## Run the agent

### Create secrets

Get an Anthropic API key from the [Anthropic console](https://console.anthropic.com/) and a [GitHub personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) with permission to push and open PRs on the target repository.

Register both as Flyte secrets. The key names must match those declared in the `TaskEnvironment`:

```
flyte create secret github_token <YOUR_GITHUB_TOKEN>
flyte create secret internal-anthropic-api-key <YOUR_ANTHROPIC_API_KEY>
```

See [Secrets](https://www.union.ai/docs/v2/union/user-guide/task-configuration/secrets/page.md) for scoping and file-based secrets.

### Prepare the research repository

The target repository must contain a `program.md` at its root describing the research task for the agent. Point `AUTORESEARCH_REPO_URL` / `AUTORESEARCH_REPO_FULL_NAME` (and the git identity constants) at a repo you control.

### Run remotely

From the [example directory](https://github.com/unionai/unionai-examples/tree/main/v2/tutorials/autoresearch):

```
cd v2/tutorials/autoresearch
python run.py
```

This task runs remotely (it needs a GPU and network access). Follow the printed run URL to watch the agent's logs stream in, and open the run's report panel to see the progress plot once results are available. When the task finishes, the returned `AutoResearchResult` contains the pull request URL.

---
**Source**: https://github.com/unionai/unionai-docs/blob/main/content/tutorials/agents/autoresearch/_index.md
**HTML**: https://www.union.ai/docs/v2/union/tutorials/agents/autoresearch/
