Parallelized autoresearch agent
Code available here.
This tutorial extends the
Autoresearch agent pattern with a code-mode MLE agent that plans batches of training experiments, saves distinct train.py edits, and runs them in parallel via flyte.map. It follows the
karpathy/autoresearch loop — minimize validation bits-per-byte on a TinyGPT variant — but orchestrates fan-out batches with durable Flyte tasks and
unionai-sandbox execution.
Compared to the single-threaded Claude Code autoresearch tutorial, this agent:
- Edits full
train.pysource (upstream karpathy style) instead of calling a remote coding CLI - Uses
code_mode=Trueso the LLM writes Python plans that callrun_experiment_batchorflyte_map - Persists a leaderboard, code-edit history, and batch plans in
MemoryStore - Self-heals OOM during sandbox training runs by bumping memory and retrying
Define the task environments
The example uses three environments — bundle preparation, sandbox experiments, and the agent driver — sharing a Debian-based image with PyTorch and sandbox tooling.
agent_env = flyte.TaskEnvironment(
name="autoresearch-agent",
resources=flyte.Resources(cpu=1, memory="2Gi"),
image=image,
include=_INCLUDE,
secrets=[flyte.Secret(key="internal-anthropic-api-key", as_env_var="ANTHROPIC_API_KEY")],
depends_on=[experiment_env, bundle_env],
)
Supporting modules (train.py, prepare.py, tools.py, and ui.py) live alongside the entry point in the example directory.
The fan-out agent task
The driver task parallelized_autoresearch restores prior memory (default key parallelized-autoresearch), streams Activity / Leaderboard / Code edits / Memory report tabs, and runs the code-mode agent loop.
@agent_env.task(report=True)
async def parallelized_autoresearch(
n_experiments: int = 6,
num_shards: int = DEFAULT_NUM_SHARDS,
memory_key: str = tools.MEMORY_KEY_FANOUT,
batch_size: int = 3,
max_turns: int = DEFAULT_MAX_TURNS,
) -> AutoresearchOutput:
"""Drive the fan-out code-edit MLE agent with sandbox batch execution."""
bundle = await build_bundle(num_shards=num_shards)
profile = await profile_bundle(bundle)
memory = await MemoryStore.get_or_create.aio(key=memory_key)
persisted = await memory.read_json.aio("memory/leaderboard.json", default=[])
promising = await memory.read_json.aio("memory/promising_code.json", default=[])
history = await tools.load_research_history(memory_key)
flyte.logger.info(
"Fan-out agent restored %d messages, %d experiments, %d promising edits, best val_bpb=%s.",
len(memory.messages),
len(persisted),
len(promising),
history.get("best_val_bpb"),
)
events: list[dict[str, Any]] = []
async def on_event(ev) -> None:
events.append({"type": ev.type, "data": ev.data})
if ev.type in ("tool_start", "tool_end", "tool_error", "turn_start", "agent_end"):
tab = flyte.report.get_tab("Activity")
tab.replace(ui.render_activity_log(events))
await flyte.report.flush.aio()
if ev.type == "tool_end" and ev.data.get("tool") in (
"edit_train_code",
"edit_train_code_batch",
"<sandbox>",
):
edits = await tools.load_saved_code_edits(memory_key)
if edits:
flyte.report.get_tab("Code edits").replace(ui.render_code_edits_panel(edits))
await flyte.report.flush.aio()
directive_text = ui.directive_code_edit_fanout(
n_experiments,
profile,
memory_key,
batch_size=batch_size,
history=history,
)
token = agent_progress_cb.set(on_event)
run_agent = build_fanout_agent(max_turns=max_turns)
try:
result = await run_agent.run.aio(directive_text, memory=memory)
finally:
agent_progress_cb.reset(token)
leaderboard, best = ui.parse_leaderboard(
memory.messages,
promising_fallback=promising,
)
leaderboard_dicts = [dataclasses.asdict(e) for e in leaderboard]
code_edits = await tools.load_saved_code_edits(memory_key)
tab_lb = flyte.report.get_tab("Leaderboard")
tab_lb.replace(ui.render_leaderboard(leaderboard, best))
flyte.report.get_tab("Code edits").replace(
ui.render_code_edits_panel(code_edits, best_title=best.title if best else None)
)
await memory.write_json.aio(
"memory/leaderboard.json",
leaderboard_dicts,
actor="parallelized-autoresearch",
reason=f"leaderboard after {len(leaderboard)} experiments",
)
await memory.save.aio()
audit = await memory.audit_tail(20)
hypotheses = await memory.read_json.aio("memory/hypotheses.json", default=[])
promising = await memory.read_json.aio("memory/promising_code.json", default=[])
tab_mem = flyte.report.get_tab("Memory")
tab_mem.replace(
ui.render_memory_panel(
memory_key,
len(memory.messages),
leaderboard_dicts,
audit,
hypotheses,
persisted_promising=promising,
code_edits=code_edits,
)
)
summary_body = result.summary or result.error or ""
if result.error and leaderboard:
best_line = f" Best val_bpb so far: {best.val_bpb} ({best.title})." if best and best.val_bpb else ""
summary_body = f"{result.error}{best_line}"
await flyte.report.replace.aio(
ui.render_summary(
directive_text,
leaderboard,
best,
summary_body,
code_edits=code_edits,
)
)
await flyte.report.flush.aio()
return AutoresearchOutput(
directive=directive_text,
dataset_profile=profile,
best=best,
leaderboard=leaderboard,
summary=summary_body,
memory_key=memory_key,
total_experiments=len(leaderboard),
)
Run the agent
Create secrets
Register an Anthropic API key for the agent LLM calls:
flyte create secret internal-anthropic-api-key <YOUR_ANTHROPIC_API_KEY>Run remotely
From the example directory:
cd v2/tutorials/parallelized_autoresearch
uv run --script parallelized_autoresearch.py -- --n-experiments 6 --batch-size 3 --num-shards 1Use --memory-key to resume a prior research session (default: parallelized-autoresearch). Pass a unique key — for example parallelized-autoresearch-20260622-215057 — to start with empty memory. Code mode needs more turns than JSON tool mode — increase --max-turns for larger sweeps.
Or invoke the agent task directly with flyte run (snake_case task inputs):
flyte run parallelized_autoresearch.py parallelized_autoresearch \
--n_experiments 6 --batch_size 3 --num_shards 1 --max_turns 12 \
--memory_key parallelized-autoresearchThe first run downloads climbmix data shards and trains a BPE tokenizer. Subsequent runs reuse cached bundle tasks.
See also the single-task Autoresearch agent tutorial for the Claude Code + pull-request workflow.