Autonomous Agents for ML Research Acceleration

When Andrej Karpathy shared an autoresearch demo showing an agent improving a simplified single-GPU implementation of nanochat , it captured something many ML teams were already beginning to feel. Portions of our research resemble an agent loop, with much of our day-to-day ML model development work already following this structure: we form hypotheses, tune training recipes, inspect metrics, and discard unsuccessful ideas. The work still requires judgment, but many steps are repeatable with the appropriate context, tools, and evaluation signal.

We had been seeing a similar pattern in our engineering work at PathAI. As AI agents became more capable, we were able to delegate bounded tasks with clear feedback signals. For example, in debugging, an agent could inspect the relevant code, reproduce the failure, try fixes, run tests, and keep iterating until the issue was resolved.

We wondered – could a similar loop work for applied ML research scoped appropriately?

We explored this question during a recent internal hackathon by setting up an autoresearch workflow for PathAI ML tasks. Here, we share what we learned: where autoresearch helped, where it struggled, and what it takes to make autonomous experimentation practical for established ML systems.

What is autoresearch?

At its core, autoresearch is an AI agent running an experimental loop.

The idea is simple: point an agent at a machine learning model training codebase, give it access to compute, define a metric to optimize, and allow it to iterate autonomously. The agent forms a hypothesis, modifies the code or configuration, launches a model training run, evaluates the result, and decides whether the change was beneficial. Successful changes are retained, unsuccessful ones are reverted, and the outcome of every attempt is recorded in a running log.

Unlike traditional hyperparameter optimization approaches such as grid search or Bayesian optimization, autoresearch is not constrained to a predefined search space. The agent is free to modify any aspect of the training pipeline that falls within its operating boundaries. The search process becomes open-ended and hypothesis-driven rather than parameter-driven.

For applied ML teams, autoresearch-driven workflows are particularly appealing because there is rarely a shortage of ideas to test. Most researchers have a growing list of promising experiments that never quite make it into the queue because every idea requires setup, execution, and analysis time. Autoresearch offers a way to autonomously work through that backlog.

Adapting autoresearch to PathAI

The original autoresearch examples operated on relatively self-contained training scripts. We needed a system that could work across PathAI training pipelines, which depend on large datasets, shared infrastructure, and internal tooling.

Rather than building a custom workflow for each project, we focused on creating a reusable framework (Figure 1) that could be applied to most training jobs. At the center of the system is a generic playbook that defines how the agent operates for proposing experiments, evaluating results, recording what it learns, and how it stays within a set of safety guardrails.

1782308385566

Figure 1. Autoresearch workflow.

Scoped Environments

To start a task, we give the agent a prepared ML training environment, a command to run the job, and a target metric. The environment already has the right dependencies, datasets, code, and project inputs staged, so the agent can focus on experimentation instead of setup.

Since training runs execute in our cluster environments, we bound what the agent can access and modify. Code changes are versioned through git, runs are tracked in Weights & Biases, and the agent’s actions are limited to the training workspace. We also restrict its edit surface to files where experimentation belongs, such as training code and model configs. Validation logic, metric computation, and other evaluation-critical paths remain fixed. This gives the agent room to explore while preserving trust in the results.

Learning Across Runs

As experiments accumulate, the process can become collaborative between agents and researchers. Both agents and researchers review results, identify promising directions, and add learnings back into the playbook, allowing knowledge to persist across runs. The individual contributions of the agents and researchers are equivalent in scope but a bit more nuanced: researchers contribute judgement and preferences, agents run experiments and identify learnings based on what they are seeing. Over time, this process allows the agent to operate with better prior knowledge about the task, the codebase, and the kinds of changes that are worth trying.

Three tasks, and what they surfaced

As proof-of-concept, we ran autoresearch loops on three representative digital pathology tasks, each with a constrained token budget:

Predicting stains at the slide-level using additive multiple instance learning (aMIL)
Detecting artifacts at the patch-level using a model deployed in production
Exploring slide-level vision-language alignment

Task 1: Stain Prediction

Despite the large number of slides at our disposal for model training, missing or incorrect stain metadata can still be a problem. We had previous baselines for this task using aMIL models which we wanted to improve. Because the task was self-contained but hard to prioritize manually, it was a good fit for autoresearch.

The agent ran 17 experiments over 8.5 hours (Figure 2). Each experiment had a short 30-minute compute budget, which naturally favored changes that improved convergence quickly. Over the course of these experiments, the validation F1 improved from 0.57 to 0.65 (figure below). Six experiments which progressively improved performance were preserved while the rest were discarded.

Most of the work stayed in configuration space: learning-rate schedules, iteration limits, and other training settings. This setup was useful for finding quick wins, but it also highlighted that short autoresearch runs may never progress past shallow sweeps into deeper modeling or code changes.

FIgure 2. Autoresearch experiments for stain prediction.

Task 2: Artifact Detection

The first step in many digital pathology pipelines is detecting and filtering artifacts from input images, which often arise during sample preparation or slide scanning. For this task, autoresearch ran 36 artifact detection experiments over roughly 60 hours, with each run capped at 5 validations and about 1K training steps. This fixed-step budget made comparisons easier than fixed time limits, as model changes, batch size, data loading speed, or cluster conditions can change how much time each experiment takes.

While the agent did not find a recipe that clearly beat the baseline performance over its full training run (70k iterations), it did find configurations that reached comparable performance much faster. The final experiment achieved baseline validation performance in ~15k steps versus ~25k for the baseline, a roughly 40% faster convergence.

The caveat is that this search was intentionally short-horizon. Fixed-step budgets bias the agent toward faster-converging configurations, so some choices may not transfer directly to full step training runs. Still, the experiments gave us a useful default configuration for training artifact detection models.

Task 3: Vision-Language Alignment

In a previous blog post we discussed our approach to learning a shared vision language space to improve diagnosis prediction. This was a natural fit for autoresearch because of its large and evolving search space. The agent ran for about 30 hours across 22 experiments, naturally moving from configuration tweaks to code changes over time (Figure 3).

This task also surfaced training instability as a real failure mode. One experiment introduced a new aggregation mode that destabilized training; a later experiment fixed it with torch.amp. The agent was able to self-correct because it had enough context about the training framework to detect and correct these issues. The agent experimented with several interesting methods:

Combining the contrastive objective with an auxiliary prototype-classification loss consistently improved results.
Ideas that failed to help included positional encodings, sigmoid-style contrastive losses, label smoothing, and taxonomy-based soft targets, which all produced little or no benefit in our setup.

1782308494862

Figure 3. Autoresearch experiments for vision-language alignment.

Practical Lessons

Autoresearch Is Best At Structured Elimination

Across these three tasks, autoresearch demonstrated its potential not just through metric improvement, but also through quickly identifying which ideas helped and why. The resulting experiment trail became a map of the problem that would otherwise take much longer to build manually.

Hypothesis Quality Depends On Explicit Guidance

The agent did not always infer training dynamics from logs on its own. Its reasoning improved when we seeded the instructions with interpretation heuristics, such as how to read train-validation divergence, distinguish early validation peaks from plateaus, and separate instability from underfitting.

When the loops were stuck, lightweight intervention was more effective than restarting. Inline guidance lets engineers redirect exploration, helping to make the process resemble a continuous research thread.

Budgeting Shapes The Search

Fixed iteration or validation budgets worked better than time limits. Time-based budgets were noisy because infrastructure variance changed how much training each experiment actually received. Step-based budgets made comparisons cleaner, but favored fast-converging configurations that still need longer validation before they can be considered durable.

Keep the Loop Tight

Making autoresearch practical required optimizing not only for model performance, but also for the efficiency of the loop itself. In autonomous experimentation, token costs can accumulate quickly, so we implemented two cost-control measures:

Separating responsibilities across models: more capable reasoning models handled planning and experiment design, while smaller, cheaper models were used for execution, monitoring, and routine bookkeeping tasks.
Reducing context use during monitoring: the agent sometimes got ahead of itself, hypothesizing the next experiment or making code changes before the current run had finished. We also found it was rereading the full training log every time it checked progress. To reduce that overhead, we set up a scheduled job that wakes at intervals, checks the run, and shares lightweight updates for the agent to consume.

Moving Forward

To more effectively scale the autoresearch approach, the following improvements will be needed:

Parallelization: running multiple independent experiments simultaneously with a shared compute budget
Reliability: successful recovery from interruptions without losing its thread
Efficiency: testing many ideas with reduced compute cost, reserving full-scale training for paths with the best chance of success

In addition, the autoresearch workflow can be extended to additional engineering, research, and operational tasks across digital pathology, such as optimizing case processing time, improving algorithm GPU utilization, and maximizing deployment reliability.

In clinical domains such as pathology, researchers must consider the tradeoffs between performance and complexity and, ultimately, whether results would be trusted by a clinician. Autoresearch has the potential to accelerate model development, but expert guidance is necessary to translate that loop into meaningful progress.