What if you could run 100 machine learning experiments overnight --- on a single GPU --- without writing a line of code?

That’s exactly what Andrej Karpathy’s autoresearch does. Released on March 7, 2026, this 630-line Python script lets AI agents autonomously modify training code, run experiments, evaluate results, and keep improving --- all while you sleep.

Within two days, the announcement had millions of views. Researchers, developers, and companies were already running their own overnight experiments.

Here’s how it works and why it matters.

The Core Loop

Autoresearch’s design is elegant in its simplicity:

Read the program.md file (your Markdown instructions)
Modify train.py based on those instructions
Train for exactly 5 minutes
Measure the result (validation loss)
Keep or discard --- if the metric improved, commit; if not, git reset
Repeat indefinitely

At roughly 12 experiments per hour, you get about 100 experiments in an overnight session. Each successful improvement builds on the last, creating a compounding effect.

What You Need

The barrier to entry is remarkably low:

One GPU --- the entire system is designed for single-GPU training
630 lines of Python --- small enough to fit in any LLM’s context window
An LLM API key --- Claude, GPT, or another capable model
A program.md file --- your Markdown instructions telling the agent what to optimize

That’s it. No cluster. No distributed training setup. No ML engineering team. One person, one GPU, one Markdown file.

Real Results

Karpathy left autoresearch running for about two days on a depth-12 model. The AI agent autonomously discovered around 20 improvements:

Training time for the GPT-2 benchmark dropped from 2.02 hours to 1.80 hours
An 11% improvement with zero human intervention
The agent found issues humans had missed: attention mechanisms lacking proper scaling, missing regularization, and suboptimal hyperparameters

The key insight: the agent discovered things that experienced ML researchers hadn’t noticed. Not because it’s smarter, but because it could try 100 variations where a human might try 5.

Why 630 Lines Matters

The codebase is intentionally tiny. At ~630 lines, the entire train.py file fits within an LLM’s context window. This is a critical design decision.

If the agent can see the whole system at once, it can make intelligent modifications. It understands how the learning rate interacts with the batch size, how the attention mechanism connects to the output layer, how one change ripples through the entire training pipeline.

Give an AI agent a 50,000-line codebase and it makes local changes that might not make sense globally. Give it 630 lines and it can reason about the whole system.

The 5-Minute Budget

Every experiment runs for exactly 5 minutes. This constraint is brilliant:

It makes experiments comparable. If one run takes 3 minutes and another takes 20, you can’t fairly compare their results. A fixed time budget means every improvement is measured on equal footing.

It enables rapid iteration. 5 minutes is long enough to see meaningful training progress but short enough to run 12 experiments per hour.

It prevents runaway costs. Without a time limit, an agent might train for hours on a single promising change. The 5-minute cap keeps the feedback loop tight.

The Git Memory

Every experiment is a git commit. This gives the system memory:

Successful changes are committed on a feature branch, building a chain of improvements
Failed experiments are reverted with git reset, leaving no trace
The history shows exactly what was tried, what worked, and what didn’t

This means you can review the agent’s work as a series of git commits. Each commit message explains what the agent changed and why. It’s a complete audit trail of autonomous research.

Beyond ML: The Pattern That Matters

Autoresearch is about training language models, but the pattern it introduces is universal:

Human writes Markdown instructions → AI agent executes autonomously → Results are measured and kept/discarded → Loop repeats

This pattern works for any domain where you can:

Define clear goals in natural language
Measure success automatically
Keep or discard changes based on results

Companies are already applying this pattern beyond ML research --- to code optimization, marketing experiments, and product development.

The Markdown-First Approach

At the center of autoresearch is a Markdown file. Not Python. Not YAML. Not a GUI. A plain text file that anyone can read and edit.

This matters because it lowers the barrier to directing AI research. You don’t need to be an ML engineer to write a program.md. You need to understand the problem, the goals, and the constraints. The agent handles the implementation.

The skill shift is clear: from knowing how to write training code to knowing how to write effective agent instructions.

Getting Started

If you want to try the autoresearch pattern (even outside ML), start with these steps:

Define your metric. What does “better” mean, and how do you measure it automatically?
Write your program.md. Set goals, constraints, and strategy in clear Markdown.
Keep the scope small. Like autoresearch’s 630-line codebase, smaller systems give better results.
Let it run. The point is autonomous operation. Resist the urge to intervene.
Review the results. Check the git history to see what the agent tried and what worked.

Building the Knowledge to Write Good Instructions

The quality of your program.md depends on your domain knowledge. The more you understand about the problem space, the better your instructions will be.

This is where having a curated library of reference material in Markdown format becomes valuable. Documentation, papers, blog posts, and examples --- all saved as clean Markdown, ready to inform your agent instructions.

Save converts any webpage to clean Markdown --- building the reference library you need to write effective AI agent instructions. Try Save free.