Autoresearch for Everyone: How to Run 100 AI Experiments While You Sleep
What if you could run 100 machine learning experiments overnight --- on a single GPU --- without writing a line of code?
That’s exactly what Andrej Karpathy’s autoresearch does. Released on March 7, 2026, this 630-line Python script lets AI agents autonomously modify training code, run experiments, evaluate results, and keep improving --- all while you sleep.
Within two days, the announcement had millions of views. Researchers, developers, and companies were already running their own overnight experiments.
Here’s how it works and why it matters.
The Core Loop
Autoresearch’s design is elegant in its simplicity:
- Read the
program.mdfile (your Markdown instructions) - Modify
train.pybased on those instructions - Train for exactly 5 minutes
- Measure the result (validation loss)
- Keep or discard --- if the metric improved, commit; if not, git reset
- Repeat indefinitely
At roughly 12 experiments per hour, you get about 100 experiments in an overnight session. Each successful improvement builds on the last, creating a compounding effect.
What You Need
The barrier to entry is remarkably low:
- One GPU --- the entire system is designed for single-GPU training
- 630 lines of Python --- small enough to fit in any LLM’s context window
- An LLM API key --- Claude, GPT, or another capable model
- A
program.mdfile --- your Markdown instructions telling the agent what to optimize
That’s it. No cluster. No distributed training setup. No ML engineering team. One person, one GPU, one Markdown file.
Real Results
Karpathy left autoresearch running for about two days on a depth-12 model. The AI agent autonomously discovered around 20 improvements:
- Training time for the GPT-2 benchmark dropped from 2.02 hours to 1.80 hours
- An 11% improvement with zero human intervention
- The agent found issues humans had missed: attention mechanisms lacking proper scaling, missing regularization, and suboptimal hyperparameters
The key insight: the agent discovered things that experienced ML researchers hadn’t noticed. Not because it’s smarter, but because it could try 100 variations where a human might try 5.
Why 630 Lines Matters
The codebase is intentionally tiny. At ~630 lines, the entire train.py file fits within an LLM’s context window. This is a critical design decision.
If the agent can see the whole system at once, it can make intelligent modifications. It understands how the learning rate interacts with the batch size, how the attention mechanism connects to the output layer, how one change ripples through the entire training pipeline.
Give an AI agent a 50,000-line codebase and it makes local changes that might not make sense globally. Give it 630 lines and it can reason about the whole system.
The 5-Minute Budget
Every experiment runs for exactly 5 minutes. This constraint is brilliant:
It makes experiments comparable. If one run takes 3 minutes and another takes 20, you can’t fairly compare their results. A fixed time budget means every improvement is measured on equal footing.
It enables rapid iteration. 5 minutes is long enough to see meaningful training progress but short enough to run 12 experiments per hour.
It prevents runaway costs. Without a time limit, an agent might train for hours on a single promising change. The 5-minute cap keeps the feedback loop tight.
The Git Memory
Every experiment is a git commit. This gives the system memory:
- Successful changes are committed on a feature branch, building a chain of improvements
- Failed experiments are reverted with
git reset, leaving no trace - The history shows exactly what was tried, what worked, and what didn’t
This means you can review the agent’s work as a series of git commits. Each commit message explains what the agent changed and why. It’s a complete audit trail of autonomous research.
Beyond ML: The Pattern That Matters
Autoresearch is about training language models, but the pattern it introduces is universal:
Human writes Markdown instructions → AI agent executes autonomously → Results are measured and kept/discarded → Loop repeats
This pattern works for any domain where you can:
- Define clear goals in natural language
- Measure success automatically
- Keep or discard changes based on results
Companies are already applying this pattern beyond ML research --- to code optimization, marketing experiments, and product development.
The Markdown-First Approach
At the center of autoresearch is a Markdown file. Not Python. Not YAML. Not a GUI. A plain text file that anyone can read and edit.
This matters because it lowers the barrier to directing AI research. You don’t need to be an ML engineer to write a program.md. You need to understand the problem, the goals, and the constraints. The agent handles the implementation.
The skill shift is clear: from knowing how to write training code to knowing how to write effective agent instructions.
Getting Started
If you want to try the autoresearch pattern (even outside ML), start with these steps:
- Define your metric. What does “better” mean, and how do you measure it automatically?
- Write your program.md. Set goals, constraints, and strategy in clear Markdown.
- Keep the scope small. Like autoresearch’s 630-line codebase, smaller systems give better results.
- Let it run. The point is autonomous operation. Resist the urge to intervene.
- Review the results. Check the git history to see what the agent tried and what worked.
Building the Knowledge to Write Good Instructions
The quality of your program.md depends on your domain knowledge. The more you understand about the problem space, the better your instructions will be.
This is where having a curated library of reference material in Markdown format becomes valuable. Documentation, papers, blog posts, and examples --- all saved as clean Markdown, ready to inform your agent instructions.
Save converts any webpage to clean Markdown --- building the reference library you need to write effective AI agent instructions. Try Save free.