Survey 2026

What's Missing in Autonomous Research? A Systematization of Systems, Benchmarks, and Verification

Xingyu Ren, Youran Sun, Chugang Yi, Kejia Zhang, Jiaxuan Guo, Jianda Du, Haizhao Yang

Seven-axis coordinate system for autonomous research

Overview

In barely two years, autonomous research has jumped from isolated research assistants to systems that automate large stretches of the scientific workflow,drafting papers, searching over candidates, running code, and connecting to wet-lab and simulator readouts. But the same systems are described as AI scientists, research agents, deep-research agents, or agentic-science platforms, report wildly different kinds of evidence, and are scored on benchmarks that rarely line up. The field has outgrown its own map.

This survey draws that map. It systematizes public work on autonomous research through June 2026, and its organizing move is a single distinction the field usually treats as a footnote: it separates what a system can produce from what it can defend. Read along that axis, most systems can generate a research artifact,yet almost none can block a weak result before release. The survey makes that missing release architecture explicit, system by system.

Timeline of autonomous-research systems and evaluation signals

Timeline of autonomous-research systems and evaluation signals. The upper track summarizes system and domain milestones; the lower track summarizes benchmarks, audits, and review studies that motivate evaluation and verification.

Key Contributions

The survey is built in two halves that answer two different questions. Part one maps the systems,what each one builds and how far it can defend it. Part two maps the evaluation and verification machinery around them,what benchmarks can make visible and what verifiers can actually block.

For part one, the survey hand-codes 56 autonomous-research systems along seven axes, each prying apart a capability that end-to-end "AI Scientist" claims tend to fuse into one:

Axis What it records
Loop topology (L)Whether feedback is absent, single-loop, or multi-loop
Verifier gate scope (G)What a verifier can actually block in practice
Orchestration mode (O)Who advances the workflow: human, script, or agent
Portfolio parallelism (P)One project, in-project branches, or multiple live projects
Artifact substrate (A)Context, loose files, or protocolized records
Disciplinary coverage (DC)Claimed scope vs. publicly demonstrated evidence
Lifecycle coverage (LC)How far a thread reaches, from support module to publication cycle

Every system also carries an Evidence Reliability Grade (R1 toR5) that records the strength of public evidence rather than capability,an infrastructure component with strong artifacts can outrank a flashy end-to-end system whose claims are thinly substantiated.

Read together, the axes draw a sharp line. Production has scaled; defensibility has not. No reviewed public system reaches a full-manuscript release gate (G-paper-full), agent-dispatcher orchestration (O2), a multi-project portfolio (P2), or cross-disciplinary validation in the DC-Evidence column,and only four gate even selected paper-level claims. The result is a field map that locates each system by both what it builds and what it can stand behind.

Part One,The System Coding Matrix · All 56 Systems

This is the core of the survey: every reviewed autonomous-research system, hand-coded on all seven axes plus its Evidence Reliability Grade (R) from public evidence only. Where a paper claimed a capability without inspectable mechanism, artifact, or trace evidence, the lower label was assigned and the claim toevidence gap recorded. Sorted by first public date.

System Date L G O P A DC-Claim DC-Evidence LC R↑ Key note
AI Scientist v124.08L1O1P1A1DC3DC1-CS/MLLC3R4Paper pipeline with soft LLM review; no release-blocking claim gate.
PaperQA224.09L0O1P0A2InfraInfraLC0R4Literature QA and citation grounding module.
CycleResearcher24.11L0O1P1A1DC3DC1-CS/MLLC1R3Paper-shaped review loop has disclosed fabricated experiments, so it is treated as textual/proposal-stage coverage rather than LC3 evidence.
OpenScholar24.11L0O1P0A2InfraInfraLC0R5Citation-grounded literature infrastructure, not end-to-end research.
Agent Laboratory25.01L1O1P0A1DC3DC1-CS/MLLC3R3Three-phase workflow with human checkpoints and judge/human score gap.
AIDE25.02L0G-taskO1P1A1DC1-MLDC1-MLLC2R4ML competition agent with external scorer.
MLGym25.02L0G-taskO1P0A2DC1-MLDC1-MLLC0R4Research-agent framework and benchmark; independent reruns do not show branch control.
Tree-of-Debate25.02L1O1P1A1InfraInfraLC0R3Retrieval-backed debate for paper comparison, not new evidence production.
Co-Scientist25.02L2O1P1A1DC3DC1-biomedLC1R4Hypothesis loop proposes and ranks candidates by Elo without an enforceable task gate; selected wet-lab checks are external validation, not autonomous evidence production.
AgentRxiv25.03L1O1P1A2DC3DC1-CS/MLLC3R3Collaborative autonomous-research target with a searchable preprint substrate and limited public gate evidence.
AI Scientist v225.04L2O1P1A1DC3DC1-CS/MLLC3R4Tree search and visual critique add feedback; publication signal is reported, while release review remains soft.
Robin25.05L1G-runO1P1A1DC1-biomedDC1-biomedLC2R4RNA-seq and assay evidence support a candidate, not an autonomous full paper.
R&D-Agent25.05L0G-taskO1P1A2DC1-ML/DSDC1-ML/DSLC2R4Industrial data-science loop with an exploration graph and MLE-Bench-style scoring.
NovelSeek25.05L1G-taskO1P1A1DC3DC2LC2R3Idea-branch evolution and hypothesis-to-verification utilities with multi-domain computational readouts.
AI-Researcher25.05L1O1P1A1DC3DC1-CS/MLLC3R3Literature, math-to-code, and paper workflow; judge artifacts remain a boundary.
AlphaEvolve25.06L0G-codeO1P1A2DC2DC2LC2R3Protocolized program database with objective-scored code gates.
AIRA-dojo25.07L0G-taskO1P1A2DC1-MLDC1-MLLC2R4Iterative ML loop drafts, improves, debugs, executes notebooks, evaluates CV, and writes submissions; no manuscript workflow.
SWE-Debate25.07L1G-codeO1P1A1DC1-SEDC1-SELC2R3Debate and MCTS around software patches; code gate is local.
MetaAgent25.08L0O1P0A2InfraInfraLC0R3Tool-interaction history feeds a persistent vector memory while reflection is injected into context separately; no enforceable gate or parallel research-branch evidence.
CMA25.08L0O1P0A2InfraInfraLC0R2Asynchronous modules and shared memory, but no qualifying research-review loop or verifier gate.
AInstein25.10L2G-taskO1P0A1DC1-CS/MLDC1-CS/MLLC1R4Generalizer and solver critique loops support pre-experimental problem-solution evidence.
Kosmos25.11L0O1P1A2DC3DC2LC3R4Sentence-to-evidence links help audit, but synthesis and interpretation remain weaker.
EviBound25.11L0G-runO1P0A2InfraDC1-MLLC0R4Reusable verifier infrastructure for run contracts.
AgentEvolver25.11L0G-taskO1P1A2InfraInfraLC0R3Agent-controller evolution over held-out tool-use tasks.
OmniScientist25.11L2O1P0A2DC3DC1-CS/MLLC3R2AI/CS case-study evidence supports a broader ideation, experiment, writing, and review workflow claim.
CFD-copilot25.12L0G-runO1P0A1DC1-CFDDC1-CFDLC2R4Simulation automation with CFD readouts.
PhysMaster25.12L1G-runO1P1A2DC1-physicsDC1-physicsLC2R2Literature retrieval and MCTS trajectories over physics cases with simulator checks.
ASG-SI25.12L0G-runO1P0A2InfraInfraLC0R2Audited skill-graph proposal; O/A and run-gate evidence are architectural claims.
ML-Master 2.026.01L0G-taskO1P1A2DC1-MLDC1-MLLC2R2MLE-Bench medal loop with hierarchical cognitive cache.
InternAgent-1.526.02L0G-taskO1P1A2DC3DC2LC2R2Generation toverification toevolution over long-horizon scientific tasks; public evidence is claim-level.
ClawdLab26.02L0O0P0A2InfraInfraLC0R2Governance and harness proposal with PI oversight; public evidence is architectural, not a demonstrated gate.
EvoScientist26.03L1O1P1A2DC3DC1-CS/MLLC3R2Idea generation, code execution, and reported full papers support LC3; venue acceptance does not by itself imply LC4.
OpenResearcher26.03L0G-taskO1P1A2InfraInfraLC0R5Open search/open/find trajectories for evidence seeking.
AEGIS26.03L1G-codeO1P0A2DC1-securityDC1-securityLC0R3Dialectical verifier and meta-audit over code evidence.
Bilevel Autoresearch26.03L0G-taskO1P0A2InfraDC1-CS/MLLC2R3Sequential trace-mediated inner loop and outer mechanism search over scoreable code tasks.
SEVerA26.03L0G-formalO1P0A2InfraDC1-formalLC0R4Reusable verifier infrastructure for formal guards.
TianJi26.03L1G-runO1P0A1DC1-atmos.DC1-atmos.LC2R3Meteorology hypotheses with numerical-model checks.
Medical AI Scientist26.03L2G-taskO1P0A2DC1-MedDC1-MedLC3R3Multi-stage medical-AI manuscript workflow with idea gates and structured experimental records.
Multi-Agent Collab26.03L0G-taskO1P1A2InfraDC1-MLLC2R3Worktree patches, global memory, preflight, training, and val-bpb acceptance form a code-optimization evidence loop.
CORAL26.04L0G-taskO1P1A2InfraDC1-CS/optim.LC2R4Hidden graders and worktrees provide strong task gates; heartbeat reflection is same-agent control, not a reviewer loop.
AutoSOTA26.04L0G-codeO1P1A2DC1-CS/MLDC1-CS/MLLC2R4Repository-grounded optimization with executable score improvement and run records.
A-Lab26.04L0G-runO1P1A2DC1-materialsDC1-materialsLC2R3Robotic synthesis and characterization provide local physical readouts.
AgentV-RL26.04L1G-taskO1P0A1InfraInfraLC0R4Forward/backward LLM verifier loop over decomposable answers; support module rather than research lifecycle.
Debate as Reward26.04L0G-taskO1P1A1DC1-CS/MLDC1-CS/MLLC1R4Ideation reward signal; narrow task verifier rather than publication gate.
Knows26.04L0O0P0A2InfraInfraLC0R4YAML sidecar serializes claims, evidence, relations, and actions.
ARA26.04L0G-runO0P0A2InfraInfraLC0R4Agent-native package improves reproduction and rigor audit.
SciCrafter26.04L0G-taskO1P1A2DC1-embodiedDC1-embodiedLC2R4MCP traces and structured knowledge books support controlled redstone discovery gates.
NORA26.05L2G-paper-partialO0P0A2DC1-spatialDC1-spatialLC4R3Target venue, paper writing, review/revise, submit-check, and submission packaging support LC4; final authority remains human.
ARIS26.05L2G-paper-partialO0P0A2DC3DC1-CS/MLLC4R4Skill suite spans idea discovery, experiment bridge, paper writing, rebuttal, venue templates, and pre-submission audits; evidence remains observational.
PARNESS26.05L0O1P0A2InfraInfraLC0R3Declarative DAG and cross-run knowledge support coding agents; no demonstrated P2 controller or release gate.
SkillFlow26.05L0G-taskO1P1A2InfraInfraLC0R4Skill-library controller trained on verifiable task rewards.
Qumus26.05L0G-runO1P0A2DC1-materialsDC1-materialsLC2R3Sequential robotic synthesis and device evidence support local physical claims.
AutoResearchClaw26.05L2G-paper-partialO1P1A2DC3DC2LC3R4Discovery-experiment-writing workflow and export checks support LC3; public evidence does not show rebuttal or camera-ready workflow.
HANA26.05L0O1P1A1DC1-networksDC1-networksLC0R2Hierarchical network proposal without demonstrated context-separated reviewer-loop evidence.
ScientistOne26.05L2G-paper-partialO1P1A2DC3DC1-CS/systemsLC3R4Chain-of-Evidence checks paper integrity, with semantic support still partial.
AutoScientists26.05L1G-taskO1P1A2DC2DC2LC2R4Shared champions, logs, forums, and dead-end records gate benchmark progress.

L loop topology (L0 none → L2 nested reviewer loops) · G verifier gate scope (— none, then G-task / G-run / G-code / G-formal / G-paper-partial / G-paper-full by release relevance) · O orchestration (O0 human, O1 script, O2 agent-dispatcher) · P portfolio parallelism (P0 one project, P1 in-project branches, P2 multi-project) · A artifact substrate (A0 context, A1 loose files, A2 protocolized records) · DC disciplinary coverage, claimed vs. evidenced (DC1 single community, DC2 multi-domain computational, DC3 cross-disciplinary, Infra domain-agnostic) · LC lifecycle coverage (LC0 support module → LC4 publication cycle) · R Evidence Reliability Grade (R1 unsubstantiated → R5 verified). No reviewed public system reaches G-paper-full, O2, or P2.

Part Two,Benchmarks and Verification

The second half of the survey turns from the systems to the machinery that judges them, and it keeps two jobs separate. Evaluation asks which failures a system makes visible,failed execution, weak retrieval, a fragile process, or an unsupported claim,and benchmarks supply that visibility. Verification asks what can actually block a generated artifact from being accepted, submitted, or released. A benchmark can expose a failure without holding any release authority; a verifier can block a narrow failure without seeing the wider research process. Each is necessary, and neither substitutes for the other.

Evaluation and verification map of benchmark targets and verifier routes

The evaluation-and-verification map: benchmark targets (what is made visible) on one side, verifier routes (what can be blocked) on the other.

What benchmarks reveal: four targets

The survey catalogs benchmarks by the failure each one is built to expose. Every target measures one slice of the problem and hides another, and none yields a paper-level pass/fail that a gate could enforce on a generated manuscript.

TargetWhat it tests,and what it cannot see
ReconstructionCan an agent recover an executable or paper-level artifact from a published record? Running the artifact is measurable; validating every claim attached to it is not.
OptimizationResearch as search under an external scorer (competition metric, validation loss, task reward). Winning the scorer is not the same as making a defensible claim.
Process qualityDoes the workflow resemble research,rigor, retrieval, synthesis,not just a local metric? But final scoring leans on model judges, inheriting the LLM-as-judge ceiling.
SoundnessDo plausible claims survive evidence checks? The strongest warning signal: e.g. SPOT reports ~21% recall on author-confirmed paper errors; SoundnessBench ~26% on low-soundness proposals.

A cross-cutting validity threat sits underneath all four: contamination, leaderboard overfitting, and disclosure gaps can inflate scores even when the rest of the protocol is sound.

What verifiers block: four routes

Every system has a verifier slot. The survey identifies what fills it across the corpus, ordered by how much external evidence each route brings to the gate,because if generator and verifier share the same context and incentives, verification collapses into style-sensitive self-review.

Verifier routeWhat it can block
Execution & deploymentCompetition metrics, unit tests, exploits, deployment endpoints,the clearest antidote to circularity, but authority binds to a specific task object, not a paper's claims.
Formal & statistical gatesAccept or reject when the target is a contract or calibrated test (run records, metric ranges, logic specifications). Strong where the target is explicit, silent otherwise.
Evidence bundlesBind numbers, citations, runs, and artifacts to declared sources (chain-of-evidence, numeric registries). Reaches closest to the full paper, yet still misses causal, interpretive, novelty, and limitation claims.
LLM-as-judgeThe most common route,and the structurally weakest. Reads text without external evidence; stacking more judges or denser rubrics does not lift the ceiling.

The same boundary repeats across six scientific domains,biology, chemistry and materials, physics and astronomy, medicine, engineering and computer science, and quantum. In every one, a local readout (an assay, a sample, a reproduced number, a code patch, a circuit) verifies a local piece without covering the surrounding paper. The missing instrument sits at the architecture level: a release gate would need claim extraction, independent counterevidence search, evidence attached to specific claim spans, a release-blocking rule, and human adjudication where artifacts cannot settle the dispute.

Citation

@article{ren2026autonomousresearch, title={What's Missing in Autonomous Research? A Systematization of Systems, Benchmarks, and Verification}, author={Ren, Xingyu and Sun, Youran and Yi, Chugang and Zhang, Kejia and Guo, Jiaxuan and Du, Jianda and Yang, Haizhao}, year={2026}, month={June}, url={https://www.researchgate.net/publication/406952713} }