AI for Auto-Research: A Survey

Awesome AI Auto-Research Team A comprehensive practitioner's guide for using AI-assisted tools across the complete academic research lifecycle.

AI Auto-Research Teaser
Crossing a threshold

AI-assisted research is crossing a threshold. Fully automated systems can now generate research papers for as little as $15, while long-horizon agents execute experiments, draft manuscripts, and simulate critique with minimal human input — yet under scientific pressure, even frontier LLMs still fabricate experiment results, miss hidden errors, and fail to judge novelty reliably.

AI auto-research across the complete lifecycle: four phases and eight stages.
AI auto-research across the complete lifecycle. Four phases and eight stages: Creation (ideation, literature, code & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive paper agents).

Across this lifecycle we identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy. The core challenge is no longer whether AI can produce the forms of research, but whether it can preserve the substance — evidence, judgment, provenance, and accountability. We argue that human-governed collaboration is the most credible deployment paradigm, and that effective systems converge on layered architectures combining exploration, execution, and verification.

250+
Papers Surveyed
8
Research Stages
4
Lifecycle Phases
5
Central Findings


The Academic Research Lifecycle

Four Phases, Eight Stages

We organize the academic research lifecycle as eight interconnected stages grouped into four epistemological phases. Each phase serves a distinct function in producing, scrutinizing, and communicating scientific knowledge.

Phase 1: Creation
Phase 1 · Stages 1–4

Creation

The stages through which a research contribution is materially produced.

From spark to substance: ideation grounded in the literature, executed in code, and rendered as figures and tables. This is where AI's promise — and its capability boundary — is most visible.

S1: Idea Generation
Stage 1

Idea Generation

Know more

Generating, refining, and evaluating research hypotheses. Systems span direct LLM prompting, retrieval-augmented and knowledge-graph generation, multi-agent collaboration, and learned quality signals. The central challenge: LLMs can produce ideas that appear novel and well-motivated, yet often struggle to generate ones that remain feasible, distinctive, and impactful after execution.

LLM-based generation KG-driven Trend-driven Multi-agent ideation Novelty assessment Human-AI co-ideation
AI Scientist ResearchAgent SciMON VirSci SciAgents IdeaSynth MOOSE-Chem
S2: Literature Review
Stage 2

Literature Review

Know more

Retrieving, synthesizing, and organizing prior work into coherent research contexts. Compared with idea generation, this stage is more grounded and externally verifiable, making it one of the fastest-maturing areas in AI-assisted research. Yet faithful citation, coverage completeness, and multi-paper relational reasoning remain difficult.

Semantic retrieval Survey generation Deep Research Citation graph Hierarchical org. Related work gen.
PaperQA2 AutoSurvey STORM SurveyX SurveyForge OpenScholar
S3: Coding & Experiments
Stage 3

Coding & Experiments

Know more

Translating ideas into executable code, running experiments, and analyzing empirical results. The challenge is not whether LLMs can write plausible code, but whether they can produce semantically correct research implementations, execute meaningful experiments, and interpret results reliably — performance still drops sharply on genuinely novel research code.

Paper-to-Code Experiment design Auto execution Result analysis Lab automation Benchmarking
PaperCoder MLAgentBench AIDE R&D-Agent ChemCrow Coscientist
S4: Tables & Figures
Stage 4

Tables & Figures

Know more

Constructing method diagrams, result plots, comparison tables, mathematical formulas, and algorithmic illustrations. Despite their importance in daily research practice, this stage remains comparatively underdeveloped — current tools serve as assistants rather than autonomous producers, and AI-generated figures frequently require human modification for domain-specific symbols and paper-specific visual languages.

Method diagrams Result plots Table generation LaTeX / TikZ SVG generation Visual feedback
AutoFigure MatPlotAgent PlotGen ChartGPT AutomaTikZ DeTikZify
Phase 2: Writing
Phase 2 · Stage 5

Writing

Organizing the outputs of Creation into a formal scholarly manuscript for communication and external scrutiny.

Writing is not a formatting step — it is a rhetorical and evidential organization process that requires distinct AI capabilities from those used to produce code, experiments, or figures.

S5: Paper Writing
Stage 5

Paper Writing

Know more

Drafting, editing, polishing, and structuring academic manuscripts. AI assistance ranges from grammar correction and citation support to section-level drafting and full-paper generation. The central failure mode is no longer ungrammatical prose but the gap between fluency and argumentative depth; end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards.

Semi-auto assist Full-auto generation Citation insertion AI text detection LaTeX generation Style & polishing
CycleResearcher ScholarCopilot CoAuthor TeXpert FutureGen FARS
Phase 3: Validation
Phase 3 · Stages 6–7

Validation

External scrutiny — how the community challenges and refines a manuscript.

Together, peer review and rebuttal form the community-facing mechanism through which scientific claims are challenged, defended, and revised — not a bureaucratic afterthought, but the verification layer of research.

S6: Peer Review
Stage 6

Peer Review

Know more

Generating structured reviews, matching reviewers to manuscripts, assessing review quality, and supporting meta-review decisions. These systems aim to assist, not replace, the community's evaluative process — standalone AI review remains unsafe, with persistent leniency bias and vulnerability to adversarial manipulation.

Review generation Meta-review Reviewer matching Quality assessment Bias detection Review feedback
DeepReviewer MARG REMOR AgentReview OpenReviewer
S7: Rebuttal & Revision
Stage 7

Rebuttal & Revision

Know more

Analyzing reviewer comments, identifying required evidence, drafting evidence-grounded responses, and supporting manuscript revision. This stage connects external critique with additional analysis, clarification, and experimental follow-up — under-served relative to its practical importance, despite being where authors directly negotiate reviewer concerns before final decisions.

Comment analysis Response generation Evidence retrieval Strategy planning Effectiveness analysis
RebuttalAgent Paper2Rebuttal ReviewerToo ReviewMT Re² Author-in-Loop
Phase 4: Dissemination
Phase 4 · Stage 8

Dissemination

Translating the manuscript into formats accessible to broader audiences.

Posters, slides, videos, project pages, social-media summaries, and interactive agents are independent knowledge artifacts with their own fidelity and trust requirements — not appendages to the paper.

S8: Paper2X
Stage 8

Paper2X

Know more

Converting papers into posters, slides, videos, social media, project pages, and interactive agents. The core bottleneck is trust: researchers may use AI to draft public-facing materials, but remain reluctant to delegate final communication to systems that may distort results, overstate claims, or omit important limitations.

Paper→Poster Paper→Slides Paper→Video Paper→Social Paper→Web Paper→Agent
Paper2Poster PPTAgent DeepPresenter PresentAgent Paper2Video Paper2Web Paper2Agent
S8.1: Paper2Video
Stage 8.1

Paper2Video

Know more

Generating explanatory videos from papers by synchronizing script, narration, animations, and subtitles across multiple output channels. Video is the most demanding Paper2X format — it must coordinate visual, auditory, and temporal channels simultaneously while remaining faithful to the paper's claims.

Script generation Narration synthesis Slide animation Subtitle alignment Talking-head video Visual storytelling
Paper2Video Preacher AutoPresent ScholarCast
S8.2: Paper2Slides
Stage 8.2

Paper2Slides

Know more

Automatically distilling paper content into structured presentation decks. Systems must extract key contributions, arrange them into a logical slide flow, and generate visual representations — bridging the gap between dense manuscript prose and audience-ready slides.

Content extraction Slide structuring Layout generation Figure repurposing Key message distillation Design automation
PPTAgent DeepPresenter PresentAgent Paper2Slides SlideCrafter D2S
S8.3: Paper2Poster
Stage 8.3

Paper2Poster

Know more

Condensing papers into visually compelling research posters, requiring simultaneous summarization and visual design. The core challenge is balancing information density, spatial layout, and aesthetic coherence within a single high-resolution canvas.

Layout planning Content prioritization Figure adaptation Typography Visual hierarchy Template generation
Paper2Poster PosterLLaVA AutoPoster PosterAgent
S8.4: Paper2Social
Stage 8.4

Paper2Social

Know more

Crafting platform-adapted social media posts from academic papers. Each platform (X, LinkedIn, Reddit, Mastodon) has distinct tone, length, and audience expectations — requiring format-aware generation that simplifies results without distorting them.

Audience adaptation Tone calibration Thread generation Platform formatting Claim simplification Hashtag generation
Paper2Social TweetPaper PaperDigest SciComm
S8.5: Paper2Agent
Stage 8.5

Paper2Agent

Know more

Building interactive agents that allow readers to query a paper's methodology, results, and implications through conversation. These systems ground responses in specific paper content, enabling exploration beyond the static manuscript.

Paper grounding Conversational QA MCP integration Citation tracing Demo interfaces Fidelity evaluation
Paper2Agent PaperCopilot DocAgent ScholarBot


Paper Collection

Per-Stage Inventory of Surveyed Methods

A curated, expandable inventory of method discussed in the survey, organized by stage. Click a stage to expand its full table; use the search box to filter by method name, venue, or category.

S1 Idea Generation 26 methods Expand
# Method Venue Category Evaluation Links
LLM Internal Knowledge-Based Generation
1 Chain of Ideas ACL'24 LLM Internal Comparable to human quality; ~$0.50/idea minimum cost.
2 ResearchAgent NAACL'25 LLM Internal Human + model evaluation; academic graph feedback for refinement.
3 SciMON ACL'24 LLM Internal Mitigates shallow novelty via iterative refinement.
4 Idea Gen Agent arXiv'24 LLM Internal 100+ NLP researchers; LLM ideas higher novelty (p<0.05).
5 IRIS ACL'25 LLM Internal MCTS adaptive reasoning; human-in-the-loop platform.
6 Spark ICCC'25 LLM Internal Judge model trained on 600K OpenReview reviews.
24 Rubric Rewards arXiv'25 LLM Internal 70% expert preference; RL with rubric self-grading.
25 DeepInnovator arXiv'26 LLM Internal 80–94% win rates vs. frontier models; 14B parameters.
External Signal-Driven Generation
7 MOOSE-Chem ICLR'25 External Signal Rediscovers hypotheses from 51 high-impact papers.
8 Nova arXiv'24 External Signal 3.4× more novel ideas; 2.5× more top-rated.
9 SciAgents arXiv'24 External Signal Multi-agent reasoning over scientific knowledge graphs.
10 SciPIP arXiv'24 External Signal Multi-domain; paper-anchored idea generation.
11 IdeaSynth CHI'25 External Signal 20-user study; more alternatives explored vs. baseline.
12 MOOSE-Chem2 NeurIPS'25 External Signal Fine-grained, experimentally actionable hypotheses.
26 FlowPIE arXiv'26 External Signal Higher novelty, feasibility, and diversity vs. baselines.
Multi-Agent Collaborative Generation
13 Combi. Creativity arXiv'24 Multi-Agent +7–10% similarity scores; cross-domain composition.
14 Deep Ideation arXiv'25 Multi-Agent +10.67% quality; surpasses conference acceptance levels.
15 VirSci ACL'25 Multi-Agent Outperforms single-agent on novelty.
16 Multi-Agent Dial. SIGDIAL'25 Multi-Agent Optimal at 3 critique-revision rounds.
17 Artificial Hivemind NeurIPS'25 Multi-Agent 26K queries; diversity collapse across models.
Novelty and Feasibility Assessment
18 IdeaBench KDD'25 Evaluation 2,374 papers; 8 domains; novelty >0.6, feasibility <0.5.
19 LiveIdeaBench arXiv'24 Evaluation 40+ models; 1,180 keywords; 22 scientific domains.
20 AI Idea Bench 2025 arXiv'25 Evaluation 3,495 papers; alignment + general reference evaluation.
21 HeurekaBench ICLR'26 Evaluation +22% with critic module; open-ended science tasks.
22 ResearchBench ACL'26 Evaluation 12 disciplines; inspiration retrieval + ranking.
23 HindSight arXiv'26 Evaluation LLM novelty negatively correlated with impact (ρ=−0.29).
S2 Literature Review 35 methods Expand
# Method Venue Category Evaluation Links
Literature Retrieval
1 CiteME arXiv'24 Retrieval Citation fidelity benchmark.
2 LitLLM arXiv'24 Retrieval LLM + academic database integration.
3 LitSearch arXiv'24 Retrieval Retrieval precision benchmark.
4 PaperQA2 arXiv'24 Retrieval Matches/exceeds expert on 3 tasks; 70% contradiction validation.
5 OpenResearcher EMNLP'24 Retrieval RAG + graph traversal for literature exploration.
6 PaSa arXiv'25 Retrieval Agentic multi-step iterative retrieval.
Survey & Related Work Generation
7 ChatPaper GitHub'23 Generation 19K+ GitHub stars; arXiv summarization tool.
8 PaperQA arXiv'23 Generation 8K+ GitHub stars; RAG for scientific Q&A.
9 AutoSurvey arXiv'24 Generation First end-to-end LLM survey drafting system.
10 GPT Researcher GitHub'24 Generation 26K+ GitHub stars; comprehensive report generation.
11 LLMs for Lit. Review arXiv'24 Generation Hallucination analysis; models still generate errors.
12 STORM arXiv'24 Generation Multi-perspective question-asking for outlines.
13 Agentic AutoSurvey arXiv'25 Generation Multi-agent role decomposition.
14 Citegeist arXiv'25 Generation Dynamic RAG pipeline on arXiv corpus.
15 IterSurvey arXiv'25 Generation Iterative outline planning with stability checks.
16 LiRA arXiv'25 Generation Multi-agent retrieval + verification + narrative.
17 SurveyForge arXiv'25 Generation Outperforms AutoSurvey on outline quality.
18 SurveyG arXiv'25 Generation Three-layer citation graph (Foundation/Dev/Frontier).
19 SurveyX arXiv'25 Generation +0.259 content quality improvement; near expert level.
20 InteractiveSurvey arXiv'25 Generation User-customizable reference categorization + outlines.
21 CiteLLM arXiv'26 Generation Hallucination-free via trusted repository routing.
Deep Research Agents
22 ASReview Nature MI'21 Deep Research Active learning; up to 95% effort reduction.
23 CHIME arXiv'24 Deep Research Hierarchical organization of scientific studies.
24 DeepResearch-Agent GitHub'25 Deep Research Hierarchical multi-agent; planner + sub-agents.
25 DeerFlow GitHub'25 Deep Research Sub-agents with shared memory; sandboxed execution.
26 OpenScholar Nature'26 Deep Research 45M papers; +6.1% over GPT-4o, +5.5% over PaperQA2.
27 AutoAgent arXiv'25 Deep Research Universal LLM compatibility; GAIA benchmark.
28 Tongyi DeepResearch GitHub'25 Deep Research 30.5B params (3.3B activated); SOTA on Deep Research.
29 O-Researcher arXiv'26 Deep Research Multi-agent distillation + agentic RL.
30 OpenResearcher (2026) arXiv'26 Deep Research 54.8% BrowseComp-Plus; 97K+ trajectories.
Retrieval and Synthesis Quality Assessment
31 DeepScholar-Bench arXiv'25 Evaluation Coverage, coherence, factual accuracy benchmark.
32 ReportBench arXiv'25 Evaluation 100-prompt benchmark from 678 filtered survey papers.
33 IDRBench arXiv'26 Evaluation 100 tasks; interactive Deep Research evaluation.
34 ScholarGym arXiv'26 Evaluation 2,536 queries; query planning + tool invocation.
35 SciNetBench arXiv'26 Evaluation 18M papers; relation-aware retrieval <20%.
S3 Coding & Experiments 38 methods Expand
# Method Venue Category Evaluation Links
Code Generation
1 SWE-bench ICLR'24 Code Gen. 2,294 real GitHub issues; Verified split (500 problems).
2 SWE-agent arXiv'24 Code Gen. Agent–computer interface paradigm for coding.
3 OpenHands ICLR'25 Code Gen. Open platform for generalist coding agents.
4 SWE-bench Pro arXiv'25 Code Gen. 1,865 enterprise problems; best score 23%.
5 SWE-EVO arXiv'25 Code Gen. Software evolution benchmark; best score 25%.
Paper-to-Code
6 FunSearch Nature'24 Paper-to-Code New cap-set solutions; evolutionary program search.
7 SciCode arXiv'24 Paper-to-Code Research-level coding across math, physics, chemistry.
8 PaperBench arXiv'25 Paper-to-Code 20 ICML'24 papers; 8,316 gradable subtasks.
9 PaperCoder arXiv'25 Paper-to-Code 3-stage multi-agent; ML papers to code repos.
10 ResearchCodeBench arXiv'25 Paper-to-Code 212 novel ML tasks; best 37.3% (Gemini-2.5-Pro).
11 SciReplicate-Bench arXiv'25 Paper-to-Code 100 tasks from 36 NLP papers; 39% ceiling.
Experiment Execution & Orchestration
12 BioPlanner arXiv'23 Execution Biological protocol planning evaluation.
13 CRISPR-GPT arXiv'24 Execution Gene-editing experiment design assistance.
14 DS-Agent arXiv'24 Execution End-to-end data science workflow automation.
15 MLE-Bench arXiv'24 Execution 75 Kaggle competitions benchmark.
16 MLAgentBench arXiv'24 Execution 13 ML experimentation tasks benchmark.
17 MLR-Copilot arXiv'24 Execution IdeaAgent + ExperimentAgent dual-agent pipeline.
18 AIDE arXiv'25 Execution SOTA on MLE-Bench + RE-Bench; tree search in code space.
19 AlphaEvolve arXiv'25 Execution LLM-generated mutations + automated evaluators.
20 AutoReproduce arXiv'25 Execution Paper lineage algorithm for experiment reproduction.
21 CURIE arXiv'25 Execution Rigorous automated experimentation framework.
22 MLGym arXiv'25 Execution AI research agent gym benchmark.
23 MLR-Bench arXiv'25 Execution 201 tasks (NeurIPS/ICLR/ICML); 80% fabrication rate.
24 Execution-Grounded arXiv'26 Execution 69.4% vs 48.0% GRPO; parallel GPU search.
25 Learn to Discover arXiv'26 Execution Test-time training + RL; math, GPU kernel, biology.
26 SciNav arXiv'26 Execution Pairwise tree-search branch selection.
27 FrontierScience arXiv'26 Execution Expert-level tasks; Olympiad + PhD difficulty.
Code Correctness and Reproducibility Assessment
28 DiscoveryBench arXiv'24 Analysis Data-driven insight extraction benchmark.
29 DiscoveryWorld arXiv'24 Analysis 120 tasks; 8 topics; 3 difficulty levels.
30 InfiAgent-DABench arXiv'24 Analysis End-to-end data analysis workflow benchmark.
31 ScienceAgentBench arXiv'24 Analysis Rigorous data-driven scientific discovery assessment.
32 LAB-Bench arXiv'24 Execution Multi-domain biology research task benchmark.
33 KernelBench arXiv'25 Execution GPU kernel generation benchmark.
34 TritonBench arXiv'25 Execution Triton operator generation benchmark.
35 AstaBench arXiv'25 Execution 2,400+ problems; multi-domain scientific research.
36 ResearchClawBench arXiv'25 Execution Scientist-aligned workflow benchmark.
37 EXP-Bench ICLR'26 Execution 461 tasks from 51 AI papers.
38 PostTrainBench arXiv'26 Execution LLM post-training automation benchmark.
S4 Tables & Figures 26 methods Expand
# Method Venue Category Evaluation Links
Scientific Figure Generation
1 ChartGPT arXiv'23 Data Viz 6-step reasoning for chart generation.
2 MatPlotAgent arXiv'24 Data Viz +12.3 over GPT-4 base; VLM visual feedback.
3 CoDA arXiv'25 Data Viz +41.5% over baselines; multi-agent collaboration.
4 PlotGen arXiv'25 Data Viz 4–6% improvement over baselines.
5 VIS-Shepherd arXiv'25 Figure Editing Constructive critique feedback framework.
6 DiagramAgent CVPR'25 Data Viz 4 specialized agents; 8 diagram categories.
7 StarVector CVPR'25 Method Diagrams Scalable SVG generation from descriptions.
8 VisCoder EMNLP'25 Data Viz VisCode-200K dataset; 90%+ execution pass rate.
9 AI-Generated Figures arXiv'26 Policy Publisher policy survey (Nature, Science, etc.).
10 AutoFigure-Edit arXiv'26 Method Diagrams Editable text-to-SVG scientific illustrations.
11 AutoFigure ICLR'26 Method Diagrams FigureBench (3,300 pairs); publication-ready illustrations.
12 PaperBanana arXiv'26 Method Diagrams 292 test cases; outperforms baselines.
13 SAIL arXiv'26 Figure Editing Domain logic / code syntax separation.
Table Understanding & Generation
14 ArxivDIGESTables EMNLP'24 Table Gen. Literature comparison table synthesis.
15 Chain-of-Table ICLR'24 Table Reasoning Multi-step table reasoning chains.
16 ShowTable CVPR'26 Table Viz Collaborative reflection and refinement.
17 Table2LaTeX-RL arXiv'25 Table Conversion Image-to-LaTeX via reinforced multimodal LM.
Mathematical Formulas & TikZ
18 AutomaTikZ ICLR'24 TikZ DaTikZ: first large-scale dataset (120K drawings).
19 DeTikZify NeurIPS'24 TikZ 360K TikZ graphics; MCTS iterative refinement.
20 TikZilla arXiv'26 TikZ 3B/8B matches GPT-5; SFT+RL on expanded DaTikZ.
Visual Fidelity and Scientific Accuracy Assessment
21 PlotCraft arXiv'25 Benchmark 1K-task benchmark; 48 chart types.
22 TeXpert SDP'25 Benchmark 3-level difficulty; 78.8%/58.7%/17.5%.
23 AbGen ACL'25 Benchmark 1,500 ablation studies; 807 NLP papers.
24 SciFig arXiv'26 Benchmark Rubric-based evaluation; 2K+ pipeline figures.
25 SciFlow-Bench arXiv'26 Benchmark 500 framework figures; inverse-parsing evaluation.
26 FigureBench ICLR'26 Benchmark 3,300 text-figure pairs; publication-ready evaluation.
S5 Paper Writing 22 methods Expand
# Method Venue Category Evaluation Links
Semi-Automated Writing Assistance
1 CoAuthor arXiv'22 Collaborative Human–AI collaborative writing workflows.
2 Script&Shift CHI'25 Source Transform CHI Honorable Mention; preserves cognitive engagement.
3 AI Writing Study AIED'25 Empirical Study 90-student RCT; purposeful AI fosters writing.
4 OpenDraft GitHub'25 Full Draft Gen. 19 agents; 20K+ words in 10 min; verified citations.
5 DraftMarks arXiv'25 Transparency Skeuomorphic visual traces for AI process transparency.
6 PaperDebugger arXiv'25 In-Editor Assist Multi-agent Overleaf plugin (Reviewer+Enhancer+Scorer).
7 ScholarCopilot arXiv'25 Citation Assist 40.1% top-1 citation accuracy (vs 15.0% E5-Mistral).
8 XtraGPT arXiv'25 Post-Writing 1.5B–14B models; 7K papers; 140K revision pairs.
9 LimAgents arXiv'26 Limitations Gen. OpenReview comments + citation network integration.
Fully Automated Paper Generation
10 CycleResearcher ICLR'25 E2E Gen. 5.36 ICLR scale (vs 5.24 preprint, 5.69 accepted).
11 Agent Laboratory EMNLP'25 E2E Gen. $2–13/paper; 84% cost reduction; 3.5–4.0 score.
12 FutureGen arXiv'25 Section Gen. RAG-based Future Work section generation.
13 AI Scientist Nature'26 E2E Gen. $15/paper; end-to-end across 3 ML subfields.
14 APRES arXiv'26 Rubric Revision 79% expert preference; citation-predictive rubrics.
Societal Analysis
15 AI Writing Adoption Nature'26 Measurement 41.3M papers; AI expands impact but contracts focus.
16 Nature AI Survey Nature'26 Survey 57% of researchers use AI in peer review.
Writing Quality and AI Detection Assessment
17 Mapping LLM Use arXiv'24 Detection Up to 17.5% of CS papers AI-modified.
18 CycleReviewer ICLR'25 AI Judge 26.89% MAE reduction vs individual human reviewers.
19 Stanford Agentic Web'25 AI Judge ρ=0.42 vs human ρ=0.41; matches consistency.
20 SciIG arXiv'25 Writing Bench NAACL/ICLR 2025 introduction writing benchmark.
21 Watermarking arXiv'25 Detection Near-zero false-positive rate under controlled conditions.
22 PaperWritingBench arXiv'26 Benchmark 200 reverse-engineered top-tier conference papers.
S6 Peer Review 31 methods Expand
# Method Venue Category Evaluation Links
Automated Review Generation
1 ChatReviewer GitHub'23 Review Gen. ChatGPT-based strengths/weaknesses analysis.
2 AI-Peer-Review GitHub'24 Review Gen. Multi-LLM independent reviews + meta-review synthesis.
3 MARG arXiv'24 Review Gen. 3.7 good comments/paper (2.2× over baseline).
4 Reviewer2 arXiv'24 Review Gen. Two-stage prompt-based review aspect modeling.
5 ReviewRL EMNLP'25 Review Gen. RL + RAG; factually grounded reviews.
6 DeepReviewer arXiv'25 Review Gen. 88.21% win rate vs GPT-o1; 64% accept/reject acc.
7 OpenReviewer NAACL'25 Review Gen. Llama-8B fine-tuned on 79K expert reviews.
8 REMOR arXiv'25 Review Gen. Multi-objective RL review optimization.
9 ScholarPeer arXiv'26 Review Gen. Context-aware multi-agent; literature verification.
Meta-Review & Reviewer Matching
10 AgentReview EMNLP'24 Meta-Review Full review lifecycle simulation; social/authority bias.
11 Meta-Review LLMs NAACL'25 Meta-Review 40 ICLR papers; GPT-3.5/LLaMA-2/PaLM-2 compared.
12 RATE arXiv'26 Matching Expertise-based matching via profile distillation.
Adversarial Attacks & Bias Analysis
13 Raina et al. EMNLP'24 Adversarial Benign adjectives as universal adversarial triggers.
14 AI Review Lottery arXiv'24 Bias Analysis 15.8% ICLR reviews AI-assisted; +4.9pp borderline.
15 Ye et al. arXiv'24 Adversarial Scores inflated to ~8; 5% manipulation flips 12%.
16 Breaking the Reviewer arXiv'25 Adversarial Systematic adversarial robustness evaluation.
17 LLM Reviewer Bias arXiv'25 Bias Analysis 1,441 papers; 95.8% rejected misclassified as acceptable.
18 Prompt Injection arXiv'25 Adversarial White-text injection; up to 100% acceptance scores.
19 Sahoo et al. arXiv'25 Adversarial +13.95 on Mistral; 13 LLMs; 15 attack strategies.
20 Zhou et al. arXiv'25 Adversarial +1.24 to +2.80 from hype prose; 10.00 under iteration.
Detection & Policy
21 AI Detection arXiv'25 Detection 788,984 AI-written reviews; 18 detection algorithms.
22 AI Use Rejects Nature'26 Policy 497 papers rejected (~2% of submissions).
23 Nature AI Survey Nature'26 Survey 1,600 academics; 57% use AI in peer review.
24 Policy Enforcement arXiv'26 Policy All 5 SOTA detectors misclassify LLM-polished reviews.
25 Reviewer Feedback CHI'26 Empirical ICLR 2025 live process; reviewer engagement study.
Review Consistency and Bias Assessment
26 Review Survey IF'25 Survey Comprehensive taxonomy of review methods.
27 Stanford Agentic Web'25 Quality ρ=0.42 vs human ρ=0.41; matches consistency.
28 ClaimCheck EMNLP'25 Quality LLM critique grounding; gaps in factual basis.
29 ReViewGraph AAAI'26 Quality +15.73% avg improvement via heterogeneous graph.
30 ReviewAgents arXiv'25 Quality 37,403 papers; 142,324 reviews; Review-CoT dataset.
31 ICLR 2025 Study NMI'26 Quality 22,467 reviews; 89% quality improved; no acceptance effect.
S7 Rebuttal & Revision 10 methods Expand
# Method Venue Category Evaluation Links
Reviewer Comment Analysis
1 ReviewMT arXiv'24 Analysis 26,841 papers; 92,017 reviews; multi-turn dialogue.
2 ICLR Rebuttal Study arXiv'25 Analysis ICLR 2024–2025; score transition analysis.
Automated Rebuttal Generation
3 ReviewerToo arXiv'25 Modular Pipeline 81.8% accept/reject accuracy (vs 83.9% human).
4 RebuttalAgent ICLR'26 Rebuttal Gen. 18.3% avg improvement; ToM-grounded.
5 Author-in-the-Loop ACL'26 Author-Aware Integrates author expertise and intent.
6 DRPG arXiv'26 Rebuttal Gen. 98%+ planning accuracy; surpasses avg human quality.
7 Paper2Rebuttal arXiv'26 Rebuttal Gen. Evidence-centric rebuttal planning.
Rebuttal Effectiveness Assessment
8 Re² arXiv'25 Dataset 19,926 submissions; 70,668 reviews; 53,818 rebuttals.
9 Commitment Checklist arXiv'26 Benchmark 11.8 commitments/paper; ~25% unfulfilled.
10 Re³Align ACL'26 Dataset First large-scale aligned review–response–revision triplets.
S8 Paper2X (Dissemination) 24 methods Expand
# Method Venue Category Evaluation Links
Paper2Poster
1 P2P ICLR'26 Paper2Poster P2PInstruct 30K+ examples; 3 specialized agents.
2 Paper2Poster NeurIPS'25 Paper2Poster $0.005/poster; 87% fewer tokens vs GPT-4o.
3 PosterForest arXiv'25 Paper2Poster Hierarchical multi-agent collaboration.
4 PosterGen arXiv'25 Paper2Poster Aesthetic-aware multi-agent generation.
5 APEX arXiv'26 Paper2Poster First agentic interactive poster editing.
6 PosterOmni arXiv'26 Paper2Poster 6 unified poster tasks; outperforms open-source.
Paper2Slides
7 DOC2PPT AAAI'22 Paper2Slides 5,873 paired document–slide decks.
8 PPTAgent EMNLP'25 Paper2Slides PPTEval benchmark; 10,448 curated presentations.
9 AutoPresent CVPR'25 Paper2Slides 8B Llama model; SlidesBench (7K train, 585 test).
10 Paper2Slides GitHub'25 Paper2Slides 4-stage RAG pipeline; one-click conversion.
11 Auto-Slides arXiv'25 Paper2Slides Multi-agent Beamer generation; interactive editing.
12 PASS arXiv'25 Paper2Slides First combined slides + AI audio delivery.
13 SlideGen arXiv'25 Paper2Slides Multi-agent VLM coordination; editable PPTX output.
14 Talk to Your Slides arXiv'25 Paper2Slides +34% instruction fidelity; 87% lower cost vs GUI.
15 SlideTailor AAAI'26 Paper2Slides User-preference conditioned; chain-of-speech.
16 DeepPresenter arXiv'26 Paper2Slides 9B model competitive with frontier at lower cost.
17 Office Raccoon Web'26 Paper2Slides Page-level editing; template/brand-guideline learning.
Paper2Video
18 Preacher ICCV'25 Paper2Video Top-down decomposition; 5 research fields.
19 Paper2Video arXiv'25 Paper2Video 101 paper–video pairs; +10% PresentQuiz accuracy.
20 PresentAgent EMNLP'25 Paper2Video PresentEval benchmark; approaches human-level.
Paper2Web & Social Media
21 Paper2Web arXiv'25 Paper2Web 10,716 papers; multimedia-rich academic homepages.
Fidelity and Adoption Assessment
22 PPTEval EMNLP'25 Benchmark Content, design, coherence; 10,448 presentations.
23 PresentQuiz arXiv'25 Benchmark 101 paper–video pairs; +10% over human on comprehension.
24 PresentEval EMNLP'25 Benchmark End-to-end narrated video quality; near human-level.

Call for Contributions
Call for Contributions

This collection is a living resource. If you have a new method, benchmark, or tool that fits one of the eight stages, submit a pull request with the paper title, venue, link, and a brief evaluation note.

Submit a Pull Request



Exhibition

AI-Generated Dissemination Artifacts
Poster full view

Citation

BibTeX

If you find this survey or the reading list useful, please consider citing our paper:

@article{survey-ai-auto-research,
  title   = {{AI} for {Auto-Research}: Roadmap \& User Guide},
  author  = {Kong, Lingdong and Sun, Xian and Chow, Wei
             and Li, Linfeng and Lin, Kevin Qinghong and Zhang, Xuan Billy
             and Wang, Song and Li, Rong and Wu, Qing
             and Gao, Wei and Wang, Yingshuo and Xie, Shaoyuan
             and Liu, Jiachen and Qu, Leigang and Li, Shijie
             and Ng, Lai Xing and Cottereau, Benoit R.
             and Liu, Ziwei and Chua, Tat-Seng and Ooi, Wei Tsang},
  journal = {arXiv preprint arXiv:2605.18661},
  year    = {2026}
}


Responsible Use

Scope & Caveats
AI Auto-Research Lifecycle Overview

This work is intended to inform responsible use of AI-assisted research tools, not to endorse replacing human scientific judgment with full automation. Current systems are most reliable when used to assist retrieval, drafting, coding, visualization, review support, and dissemination, while humans retain responsibility for novelty, interpretation, verification, authorship, and accountability.

Because the field evolves rapidly, this paper should be read as a structured snapshot through our search cutoff, and AI-generated research outputs should be independently verified before scholarly use.


Contributors

Lingdong Kong*
National University of Singapore
Xian Sun*
Duke University
Wei Chow*
National University of Singapore
Linfeng Li
National University of Singapore
Kevin Qinghong Lin
University of Oxford
Xuan Billy Zhang
National University of Singapore
Song Wang
Zhejiang University
Rong Li
HKUST(GZ)
Qing Wu
Nanyang Technological University, Singapore
Wei Gao
Northeastern University
Yingshuo Wang
University of California, Berkeley
Shaoyuan Xie
University of California, Irvine
Jiachen Liu
Orchestra Research
Leigang Qu
National University of Singapore
Shijie Li
A*STAR, Singapore
Lai Xing Ng
A*STAR, Singapore
Benoit R. Cottereau
CNRS & IPAL
Ziwei Liu
Nanyang Technological University, Singapore
Tat-Seng Chua
National University of Singapore
Wei Tsang Ooi
National University of Singapore

* Equal Contributions