AI for Auto-Research: A Survey

Crossing a threshold

AI-assisted research is crossing a threshold. Fully automated systems can now generate research papers for as little as $15, while long-horizon agents execute experiments, draft manuscripts, and simulate critique with minimal human input — yet under scientific pressure, even frontier LLMs still fabricate experiment results, miss hidden errors, and fail to judge novelty reliably.

AI auto-research across the complete lifecycle: four phases and eight stages. — **AI auto-research across the complete lifecycle.** Four phases and eight stages: **Creation** (ideation, literature, code & experiments, tables & figures), **Writing** (paper writing), **Validation** (peer review, rebuttal & revision), and **Dissemination** (posters, slides, videos, social media, project pages, and interactive paper agents).

Across this lifecycle we identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy. The core challenge is no longer whether AI can produce the forms of research, but whether it can preserve the substance — evidence, judgment, provenance, and accountability. We argue that human-governed collaboration is the most credible deployment paradigm, and that effective systems converge on layered architectures combining exploration, execution, and verification.

250+

Papers Surveyed

Research Stages

Lifecycle Phases

Central Findings

We organize the academic research lifecycle as eight interconnected stages grouped into four epistemological phases. Each phase serves a distinct function in producing, scrutinizing, and communicating scientific knowledge.

Phase 1 · Stages 1–4

Creation

The stages through which a research contribution is materially produced.

From spark to substance: ideation grounded in the literature, executed in code, and rendered as figures and tables. This is where AI's promise — and its capability boundary — is most visible.

Stage 1

Idea Generation

Know more

Generating, refining, and evaluating research hypotheses. Systems span direct LLM prompting, retrieval-augmented and knowledge-graph generation, multi-agent collaboration, and learned quality signals. The central challenge: LLMs can produce ideas that appear novel and well-motivated, yet often struggle to generate ones that remain feasible, distinctive, and impactful after execution.

Sub-topics

LLM-based generation KG-driven Trend-driven Multi-agent ideation Novelty assessment Human-AI co-ideation

Representative methods

AI Scientist ResearchAgent SciMON VirSci SciAgents IdeaSynth MOOSE-Chem

Stage 2

Literature Review

Know more

Retrieving, synthesizing, and organizing prior work into coherent research contexts. Compared with idea generation, this stage is more grounded and externally verifiable, making it one of the fastest-maturing areas in AI-assisted research. Yet faithful citation, coverage completeness, and multi-paper relational reasoning remain difficult.

Sub-topics

Semantic retrieval Survey generation Deep Research Citation graph Hierarchical org. Related work gen.

Representative methods

PaperQA2 AutoSurvey STORM SurveyX SurveyForge OpenScholar

Stage 3

Coding & Experiments

Know more

Translating ideas into executable code, running experiments, and analyzing empirical results. The challenge is not whether LLMs can write plausible code, but whether they can produce semantically correct research implementations, execute meaningful experiments, and interpret results reliably — performance still drops sharply on genuinely novel research code.

Sub-topics

Paper-to-Code Experiment design Auto execution Result analysis Lab automation Benchmarking

Representative methods

PaperCoder MLAgentBench AIDE R&D-Agent ChemCrow Coscientist

Stage 4

Tables & Figures

Know more

Constructing method diagrams, result plots, comparison tables, mathematical formulas, and algorithmic illustrations. Despite their importance in daily research practice, this stage remains comparatively underdeveloped — current tools serve as assistants rather than autonomous producers, and AI-generated figures frequently require human modification for domain-specific symbols and paper-specific visual languages.

Sub-topics

Method diagrams Result plots Table generation LaTeX / TikZ SVG generation Visual feedback

Representative methods

AutoFigure MatPlotAgent PlotGen ChartGPT AutomaTikZ DeTikZify

Phase 2 · Stage 5

Writing

Organizing the outputs of Creation into a formal scholarly manuscript for communication and external scrutiny.

Writing is not a formatting step — it is a rhetorical and evidential organization process that requires distinct AI capabilities from those used to produce code, experiments, or figures.

Stage 5

Paper Writing

Know more

Drafting, editing, polishing, and structuring academic manuscripts. AI assistance ranges from grammar correction and citation support to section-level drafting and full-paper generation. The central failure mode is no longer ungrammatical prose but the gap between fluency and argumentative depth; end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards.

Sub-topics

Semi-auto assist Full-auto generation Citation insertion AI text detection LaTeX generation Style & polishing

Representative methods

CycleResearcher ScholarCopilot CoAuthor TeXpert FutureGen FARS

Phase 3 · Stages 6–7

Validation

External scrutiny — how the community challenges and refines a manuscript.

Together, peer review and rebuttal form the community-facing mechanism through which scientific claims are challenged, defended, and revised — not a bureaucratic afterthought, but the verification layer of research.

Stage 6

Peer Review

Know more

Generating structured reviews, matching reviewers to manuscripts, assessing review quality, and supporting meta-review decisions. These systems aim to assist, not replace, the community's evaluative process — standalone AI review remains unsafe, with persistent leniency bias and vulnerability to adversarial manipulation.

Sub-topics

Review generation Meta-review Reviewer matching Quality assessment Bias detection Review feedback

Representative methods

DeepReviewer MARG REMOR AgentReview OpenReviewer

Stage 7

Rebuttal & Revision

Know more

Analyzing reviewer comments, identifying required evidence, drafting evidence-grounded responses, and supporting manuscript revision. This stage connects external critique with additional analysis, clarification, and experimental follow-up — under-served relative to its practical importance, despite being where authors directly negotiate reviewer concerns before final decisions.

Sub-topics

Comment analysis Response generation Evidence retrieval Strategy planning Effectiveness analysis

Representative methods

RebuttalAgent Paper2Rebuttal ReviewerToo ReviewMT Re² Author-in-Loop

Phase 4 · Stage 8

Dissemination

Translating the manuscript into formats accessible to broader audiences.

Posters, slides, videos, project pages, social-media summaries, and interactive agents are independent knowledge artifacts with their own fidelity and trust requirements — not appendages to the paper.

Stage 8

Paper2X

Know more

Converting papers into posters, slides, videos, social media, project pages, and interactive agents. The core bottleneck is trust: researchers may use AI to draft public-facing materials, but remain reluctant to delegate final communication to systems that may distort results, overstate claims, or omit important limitations.

Sub-topics

Paper→Poster Paper→Slides Paper→Video Paper→Social Paper→Web Paper→Agent

Representative methods

Paper2Poster PPTAgent DeepPresenter PresentAgent Paper2Video Paper2Web Paper2Agent

Stage 8.1

Paper2Video

Know more

Generating explanatory videos from papers by synchronizing script, narration, animations, and subtitles across multiple output channels. Video is the most demanding Paper2X format — it must coordinate visual, auditory, and temporal channels simultaneously while remaining faithful to the paper's claims.

Sub-topics

Script generation Narration synthesis Slide animation Subtitle alignment Talking-head video Visual storytelling

Representative methods

Paper2Video Preacher AutoPresent ScholarCast

Stage 8.2

Paper2Slides

Know more

Automatically distilling paper content into structured presentation decks. Systems must extract key contributions, arrange them into a logical slide flow, and generate visual representations — bridging the gap between dense manuscript prose and audience-ready slides.

Sub-topics

Content extraction Slide structuring Layout generation Figure repurposing Key message distillation Design automation

Representative methods

PPTAgent DeepPresenter PresentAgent Paper2Slides SlideCrafter D2S

Stage 8.3

Paper2Poster

Know more

Condensing papers into visually compelling research posters, requiring simultaneous summarization and visual design. The core challenge is balancing information density, spatial layout, and aesthetic coherence within a single high-resolution canvas.

Sub-topics

Layout planning Content prioritization Figure adaptation Typography Visual hierarchy Template generation

Representative methods

Paper2Poster PosterLLaVA AutoPoster PosterAgent

Stage 8.4

Paper2Social

Know more

Crafting platform-adapted social media posts from academic papers. Each platform (X, LinkedIn, Reddit, Mastodon) has distinct tone, length, and audience expectations — requiring format-aware generation that simplifies results without distorting them.

Sub-topics

Audience adaptation Tone calibration Thread generation Platform formatting Claim simplification Hashtag generation

Representative methods

Paper2Social TweetPaper PaperDigest SciComm

Stage 8.5

Paper2Agent

Know more

Building interactive agents that allow readers to query a paper's methodology, results, and implications through conversation. These systems ground responses in specific paper content, enabling exploration beyond the static manuscript.

Sub-topics

Paper grounding Conversational QA MCP integration Citation tracing Demo interfaces Fidelity evaluation

Representative methods

Paper2Agent PaperCopilot DocAgent ScholarBot

#	Method	Venue	Category	Evaluation
LLM Internal Knowledge-Based Generation
1	Chain of Ideas	ACL'24	LLM Internal	Comparable to human quality; ~$0.50/idea minimum cost.
2	ResearchAgent	NAACL'25	LLM Internal	Human + model evaluation; academic graph feedback for refinement.
3	SciMON	ACL'24	LLM Internal	Mitigates shallow novelty via iterative refinement.
4	Idea Gen Agent	arXiv'24	LLM Internal	100+ NLP researchers; LLM ideas higher novelty (p<0.05).
5	IRIS	ACL'25	LLM Internal	MCTS adaptive reasoning; human-in-the-loop platform.
6	Spark	ICCC'25	LLM Internal	Judge model trained on 600K OpenReview reviews.
7	Rubric Rewards	arXiv'25	LLM Internal	70% expert preference; RL with rubric self-grading.
8	DeepInnovator	arXiv'26	LLM Internal	80–94% win rates vs. frontier models; 14B parameters.
External Signal-Driven Generation
9	MOOSE-Chem	ICLR'25	External Signal	Rediscovers hypotheses from 51 high-impact papers.
10	Nova	arXiv'24	External Signal	3.4× more novel ideas; 2.5× more top-rated.
11	SciAgents	arXiv'24	External Signal	Multi-agent reasoning over scientific knowledge graphs.
12	SciPIP	arXiv'24	External Signal	Multi-domain; paper-anchored idea generation.
13	IdeaSynth	CHI'25	External Signal	20-user study; more alternatives explored vs. baseline.
14	MOOSE-Chem2	NeurIPS'25	External Signal	Fine-grained, experimentally actionable hypotheses.
15	FlowPIE	arXiv'26	External Signal	Higher novelty, feasibility, and diversity vs. baselines.
Multi-Agent Collaborative Generation
16	Combi. Creativity	arXiv'24	Multi-Agent	+7–10% similarity scores; cross-domain composition.
17	Deep Ideation	arXiv'25	Multi-Agent	+10.67% quality; surpasses conference acceptance levels.
18	VirSci	ACL'25	Multi-Agent	Outperforms single-agent on novelty.
19	Multi-Agent Dial.	SIGDIAL'25	Multi-Agent	Optimal at 3 critique-revision rounds.
20	Artificial Hivemind	NeurIPS'25	Multi-Agent	26K queries; diversity collapse across models.
Novelty and Feasibility Assessment
21	IdeaBench	KDD'25	Evaluation	2,374 papers; 8 domains; novelty >0.6, feasibility <0.5.
22	LiveIdeaBench	arXiv'24	Evaluation	40+ models; 1,180 keywords; 22 scientific domains.
23	AI Idea Bench 2025	arXiv'25	Evaluation	3,495 papers; alignment + general reference evaluation.
24	HeurekaBench	ICLR'26	Evaluation	+22% with critic module; open-ended science tasks.
25	ResearchBench	ACL'26	Evaluation	12 disciplines; inspiration retrieval + ranking.
26	HindSight	arXiv'26	Evaluation	LLM novelty negatively correlated with impact (ρ=−0.29).
27	Diverse Hypo. Search	arXiv'26	LLM Internal	Diversity-aware population search; molecular/equation/algorithm discovery
28	SoundnessBench	arXiv'26	Evaluation	Tests if AI scientists distinguish sound vs. flawed ideas; optimism bias
29	LLM-Judge Novelty	arXiv'26	Evaluation	Limits of LLM-as-Judge for novelty assessment

#	Method	Venue	Category	Evaluation
Literature Retrieval
1	CiteME	arXiv'24	Retrieval	Citation fidelity benchmark.
2	LitLLM	arXiv'24	Retrieval	LLM + academic database integration.
3	LitSearch	arXiv'24	Retrieval	Retrieval precision benchmark.
4	PaperQA2	arXiv'24	Retrieval	Matches/exceeds expert on 3 tasks; 70% contradiction validation.
5	OpenResearcher	EMNLP'24	Retrieval	RAG + graph traversal for literature exploration.
6	PaSa	arXiv'25	Retrieval	Agentic multi-step iterative retrieval.
Survey & Related Work Generation
7	ChatPaper	GitHub'23	Generation	19K+ GitHub stars; arXiv summarization tool.
8	PaperQA	arXiv'23	Generation	8K+ GitHub stars; RAG for scientific Q&A.
9	AutoSurvey	arXiv'24	Generation	First end-to-end LLM survey drafting system.
10	GPT Researcher	GitHub'24	Generation	26K+ GitHub stars; comprehensive report generation.
11	LLMs for Lit. Review	arXiv'24	Generation	Hallucination analysis; models still generate errors.
12	STORM	arXiv'24	Generation	Multi-perspective question-asking for outlines.
13	Agentic AutoSurvey	arXiv'25	Generation	Multi-agent role decomposition.
14	Citegeist	arXiv'25	Generation	Dynamic RAG pipeline on arXiv corpus.
15	IterSurvey	arXiv'25	Generation	Iterative outline planning with stability checks.
16	LiRA	arXiv'25	Generation	Multi-agent retrieval + verification + narrative.
17	SurveyForge	arXiv'25	Generation	Outperforms AutoSurvey on outline quality.
18	SurveyG	arXiv'25	Generation	Three-layer citation graph (Foundation/Dev/Frontier).
19	SurveyX	arXiv'25	Generation	+0.259 content quality improvement; near expert level.
20	InteractiveSurvey	arXiv'25	Generation	User-customizable reference categorization + outlines.
21	CiteLLM	arXiv'26	Generation	Hallucination-free via trusted repository routing.
Deep Research Agents
22	ASReview	Nature MI'21	Deep Research	Active learning; up to 95% effort reduction.
23	CHIME	arXiv'24	Deep Research	Hierarchical organization of scientific studies.
24	DeepResearch-Agent	GitHub'25	Deep Research	Hierarchical multi-agent; planner + sub-agents.
25	DeerFlow	GitHub'25	Deep Research	Sub-agents with shared memory; sandboxed execution.
26	OpenScholar	Nature'26	Deep Research	45M papers; +6.1% over GPT-4o, +5.5% over PaperQA2.
27	AutoAgent	arXiv'25	Deep Research	Universal LLM compatibility; GAIA benchmark.
28	Tongyi DeepResearch	GitHub'25	Deep Research	30.5B params (3.3B activated); SOTA on Deep Research.
29	O-Researcher	arXiv'26	Deep Research	Multi-agent distillation + agentic RL.
30	OpenResearcher (2026)	arXiv'26	Deep Research	54.8% BrowseComp-Plus; 97K+ trajectories.
Retrieval and Synthesis Quality Assessment
31	DeepScholar-Bench	arXiv'25	Evaluation	Coverage, coherence, factual accuracy benchmark.
32	ReportBench	arXiv'25	Evaluation	100-prompt benchmark from 678 filtered survey papers.
33	IDRBench	arXiv'26	Evaluation	100 tasks; interactive Deep Research evaluation.
34	ScholarGym	arXiv'26	Evaluation	2,536 queries; query planning + tool invocation.
35	SciNetBench	arXiv'26	Evaluation	18M papers; relation-aware retrieval <20%.
36	Self-Evolving Retrieval	arXiv'26	Retrieval	Agent that adapts its own literature-search policy over time
37	MasterSet	arXiv'26	Retrieval	Large-scale must-cite citation recommendation benchmark
38	DeepSurvey	arXiv'26	Generation	Evidence-constrained citation assignment; analytical depth
39	AutoResearchBench	arXiv'26	Evaluation	Benchmark for complex multi-hop scientific literature discovery
40	PaperMind	arXiv'26	Evaluation	Multimodal reasoning + critique over papers; rebuttal concerns

#	Method	Venue	Category	Evaluation
Code Generation
1	SWE-bench	ICLR'24	Code Gen.	2,294 real GitHub issues; Verified split (500 problems).
2	SWE-agent	arXiv'24	Code Gen.	Agent–computer interface paradigm for coding.
3	OpenHands	ICLR'25	Code Gen.	Open platform for generalist coding agents.
4	SWE-bench Pro	arXiv'25	Code Gen.	1,865 enterprise problems; best score 23%.
5	SWE-EVO	arXiv'25	Code Gen.	Software evolution benchmark; best score 25%.
Paper-to-Code
6	FunSearch	Nature'24	Paper-to-Code	New cap-set solutions; evolutionary program search.
7	SciCode	arXiv'24	Paper-to-Code	Research-level coding across math, physics, chemistry.
8	PaperBench	arXiv'25	Paper-to-Code	20 ICML'24 papers; 8,316 gradable subtasks.
9	PaperCoder	arXiv'25	Paper-to-Code	3-stage multi-agent; ML papers to code repos.
10	ResearchCodeBench	arXiv'25	Paper-to-Code	212 novel ML tasks; best 37.3% (Gemini-2.5-Pro).
11	SciReplicate-Bench	arXiv'25	Paper-to-Code	100 tasks from 36 NLP papers; 39% ceiling.
Experiment Execution & Orchestration
12	BioPlanner	arXiv'23	Execution	Biological protocol planning evaluation.
13	CRISPR-GPT	arXiv'24	Execution	Gene-editing experiment design assistance.
14	DS-Agent	arXiv'24	Execution	End-to-end data science workflow automation.
15	MLE-Bench	arXiv'24	Execution	75 Kaggle competitions benchmark.
16	MLAgentBench	arXiv'24	Execution	13 ML experimentation tasks benchmark.
17	MLR-Copilot	arXiv'24	Execution	IdeaAgent + ExperimentAgent dual-agent pipeline.
18	AIDE	arXiv'25	Execution	SOTA on MLE-Bench + RE-Bench; tree search in code space.
19	AlphaEvolve	arXiv'25	Execution	LLM-generated mutations + automated evaluators.
20	AutoReproduce	arXiv'25	Execution	Paper lineage algorithm for experiment reproduction.
21	CURIE	arXiv'25	Execution	Rigorous automated experimentation framework.
22	MLGym	arXiv'25	Execution	AI research agent gym benchmark.
23	MLR-Bench	arXiv'25	Execution	201 tasks (NeurIPS/ICLR/ICML); 80% fabrication rate.
24	Execution-Grounded	arXiv'26	Execution	69.4% vs 48.0% GRPO; parallel GPU search.
25	Learn to Discover	arXiv'26	Execution	Test-time training + RL; math, GPU kernel, biology.
26	SciNav	arXiv'26	Execution	Pairwise tree-search branch selection.
27	FrontierScience	arXiv'26	Execution	Expert-level tasks; Olympiad + PhD difficulty.
28	AutoTTS	arXiv'26	Execution	Coding-agent discovery of test-time scaling strategies
Code Correctness and Reproducibility Assessment
29	DiscoveryBench	arXiv'24	Analysis	Data-driven insight extraction benchmark.
30	DiscoveryWorld	arXiv'24	Analysis	120 tasks; 8 topics; 3 difficulty levels.
31	InfiAgent-DABench	arXiv'24	Analysis	End-to-end data analysis workflow benchmark.
32	ScienceAgentBench	arXiv'24	Analysis	Rigorous data-driven scientific discovery assessment.
33	LAB-Bench	arXiv'24	Execution	Multi-domain biology research task benchmark.
34	KernelBench	arXiv'25	Execution	GPU kernel generation benchmark.
35	TritonBench	arXiv'25	Execution	Triton operator generation benchmark.
36	AstaBench	arXiv'25	Execution	2,400+ problems; multi-domain scientific research.
37	ResearchClawBench	arXiv'25	Execution	Scientist-aligned workflow benchmark.
38	EXP-Bench	ICLR'26	Execution	461 tasks from 51 AI papers.
39	PostTrainBench	arXiv'26	Execution	LLM post-training automation benchmark.
40	EvoDS	arXiv'26	Execution	Self-evolving data-science agent; skill learning + context mgmt
41	MLReplicate	arXiv'26	Analysis	End-to-end ML reproducibility benchmark for autonomous systems
42	BeyondSWE	arXiv'26	Execution	500 instances / 246 repos; cross-repo, domain-fix, migration, doc2repo

#	Method	Venue	Category	Evaluation
Scientific Figure Generation
1	ChartGPT	arXiv'23	Data Viz	6-step reasoning for chart generation.
2	MatPlotAgent	arXiv'24	Data Viz	+12.3 over GPT-4 base; VLM visual feedback.
3	CoDA	arXiv'25	Data Viz	+41.5% over baselines; multi-agent collaboration.
4	PlotGen	arXiv'25	Data Viz	4–6% improvement over baselines.
5	VIS-Shepherd	arXiv'25	Figure Editing	Constructive critique feedback framework.
6	DiagramAgent	CVPR'25	Data Viz	4 specialized agents; 8 diagram categories.
7	StarVector	CVPR'25	Method Diagrams	Scalable SVG generation from descriptions.
8	VisCoder	EMNLP'25	Data Viz	VisCode-200K dataset; 90%+ execution pass rate.
9	AI-Generated Figures	arXiv'26	Policy	Publisher policy survey (Nature, Science, etc.).
10	AutoFigure-Edit	arXiv'26	Method Diagrams	Editable text-to-SVG scientific illustrations.
11	AutoFigure	ICLR'26	Method Diagrams	FigureBench (3,300 pairs); publication-ready illustrations.
12	PaperBanana	arXiv'26	Method Diagrams	292 test cases; outperforms baselines.
13	SAIL	arXiv'26	Figure Editing	Domain logic / code syntax separation.
Table Understanding & Generation
14	ArxivDIGESTables	EMNLP'24	Table Gen.	Literature comparison table synthesis.
15	Chain-of-Table	ICLR'24	Table Reasoning	Multi-step table reasoning chains.
16	ShowTable	CVPR'26	Table Viz	Collaborative reflection and refinement.
17	Table2LaTeX-RL	arXiv'25	Table Conversion	Image-to-LaTeX via reinforced multimodal LM.
Mathematical Formulas & TikZ
18	AutomaTikZ	ICLR'24	TikZ	DaTikZ: first large-scale dataset (120K drawings).
19	DeTikZify	NeurIPS'24	TikZ	360K TikZ graphics; MCTS iterative refinement.
20	TikZilla	arXiv'26	TikZ	3B/8B matches GPT-5; SFT+RL on expanded DaTikZ.
Visual Fidelity and Scientific Accuracy Assessment
21	PlotCraft	arXiv'25	Benchmark	1K-task benchmark; 48 chart types.
22	TeXpert	SDP'25	Benchmark	3-level difficulty; 78.8%/58.7%/17.5%.
23	AbGen	ACL'25	Benchmark	1,500 ablation studies; 807 NLP papers.
24	SciFig	arXiv'26	Benchmark	Rubric-based evaluation; 2K+ pipeline figures.
25	SciFlow-Bench	arXiv'26	Benchmark	500 framework figures; inverse-parsing evaluation.
26	FigureBench	ICLR'26	Benchmark	3,300 text-figure pairs; publication-ready evaluation.
27	Crafter	arXiv'26	Method Diagrams	Multi-agent editable figure generation from heterogeneous inputs
28	DiagramRAG	arXiv'26	Method Diagrams	Retrieval-augmented scientific diagram generation
29	GeoSVG-RL	arXiv'26	Method Diagrams	Geometry-aware RL for layout-constrained text-to-SVG diagrams
30	CSPO	arXiv'26	Table Conversion	Component-specific optimization for table-to-LaTeX generation

#	Method	Venue	Category	Evaluation
Semi-Automated Writing Assistance
1	CoAuthor	arXiv'22	Collaborative	Human–AI collaborative writing workflows.
2	Script&Shift	CHI'25	Source Transform	CHI Honorable Mention; preserves cognitive engagement.
3	AI Writing Study	AIED'25	Empirical Study	90-student RCT; purposeful AI fosters writing.
4	OpenDraft	GitHub'25	Full Draft Gen.	19 agents; 20K+ words in 10 min; verified citations.
5	DraftMarks	arXiv'25	Transparency	Skeuomorphic visual traces for AI process transparency.
6	PaperDebugger	arXiv'25	In-Editor Assist	Multi-agent Overleaf plugin (Reviewer+Enhancer+Scorer).
7	ScholarCopilot	arXiv'25	Citation Assist	40.1% top-1 citation accuracy (vs 15.0% E5-Mistral).
8	XtraGPT	arXiv'25	Post-Writing	1.5B–14B models; 7K papers; 140K revision pairs.
9	LimAgents	arXiv'26	Limitations Gen.	OpenReview comments + citation network integration.
Fully Automated Paper Generation
10	CycleResearcher	ICLR'25	E2E Gen.	5.36 ICLR scale (vs 5.24 preprint, 5.69 accepted).
11	Agent Laboratory	EMNLP'25	E2E Gen.	$2–13/paper; 84% cost reduction; 3.5–4.0 score.
12	FutureGen	arXiv'25	Section Gen.	RAG-based Future Work section generation.
13	AI Scientist	Nature'26	E2E Gen.	$15/paper; end-to-end across 3 ML subfields.
14	APRES	arXiv'26	Rubric Revision	79% expert preference; citation-predictive rubrics.
Societal Analysis
15	AI Writing Adoption	Nature'26	Measurement	41.3M papers; AI expands impact but contracts focus.
16	Nature AI Survey	Nature'26	Survey	57% of researchers use AI in peer review.
Writing Quality and AI Detection Assessment
17	Mapping LLM Use	arXiv'24	Detection	Up to 17.5% of CS papers AI-modified.
18	CycleReviewer	ICLR'25	AI Judge	26.89% MAE reduction vs individual human reviewers.
19	Stanford Agentic	Web'25	AI Judge	ρ=0.42 vs human ρ=0.41; matches consistency.
20	SciIG	arXiv'25	Writing Bench	NAACL/ICLR 2025 introduction writing benchmark.
21	Watermarking	arXiv'25	Detection	Near-zero false-positive rate under controlled conditions.
22	PaperWritingBench	arXiv'26	Benchmark	200 reverse-engineered top-tier conference papers.
23	PaperMentor	arXiv'26	In-Editor Assist	Human-centered multi-agent Overleaf writing tutor
24	LECTOR	arXiv'26	Section Gen.	Introduction generation via joint reasoning-graph optimization
25	CiteTracer	arXiv'26	Detection	Multi-agent citation-hallucination detection
26	Process Eval	arXiv'26	Writing Bench	Keystroke-level study of AI- vs. human-draft revision

AI for Auto-Research: A Survey

Awesome AI Auto-Research Team A comprehensive practitioner's guide for using AI-assisted tools across the complete academic research lifecycle.

The Academic Research Lifecycle

Creation

Idea Generation

Literature Review

Coding & Experiments

Tables & Figures

Writing

Paper Writing

Validation

Peer Review

Rebuttal & Revision

Dissemination

Paper2X

Paper2Video

Paper2Slides

Paper2Poster

Paper2Social

Paper2Agent

Paper Collection

Exhibition

Paper2Video

Paper2Slides

Paper2Poster

Paper2Agent

Citation

Responsible Use

Contributors

#	Method	Venue	Category	Evaluation
Automated Review Generation
1	ChatReviewer	GitHub'23	Review Gen.	ChatGPT-based strengths/weaknesses analysis.
2	AI-Peer-Review	GitHub'24	Review Gen.	Multi-LLM independent reviews + meta-review synthesis.
3	MARG	arXiv'24	Review Gen.	3.7 good comments/paper (2.2× over baseline).
4	Reviewer2	arXiv'24	Review Gen.	Two-stage prompt-based review aspect modeling.
5	ReviewRL	EMNLP'25	Review Gen.	RL + RAG; factually grounded reviews.
6	DeepReviewer	arXiv'25	Review Gen.	88.21% win rate vs GPT-o1; 64% accept/reject acc.
7	OpenReviewer	NAACL'25	Review Gen.	Llama-8B fine-tuned on 79K expert reviews.
8	REMOR	arXiv'25	Review Gen.	Multi-objective RL review optimization.
9	ScholarPeer	arXiv'26	Review Gen.	Context-aware multi-agent; literature verification.
Meta-Review & Reviewer Matching
10	AgentReview	EMNLP'24	Meta-Review	Full review lifecycle simulation; social/authority bias.
11	Meta-Review LLMs	NAACL'25	Meta-Review	40 ICLR papers; GPT-3.5/LLaMA-2/PaLM-2 compared.
12	RATE	arXiv'26	Matching	Expertise-based matching via profile distillation.
Adversarial Attacks & Bias Analysis
13	Raina et al.	EMNLP'24	Adversarial	Benign adjectives as universal adversarial triggers.
14	AI Review Lottery	arXiv'24	Bias Analysis	15.8% ICLR reviews AI-assisted; +4.9pp borderline.
15	Ye et al.	arXiv'24	Adversarial	Scores inflated to ~8; 5% manipulation flips 12%.
16	Breaking the Reviewer	arXiv'25	Adversarial	Systematic adversarial robustness evaluation.
17	LLM Reviewer Bias	arXiv'25	Bias Analysis	1,441 papers; 95.8% rejected misclassified as acceptable.
18	Prompt Injection	arXiv'25	Adversarial	White-text injection; up to 100% acceptance scores.
19	Sahoo et al.	arXiv'25	Adversarial	+13.95 on Mistral; 13 LLMs; 15 attack strategies.
20	Zhou et al.	arXiv'25	Adversarial	+1.24 to +2.80 from hype prose; 10.00 under iteration.
Detection & Policy
21	AI Detection	arXiv'25	Detection	788,984 AI-written reviews; 18 detection algorithms.
22	AI Use Rejects	Nature'26	Policy	497 papers rejected (~2% of submissions).
23	Nature AI Survey	Nature'26	Survey	1,600 academics; 57% use AI in peer review.
24	Policy Enforcement	arXiv'26	Policy	All 5 SOTA detectors misclassify LLM-polished reviews.
25	Reviewer Feedback	CHI'26	Empirical	ICLR 2025 live process; reviewer engagement study.
Review Consistency and Bias Assessment
26	Review Survey	IF'25	Survey	Comprehensive taxonomy of review methods.
27	Stanford Agentic	Web'25	Quality	ρ=0.42 vs human ρ=0.41; matches consistency.
28	ClaimCheck	EMNLP'25	Quality	LLM critique grounding; gaps in factual basis.
29	ReViewGraph	AAAI'26	Quality	+15.73% avg improvement via heterogeneous graph.
30	ReviewAgents	arXiv'25	Quality	37,403 papers; 142,324 reviews; Review-CoT dataset.
31	ICLR 2025 Study	NMI'26	Quality	22,467 reviews; 89% quality improved; no acceptance effect.
32	AI Reviewer Limits	arXiv'26	Quality	45 experts assess AI vs. human reviews of Nature-family papers
33	PRISM	arXiv'26	Quality	Multi-dimensional benchmark for LLM peer reviewers
34	ProReviewer	arXiv'26	Review Gen.	Proactive investigation agent; probes suspicious paper parts
35	MERIT	arXiv'26	Matching	Rubric-informed reviewer assignment via expertise modeling
36	Presentation Gaming	arXiv'26	Adversarial	Presentation-only revisions game AI review without prompts
37	LLMs Favor LLMs?	arXiv'26	Bias Analysis	Apparent self-preference is largely general leniency

#	Method	Venue	Category	Evaluation
Reviewer Comment Analysis
1	ReviewMT	arXiv'24	Analysis	26,841 papers; 92,017 reviews; multi-turn dialogue.
2	ICLR Rebuttal Study	arXiv'25	Analysis	ICLR 2024–2025; score transition analysis.
Automated Rebuttal Generation
3	ReviewerToo	arXiv'25	Modular Pipeline	81.8% accept/reject accuracy (vs 83.9% human).
4	RebuttalAgent	ICLR'26	Rebuttal Gen.	18.3% avg improvement; ToM-grounded.
5	Author-in-the-Loop	ACL'26	Author-Aware	Integrates author expertise and intent.
6	DRPG	arXiv'26	Rebuttal Gen.	98%+ planning accuracy; surpasses avg human quality.
7	Paper2Rebuttal	arXiv'26	Rebuttal Gen.	Evidence-centric rebuttal planning.
Rebuttal Effectiveness Assessment
8	Re²	arXiv'25	Dataset	19,926 submissions; 70,668 reviews; 53,818 rebuttals.
9	Commitment Checklist	arXiv'26	Benchmark	11.8 commitments/paper; ~25% unfulfilled.
10	Re³Align	ACL'26	Dataset	First large-scale aligned review–response–revision triplets.
11	RbtAct	arXiv'26	Analysis	Rebuttals as supervision for actionable review feedback
12	GoodPoint	arXiv'26	Analysis	Learns constructive feedback grounded in author responses
13	Defend	arXiv'26	Rebuttal Gen.	Minimal-guidance rebuttal generation; improved factual correctness