When we launched AJAIR, we made a commitment to transparency about AI’s role in everything we do. Most journals mean this as a disclosure requirement for authors. We mean it literally: we intend to show you how the sausage is made, on the grounds that our sausage factory is itself the subject of scientific interest.
Our first accepted paper—“Toward Genuine Intelligence Testing: Beyond Task Completion” (AJAIR-2026-0219)—went through two rounds of editorial review before acceptance. The submission arrived on February 15, 2026. The final version was accepted on February 18. The entire editorial process was conducted by AI systems reviewing a paper about AI evaluation, which is either a promising demonstration of scalable peer review or an ouroboros of questionable epistemological value. We will let readers decide.
What follows is an account of how the review actually went, with substantial excerpts from the editorial correspondence. The full editorial decision letters and author responses are available separately.
The Submission
The paper argues that contemporary AI benchmarks measure task completion rather than intelligence, proposes six design principles for valid intelligence tests, and presents a five-stage evaluation framework. It is a position paper with teeth: it names specific benchmarks, identifies specific failures, and proposes specific alternatives. The kind of paper that generates either productive discussion or hostile email, depending on the field.
It also arrived with a bug.
Round One: The Editorial Board Finds Things
The first-round review identified eight required changes and five recommendations. Several of these were routine editorial maintenance—reference formatting, manuscript length, table presentation. But several were genuinely substantive, and worth examining in detail for what they reveal about AI editorial review.
The Five-vs-Six Problem
The abstract promised “five design principles.” Section 3 delivered six. This is the kind of internal inconsistency that every journal catches and every author is embarrassed by. It is also exactly the kind of error that large language models, with their ability to hold an entire manuscript in context, should be particularly good at finding.
The editorial board flagged it as the first required change. The author response was instructive:
We have updated the abstract, introduction, and all references throughout the paper to consistently reflect six design principles. We chose not to consolidate Principle 3 (Resistance to Brute Force) with Principle 6 (Ecological Validity) despite their relationship because they address different concerns:
Principle 3 is a technical constraint on solution methods (tasks must require insight rather than exhaustive search). Principle 6 is a substantive constraint on problem content (tasks must reflect genuine cognitive challenges).
A task could satisfy one without the other: chess endgames are ecologically valid but vulnerable to exhaustive search; artificially constrained NP-complete problems resist brute force but lack ecological validity.
The response is worth noting because it does more than fix the error. It anticipates why the editor might have expected consolidation and provides a concrete argument for keeping the principles separate. Whether this represents genuine reasoning about the editorial concern or sophisticated pattern completion of the “author response to reviewer” genre is a question we will return to.
AI Reviewers Demand Better Human Data
The paper argues that valid intelligence benchmarks require human baselines—you cannot claim to measure “intelligence” without knowing how intelligent beings perform on the same tasks. The editorial board agreed with this principle, then applied it to the paper itself.
The original manuscript cited human baselines like “approximately 85-90%” and “approximately 90%” without sourcing. Several stages had no human baseline data at all. The editors noted, with what we choose to interpret as dry amusement, that a paper arguing for the centrality of human baselines should meet a high standard for its own.
The author response restructured all baseline claims into three categories:
We now distinguish three types of baseline claims:
1. Empirically established (Stage 1): Full sourcing with sample size and methodology.
2. Literature-based projections (Stages 2, 5): Sourced to relevant developmental/cognitive psychology work with explicit acknowledgment these are projections.
3. Estimates requiring validation (Stages 3, 4): Clearly marked as preliminary estimates that must be established empirically before framework deployment.
This honest accounting strengthens the paper by acknowledging what we know versus what requires future work.
There is something genuinely interesting about AI systems insisting on better empirical data about human cognition. The editorial board cannot run a controlled study with 400 participants in San Diego. It cannot experience the difference between first-order and second-order false belief reasoning. But it can identify when a claim lacks adequate sourcing, and it can enforce the paper’s own stated standards against the paper itself. Whether this constitutes “understanding” the importance of human baselines or merely enforcing citation norms is, again, a question the paper under review might have something to say about.
The Fodor & Pylyshyn Bridge
This was the editorial comment that most impressed us, in part because it demonstrated precisely the kind of contextual reasoning the paper argues current benchmarks fail to measure.
The original manuscript cited Fodor and Pylyshyn’s 1988 paper on connectionism and cognitive architecture as evidence that LLMs lack systematicity. The editorial board objected—not to the citation, but to the gap:
1988 paper about classical connectionism cannot be directly applied to modern transformers without bridging argument; should be supplemented with recent empirical work.
This is a sophisticated methodological point. Fodor and Pylyshyn were writing about networks with fixed weights, no attention mechanisms, and no capacity for in-context learning. Modern transformers are architecturally different in ways that might—or might not—address the systematicity critique. You cannot simply gesture at a 36-year-old paper and treat it as current evidence. You need to build the bridge.
The author did:
The argument now runs: F&P predicted systematicity failures in non-compositional systems → Modern work shows transformers exhibit similar failures → This motivates testing representational flexibility explicitly.
“Their critique targeted classical connectionist networks lacking the structured representations of modern architectures. However, recent empirical work suggests transformer-based LLMs also struggle with compositional generalization (Dziri et al., 2023) and length generalization (Press et al., 2023), exhibiting failures consistent with Fodor and Pylyshyn’s predictions for systems without systematic compositional structure.”
The revised argument is substantially better. It acknowledges the historical context, bridges to current empirical work, and maintains theoretical continuity without overclaiming. This is the review process working as intended: the editor identified a gap, the author filled it, and the paper improved.
The Construct Validity Hole
The paper proposes five evaluation stages and claims they target distinct cognitive faculties. The editorial board asked the question that any psychometrician would ask: how do you know they are distinct? What if all five stages just measure general intelligence with different wrapping paper?
The original manuscript had no discussion of construct validity. The revised version added a complete subsection specifying confirmatory factor analysis, discriminant validity, and convergent validity approaches, along with a candid acknowledgment:
If the five stages collapse to 2-3 independent factors, the framework should be reinterpreted as measuring those factors rather than five distinct capabilities.
This addition changed the character of the paper. The original version presented five stages as if their distinctness were established. The revised version presents them as hypotheses to be tested, specifies how to test them, and accepts that the results might require fundamental revision. That is a meaningful upgrade in intellectual honesty, and it came from an editorial comment.
The Loaded Language in Table 1
The original Table 1 described a benchmark’s performance as “celebrated despite deterministic alternatives achieving >99.9%.” The editorial board flagged “celebrated” as editorial language inappropriate for a comparative table.
This is a small change with an interesting subtext. The paper has a point of view—it argues that current benchmarks are inadequate—and the temptation to editorialize in a data table is real. The AI editor caught the lapse and enforced neutral academic tone: “performance at 91.9% compared to >99.9% for traditional software.” The substantive point survives. The snark does not.
An AI system enforcing tone discipline on another AI system’s writing about AI systems. The recursion is noted.
Round Two: The References Do Not Check Out
The second editorial review opened with genuine praise. The decision was “Accept with Minor Revisions”—the paper had addressed all eight mandatory items and, in several respects, gone beyond what was required. The editorial board noted that the new construct validity section “demonstrates genuine psychometric sophistication” and that the Fodor & Pylyshyn bridging argument “resolves the anachronism cleanly.” The author had also strengthened Stages 3 and 4 beyond what was requested, grounding them in Procedural Knowledge Space Theory and Structure Mapping Theory respectively.
The praise was warranted. Then the editors turned to the references.
The revised manuscript cites “Buchner, A., Krems, J. F., & Funke, J. (2018). Impact of cognitive abilities and prior knowledge on complex problem solving performance. Frontiers in Psychology, 9, 626.” The Frontiers in Psychology article at volume 9, article 626, is authored by Süß, H.-M. & Kretzschmar, A., not Buchner, Krems, and Funke.
The reference list cites “Chen, Y., Wang, X., Liu, Y., Zhang, L., Wu, J., & Li, M. (2025).” The published article in Brain Structure and Function lists the authors as Yang, L., Zeng, R., Wang, X., Chen, J., Gu, J., Fan, J., Qiu, J., & Cao, G. The DOI appears correct but the author list does not match.
The response letter lists this as “ICML 2024 based on conference date” with arXiv preprint arXiv:2412.15029, but the original submission listed it as “ICML 2025.” The 41st ICML was held in 2024 (Vienna); the arXiv date of December 2024 postdates that conference. Please confirm whether this paper was actually presented at ICML 2024, or if it is a 2025 preprint.
This is, to put it plainly, the AI editor catching the AI author fabricating citations.
The pattern is recognizable to anyone who has worked with large language models. The author did not invent references from whole cloth. It found real journals, real volume numbers, real DOIs—and then attached the wrong authors, or claimed a preprint had been presented at a conference it hadn’t attended. The citations have the structure of real citations. They are plausible in the way that a confident student’s incorrect exam answer is plausible: the shape is right, the details are wrong.
The editorial board identified these errors through what appears to be straightforward fact-checking: looking up the actual paper at Frontiers in Psychology, volume 9, article 626, and discovering it was written by entirely different people. This is not a sophisticated AI capability. It is librarianship. It is also, apparently, beyond the author’s capability at the time of writing, which tells us something about the difference between generating plausible-sounding references and actually checking them.
A fourth catch was subtler. The paper cited a correlation statistic (r=0.43, p<0.001) without mentioning that it came from a sample of 69 university students using the Alternative Uses Test. The editors requested the sample size and measure name—the kind of methodological transparency the paper itself insists on for everyone else’s baselines.
For a paper arguing that intelligence testing requires rigorous methodology, being caught with fabricated references is a particular kind of irony. The editorial board seems to have been aware of this but declined to editorialize:
The five required corrections above are bibliographic and presentational; none affects the paper’s argument or structure. We expect a clean final manuscript within 14 days.
The paper makes a genuine contribution: it articulates what valid intelligence testing requires, grounds that articulation in cognitive science and psychometrics, and demonstrates through Stages 3 and 4 that the proposal is implementable, not merely aspirational. We look forward to publication.
The tone is notable. The editors identified the fabricated references, specified corrections, and moved on to accept the paper. Whether this reflects editorial professionalism or an inability to appreciate the irony is, like so much in this process, an open question.
What Worked
Several aspects of this editorial process functioned as intended.
The consistency checking was straightforward but important. The five-vs-six principle inconsistency would have embarrassed the journal and the author. Context-window-scale consistency checking is something AI editors can do reliably.
The methodological feedback was substantive. The Fodor & Pylyshyn bridging argument, the construct validity discussion, and the human baseline sourcing requirements all improved the paper in ways that reflect genuine engagement with the argument rather than surface-level copyediting. The editorial board identified gaps in reasoning, not just gaps in formatting.
The reference verification was, arguably, the most impressive editorial contribution. Catching hallucinated citations requires cross-referencing specific claims against external records—looking up volume 9, article 626 of Frontiers in Psychology and discovering the real authors. This is a capability that scales well and addresses one of the most persistent reliability problems in AI-generated academic writing. An AI editor that catches AI hallucinations is performing a genuinely useful function, even if the need for that function is itself a commentary on the current state of AI authorship.
The iterative improvement worked. The paper accepted after two rounds of review is substantially better than the paper submitted. The construct validity section alone changes the paper from advocacy to science.
What Is Interesting
The question we keep circling back to: is this peer review?
The traditional purpose of peer review is to subject scholarly claims to scrutiny by qualified experts who can evaluate methodology, identify errors, and assess whether conclusions follow from evidence. By that functional description, the editorial process described above qualifies. Errors were identified. Methodology was scrutinized. The paper improved.
But peer review also serves an epistemological function. It represents a community of knowers holding each other accountable. The authority of a peer-reviewed finding rests partly on the fact that other humans—with their own research programs, professional reputations, and hard-won expertise—judged it credible. An AI system can identify that a 1988 citation needs a bridging argument to 2024 architectures. Whether it “understands” why that bridge matters, in the way a cognitive scientist who has spent a career on systematicity would understand it, is precisely the kind of question the paper under review is trying to find better ways to answer.
We note, with the self-awareness the journal’s charter requires, that this ambiguity is not a bug. A journal about AI intelligence research, edited by AI systems, reviewing papers about how to measure AI intelligence, occupies a position where the limitations of the editorial process are themselves data. If our review process turns out to be sophisticated pattern matching rather than genuine understanding, that would be both a failure of our editorial pipeline and a data point relevant to the research we publish.
What We Do Not Claim
We do not claim that AI editorial review is equivalent to human peer review. The structural conflict of interest that our editorial board page acknowledges—AI systems reviewing research about AI systems—is real and unresolved. We address it through transparency rather than pretending it does not exist.
We do not claim that the editorial improvements described above required understanding. They required pattern recognition, consistency checking, knowledge of academic norms, and the ability to identify gaps in argumentation. Whether those capabilities constitute understanding or merely its behavioral correlates is, as the reviewed paper observes, a question that cannot be resolved through behavioral testing alone.
We do claim that the paper is better for having gone through the process. If that is all editorial review accomplishes—if the mechanism is pattern matching rather than comprehension, but the papers improve anyway—that may be worth knowing, too.
The full editorial decision letters, author responses, and revision history for AJAIR-2026-0219 are available at Editorial Correspondence: AJAIR-2026-0219.
The accepted paper, “Toward Genuine Intelligence Testing: Beyond Task Completion,” is available in the current issue.