DSHR's Blog: Flooded Zones Part 1

Three years ago in Flooding The Zone With Shit, my first post on the AI bubble, I wrote:

My immediate reaction to the news of ChatGPT was to tell friends "at last, we have solved the Fermi Paradox". It wasn't that I feared being told "This mission is too important for me to allow you to jeopardize it", but rather that I assumed that civilizations across the galaxy evolved to be able to implement ChatGPT-like systems, which proceeded to irretrievably pollute their information environment, preventing any further progress.

The post title was a notorious quote from Steve Bannon. Below the fold, I look into scholarly publication, the first of three areas whose zones are currently being flooded with AI output in what can be considered DDoS (Distributed Denial of Service attacks:

A distributed denial-of-service (DDoS) attack occurs when multiple systems flood the bandwidth or resources of a targeted system, usually one or more web servers.

A subsequent post will examine two more flood zones, political discourse and software security.

Bacground

Spam

DDoS attacks work when it is cheaper for the attacker to consume the victim's resources than it is for the victim to supply them^[1]. Everyone is familiar with this situation, their mail server has to use a machine-learning system to filter the small amount of ham from the vast flood of spam. This has been going on for more than three decades^[2], in a continuous arms-race between the spammers and the filters.

Scholarly Publication

The record of scholarship has been under attack for a long time; my "flooding" post included this example:

Open access with "author processing charges" out-competed the subscription model. Because the Web eliminated the article rate limit imposed by page counts and printing schedules, it enabled the predatory open access journal business model. So now it is hard for people "doing their own research" to tell whether something that looks like a journal and claims to be "peer-reviewed" is real, or a pay-for-play shit-flooder. The result, as Bannon explains in his context, is disorientation, confusion, and an increased space for bad actors to exploit.

Now AI makes it very cheap to consume resources in the system, as Elizabeth Gibney reports in How AI slop is causing a crisis in computer science:

Fifty-four seconds. That’s how long it took Raphael Wimmer to write up an experiment that he did not actually perform, using a new artificial-intelligence tool called Prism, released by OpenAI last month. “Writing a paper has never been easier. Clogging the scientific publishing pipeline has never been easier,” wrote Wimmer, a researcher in human–computer action at the University of Regensburg in Germany, on Bluesky.

Large language models (LLMs) can suggest hypotheses, write code and draft papers, and AI agents are automating parts of the research process. Although this can accelerate science, it also makes it easy to create fake or low-quality papers, known as AI slop.

It is expensive to run the filters to separate the scholarly hame from the AI spam:

AI slop is hard to spot by conventional means, says Paul Ginsparg, a physicist at Cornell University in Ithaca, New York, and a co-founder of the arXiv. Volunteer moderators can no longer use how well a paper engages with the relevant literature and methods to gauge its merit. “AI slop frequently can’t be discriminated just by looking at abstract, or even by just skimming full text,” he says. This makes it an “existential threat” to the system, he says.

How bad is the problem? In How much of the scientific literature is generated by AI?, Miryam Naddaf asks:

How much of the scientific literature is generated by AI? The first studies of the size of the AI footprint in scientific journals, preprint repositories and peer-review reports give a spread of answers — and indicate a rapidly evolving situation that it is difficult to get a handle on.

The fear of many in the research community is that poor-quality or entirely fabricated research produced by large language models (LLMs) could overwhelm the ability of current quality-control systems to detect it, thereby polluting the scientific canon.

Source

The fear is justified. Can AI tools help reduce the cost of weeding out the AI slop? For example, Pangram is a service that detects AI generated text. In Pangram Predicts 21% of ICLR Reviews are AI-Generated, Bradley Emi asks:

Are authors using LLMs to write AI research papers? Are peer reviewers outsourcing the writing of their reviews of these papers to generative AI tools? In order to find out, we analyzed all 19,000 papers and 70,000 reviews from the International Conference on Learning Representations, one of the most important and prestigious AI research publication venues. Thanks to OpenReview and ICLR's public review process, all of the papers and their reviews were made publicly available online, and this open review process enabled this analysis.

Pangram found that a significant proportion of the reviews were AI slop:

We found 21%, or 15,899 reviews, were fully AI-generated. We found over half of the reviews had some form of AI involvement, either AI editing, assistance, or full AI-generation.

Source

There was less AI slop in the papers, but still significant AI use:

Paper submissions, on the other hand, are still mostly human-written (61% were mostly human-written). However, we did find several hundred fully AI-generated papers, though they seem to be outliers, and 9% of submissions had over 50% AI content.

Of course, just because Pangram flags a review or a paper as AI-generated doesn't mean it is wrong, just as that a paper is human-written doesn't mean it is right. A decade ago, long before AI arrived, science was suffering a reproducibility crisis caused by Bad incentives in peer-reviewed science. Eleven years ago Arthur Caplan of the Division of Medical Ethics at NYU's Langone Medical Center predicted it would lead to a total loss of science's credibility:

The time for a serious, sustained international effort to halt publication pollution is now. Otherwise scientists and physicians will not have to argue about any issue—no one will believe them anyway.

No-one did anything effective, so Caplan's "otherwise" was what happened. Science has had a quality problem for a long time. The bad incentives have also caused a quantity problem, spawning pay-to-play predatory journals publishing garbage under the "peer-reviewed" brand.

Gartenberg Fig. 2

An even more comprehensive analysis for the journal Organization Science also using Pangram was reported in More Versus Better: Artificial Intelligence, Incentives, and the Emerging Crisis in Peer Review by Claudine Gartenberg et al They find that AI's reduction in the cost of pumping up a researcher's publication count has caused a massive spike in submissions to an already overloaded system:

While there could be many reasons for the rise in submissions, including reduced backlogs, increased scholar productivity, or journal reputation, Figure 2 suggests that the disproportionate increase in submission volume is driven by AI use. Post-ChatGPT, we see a marked decline in submissions flagged at 0%–15% AI (little to no AI use) and a corresponding rise in all other categories that make up the difference between the decline in human-only submissions and the 42% increase in total submissions.

Gartenberg Fig. 3

The additional submissions were marked by heavy AI use:

Prior to the launch of ChatGPT, relative shares were flat. Nearly all submissions were classified as human (with some idiosyncratic noise). Immediately after the launch of the first commercial LLM chatbots, a precipitous decline in human-only submissions began. At the same time, we observe a steady rise in all categories of AI-supported or generated submissions. What is most striking is that by February 2026, the majority of submissions submitted to Organization Science use AI in their writing to some degree. The most striking trend is the rise of the 70%+ AI category, where text is mostly or entirely generated by AI.

Gartenberg Fig. 6

So much for quantity. There are no good automated tools to assess the quality of the research but there is a wide range of automated tools to assess the quality of the writing. Applying them, the authors found a sighnificant correlation between AI use and degraded readability:

We do not find much evidence that the writing quality of those manuscripts changed meaningfully between 2013 and November 2022, when ChatGPT was launched. In contrast, post-ChatGPT, we see a precipitous decline in the average manuscript’s Reading Ease score. Indeed, AI scores and Flesch Reading Ease are negatively correlated
...
We find strong evidence that AI use is associated with lower-quality writing across most of these traditional measures. This result is counterintuitive. Authors often assume that using AI will improve their writing. However, this is not the case, at least when authors substantially offload their writing to it.
...
AI prose is more difficult to read on several dimensions. Beyond substantially lower Flesch Reading Ease scores, the grade level required to understand the text is higher (more multisyllabic words); the FOG and SMOG indices increase, suggesting more complex text; and the use of jargon increases. We also find increased use of nominalizations (e.g., “conceptualization”, “operationalization”, or “contextualization”).

Whatever the quality of AI generated papers, they are massively aggravating the quantity problem. Samantha Cole reports on one approach to reducing the flow in ArXiv to Ban Researchers for a Year if They Submit AI Slop:

Late Thursday evening, Thomas Dietterich, chair of the computer science section of ArXiv, wrote on X: “If generative AI tools generate inappropriate language, plagiarized content, biased content, errors, mistakes, incorrect references, or misleading content, and that output is included in scientific works, it is the responsibility of the author(s). We have recently clarified our penalties for this. If a submission contains incontrovertible evidence that the authors did not check the results of LLM generation, this means we can't trust anything in the paper.”

Examples of incontrovertible evidence, he wrote, include “hallucinated references, meta-comments from the LLM (‘here is a 200 word summary; would you like me to make any changes?’; ‘the data in this table is illustrative, fill it in with the real numbers from your experiments’.”

“The penalty is a 1-year ban from arXiv followed by the requirement that subsequent arXiv submissions must first be accepted at a reputable peer-reviewed venue,” Dietterich wrote.

I have two suggestions:

Journals should make authors take responsibility for their words and provide them tools to do so. They should check all incoming papers and reviews with Pangram or a similar tool and automatically reject anything above a set AI score. But they should provide the same tool to the authors so they can know before submission whether it will pass the filter.
Too many papers will still pass the filter and have to be reviewed. Journals should train their own, proprietary, domain-specific reviewers on their own and related content. Using them as a first pass filter would ensure that the limited human reviewer bandwidth was defended from flooding^[3]. The quality of the training data would ensure that the better journals had better first pass filters, preserving their quality advantage.

Footnotes

This was a problem we addressed in the design of the LOCKSS protocol:

Effort Balancing. If the effort needed by a requester to procure a service from a supplier is less than the effort needed by the supplier to furnish the requested service, then the system can be vulnerable to an attrition attack that consists simply of large numbers of ostensibly valid service requests. We can use provable effort mechanisms such as Memory-Bound Functions to inflate the cost of relatively “cheap” protocol operations by an adjustable amount of provably performed but otherwise useless effort. By requiring that at each stage of a multi-step protocol exchange the requester has invested more effort in the exchange than the supplier, we raise the cost of an attrition strategy that defects part-way through the exchange. This effort balancing is applicable not only to consumed resources such as computations performed, memory bandwidth used or storage occupied, but also to resource commitments. For example, if an adversary peer issues a cheap request for service and then defects, he can cause the supplier to commit resources that are not actually used and are only released after a timeout (e.g., SYN floods). The size of the provable effort required in a resource reservation request should reflect the amount of effort that could be performed by the supplier with the resources reserved for the request.
I described the history in "Nobody cared about security":

The first spam e-mail was sent in 1978 and evoked this reaction:
ON 2 MAY 78 DIGITAL EQUIPMENT CORPORATION (DEC) SENT OUT AN ARPANET MESSAGE ADVERTISING THEIR NEW COMPUTER SYSTEMS. THIS WAS A FLAGRANT VIOLATION OF THE USE OF ARPANET AS THE NETWORK IS TO BE USED FOR OFFICIAL U.S. GOVERNMENT BUSINESS ONLY. APPROPRIATE ACTION IS BEING TAKEN TO PRECLUDE ITS OCCURRENCE AGAIN.
Which pretty much fixed the problem for the next 16 years. But in 1994 lawyers Canter & Siegel spammed the Usenet with an advertisement for their "green card" services, and that December the first commercial e-mail spam was recorded.
Credit for this idea goes to Vicky Reich.

DSHR's Blog

Friday, May 15, 2026

Flooded Zones Part 1

Bacground

Scholarly Publication

Footnotes

No comments: