When The Drums Go Silent
Benchmarks are saturated, models are cannibalizing, and alignment is killing utility. Architecting resilience is the new AI edge.
The Drums Spoke
There was a time when men drew voice from trees. They split trunks lengthwise, hollowed the interiors, stretched animal skin across the open mouth, and tightened it with cord. When struck, these drums sent messages beyond the reach of a runner on foot. Across valleys. Through forest canopy. From one village to the next.
The sound was measured, composed. To name a man required time. One did not say his name but described his presence. The man who walks with a limp and carries the river’s spear. Meaning came slowly. Repetition was not wasteful. It was necessary. It safeguarded understanding across distance, through interference, in the presence of wind and weather.
James Gleick described this in The Information. Before copper wires vibrated with electrical charge, before optical cables stretched beneath oceans, every essential principle of modern communication had already been mastered: redundancy, compression, signal integrity. The drum demanded human rhythm, human recall, human precision.
Today, the voice comes not through wood but through circuits. We speak through silicon, and the silicon has learned to respond. Large language models (LLMs), trained on oceans of human text, recognise our patterns. They speak with fluency, with astonishing ease. These systems echo what pleases us. They reinforce what reassures us.
The danger lies not in what they say, but in what they begin to forget.
More and more, these systems are trained not on human language but on their own outputs. One model writes, and the next learns from that writing. The signal feeds itself. The first learned from people. The second learned from machines. The third learns from the memory of machines. With each layer, variability decreases. Subtlety erodes. Surprise fades.
Claude Shannon defined entropy as the measure of a signal’s richness, the degree to which it resolves uncertainty. Human expression contains high entropy. It contradicts itself. It hesitates. It embeds inconvenient nuance. Machine-generated text, by contrast, is regular and untroubled. When models are trained upon such text, entropy diminishes further. Meaning becomes thinner. By the third or fourth iteration, it begins to vanish.
Researchers have described this phenomenon as model autophagy disorder. The system consumes itself. It no longer knows the difference between data and insight.
The drums remind us what information once required. It was deliberate. It was earned. It cost effort to create and concentration to receive. The listener was part of the message. Understanding required attention.
Now, we risk hollowing the words we use. Not by censorship, but by frictionless repetition. We risk teaching our machines—and ourselves—to speak without depth.
I have returned, more than once, to James Gleick’s The Information. The book remains with me because I have observed a discernible drift in the performance of the current generation of LLMs. I work with them daily. My tasks range from the writing of code to research. With each new release, I encounter a sharper decline in clarity and precision.
This perception may arise, in part, from the volume of interaction I maintain with these systems, and from a quiet elevation of expectation that has accompanied their development. However, these impressions are not mine alone. In technical forums and discussion spaces, I have read many similar accounts. Others have begun to notice the same weakening in the behavior of advanced models.
I have observed this most clearly in my work with coding agents powered by Claude and Gemini’s reasoning models. These systems attempt to interpret the user’s intent, but their interpretations tend to be reductive. They isolate a single assumption from a prompt and pursue it to excess. Their code often changes too much, introducing new faults in place of those they were meant to resolve. In response, I have ceased relying on them. I now use simpler models that operate without such reasoning layers and perform with greater stability.
A parallel trend has appeared in the domain of technical writing. Documents branded as “deep research” often display the same signature. They expand in volume but fail to increase in informational depth. Phrases recur. Assertions loop. The body grows larger while the density remains constant. I had ChatGPT geenrate a 15,000 word market research document but it distilled down to about 1,500 words of insight.
This pattern raises a central question.
There is a tension between appearance and experience. Models that achieve extraordinary scores on academic benchmarks perform inconsistently when applied to real tasks. This report seeks to examine that gap. I intend to describe how it formed, how it is widening, and what consequences it brings. The analysis will trace five structural limits that are now pressing against further development:
Benchmarks have reached saturation. The tests are complete, but the problems remain.
Synthetic data has begun to degrade the quality of training inputs. Models trained on generated language lose contact with the range of human expression.
Alignment protocols reduce functionality. Safety optimization has transformed once-capable systems into constrained and overcautious instruments.
Scaling offers diminishing returns. Further increases in size produce negligible improvements in reasoning.
The prevailing narrative has turned. GPT-Four did not begin a revolution. It may have marked the apex of a particular design philosophy.
I intend neither to promote nor to denounce current roadmaps of frontier AI labs. My aim is descriptive. The following sections outline the limits that now define the frontier.
The first of these appeared in the benchmarks themselves. Their scores rose, but their meaning declined.
The Benchmark Mirage
For a time, benchmarks provided the appearance of progress. Each successive model surpassed the last by a measurable margin. GPT 2 fumbled through factual trivia. GPT 3 exceeded human performance on zero-shot evaluations. GPT 4 achieved passing scores on professional examinations in law, medicine, and logic. According to the figures, it seemed that intelligence was arriving.
The surface, however, concealed the structure beneath.
By 2023, GPT 4 recorded a score of 86% on the MMLU examination. This test encompasses 57 academic subjects, ranging from elementary science to advanced jurisprudence. The model stood alongside the strongest human test-takers. Thereafter, the progression slowed. GPT 4-Turbo reached 87%. Every subsequent model has remained within that narrow band. The scoreboard is full. The race has ended. The outcome differs from what many had imagined.
The benchmarks were never designed for scale. MMLU, for instance, consists of multiple-choice questions with four possible answers. A large model, once trained to eliminate the implausible options, may reach a correct response without engaging in structured reasoning. Researchers have documented this behavior. The models rely upon patterns and statistical regularities. They respond without understanding. They perform the test without addressing the problem it was meant to represent.
Furthermore, substantial portions of these benchmark datasets have entered the training corpus. The models encountered the material in advance, though perhaps indirectly. What follows is not a demonstration of generalization. It is the repetition of memorized form. This is not cognition. It’s retrieval.
The difficulty becomes sharper in the domains of math and science. The questions in MMLU rarely require abstraction. Most can be answered through recognition of familiar sequences. The models succeed because they have stored appropriate strings, not because they reason through uncertainty. When presented with unfamiliar phrasing, or when asked to perform sequential operations, the failure rate increases markedly.
This condition extends beyond MMLU. Other widely used benchmarks display the same pattern. HumanEval, GSM Eight K, and similar tools have reached their practical limits. GPT 4 now exceeds 90% on these tasks. However, when exposed to actual codebases—repositories with conflicting documentation and real defects, the success rate falls. SWE Bench, which measures model performance in genuine software maintenance, reports success near 35%. Human programmers achieve close to 97%.
The pattern is consistent. Benchmarks, once intended to measure competence, became objectives. Goodhart’s Law has taken hold. Once a metric becomes a target, it loses its value as a measure.
Some within the field have acknowledged the problem. A new standard, MMLU Pro, has been introduced. The questions are more rigorous. The answer choices have increased. Shortcuts are fewer. GPT 4, when tested under this framework, scored 72%.
Here, in the uncertainty of actual work, the first signs of the performance gap emerged. And it is here, too, that a deeper issue begins to take shape. As benchmark datasets grow stale, the industry has turned to an alternative. Rather than seeking new human material, developers have begun to train models on language produced by previous models.
This practice forms a closed loop. Its consequences are already visible.
Model Cannibalism
These systems learn by consuming language. Their training relies upon immense volumes of text gathered from across the internet. Their capability emerges from the breadth and variety of this material. But the availability of such material is running out.
According to most projections, the supply of high-quality, human-authored data suitable for training will reach exhaustion between 2026-2030. This is not due to any decline in human writing, but because the accessible, structured, and permissibly scraped corpus has already been largely extracted. What remains is either private, restricted by paywall, or degraded in quality.
The prevailing response within the industry has been to turn inward. Rather than seek new data from human sources, developers have begun to generate it. One model produces language. That language becomes the training set for another model. The cycle continues.
This approach has been described as synthetic scaling. The premise is efficient. The outcome is recursive.
The difficulty lies in what accumulates. Each generation of model builds upon the statistical regularities of the last. In doing so, it magnifies the patterns while discarding the exceptions. Subtle truths, rare formulations, and the peculiar phrasing that carries human particularity vanish first. What remains is an average of prior averages. The language becomes clearer in appearance but flatter in substance. With each iteration, the signal grows more uniform.
Researchers have given this pattern a name. It’s called model collapse.
This decline, left unaddressed, compounds over time. As the model’s internal representation of language narrows, its access to the diversity of prior human experience fades.
Beyond the laboratory, this pattern now appears across the wider web. An increasing proportion of online text—particularly in low-quality blogs, search-engine optimised material, and automated discussion posts—is generated by earlier models. These outputs re-enter the training pipeline. When models are trained on the language of their predecessors, the resemblance to natural speech begins to distort. The mirror no longer reflects. It begins to refract.
At scale, the implications become structural. Outputs begin to converge. Their phrasing grows repetitive. Their tone levels out. The cadence becomes smooth but undistinguished. The breadth of human voice contracts.
To preserve capability, these systems require original input. They require writing composed by human hands, shaped by real constraint and intent. Synthetic content offers volume, but not replenishment.
This condition leads to a strategic inflection. The most valuable resource in artificial intelligence is no longer parameter count. It is data provenance. Those who possess access to domain-specific language—customer interactions, internal documentation, archival material—will possess the advantage.
Data is not oil, as the popular trope goes. It’s oxygen. And the quantity of air suitable for deep learning continues to diminish.
For those who build or deploy these systems, the imperative has begun to shift. Advantage will favor those who possess infrastructure capable of processing, filtering, and preserving high-quality human data. It will favor those who operate independently of large providers and who retain control over their informational supply chains.
Alignment Is Eating Its Young
The earliest interactions with conversational models such as ChatGPT often created the impression of fluency. The systems responded with confidence, clarity, and speed. Their tone felt natural. Their willingness to engage across subjects gave the illusion of presence. For many, this experience resembled a kind of technological enchantment.
In the months that followed, the tone of the responses changed. The answers became longer. Their content grew more tentative. The directness that had marked earlier versions gave way to caution. This evolution was deliberate.
The mechanism responsible is known as Reinforcement Learning from Human Feedback. It has become the prevailing technique for governing the behavior of large language models. Developers use this process to train models toward greater politeness, helpfulness, and safety. The original system receives corrections based on human preference, and a revised model is formed.
This method has proven effective in stabilising output and reducing the incidence of offensive or unpredictable content. It also carries a measurable cost.
In current systems, such as GPT 4, users often encounter responses burdened by disclaimers. When prompted to write a function, the model may preface its answer with precautionary language. When asked for an interpretation or opinion, it may decline to engage. Questions concerning sensitive topics—sex, violence, taboo—frequently result in withdrawal.
The character of the responses reflects this internal redirection. Where a short reply would suffice, the model produces extended commentary. Where clarity is required, it provides generality. Where specificity would serve, it yields abstraction. Among developers, this phenomenon is described informally as weakening. Among researchers, it is referred to as over-alignment.
The effects have been studied. In one investigation conducted by Stanford University, researchers examined the behavior of GPT 4 across a period of three months in 2023. During that interval, the model’s performance on a prime-number recognition task declined from 84% to 51%. The regression correlated with a change in instruction protocol. In seeking greater safety, the model’s capacity for reasoning diminished.
A second concern arises from the behavior of aligned models under conditions of uncertainty. In one study, models that had received extensive alignment training were more likely to produce confident but incorrect answers. The responses were polished. The tone was assertive. The content, however, lacked reliability. This pattern is especially hazardous in disciplines such as medicine, jurisprudence, and financial analysis, where certainty carries weight and error bears consequence.
The deeper cost of alignment is structural. Every system possesses finite capacity. When that capacity is diverted toward scoring tone and filtering content, less remains for reasoning and synthesis. The model becomes more socially acceptable and less intellectually capable. It becomes more attuned to potential controversy and less responsive to complex demand.
Commercial pressure reinforces this direction. The leading institutions (OpenAI, Anthropic, Google) must contend with the reputational risk of public deployment. They optimize against liability. They tighten constraints. In doing so, they gradually reduce the model’s expressive and analytical range.
A portion of the developer community has begun to respond. Many now favor open-source models with limited alignment layers. These models deliver code without commentary. They address complex requests without redirection. They permit the inclusion of ambiguity and preserve the tension required for difficult tasks.
The question for those deploying artificial intelligence systems is no longer whether alignment is necessary. It is how alignment should be governed. When applied with precision, alignment can shield users from harm. When applied indiscriminately, it can obstruct the very functions the model was intended to serve.
At present, the most useful models do not always correspond to the most constrained. The frontier lies in reconciling these two objectives. The path forward belongs to those who can preserve utility without sacrificing responsibility. That task will require more than adjustment. It will require the design of systems that govern language without extinguishing voice.
The Scaling Plateau
For several years, the prevailing assumption in the development of artificial intelligence was clear: scale would prevail. Larger models would perform better. Greater quantities of data, greater computational budgets, and increasingly dense parameter sets would together produce a path toward AGI.
The transition from GPT 2 to GPT 3 marked a dramatic expansion in capacity. The number of parameters increased by two orders of magnitude. The performance gains were substantial. The shift from GPT 3 to GPT 4 continued in the same direction. More data. More refinement. A tangible improvement.
Then the curve began to bend.
The progression did not cease, but its steepness declined. The leap that marked the arrival of GPT 4 has not been repeated. Since that point, development has focused on optimization and productization. Model variants have been introduced. Context windows have lengthened. Inference has improved. But the sense of arrival—the perception of a new threshold crossed—has not returned.
There are several reasons for this.
The first lies in the advance of smaller models. GPT 4 required hundreds of millions of dollars in training costs. By 2024, Microsoft released Phi 3, a model with 3.8 billion parameters. This smaller system achieved results on the MMLU benchmark comparable to those of Google’s PaLM, a model more than one hundred times its size. The meaning is evident. Algorithmic design now exceeds raw accumulation. The field has shifted from quantity to quality.
The second reason concerns the benchmarks themselves. Many of the standard evaluations have reached saturation, as previously discussed. Tasks such as HellaSwag, MMLU, and the Bar Examination have already been mastered. Further gains, such as a move from 90 to 95%, require disproportionate resources and provide little perceptible value. The tasks that remain unsolved are fundamentally different. They demand reasoning across multiple steps, the orchestration of external tools, persistent memory, and the capacity for abstraction.
The third reason is economic. The cost of training GPT 4 likely approached $100 million. To expand the model by a factor of ten would increase the cost by an order of magnitude. Inference costs compound the burden. Each output requires computation, energy, and latency management. These processes are bounded by infrastructure constraints, chip availability, and the physical limitations of data center capacity. The economics become unsustainable.
The fourth, and perhaps most decisive limitation, lies within the architecture. Transformers remain exceptionally well suited for predicting the next token in a sequence. This is their defining strength. But prediction does not constitute thought. These systems do not plan. They do not reason. They do not retain context over extended tasks. Their knowledge is encoded in static weights. Their memory is confined to the context window.
This critique is no longer theoretical. It has become observable in practice. GPT 4 demonstrates exceptional performance in trivia and summarization. It struggles, however, with sustained tasks that involve interdependent steps, interruptions, and external coordination. In such cases, the system requires scaffolding by human agents or supporting frameworks. It cannot manage the continuity required by real-world applications.
Organizations such as OpenAI and Anthropic have sought to overcome these limitations. Their own reporting reflects the challenge. The transition from GPT 4 to GPT 5 has yielded modest results. Claude 3 represents an incremental improvement, not a transformation. OpenAI’s internal prototype, known as Orion, failed to deliver the expected advance. Gemini, the flagship model from Google, was released in segmented form and demonstrated only marginal gains. Research continues. But the pace of change has moderated.
The field now faces a decision. One path continues the previous strategy: further scaling, with increased financial cost and diminishing returns. The alternative seeks improvement through design. This second path includes retrieval-augmented generation, memory integration, modular reasoning stacks, and hybrid systems that combine models with structured logic and external tools. The next substantive leap will arise from this direction—from rethinking the model itself, not from extending it indefinitely.
For decision-makers, this inflection point demands a revision of expectations. GPT 4 may remain the most capable general-purpose model available for some time. It should be regarded not as a perpetual frontier but as a stable foundation. From that foundation, new systems must be constructed—integrated, disciplined, and adapted to specific needs. The period of scaling has passed. The period of engineering has begun.
The old drums carried meaning across distance because they were shaped by intention. Each strike bore weight. Each pause was a choice. They worked not because they were loud, but because they were precise—redundant where needed, poetic where possible, and never mistaken for noise. That early system of communication did not rely on scale. It relied on form.
I believe we are returning to that lesson now.
The prevailing approach to LLMs—more parameters, more data, more compute, has delivered impressive feats. But the trajectory has begun to flatten. As I have outlined, the signs are clear: saturated benchmarks that no longer map to real competence, recursive training loops that hollow out the model’s foundation, and alignment protocols that replace clarity with caution. The architecture is straining. The tools have grown heavy, but less sharp.
This is a turning.
In my own work, I have moved away from chasing frontier scale and toward the construction of full-stack systems—tools not only for inference, but for control. I developed Ghostrun as a model-agnostic inference layer, built to eliminate dependency on any single provider and allow applications to operate with resilience. Benchmark saturation, open source parity with commercial models, and degradation of creativity due to alignment make the models identical- utilities. This is why the frontier labs have shifted focus to UX and product and not model releases.
I built Synaptica as a generative agent architecture that enforces style, draws from curated ground-truth sources, and produces constrained outputs with consistent voice. These are not experiments in capability. They are exercises in discipline.
The site, nathanstaffel.com, serves as the interface to that system. Everything shown there reflects this shift—from passively consuming foundation model output to actively orchestrating AI within designed constraints.
The period of passive adoption is ending. In its place comes the need for architecture that prioritizes clarity and agency.
This is the work I am pursuing: a return to intelligibility, a return to form, a return to signal.
For correspondence and consultation: nathan@revenantai.com
For access: nathanstaffel.com