I've yet to be convinced by any article, including this one, that attempts to dr...

Uehreka · 2026-01-15T23:01:14 1768518074

It feels like a lot of people keep falling into the trap of thinking we’ve hit a plateau, and that they can shift from “aggressively explore and learn the thing” mode to “teach people solid facts” mode.

A week ago Scott Hanselman went on the Stack Overflow podcast to talk about AI-assisted coding. I generally respect that guy a lot, so I tuned in and… well it was kind of jarring. The dude kept saying things in this really confident and didactic (teacherly) tone that were months out of date.

In particular I recall him making the “You’re absolutely right!” joke and asserting that LLMs are generally very sycophantic, and I was like “Ah, I guess he’s still on Claude Code and hasn’t tried Codex with GPT 5”. I haven’t heard an LLM say anything like that since October, and in general I find GPT 5.x to actually be a huge breakthrough in terms of asserting itself when I’m wrong and not flattering my every decision. But that news (which would probably be really valuable to many people listening) wasn’t mentioned on the podcast I guess because neither of the guys had tried Codex recently.

And I can’t say I blame them: It’s really tough to keep up with all the changes but also spend enough time in one place to learn anything deeply. But I think a lot of people who are used to “playing the teacher role” may need to eat a slice of humble pie and get used to speaking in uncertain terms until such a time as this all starts to slow down.

orbital-decay · 2026-01-15T23:40:13 1768520413

> in general I find GPT 5.x to actually be a huge breakthrough in terms of asserting itself when I’m wrong

That's just a different bias purposefully baked into GPT-5's engineered personality on post-training. It always tries to contradict the user, including the cases where it's confidently wrong, and keeps justifying the wrong result in a funny manner if pressed or argued with (as in, it would have never made that obvious mistake if it wasn't bickering with the user). GPT-5.0 in particular was extremely strongly finetuned to do this. And in longer replies or multiturn convos, it falls into a loop on contradictory behavior far too easily. This is no better than sycophancy. LLMs need an order of magnitude better nuance/calibration/training, this requires human involvement and scales poorly.

Fundamental LLM phenomena (ICL, repetition, serial position biases, consequences of RL-based reasoning etc) haven't really changed, and they're worth studying for a layman to get some intuition. However, they vary a lot model to model due to subtle architectural and training differences, and impossible to keep up because there are so many models and so few benchmarks that measure these phenomena.

Uehreka · 2026-01-16T03:09:15 1768532955

By the time I switched to GPT 5 we were already on 5.1, so I can't speak to 5.0. All I can say is that if the answer came down to something like "push the bias in the other direction and hope we land in the right spot"... well, I think they landed somewhere pretty good.

Don't get me wrong, I get a little tired of it ending turns with "if you want me to do X, say the word." But usually X is actually a good or at least reasonable suggestion, so I generally forgive it for that.

To your larger point: I get that a lot of this comes down to choices made about fine tuning and can be easily manipulated. But to me that's fine. I care more about if the resulting model is useful to me than I do about how they got there.

zarzavat · 2026-01-16T12:49:32 1768567772

I find both are useful.

Claude is my loyal assistant who tries its best to do what I tell it to.

GPT-5 is the egotistical coworker who loves to argue and point out what I'm doing wrong. Sometimes it's right, sometimes it's confidently wrong. It's useful to be told I'm wrong even when I'm not. But I'm not letting it modify my code, it can look but not touch.

raducu · 2026-01-16T11:13:59 1768562039

> That's just a different bias purposefully baked into GPT-5's engineered personality on post-training.

I want to highlight this realization! Just because a model says something cool, it doesn't mean it's an emergent behavior/realization, but more likely post-training.

My recent experience with claude code cli was exactly this.

It was so hyped here and elsewhere I gave it a try and I'd say it's almost arrogant/petulant.

When I pointed out bugs in long sessions it tried to gaslight me that everything was alright, faked tests to prove his point.

Nora23 · 2026-01-16T10:18:21 1768558701

By the time GPT 5.5 landed we were already on 5.1, honestly they seem to converge on similar limitations around compositional reasoning.

aeneas_ory · 2026-01-16T08:53:37 1768553617

"Still on Claude Code" is a funny statement, given that the industry is agreeing that Anthropic has the lead in software generation while others (OpenAI) are lagging behind or have significant quality issues (Google) in their tooling (not the models). And Anthropic frontier models are generally "You're absolutely right - I apologize. I need to ..." everytime they fuck something up.

zeroonetwothree · 2026-01-16T02:18:11 1768529891

Why is it every time anyone has a critique someone has to say “oh but you aren’t using model X, which clearly never has this problem and is far better”?

Yet the data doesn’t show all that much difference between SOTA models. So I have a hard time believing it.

Uehreka · 2026-01-16T03:03:43 1768532623

GP here: My problem with a lot of studies and data is that they seem to measure how good LLMs are at a particular task, but often don't account for "how good the LLM is to work with". The latter feels extremely difficult to quantify, but matters a lot when you're having a couple dozen turns of conversation with an LLM over the course of a project.

Like, I think there's definitely value in prompting a dozen LLMs with a detailed description of a CMS you want built with 12 specific features, a unit testing suite and mobile support, and then timing them to see how long they take and grading their results. But that's not how most developers use an LLM in practice.

Until LLMs become reliable one-shot machines, the thing I care most about is how well they augment my problem solving process as I work through a problem with them. I have no earthly idea of how to measure that, and I'm highly skeptical of anyone who claims they do. In the absence of empirical evidence we have to fall back on intuition.

CJefferson · 2026-01-16T05:01:59 1768539719

A friend recommended to me having a D&D style roleplay with some different engines, to see which you vibe with. I thought this sounded crazy but I took their advice.

I found this worked suprisingly well, I was certain 'claude' was best, while they like grok and someone else liked ChatGPT. Some AIs just end up fitting best with how you like to chat I think. I do definately also find claude best for coding with as well.

fragmede · 2026-01-16T03:51:20 1768535480

Because they are getting better. They're still far from perfect/AGI/ASI, but when was the last time you saw the word "delve"? So the models are clearly changing, the question is why doesn't the data show That they're actually better?

Thing is, everyone knows the benchmarks are being gamed. Exactly how is besides the point. In practice, anecdotally, Opus 4.5 is noticably better than 4, and GPT 5.2 has also noticably improved. So maybe the real question is why do you believe this data when it seems at odds with observations by humans in the field?

> Jeff Bezos: When the data and the anecdotes disagree, the anecdotes are usually right.

https://articles.data.blog/2024/03/30/jeff-bezos-when-the-da...

troupo · 2026-01-16T09:17:20 1768555040

"They don't use delve anymore" is not really a testament that they became better.

Most of what I can do now with them I could do half a year to a year ago. And all the mistakes and fail loops are still there, across all models.

What changed is the number of magical incantations we throw at these models in the form of "skills" and "plugins" and "tools" hoping that this will solve the issue at hand before the context window overflows.

kaffekaka · 2026-01-16T10:25:59 1768559159

"They dont say X as often anymore" is just a distraction, it has nothing to do with actual capability of the model.

Unfortunately, I think that the overlap between actual model improvements and what people perceive as "better" is quite small. Combine this with the fact that most people desperately want to have a strong opinion on stuff even though the factual basis is very weak.. "But I can SEE it is X now".

fatherwavelet · 2026-01-16T12:13:34 1768565614

The type of person who outsources their thinking to their social media feed news stories and isn't intellectually curious enough to deeply explore the models themselves in order for the models to display their increase in strength, isn't going to be able to tell this themselves.

I would think this also correlates with the type of person who hasn't done enough data analysis themselves to understand all the lies and misleading half-truths "data" often tells. In the reverse also, that experience with data inoculates one to some degree against the bullshitting LLM so it is probably easier to get value from the model.

I would imagine there are all kinds of factors like this that multiple so some people are really having vastly different experiences with the models than others.

jihadjihad · 2026-01-16T02:51:42 1768531902

Because the answer to the question, “Does this model work for my use case?” is subjective.

raincole · 2026-01-16T03:27:31 1768534051

People desperately want 'the plateau' to be true because it means our jobs would be safe and we could call ourselves experts again. If the ground is keep moving then no one is truly an expert. There is just no enough time to achieve expertise when the paradigm shifts every six months.

CuriouslyC · 2026-01-16T12:58:56 1768568336

That statement is only true if you're ignoring higher order patterns. I called the orchestration trend and the analytic hurdle trends back in April of last year.

alternatetwo · 2026-01-15T23:53:50 1768521230

Claude is still just like that once you’re deep enough in the valley of the conversation. not exactly that phrase but things like that’s the smoking gun or so. nothing has changed.

raducu · 2026-01-16T11:18:55 1768562335

> Claude is still just like that once you’re deep enough in the valley of the conversation

My experience is claude (but probably other models as well) indeed resort to all sorts of hacks once the conversation has gone for too long.

Not sure if it's an emergent behavior or something done in later stages of training to prevent it from wasting too many tokens when things are clearly not going well.

PaulDavisThe1st · 2026-01-16T06:39:46 1768545586

> I haven’t heard an LLM say anything like that since October, and in general I find GPT 5.x

It said precisely that to me 3 or 4 days ago when I questioned its labelling of algebraic terms (even though it was actually correct).

overgard · 2026-01-16T03:35:32 1768534532

I don't see a reason to think we're not going to hit a plateua sooner or later (and probably sooner). You can't scale your way out of hallucinations, and you can't keep raising tens of billions to train these things without investors wanting a return. Once you use up the entire internets worth of stack overflow responses and public github repositories you run into the fact that these things aren't good at doing things outside their training dataset.

Long story short, predicting perpetual growth is also a trap.

visarga · 2026-01-16T13:48:44 1768571324

> You can't scale your way out of hallucinations

You scale your way only out in verifiable domains, like code, math, optimizations, games and simulations. In all the other domains the AI developers still got billions (trillions) of tokens daily, which are validated by follow up messages, minutes or even days later. If you can study longitudinally you can get feedback signals, such as when people apply the LLM idea in practice and came back to iterate later.

raducu · 2026-01-16T11:21:54 1768562514

> Once you use up the entire internets worth of stack overflow responses and public github repositories you run into the fact that these things aren't good at doing things outside their training dataset.

I think the models have reached that human training data limitation a few generations ago, yet they stil clearly improve by various other techniques.

Q6T46nT668w6i3m · 2026-01-16T15:33:49 1768577629

On balance, there’s far more evidence to support the conclusion that language models have reached a plateau.

FuckButtons · 2026-01-16T16:49:15 1768582155

I’m not sure I agree, it doesn’t feel like we’re getting super linear growth year over year, but Claude opus 4.5 is able to do useful work over meaningful timescales without supervision. Is the code perfect? No, but that was certainly not true of model generations a year or two ago.

jgalt212 · 2026-01-16T14:00:02 1768572002

To me this seems like a classic LLM defense.

A doesn't work. You must frontier model 4.

A works on 4, but B doesn't work on 4. You doing it wrong, you must use frontier model 5.

Ok, now I use 5, A and B work, but C doesn't work. Fool, you must use frontier model 6.

Ok, I'm on 6, but now A is not working as it good as it did on A. Only fools are still trying to do A.

soulofmischief · 2026-01-16T09:02:30 1768554150

Opus 4.5 seems to be better than GPT 5.2 or 5.2 Codex at using tools and working for long stretches on complex tasks.

MoltenMan · 2026-01-15T23:21:02 1768519262

I agree with a lot of what you've said, but I completely disagree that LLM's are no longer sycophantic. GPT-5 is definitely still very sycophantic, 'You're absolutely right!' still happens, etc. It's true it happens far less in a pure coding context (Claude Code / Codex) but I suspect only because of the system prompts, and those tools are by far in the minority of LLM usage.

I think it's enlightening to open up ChatGPT on the web with no custom instructions and just send a regular request and see the way it responds.

danpalmer · 2026-01-16T01:26:21 1768526781

I used to get made up APIs in functions, now I get them in modules. I used to get confidently incorrect assertions in files now I get them across codebases.

Hell, I get poorly defined APIs across files and still get them between functions. LLMs aren't good at writing well defined APIs at any level of the stack. They can attempt it at levels of the stack they couldn't a year ago, but they're still terrible at it unless the problem is so well known enough that they can regurgitate well reviewed code.

refactor_master · 2026-01-16T01:45:40 1768527940

I still get made-up Python types all the time with Gemini. Really quite distracting when your codebase is massive and triggers a type error, and Gemini says

"To solve it you just need to use WrongType[ThisCannotBeUsedHere[Object]]"

and then I spend 15 minutes running in circles, because everything from there on is just a downward spiral, until I shut off the AI noise and just read the docs.

baq · 2026-01-16T06:00:43 1768543243

Gemini unfortunately sucks at calling tools, including ‘read the docs’ tool… it’s a great model otherwise. I’m sure Hassabis’ team is on it since it’s how the model can ground itself in non-coding contexts, too.

conradfr · 2026-01-16T09:15:53 1768554953

Yeah I've been trying Claude Code for a week (mostly Opus) and in a C++ Juce project it kept hallucinating functions for a simple task ("retrieve DAW track name if available") and actually never got it right.

It also failed a lot to modify a simple Caddyfile.

On the other hand it sometimes blows me away and offers to correct mistakes I coded myself. It's really good on web code I guess as that must be the most public code available (Vue3 and elixir in my case).

measurablefunc · 2026-01-16T05:03:29 1768539809

This is the right answer. Unless there is some equivalent of it on the open internet which their search engine can find you should not expect a good outcome.

danpalmer · 2026-01-16T05:09:58 1768540198

"good outcome" is pretty subjective, I do get useful productivity gains from some LLM work, but the issues are the same as they always have been.

measurablefunc · 2026-01-16T06:27:26 1768544846

That's probably b/c you know how to write code & have enough of an understanding about the fundamentals to know when the LLM is bullshitting or when it is actually on the right track.

groby_b · 2026-01-15T22:29:42 1768516182

LLMs are bad at creating abstraction boundaries since inception. People have been calling it out since inception. (Heck, even I got a twitter post somewhere >12 months old calling that out, and I'm not exactly a leading light of the effort)

It is in no way size-related. The technology cannot create new concepts/abstractions, and so fails at abstraction. Reliably.

TeMPOraL · 2026-01-15T23:59:14 1768521554

> The technology cannot create new concepts/abstractions, and so fails at abstraction. Reliably.

That statement is way too strong, as it implies either that humans cannot create new concepts/abstractions, or that magic exists.

atty · 2026-01-16T00:36:02 1768523762

I think both your statement and their statement are too strong. There is no reason to think LLMs can do everything a human can do, which seems to be your implication. On the other hand, the technology is still improving, so maybe it’ll get there.

TeMPOraL · 2026-01-16T02:32:06 1768530726

My take is that:

1) LLMs cannot do everything humans can, but

2) There's no fundamental reason preventing some future technology to do everything humans can, and

3) LLMs are explicitly designed and trained to mimic human capabilities in fully general sense.

Point 2) is the "or else magic exists" bit; point 3) says you need a more specific reason to justify assertion that LLMs can't create new concepts/abstractions, given that they're trained in order to achieve just that.

Note: I read OP as saying they fundamentally can't and thus never will. If they meant just that the current breed can't, I'm not going to dispute it.

Jensson · 2026-01-16T07:07:22 1768547242

> 3) LLMs are explicitly designed and trained to mimic human capabilities in fully general sense.

This is wrong, LLM are trained to mimic human writing not to mimic human capabilities. Writing is just the end result not the inner workings of a human, most of what we do happens before we write it down.

You could argue you think that writing captures everything about humans, but that is another belief you have to add to your takes. So first that LLM are explicitly designed to mimic human writing, and then that human writing captures human capabilities in a fully general sense.

TeMPOraL · 2026-01-16T09:20:06 1768555206

It's more than that. The overall goal function in LLM training is judging predicted text continuation by whether it looks ok to humans, in fully general sense of that statement. This naturally captures all human capabilities that are observable through textual (and now multimodal) communication, including creating new abstractions and concepts, as well as thinking, reasoning, even feeling.

Whether or not they're good at it or have anything comparable to our internal cognitive processes is a different, broader topic - but the goal function on the outside, applying tremendous optimization pressure to a big bag of floats, is both beautifully simple and unexpectedly powerful.

nosianu · 2026-01-16T09:21:31 1768555291

Humans are trained on the real world. With real world sensors and the ability to act on their world. A baby starts with training hearing, touching (lots of that), smelling, tasting, etc. Abstract stuff comes waaayyyyy later.

LLMs are trained on our intercepted communication - and even then only the formal part that uses words.

When a human forms sentences it is from a deep model of the real world. Okay, people are also capable of talking about things they don't actually know, they have only read about, in which case they have a superficial understanding and unwarranted confidence similar to AI...

TeMPOraL · 2026-01-16T09:40:37 1768556437

All true, but note I didn't make any claims on internal mechanics of LLMs here - only on the observable, external ones, and the nature of the training process.

Do consider however that even the "formal part that uses words" of human communication, i.e. language, is strongly correlated with our experience of the real world. Things people write aren't arbitrary. Languages aren't arbitrary. The words we use, their structure, similarities across languages and topics, turns of phrases, the things we say and the things we don't say, even the greatest lies, they all carry information about the world we live in. It's not unreasonable to expect the training process as broad and intense as with LLMs to pick up on that.

I said nothing about internals earlier, but I'll say now: LLMs do actually form a "deep mofel of the real world", at least in terms of concepts and abstractions. That has already been empirically demonstrated ~2 years ago, there's e.g. research done by Anthropic where they literally find distinct concepts within the neural network, observe their relationships, and even suppress and amplify them on demand. So that ship has already sailed, it's surprising to see people still think LLMs don't do concepts or don't have internal world models.

nosianu · 2026-01-19T08:16:05 1768810565

> but note I didn't make any claims on internal mechanics of LLMs here

Great - neither did I!

Not a single word about any internals anywhere in sight in my comment!!

csomar · 2026-01-16T12:26:15 1768566375

Most humans can't. Some humans do by process of hallucination.

reactordev · 2026-01-16T00:39:44 1768523984

That’s a straw man argument if I’ve ever seen one. He was talking about technology. Not humans.

w0m · 2026-01-15T23:02:33 1768518153

I believe his argument is that now that you've defined the limitation, it's a ceiling that will likely be cracked in the relatively near future.

emp17344 · 2026-01-15T23:12:24 1768518744

Well, hallucinations have been identified as an issue since the inception of LLMs, so this doesn’t appear true.

johnfn · 2026-01-16T00:38:12 1768523892

Hallucinations are more or less a solved problem for me ever since I made a simple harness to have Codex/Claude check its work by using static typechecking.

emp17344 · 2026-01-16T00:47:05 1768524425

But there aren’t very many domains where this type of verification is even possible.

nextaccountic · 2026-01-16T02:09:21 1768529361

Then you apply LLMs in domains where things can be checked

Indeed I expect to see a huge push into formally verified software just because sound mathematical proofs provide an excellent verifier to put into a LLM hardness. Just see how Aristotle has been successful at math, and it could be applied to coding too

Maybe Lean will become the new Python

https://harmonic.fun/news#blog-post-verina-bench-sota

filoeleven · 2026-01-16T03:55:59 1768535759

  "LLMs reliably fail at abstraction."
  "This limitation will go away soon."
  "Hallucinations haven't."
  "I found a workaround for that."
  "That doesn't work for most things."
  "Then don't use LLMs for most things."

baq · 2026-01-16T06:02:32 1768543352

Um, yes? Except ‘most things’ are not much at all by volume.

w0m · 2026-01-16T01:12:31 1768525951

I mean, Hallucinations are 95% better now than the first time I heard the term and experienced them in this context. To claim otherwise is simply shifting goalposts. No one is saying it's perfect or will be perfect, just that there has been steady progression and likely will continue to be for the foreseeable future.

131hn · 2026-01-16T08:20:22 1768551622

There’s only one way to implement a mission, an algorithm, a task. But there’s an infinity of path, inconsistants, fuzzy and always subjective way to live. Thàt’s our lives, that’s the code LLM are trained on. I do not think, and hope, it will ever change much

wouldbecouldbe · 2026-01-16T13:01:49 1768568509

I feel like the main challenge is where to be "loose" and where to be "strict", Claude takes too much liberty often. Assuming things, adding some mock data to make it work, using local storage because there is no db. This makes it work well out of the box, and means I can prompt half ass and have great results. But it also long term causes issues. It can be prompted away, but it needs constant reminder. This seems like a hard problem to solve. I feel like it can already almost do everything if you have the correct vision / structure in mind and have the patience to prompt properly.

It's worst feature is debugging hard errors, it will just keep trying everything and can get pretty wild instead of entering plan mode and really discuss & think things true.

pankajdoharey · 2026-01-16T12:59:50 1768568390

Claude is overrated premium piece of developer tech, i have produced equally good results from Gemini and Way better with GPT - medium. And GPT Medium is a really good model at assembling and debugging stuff than Claude. Claude hallucinates when asked why something is correct or should be done. All Models fail equally in some or the other aspect, which point to the fact that these models have strength's and weaknesses, and GPT just happens to be a good overall model. But dev community is so stuck up on Claude for no good reason other than shiny tooling : "Claude Code", besides that the models can be equally worse as the competition. The Benchmarks do not explain the full story. In general though the Thumb rule is if the Model says you are Brilliant, Thats genius or Now thats a deep and insightful question you asked... Its time to start a new session.

skybrian · 2026-01-16T04:08:45 1768536525

The article is mostly reporting on the present. (Note the "yet" in the title.)

There's only one sentence where it handwaves about the future. I do think that line should have been cut.