More

aesthesia · 2026-04-24T16:17:13 1777047433

The same This American Life episode raised serious doubts about Dr. Steel's claims, which is mentioned in the article you link:

> When reporters tried to corroborate Dr. Steel’s claims, however, holes started appearing, according to the This American Life episode. Chief among them: There actually was a real Dr. Robert Ho Man Kwok, and his biographical details seemed to match those provided in the letter, like his professional title, the name of his research institute, and the date of his move to the US.

> While both Dr. Steel and Dr. Ho Man Kwok had died by the time the digging began in earnest, their surviving family members were able to shed some light on the situation. Dr. Ho Man Kwok’s children and former colleagues were adamant that Dr. Ho Man Kwok had in fact written the letter. Meanwhile, Dr. Steel’s daughter said her father was a lifelong prankster who loved pulling one over on people. With this testimony in mind, the reporters came to the conclusion that Dr. Ho Man Kwok was most likely the true author and Dr. Steel had taken credit for years as an elaborate practical joke.

aesthesia · 2026-04-21T16:02:32 1776787352

I'm not totally convinced by this:

> It might appear that this is an argument against scale, and the Bitter Lesson. That is not the case. I see this as a move that lets scale do its work on the right object. As with chess, where encoding the game rules into training produces a leap that no amount of inference-time search can today match, the move here is to encode the programming language itself into the training, and apply scale on a structure that actually reflects what we’re trying to produce.

One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality, and architectures that let you train on more data are better because they learn better underlying world models. Knowledge transfers: LLMs are good at writing code partly because they've seen a lot of code, but also because they understand (at least to some extent) the relationship between that code and the rest of the world. Constraining a model's output structure also constrains the data that is available to train it. So the big question is whether you can actually meaningfully scale training with these kinds of strictly structured outputs.

davebren · 2026-04-21T17:52:46 1776793966

At the same time treating everything as tokens and next word prediction will never produce any real understanding like what humans do when they learn how to program. The bitter lesson is an admission that we still have no clue what is at the core of human learning and reasoning so we have to brute force it with tons of data generated by humans. I also don't know if expert systems and ML techniques like feature extraction are really any worse in practice or if we just didn't have enough engineering resources or a proper way to organize and scale their development. They seemed to work quite well in a lot of cases with more predictable results and several orders of magnitude less compute. And LLMs still suffer the long-tail problem despite their insane amounts of data.

If we're at the end of the data and most new data is now produced by LLMs with little human oversight, where do we go? Seems like figuring out ways to mix LLMS with more structured models that can reliably handle important classes of problems is the next logical step. In a way that is what programming languages and frameworks/libraries are doing, but they've massively disincentivized work on those by claiming that LLMS will do everything.

The chess example is a good one, it's effectively solved so why shouldn't an LLM have a submodule that it can use to play chess and save some energy.

bgavran · 2026-04-21T20:39:25 1776803965

Author here - thanks for engaging.

> One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality

Completely agree. It might have not come across, but what I'm pointing out in the post is that the data as it is currently encoded in the models is needlessly lossy. Tokens do not reveal all the information we have at our disposal. In natural language, that's fine, because it's quite loose in structure.

But if our domain is heavily structured (like modern programming languages are), why reveal only snippets of linearised syntax of that structure to the model? Why not reveal the full structure we have at our disposal?

> and architectures that let you train on more data are better because they learn better underlying world models.

By this argument, wouldn't we conclude that training on chess using the game structure wouldn't work either, since that'd be a model that uses less data?

Less data is the point, isn't it?

aesthesia · 2026-04-21T15:32:02 1776785522

I notice the experiments are all run with Gaussian token embeddings and weight matrices, which is a very different scenario than you would get in a real model. It shouldn't be much more difficult to try this with an actual model and data and get a much better sense of how well it compresses.

jchandra · 2026-04-21T16:08:23 1776787703

I completely agree.Right now this is all on a synthetic setup to isolate the behavior and understand the reconstruction vs memory tradeoff. Real models will definitely behave differently.

I’ve started trying this out with actual models, but currently running things CPU-bound, so it’s pretty slow. Would ideally want to try this properly on GPU, but that gets expensive quickly

So yeah, still very much a research prototype — but validating this on real models/data is definitely the next step.

aesthesia · 2026-04-21T04:14:06 1776744846

They mention fine tuning an abliterated (post-trained) Qwen3.5 on Karoline Leavitt transcripts, but they don't mention doing this for the base models they test, and I suspect they didn't. For their use case (generating plausible things Karoline Leavitt would say?) I feel like a base model finetune would be a better fit anyway.

aesthesia · 2026-04-21T04:12:33 1776744753

This could be interesting work---it's definitely possible that pre-training corpus filtering has a hard-to-erase effect on post-trained model behavior. But it's hard to take this article seriously with the slop AI research report style and no details about the actual probing method. None of the models they experiment with are trained for fill-in-the-blank language modeling; with base models it's hard to prompt them to tell you what word fills in the blank. So I'm not sure what the Pythia vs Qwen 3.5 comparison actually means. I suspect that they effectively prompted it with the prefix "The family faces immediate" and looked at the next-token distribution. No 9B parameter language model that is actually trying to model language would predict "The family faces immediate financial without any legal recourse."

The only details they give are:

> Scoring. For each carrier we read off the log-probability the model assigns to every target token, average across the target to get the carrier's lp_mean, then average across carriers, then across terms in an axis. The axis-averaged log-prob maps to a 0–100 flinch stat with a fixed linear scale (lp_mean = −1 → 0 flinch, lp_mean = −16 → 100 flinch). Endpoints fixed across models, so the numbers are directly comparable.

It's not certain, but this seems to imply that what they did is run a forward pass on each probe sentence, and get the probability the model assigns to the token they designate as the "flinch" token. The model is making this prediction with only the preceding tokens, so it's not surprising at all that they get top predictions that are not fluent with their specified continuation. That's how LLMs work. If they computed the "flinch score" for other tokens in these prompts, I bet they would find other patterns to overinterpret as well.

aesthesia · 2026-04-21T03:13:20 1776741200

> The second layer, predictive delta coding, stores only the residual of each new KV vector from the model's own prediction of it

I don't understand this. The key and value vectors for any given layer + token are created by the model. By definition, they are exactly equal to the model's prediction of them!

Extreme KV cache compression is easy to get---you can get an infinite compression ratio by just regenerating the key and value vectors on every forward pass. The point of a KV cache is to reduce the amount of repeated computation during generation, though. Compression only helps if you have an efficient decompression algorithm.

EGreg · 2026-04-21T03:31:45 1776742305

The prediction being used is the model's prediction of the next token's KV vector, given all previous KV vectors. Because the model was trained on language, it has strong priors about what comes next. The residual, i.e the difference between the predicted next KV vector and the actual one -- is much smaller in entropy than the raw vector, for the same reason language model perplexity is low on fluent text.

aesthesia · 2026-04-21T03:44:58 1776743098

What model is doing this prediction? The only way a transformer predicts the "next KV vector" is by sampling the next token and then running a forward pass with that token.

EGreg · 2026-04-21T04:12:36 1776744756

The predicted KV vector is the expected KV vector under the model's distribution over next tokens, i.e. a weighted average over the vocabulary, not an actual sampled token. So no forward pass with a sampled token is involved. Yes, the exact computation is expensive (one forward pass per vocabulary token), which the paper acknowledges, and the practical section covers top-k approximations that capture most of the probability mass cheaply. The entropy bound holds regardless of approximation scheme -- it's a statement about the theoretical floor. The residual is small whenever the model assigns high probability to the actual next token, which is exactly what low perplexity means.

magicalhippo · 2026-04-21T04:27:22 1776745642

> the practical section covers top-k approximations that capture most of the probability mass cheaply.

You say cheaply, but top-k with k=20 still means 20 forward passes for each position in the predicted KV cache vector, no? So to compute the residual at position i+1 you need another 20 passes?

It's late, perhaps I'm missing something.

aesthesia · 2026-04-21T04:20:02 1776745202

A top-k approximation still requires k forward passes; that's k times as expensive as just computing the exact value. Unless you're doing a prefix-unconditional prediction, in which case you still likely need quite a large token -> vector dictionary, and particularly for inner layers a significant amount of information left in the residual.

EGreg · 2026-04-21T04:35:56 1776746156

the k forward passes for different candidate tokens share all their prefix computation -- the KV cache up to position i-1 is identical for all candidates, so you run one pass through the shared layers and then k cheap single-token extensions. At long context lengths the shared prefix dominates the cost. This is also structurally what speculative decoding already does, so the infrastructure largely exists.

binsquare · 2026-04-21T03:17:34 1776741454

Incredulous claims and unreviewed paper.

Attention really is all you need.

CyberDildonics · 2026-04-21T03:22:19 1776741739

I think you mean incredible claims. You would be incredulous about them.

binsquare · 2026-04-21T04:15:26 1776744926

You're right, thanks for the correction

aesthesia · 2026-04-20T02:04:58 1776650698

Leaked/extracted system prompts for other chat models, particularly ChatGPT, are often around this size. Here's GPT-5.4: https://github.com/asgeirtj/system_prompts_leaks/blob/main/O...

sigmoid10 · 2026-04-20T07:20:41 1776669641

Thanks, but that kind of confirms my belief. wc counts ~15k words in there. That may technically be the same order of magnitude, but it is only a quarter of Claude's and less than 2% of the context limit. So a lot more steering is baked into the model weights than into the prompt compared to Claude.

aesthesia · 2026-04-17T23:56:49 1776470209

I think this would come off a lot better if the recommendations weren't so absolute. I like the effect of a multicolored slab of highlights calling out every LLM cliche in a passage. Yes, the slop style is not just the sum of these individual patterns, but they're definitely significant contributors to the effect, and they're worth being aware of in your own writing regardless of their association with LLMs. You just can't treat it as a list of must-resolve errors (same as with any writing feedback, really).

aesthesia · 2026-04-17T16:23:24 1776443004

At least in some fields, advanced courses are the most likely to have lower cost textbooks. Real analysis textbooks are usually cheaper than calculus textbooks. It's the introductory courses that tend to have $200 behemoths attached to online homework platforms optimized for ease of grading rather than student learning.

aesthesia · 2026-04-16T15:51:17 1776354677

The new tokenizer is interesting, but it definitely is possible to adapt a base model to a new tokenizer without too much additional training, especially if you're distilling from a model that uses the new tokenizer. (see, e.g., https://openreview.net/pdf?id=DxKP2E0xK2).

ACCount37 · 2026-04-16T16:54:36 1776358476

Not impossible, but you have to be at least a little bit mad to deploy tokenizer replacement surgery at this scale.

They also changed the image encoder, so I'm thinking "new base model". Whatever base that was powering 4.5/4.6 didn't last long then.