More

energy123 · 2026-04-26T18:36:13 1777228573

> 93.6% (congrats Anthropic)

But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"

0.191 * 0.594 > 1 - 0.936

Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?

cjsaltlake · 2026-04-26T18:39:32 1777228772

I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.

energy123 · 2026-04-26T06:51:24 1777186284

Cost is like 90-99% of what matters. Last year, China installed 300GW of new renewables and 0GW of geothermal, despite geothermal being "an inexhaustible 24/7 production capable".

Geothermal will compete with solar if they can get the cost low enough. I hope they succeed!

energy123 · 2026-04-26T05:33:23 1777181603

> Each time there's a new model release a few more get solved.

I'm no expert, but based on the commentary from mathematicians, this Erdős proof is a unique milestone because the problem received previous attention from multiple professional mathematicians, and the proof was surprising, elegant, and revealed some new connections.

The previous ChatGPT Erdős proofs have been qualitatively less impressive, more akin to literature search or solving easier problems that have been neglected.

Reading the prompt[1], one wonders if stoking the model to be unconventional is part of the success: "this ... may require non-trivial, creative and novel elements"

[1] https://chatgpt.com/share/69dd1c83-b164-8385-bf2e-8533e9baba...

sigmoid10 · 2026-04-26T08:12:26 1777191146

>one wonders if stoking the model to be unconventional is part of the success

I've long suspected that a lot of these model's real capabilities are still locked behind certain prompts, despite the big labs spending tons of effort on making default responses to simple prompts better. Even really dumb shit like "Answer this: ..." vs "Question: ..." vs "... you'll be judged by <competitor>" that should have zero impact in an ideal world can significantly impact benchmark results. The problem is that you can waste a ton of time finding the right prompt using these "dumb" approaches, while the model actually just required some very specific context that was obvious to you and not to it in many day-to-day situations. My go to method is still to have the model ask me questions as the very first step to any of these problems. They kind of tried that with deep research since the early o-series, but it still needs improvement.

burnerRhodov2 · 2026-04-26T08:29:59 1777192199

Just the right "prompt" is exactly what happened here. Lean has been developed and incorporated into it's data set. Also, token responses only vaguely correlate to "human language" and it's been proven transformers develop their own internal representation that has created a whole field called machanistic interpretation. Being able to more correctly "parse", AKA using Lean and the right "Prompts, insights and suggestions", will take a whole new meaning in the future.

bonesss · 2026-04-26T09:27:01 1777195621

> machanistic interpretation

Awesome term/info, and (completely orthogonal to whether they’ll take err jerbs): I’m really excited about the social/civic picture that might be enabled by a defined and verifiable ontological and taxonomical foundation shared across humanity, particularly coupled with potential ‘legislation as code’ or ‘legal system as code’ solutions.

I’m thinking on a time horizon a bit past my own lifespan, but: even the possibility to objectively map out some specific aspect of a regional approach to social rights in a given time period and consider it with another social framework, alongside automated & verifiable execution of policy, irrespective of the language of origin is incredible.

Instead of hundreds and thousands of incommensurate legislative silos we might create a bazaar of shared improvement and governance efficiency. Turnkey mature governance and anti-corruption measures for newborn nations and countries trying to break out of vicious historical exploitation cycles. Fingers crossed.

anon7725 · 2026-04-26T19:19:40 1777231180

Do you think the root cause of social/civic failures has been an inadequate policy repository and lack of a map between policy representations? If so, I have a bridge in Alaska for you to encode into your representation scheme.

dalmo3 · 2026-04-26T11:18:42 1777202322

Ah, yes, 2001 but on land.

bitwize · 2026-04-26T16:33:13 1777221193

I consider the scene with Dr. Chandra and SAL 9000 to be a fairly realistic predictive description of how experts interact with LLMs. SAL even has a somewhat obsequious personality.

balamatom · 2026-04-26T14:22:19 1777213339

Moldbug called, asked for his mold and bugs back.

omcnoe · 2026-04-26T11:39:00 1777203540

Model output reflects on your input, and the effect is self reinforcing over the course of a whole conversation. Color you add around a problem influences the model behavior.

A "dumber"/vague framing will get a less insightful solution, or possibly no solution at all.

I don't even necessarily think this is a critical flaw - in general it's just the model tuning it's responses to your style of prompt. People utilize LLMs for all kinds of different tasks, and the "modes of thought" for responding to an Erdos problem versus software engineering versus a more human/soft skills topic are all very different. I think the "prompt sensitivity" issue is just coming bundled along with this general behavior.

WarmWash · 2026-04-26T15:29:41 1777217381

Keeping a pristine context is so important that I used two separate conversations whenever doing something meaningful. One is the main task executor, and the other is for me to bounce random problems, thoughts, and ideas off of while doing everything to keep a pristine context in the executor instance.

It's sort of an agentic loop where I am one of the agents

muzani · 2026-04-26T11:15:36 1777202136

They're tuned to target a certain customer demographic solving for certain problems. I've seen standard AI models to absolutely brilliant things sometimes. But the prompts to get it to perform like it did with GPT-3 seem to get lengthier and lengthier in time. At some point we'll probably just snip out smaller, specialized models to do certain things.

AlienRobot · 2026-04-26T18:55:00 1777229700

Yes, it's extremely awkward! Why is a model that can solve problems in scientific literature the same model that can generate random code, write poems in pirate speech, and do all sorts of other random tasks?

It feels like there is a lot of untapped power for specialized LLM tasks if they were created for specialists instead of the general populace prompting from a smartphone.

hyperpape · 2026-04-26T10:27:59 1777199279

> “The raw output of ChatGPT’s proof was actually quite poor. So it required an expert to kind of sift through and actually understand what it was trying to say,” Lichtman says. But now he and Tao have shortened the proof so that it better distills the LLM’s key insight.

Interestingly, it was an elegant technique, but the proof still required a lot of work.

energy123 · 2026-04-25T09:08:37 1777108117

Deepmind solving Atari games was another big milestone around that time.

energy123 · 2026-04-25T09:05:46 1777107946

Would you agree/disagree with the following:

- It's not gradient boosting per se that's good on tabular data, it's trees. Other fitting methods with trees as the model are also usually superior to NNs on tabular data.

- Trees are better on tabular data because they encode a useful inductive bias that NNs currently do not. Just like CNNs or ViTs are better on images because they encode spatial locality as an inductive bias.

hodgehog11 · 2026-04-25T12:14:01 1777119241

Absolutely agree on both counts. Gradient boosting is the most commonly known and most successful variant, but it's the decision tree structure that is the underlying architecture there. Decision trees don't have the same "implicit training bias" phenomenon that neural networks have though, so all of this is just model bias in the classical statistical sense.

energy123 · 2026-04-25T13:30:31 1777123831

Can NNs be made to be better than trees on tabular data with some further constraints, or something?

hodgehog11 · 2026-04-25T15:04:10 1777129450

You might find this interesting: https://arxiv.org/abs/1909.06312

energy123 · 2026-04-24T08:15:08 1777018508

> They are just ordinary gambling unless you allow insider trading and manipulation, because that’s the only way the market can acquire and represent novel useful information.

Representing only public information without agenda is useful in itself. Words are cheap, and which words you get to see and which words you don't get to see is according to some non-truth incentive. Prediction markets say "you get to make money if you know what the truth actually is". Media says "you get to make money if you entertain people".

It's unfortunate there's also significant negative side effects to financialized prediction markets. I'm more favorable to non-financial prediction markets like Manifold, which say "you get to have social status if you know what the truth is". Seems as though that's the right balance, although you could see how such non-financial prediction markets can be more easily defeated by dedicated non-truth actors if it became prominent in the public conversation.

energy123 · 2026-04-23T19:27:26 1776972446

Less cost to accomplish the same goal is a sign of intelligence. That's not necessarily achieved with less tokens but it may be.

energy123 · 2026-04-23T19:22:10 1776972130

Look a cost per intelligence or cost per task instead of cost per token.

yokoprime · 2026-04-23T19:48:48 1776973728

How do I reliably measure 1 unit of intelligence?

wellthisisgreat · 2026-04-23T20:48:50 1776977330

In pelicans, obviously

ulimn · 2026-04-23T19:36:03 1776972963

Isn't the outcome / solution for a given task non-deterministic? So can we reliably measure that?

foota · 2026-04-23T19:51:09 1776973869

Yes, sort of. Generally you can measure the pass rate on a benchmark given a fixed compute budget. A sufficiently smart model can hit a high pass rate with fewer tokens/compute. Check out the cost efficiency on https://artificialanalysis.ai/ (say this posted here the other day, pretty neat charts!)

throwuxiytayq · 2026-04-23T19:58:51 1776974331

It's much easier to measure a language model's intelligence than a human's because you can take as many samples as you want without affecting its knowledge. And we do measure human intelligence.

genericresponse · 2026-04-23T19:48:13 1776973693

Statistically. Do many trials and measure how often it succeeds/fails.

kridsdale1 · 2026-04-24T14:14:19 1777040059

Aka a benchmark.

torginus · 2026-04-23T20:02:11 1776974531

This is the only correct take. The only metric that matters is cost per desired outcome.

dns_snek · 2026-04-23T19:47:51 1776973671

Repetition and statistics, if you have $1000++ you didn't need anyway.

energy123 · 2026-04-22T03:42:40 1776829360

Landlords in San Diego were feeling less greedy this year. Same as lithium miners between 2022-2025.

But SSD manufacturers and oil producers have been feeling really greedy this year.

energy123 · 2026-04-21T05:30:40 1776749440

That's a stale insight from an old era of warfare. The purpose of quality is to remove quantity. Iran is the case study. A large stockpile of munitions counts for something, but once the factories are gone, you're on a 3 month clock. Factories being deleted can only be achieved with quality (expensive stand-off munitions + F-35s for SEAD, then missile trucks with cheap JDAMs to take out the factories).

30-50 years ago you just couldn't do this kind of warfare, the technology and intelligence didn't exist. Now you can. People haven't updated on this paradigm shift.

People are over-learning the wrong lessons from Ukraine. That is a unique war with air parity. That's why the Ukraine war is shaped the way it is. Not because this is how wars ought to be fought.

This is not to discount quantity. But you can't have only quantity unless you want to fight an attritional war for 10 years (or worse, lose your own industrial production to an enemy that achieves air superiority over your skies because they had the foresight to invest in quality).

pjc50 · 2026-04-21T13:05:40 1776776740

The Iran war isn't over yet. Plenty of time for it to become attritional, especially if the people who want Big Gaza / "mowing the nuclear lawn" to become the status quo are in charge. After all, Afghanistan was a quick victory, 20 years of attrition, and eventual exit.

energy123 · 2026-04-21T14:44:43 1776782683

Without factories? I doubt it. I'm not saying the US is going to win (in the sense of achieving objectives), but it's not going to be an attrition war like in WW2 or Ukraine. Japan had factories. Ukraine has factories. You can't sustain a modern war without factories.

Afghanistan wasn't an attrition war (where the outcome is a collapse of one side). The CENTCOM commander explains best why the US lost, it's because of sanctuary:

> The core of the Taliban’s command and control was in the mountainous town of Quetta in southern Pakistan, and the most violent branch of the movement, the Haqqanis, were safely ensconced farther north, also in Pakistan. All were off limits to our forces. Occasionally, Pakistan would apply some pressure, but it was never enough to reduce their ability to operate. I came to see this as the absolutely critical failure of all our plans, and I grew to believe that there weren’t enough U.S. forces in all the world to establish order in Afghanistan, so long as Pakistan was open to the Taliban. It was a logical error in our approach to counterinsurgency that could not be papered over or compensated for.

pjc50 · 2026-04-21T15:57:15 1776787035

> You can't sustain a modern war without factories

No, but somehow Iranian backed Hamas and Hizbollah forces manage it from factoryless regions of Palestine and Lebanon. That's what I meant by "big Gaza": a region that's substantially damaged but still capable of fighting, where US/Israeli forces have to keep bombing militants in civilian areas forever. Every few weeks a new pile of dead kids for social media. Is that the plan for Iran?

energy123 · 2026-04-22T04:58:53 1776833933

> US/Israeli forces have to keep bombing militants in civilian areas forever

It's not forever. A common misconception about insurgencies is that they're impossible to defeat because they're an "ideology". But it's more about sanctuary and state sponsorship. Afghanistan was a loss because of sanctuary, as per my quote above. This article provides quantitative analysis on that:

https://www.rand.org/content/dam/rand/pubs/monographs/2010/R...

Hezbollah had sanctuary in Syria before Assad's collapse, and their state sponsorship is under strain because their supply route through Syria has been cut off and their state sponsor in Iran has degraded industrial production and finances.

> Is that the plan for Iran?

The plan for Iran is to prevent a fait accompli, defined as 10000 ballistic missiles (exceeding interceptor stockpiles) or a nuclear weapon. The best case scenario is regime change. The second best case scenario is coercing them into terms. The worst case scenario is to degrade their power projection capabilities without a negotiated agreement. But all three scenarios are considered better than the status quo trajectory by the belligerents. The status quo trajectory is seen as leading to a bigger war later (e.g. once they reach 9000 ballistic missiles instead of 5000), or worse.