But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"
0.191 * 0.594 > 1 - 0.936
Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?
I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.
Cost is like 90-99% of what matters. Last year, China installed 300GW of new renewables and 0GW of geothermal, despite geothermal being "an inexhaustible 24/7 production capable".
Geothermal will compete with solar if they can get the cost low enough. I hope they succeed!
> Each time there's a new model release a few more get solved.
I'm no expert, but based on the commentary from mathematicians, this Erdős proof is a unique milestone because the problem received previous attention from multiple professional mathematicians, and the proof was surprising, elegant, and revealed some new connections.
The previous ChatGPT Erdős proofs have been qualitatively less impressive, more akin to literature search or solving easier problems that have been neglected.
Reading the prompt[1], one wonders if stoking the model to be unconventional is part of the success: "this ... may require non-trivial, creative and novel elements"
>one wonders if stoking the model to be unconventional is part of the success
I've long suspected that a lot of these model's real capabilities are still locked behind certain prompts, despite the big labs spending tons of effort on making default responses to simple prompts better. Even really dumb shit like "Answer this: ..." vs "Question: ..." vs "... you'll be judged by <competitor>" that should have zero impact in an ideal world can significantly impact benchmark results. The problem is that you can waste a ton of time finding the right prompt using these "dumb" approaches, while the model actually just required some very specific context that was obvious to you and not to it in many day-to-day situations. My go to method is still to have the model ask me questions as the very first step to any of these problems. They kind of tried that with deep research since the early o-series, but it still needs improvement.
Just the right "prompt" is exactly what happened here. Lean has been developed and incorporated into it's data set. Also, token responses only vaguely correlate to "human language" and it's been proven transformers develop their own internal representation that has created a whole field called machanistic interpretation. Being able to more correctly "parse", AKA using Lean and the right "Prompts, insights and suggestions", will take a whole new meaning in the future.
Awesome term/info, and (completely orthogonal to whether they’ll take err jerbs): I’m really excited about the social/civic picture that might be enabled by a defined and verifiable ontological and taxonomical foundation shared across humanity, particularly coupled with potential ‘legislation as code’ or ‘legal system as code’ solutions.
I’m thinking on a time horizon a bit past my own lifespan, but: even the possibility to objectively map out some specific aspect of a regional approach to social rights in a given time period and consider it with another social framework, alongside automated & verifiable execution of policy, irrespective of the language of origin is incredible.
Instead of hundreds and thousands of incommensurate legislative silos we might create a bazaar of shared improvement and governance efficiency. Turnkey mature governance and anti-corruption measures for newborn nations and countries trying to break out of vicious historical exploitation cycles. Fingers crossed.
Do you think the root cause of social/civic failures has been an inadequate policy repository and lack of a map between policy representations? If so, I have a bridge in Alaska for you to encode into your representation scheme.
I consider the scene with Dr. Chandra and SAL 9000 to be a fairly realistic predictive description of how experts interact with LLMs. SAL even has a somewhat obsequious personality.
Model output reflects on your input, and the effect is self reinforcing over the course of a whole conversation. Color you add around a problem influences the model behavior.
A "dumber"/vague framing will get a less insightful solution, or possibly no solution at all.
I don't even necessarily think this is a critical flaw - in general it's just the model tuning it's responses to your style of prompt. People utilize LLMs for all kinds of different tasks, and the "modes of thought" for responding to an Erdos problem versus software engineering versus a more human/soft skills topic are all very different. I think the "prompt sensitivity" issue is just coming bundled along with this general behavior.
Keeping a pristine context is so important that I used two separate conversations whenever doing something meaningful. One is the main task executor, and the other is for me to bounce random problems, thoughts, and ideas off of while doing everything to keep a pristine context in the executor instance.
It's sort of an agentic loop where I am one of the agents
They're tuned to target a certain customer demographic solving for certain problems. I've seen standard AI models to absolutely brilliant things sometimes. But the prompts to get it to perform like it did with GPT-3 seem to get lengthier and lengthier in time. At some point we'll probably just snip out smaller, specialized models to do certain things.
Yes, it's extremely awkward! Why is a model that can solve problems in scientific literature the same model that can generate random code, write poems in pirate speech, and do all sorts of other random tasks?
It feels like there is a lot of untapped power for specialized LLM tasks if they were created for specialists instead of the general populace prompting from a smartphone.
> “The raw output of ChatGPT’s proof was actually quite poor. So it required an expert to kind of sift through and actually understand what it was trying to say,” Lichtman says. But now he and Tao have shortened the proof so that it better distills the LLM’s key insight.
Interestingly, it was an elegant technique, but the proof still required a lot of work.
- It's not gradient boosting per se that's good on tabular data, it's trees. Other fitting methods with trees as the model are also usually superior to NNs on tabular data.
- Trees are better on tabular data because they encode a useful inductive bias that NNs currently do not. Just like CNNs or ViTs are better on images because they encode spatial locality as an inductive bias.
Absolutely agree on both counts. Gradient boosting is the most commonly known and most successful variant, but it's the decision tree structure that is the underlying architecture there. Decision trees don't have the same "implicit training bias" phenomenon that neural networks have though, so all of this is just model bias in the classical statistical sense.
> They are just ordinary gambling unless you allow insider trading and manipulation, because that’s the only way the market can acquire and represent novel useful information.
Representing only public information without agenda is useful in itself. Words are cheap, and which words you get to see and which words you don't get to see is according to some non-truth incentive. Prediction markets say "you get to make money if you know what the truth actually is". Media says "you get to make money if you entertain people".
It's unfortunate there's also significant negative side effects to financialized prediction markets. I'm more favorable to non-financial prediction markets like Manifold, which say "you get to have social status if you know what the truth is". Seems as though that's the right balance, although you could see how such non-financial prediction markets can be more easily defeated by dedicated non-truth actors if it became prominent in the public conversation.
Yes, sort of. Generally you can measure the pass rate on a benchmark given a fixed compute budget. A sufficiently smart model can hit a high pass rate with fewer tokens/compute. Check out the cost efficiency on https://artificialanalysis.ai/ (say this posted here the other day, pretty neat charts!)
It's much easier to measure a language model's intelligence than a human's because you can take as many samples as you want without affecting its knowledge. And we do measure human intelligence.
That's a stale insight from an old era of warfare. The purpose of quality is to remove quantity. Iran is the case study. A large stockpile of munitions counts for something, but once the factories are gone, you're on a 3 month clock. Factories being deleted can only be achieved with quality (expensive stand-off munitions + F-35s for SEAD, then missile trucks with cheap JDAMs to take out the factories).
30-50 years ago you just couldn't do this kind of warfare, the technology and intelligence didn't exist. Now you can. People haven't updated on this paradigm shift.
People are over-learning the wrong lessons from Ukraine. That is a unique war with air parity. That's why the Ukraine war is shaped the way it is. Not because this is how wars ought to be fought.
This is not to discount quantity. But you can't have only quantity unless you want to fight an attritional war for 10 years (or worse, lose your own industrial production to an enemy that achieves air superiority over your skies because they had the foresight to invest in quality).
The Iran war isn't over yet. Plenty of time for it to become attritional, especially if the people who want Big Gaza / "mowing the nuclear lawn" to become the status quo are in charge. After all, Afghanistan was a quick victory, 20 years of attrition, and eventual exit.
Without factories? I doubt it. I'm not saying the US is going to win (in the sense of achieving objectives), but it's not going to be an attrition war like in WW2 or Ukraine. Japan had factories. Ukraine has factories. You can't sustain a modern war without factories.
Afghanistan wasn't an attrition war (where the outcome is a collapse of one side). The CENTCOM commander explains best why the US lost, it's because of sanctuary:
> The core of the Taliban’s command and control was in the mountainous town of Quetta in southern Pakistan, and the most violent branch of the movement, the Haqqanis, were safely ensconced farther north, also in Pakistan. All were off limits to our forces. Occasionally, Pakistan would apply some pressure, but it was never enough to reduce their ability to operate. I came to see this as the absolutely critical failure of all our plans, and I grew to believe that there weren’t enough U.S. forces in all the world to establish order in Afghanistan, so long as Pakistan was open to the Taliban. It was a logical error in our approach to counterinsurgency that could not be papered over or compensated for.
> You can't sustain a modern war without factories
No, but somehow Iranian backed Hamas and Hizbollah forces manage it from factoryless regions of Palestine and Lebanon. That's what I meant by "big Gaza": a region that's substantially damaged but still capable of fighting, where US/Israeli forces have to keep bombing militants in civilian areas forever. Every few weeks a new pile of dead kids for social media. Is that the plan for Iran?
> US/Israeli forces have to keep bombing militants in civilian areas forever
It's not forever. A common misconception about insurgencies is that they're impossible to defeat because they're an "ideology". But it's more about sanctuary and state sponsorship. Afghanistan was a loss because of sanctuary, as per my quote above. This article provides quantitative analysis on that:
Hezbollah had sanctuary in Syria before Assad's collapse, and their state sponsorship is under strain because their supply route through Syria has been cut off and their state sponsor in Iran has degraded industrial production and finances.
> Is that the plan for Iran?
The plan for Iran is to prevent a fait accompli, defined as 10000 ballistic missiles (exceeding interceptor stockpiles) or a nuclear weapon. The best case scenario is regime change. The second best case scenario is coercing them into terms. The worst case scenario is to degrade their power projection capabilities without a negotiated agreement. But all three scenarios are considered better than the status quo trajectory by the belligerents. The status quo trajectory is seen as leading to a bigger war later (e.g. once they reach 9000 ballistic missiles instead of 5000), or worse.
But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"
0.191 * 0.594 > 1 - 0.936
Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?
reply