It doesn't, I get that it's _a_ benchmark. It's just not a good or insightful one, and having it posted so often on HN feels like low quality spam at this point
The issue is that benchmarks that look insightful will end up being gamed by labs quickly (Goodharts law)
The best LLM benchmarks test around the margins of those behaviors, tasks that are difficult and correlate with usefulness while being removed enough to stay unpolluted
"Waterfall" got a bad rep because it meant "we stay months in the requirements gathering, then months design phase, then months in development, then months in validation". If you compress "months" to days/hours, what you obtain is something that nobody from the 90s would recognize as "waterfall"; it is not the end of agility, far from it.
"They’re writing TypeScript that compiles to JavaScript that runs in a V8 engine written in C++ that’s making system calls to an OS kernel that’s scheduling threads across cores they’ve never thought about, hitting RAM through a memory controller with caching layers they couldn’t diagram, all while npm pulls in 400 packages they’ve never read a line of."
https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/
reply