Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Furiosa: 3.5x efficiency over H100s (furiosa.ai)
206 points by written-beyond 1 day ago | hide | past | favorite | 153 comments




I am of the opinion that Nvidia's hit the wall with their current architecture in the same way that Intel has historically with its various architectures - their current generation's power and cooling requirements are requiring the construction of entirely new datacenters with different architectures, which is going to blow out the economics on inference (GPU + datacenter + power plant + nuclear fusion research division + lobbying for datacenter land + water rights + ...).

The story with Intel around these times was usually that AMD or Cyrix or ARM or Apple or someone else would come around with a new architecture that was a clear generation jump past Intel's, and most importantly seemed to break the thermal and power ceilings of the Intel generation (at which point Intel typically fired their chip design group, hired everyone from AMD or whoever, and came out with Core or whatever). Nvidia effectively has no competition, or hasn't had any - nobody's actually broken the CUDA moat, so neither Intel nor AMD nor anyone else is really competing for the datacenter space, so they haven't faced any actual competitive pressure against things like power draws in the multi-kilowatt range for the Blackwells.

The reason this matters is that LLMs are incredibly nifty often useful tools that are not AGI and also seem to be hitting a scaling wall, and the only way to make the economics of, eg, a Blackwell-powered datacenter make sense is to assume that the entire economy is going to be running on it, as opposed to some useful tools and some improved interfaces. Otherwise, the investment numbers just don't make sense - the gap between what we see on the ground of how LLMs are used and the real but limited value add they can provide and the actual full cost of providing that service with a brand new single-purpose "AI datacenter" is just too great.

So this is a press release, but any time I see something that looks like an actual new hardware architecture for inference, and especially one that doesn't require building a new building or solving nuclear fusion, I'll take it as a good sign. I like LLMs, I've gotten a lot of value out of them, but nothing about the industry's finances add up right now.


> I am of the opinion that Nvidia's hit the wall with their current architecture

Based on what?

Their measured performance on things people care about keep going up, and their software stack keeps getting better and unlocking more performance on existing hardware

Inference tests: https://inferencemax.semianalysis.com/

Training tests: https://www.lightly.ai/blog/nvidia-b200-vs-h100

https://newsletter.semianalysis.com/p/mi300x-vs-h100-vs-h200... (only H100, but vs AMD)

> but nothing about the industry's finances add up right now

Is that based just on the HN "it is lots of money so it can't possibly make sense" wisdom? Because the released numbers seem to indicate that inference providers and Anthropic are doing pretty well, and that OpenAI is really only losing money on inference because of the free ChatGPT usage.

Further, I'm sure most people heard the mention of an unnamed enterprise paying Anthropic $5000/month per developer on inference(!!) If a company if that cost insensitive is there any reason why Anthropic would bother to subsidize them?


> Their measured performance on things people care about keep going up, and their software stack keeps getting better and unlocking more performance on existing hardware

I'm more concerned about fully-loaded dollars per token - including datacenter and power costs - rather than "does the chip go faster." If Nvidia couldn't make the chip go faster, there wouldn't be any debate, the question right now is "what is the cost of those improvements." I don't have the answer to that number, but the numbers going around for the costs of new datacenters doesn't give me a lot of optimism.

> Is that based just on the HN "it is lots of money so it can't possibly make sense" wisdom?

OpenAI has $1.15T in spend commitments over the next 10 years: https://tomtunguz.com/openai-hardware-spending-2025-2035/

As far as revenue, the released numbers from nearly anyone in this space are questionable - they're not public companies, we don't actually get to see inside the box. Torture the numbers right and they'll tell you anything you want to hear. What we _do_ get to see is, eg, Anthropic raising billions of dollars every ~3 months or so over the course of 2025. Maybe they're just that ambitious, but that's the kind of thing that makes me nervous.


> OpenAI has $1.15T in spend commitments over the next 10 years

Yes, but those aren't contracted commitments, and we know some of them are equity swaps. For example "Microsoft ($250B Azure commitment)" from the footnote is an unknown amount of actual cash.

And I think it's fair to point out the other information in your link "OpenAI projects a 48% gross profit margin in 2025, improving to 70% by 2029."


> "OpenAI projects a 48% gross profit margin in 2025, improving to 70% by 2029."

OpenAI can project whatever they want, they're not public.


They still have shareholders who can sue for misinformation.

Private companies do have a license to lie to their shareholders.


Sounds like the railway boom.. I mean bond scam's

> Yes, but those aren't contracted commitments, and we know some of them are equity swaps.

It's worse than not contracted. Nvidia said in their earnings call that their OpenAI commitment was "maybe".


The fact that there's an incestual circle between OpenAI, Microsoft, NVidia, AMD, etc.. where they provide massive promises to each other for future business is nothing short of hilarious.

The economics of the entire setup are laughable and it's obvious that it's a massive bubble. The profit that'd need to be delivered to justify the current valuations is far beyond what is actually realistic.

What moat does OpenAI have? I'd argue basically none. They make extremely lofty forecasts and project an image of crazy growth opportunities, but is that going to ever survive the bubble popping?


I still don't really understand this "circle" issue. If I fix your bathroom and in return you make me a new table, is that an incestuous circle? Haven't we both just exchanged value?

The circle allows you to put an arbitrary "price" on those services. You could say that the bathroom and table are $100 each, so your combined work was $200. Or you could claim that each of you did $1M work. Without actual money flowing in/out of your circle, your claims aren't tethered to reality.

You don’t think real money is changing hands when Microsoft buys Nvidia GPUs?

What about when Nvidia sells GPUs to a client and then buys 10% of their shares?

Their shares will be based on the client's valuation, which in public markets is externally priced. If not in public markets it is murkier, but will be grounded in some sort of reality so Nvidia gets the right amount of the company.

It's a soft version of money printing basically. These firms are clearly inflating each other's valuations by making huge promises of future business to each other. Naively, one would look at the headlines and draw the conclusion that much more money is going to flow into AI in the near future.

Of course, a rational investor looks at this and discounts the fact that most of those promises are predicated on insane growth that has no grounding in reality.

However, there are plenty of greedy or irrational investors, whose recklessness will affect everyone, not just them.


For Nvidia shares: converting cash into shares in a speculative business while guaranteeing increasing demand for your product is a pretty good idea, and probably doesn't have any downsides.

For the AI company being bought: I wouldn't trust these shares or valuations, because the money invested is going on GPUs and back to Nvidia.


GPUs are supply constrained and price isn't declining that fast so why do you expect the token price price to decrease. I think the supply issue will resolve in 1-2 years as now they have good prediction of how fast the market would grow.

Nvidia is literally selling GPUs with 90% profit margin and still everything is out of stock, which is unheard of before.


> Further, I'm sure most people heard the mention of an unnamed enterprise paying Anthropic $5000/month per developer on inference

I haven't and I'd like to know more.


>Further, I'm sure most people heard the mention of an unnamed enterprise paying Anthropic $5000/month per developer on inference

Companies have wasted more money on dumber things so spending isn't a good measure.

And what about the countless other AI companies? Anthropic has one of the top models for coding so that's like saying there ins't a problem pre dot com bubble because Amazon is doing fine.

The real effects of AI is measured in rising profit of the customers of those AI companies otherwise you're looking at the shovel sellers


> Is that based just on the HN "it is lots of money so it can't possibly make sense" wisdom?

I mean the amount of money invested across just a handful of AI companies is currently staggering and their respective revenues are no where near where they need to be. That’s a valid reason to be skeptical. How many times have we seen speculative investment of this magnitude? It’s shifting entire municipal and state economies in the US.

OpenAI alone is currently projected to burn over $100 billion by what? 2028 or 2029? Forgot what I read the other day. Tens of billions a year. That is a hell of a gamble by investors.


The flip side is that these companies seem to be capacity constrained (although that is hard to confirm). If you assume the labs are capacity constrained, which seems plausible, then building more capacity could pay off by allowing labs to serve more customers and increase revenue per customer.

This means the bigger questions are whether you believe the labs are compute constrained, and whether you believe more capacity would allow them to drive actual revenue. I think there is a decent chance of this being true, and under this reality the investments make more sense. I can especially believe this as we see higher-cost products like Claude Code grow rapidly with much higher token usage per user.

This all hinges on demand materialising when capacity increases, and margins being good enough on that demand to get a good ROI. But that seems like an easier bet for investors to grapple with than trying to compare future investment in capacity with today's revenue, which doesn't capture the whole picture.


I am not someone who would ever be ever be considered an expert on factories/manufacturing of any kind, but my (insanely basic) understanding is that typically a “factory” making whatever widgets or doodads is outputting at a profit or has a clear path to profitability in order to pay off a loan/investment. They have debt, but they’re moving towards the black in a concrete, relatively predictable way - no one speculates on a factory anywhere near the degree they do with AI companies currently. If said factory’s output is maxed and they’re still not making money, then it’s a losing investment and they wouldn’t expand.

Basically, it strikes me as not really apples to apples.


Consensus seems to be that the labs are profitable on inference. They are only losing money on training and free users.

The competition requiring them to spend that money on training and free users does complicate things. But when you just look at it from an inference perspective, looking at these data centres like token factories makes sense. I would definitely pay more to get faster inference of Opus 4.5, for example.

This is also not wholly dissimilar to other industries where companies spend heavily on R&D while running profitable manufacturing. Pharma semiconductors, and hardware companies like Samsung or Apple all do this. The unusual part with AI labs is the ratio and the uncertainty, but that's a difference of degree, not kind.


> But when you just look at it from an inference perspective, looking at these data centres like token factories makes sense.

So if you ignore the majority of the costs, then it makes sense.

Opus 4.5 was released on November 25, 2025. That is less than 2 months ago. When they stop training new models, then we can forget about training costs.


I'm not taking a side here - I don't know enough - but it's an interesting line of reasoning.

So I'll ask, how is that any different than fabs? From what I understand R&D is absurd and upgrading to a new node is even more absurd. The resulting chips sell for chump change on a per unit basis (analogous to tokens). But somehow it all works out.

Well, sort of. The bleeding edge companies kept dropping out until you could count them on one hand at this point.

At first glance it seems like the analogy might fit?


Someone else mentioned it elsewhere in this thread, and I believe this is the crux of the issue: this is all predicated in the actual end users finding enough benefit in LLM services to keep the gravy train going. It's irrelevant how scalable and profitable the shovel makes are, to keep this business afloat long term, the shovelers - ie the end users - have to make money using the shovesl. Those expectations are currently ridiculously inflated. Far beyond anything in the past.

Invariably, there's going to be a collapse in the hype, the bubble will burst, and an investment deleveraging will remove a lot of money from the space in a short period of time. The bigger the bubble, the more painful and less survivable this event will be.


Inference costs scale linearly with usage. R&D expenses do not.

That's not to mention that Dario Amodei has said that their models actually have a good return, even when accounting for training costs [0].

[0] https://youtu.be/GcqQ1ebBqkc?si=Vs2R4taIhj3uwIyj&t=1088


> Inference costs scale linearly with usage. R&D expenses do not.

Do we know this is true for AI?


It’s pretty much the definition of fixed costs versus variable costs.

You spend the same amount on R&D whether you have one hobbyist user or 90% market share.


Yes. R&D is guaranteed to fall as a percentage of costs eventually. The only question is when, and there is also a question of who is still solvent when that time comes. It is competition and an innovation race that keeps it so high, and it won't stay so high forever. Either rising revenues or falling competition will bring R&D costs down as a percentage of revenue at some point.

Yes, but eventually may be longer than the market can hold out. So far R&D expenses have skyrocketed and it does not look like that will be changing anytime soon.

That's why it is a bet, and not a sure thing.

>Consensus seems to be that the labs are profitable on inference. They are only losing money on training and free users.

That sounds like “we’re profitable if you ignore our biggest expenses.” If they could be profitable now, we’d see at least a few companies just be profitable and stop the heavy expenses. My guess is it’s simply not the case or everyone’s trapped in a cycle where they are all required to keep spending too much to keep up and nobody wants to be the first to stop. Either way the outcome is the same.


This is just not true. Plenty of companies will remain unprofitable for as long as they can in the name of growth, market share, and beating their competition. At some point it will level out, but while they can still raise cheap capital and spend it to grow, they will.

OpenAI could put in ads tomorrow and make tons of money overnight. The only reason they don't is competition. But when they start to find it harder to raise capital to fund their growth, they will.


> I mean the amount of money invested across just a handful of AI companies is currently staggering and their respective revenues are no where near where they need to be. That’s a valid reason to be skeptical.

Yes and no. Some of it just claims to be "AI". Like the hyperscalers are building datacenters and ramping up but not all of it is "AI". The crypto bros have rebadged their data centers into "AI".


> The crypto bros have rebadged their data centers into "AI"

That the previous unsustainable bubble is rebranding into the new one, is maybe not the indicator of stability we should be hoping for


> (at which point Intel typically fired their chip design group, hired everyone from AMD or whoever, and came out with Core or whatever)

Didn't the Core architecture come from the Intel Pentium M Israeli team? https://en.wikipedia.org/wiki/Intel_Core_(microarchitecture)...


Correct. Core came from Pentium M, which actually came from the Israeli team who took the Pentium 3 architecture, and coupled this with the best bits from the Pentium 4

Yeah, that bit was pure snark - point was Intel’s gotten caught resting on their laurels a couple times when their architectures get a little long in the tooth, and often it’s existential enough that the team that pulls them out of it isn’t the one that put them in it.

I think that's an overly reductive view of a very complicated problem space, with the benefit of hindsight.

If you wanted to make that point, Itanium or 64-bit/multi-core desktop processing would be better examples than Core.


Yes, and the newest Panther Lake too!

https://techtime.news/2025/10/10/intel-25/


What about TPUs? They are more efficient than nvidia GPUs, a huge amount of inference is done with them, and while they are not literally being sold to the public, the whole technology should be influencing the next steps of Nvidia just like AMD influenced Intel

TPUs can be more efficient, but are quite difficult to program for efficiently (difficult to saturate). That is why Google tends to sell TPU-services, rather than raw access to TPUs, so they can control the stack and get good utilization. GPUs are easier to work with.

I think the software side of the story is underestimated. Nvidia has a big moat there and huge community support.


My understanding is all of Google's AI is trained and run on quite old but well designed TPUs. For a while the issue was that developing these AI models still needed flexibility and customised hardware like TPUs couldn't accomodate that.

Now that the model architecture has settled into something a bit more predictable, I wouldn't be surprised if we saw a little more specialisation in the hardware.


> and the only way to make the economics of, eg, a Blackwell-powered datacenter make sense is to assume that the entire economy is going to be running on it, as opposed to some useful tools and some improved interfaces.

And I'm still convinced we're not paying real prices anywhere. Everyone is still trying to get market share so the prices are going to go up when this all needs to sustain itself. At that point, which use cases become too expensive and does that shrink it's applicability ?


> The reason this matters is that LLMs are incredibly nifty often useful tools that are not AGI and also seem to be hitting a scaling wall

I don't know who needs to hear this, but the real break through in AI that we have had is not LLMs, but generative AI. LLM is but one specific case. Furthermore, we have hit absolutely no walls. Go download a model from Jan 2024, another from Jan 2025 and one from this year and compare. The difference is exponential in how well they have gotten.


> exponential

Is this the second most abused english word (after 'literally')?

> a model from Jan 2024, another from Jan 2025 and one from this year

You literally can't tell the difference is 'exponential', quadratic, or whatever from three data points.

Plus it's not my experience at all. Since Deepseek I haven't found models that one can run on consumer hardware get much better.


I’ve heard “orders of magnitude” used more than once to mean 4-5 times

In binary 2x is one order of magnitude

exactly!

I've been wondering about this for quite a while now. Why does everybody automatically assume that I'm using the decimal system when saying "orders of magnitude"?!


I'd argue that 100% of all humans use the decimal system, most of the time. Maybe 1 to 5% of all humans use another system some of the time.

Anyway, there are 10 types of people, those who understand binary and those who don't.


Because, as xkcd 169 says, communicating badly and then actung smug when you're misunderstood is not cleverness. "Orders of magnitude" refers to a decimal system in the vast majority of uses (I must admit I have no concrete data on this, but I can find plenty of references to it being base-10 and only a suggestion that it could be sometihng else).

Unless you've explicitly stated that you mean something else, people have no reason to think that you mean something else.


There is a lot of talking past each other when discussing LLM performance. The average person whose typical use case is asking ChatGPT how long they need to boil an egg for hasn't seen improvements for 18 months. Meanwhile if you're super into something like local models for example the tangible improvements are without exaggeration happening almost monthly.

Random trivia are answered much better in my case.

> The average person whose typical use case is asking ChatGPT how long they need to boil an egg for hasn't seen improvements for 18 months

I don’t think that’s true. I think both my mother and my mother-in-law would start to complain pretty quickly if they got pushed back to 4o. Change may have felt gradual, but I think that’s more a function of growing confidence in what they can expect the machine to do.

I also think “ask how long to boil an egg” is missing a lot here. Both use ChatGPT in place of Google for all sorts of shit these days, including plenty of stuff they shouldn’t (like: “will the city be doing garbage collection tomorrow?”). Both are pretty sharp women but neither is remotely technical.


>go download a model

GP was talking about commercially hosted LLMs running in datacenters, not free Chinese models.

Local is definitely still improving. That’s another reason the megacenter model (NVDA’s big line up forever plan) is either a financial catastrophe about to happen, or the biggest bailout ever.


GPT 5.2 is an incredible leap over 5.1 / 5

5.2 is great if you ask it engineering questions, or questions an engineer might ask. It is extremely mid, and actually worse than the o3/o4 era models if you start asking it trivia like if the I-80 tunnel on the bay bridge (yerba buena island) is the largest bore in the world. Don't even get me started on whatever model is wired up to the voice chat button.

But yes it will write you a flawless, physics accurate flight simulator in rust on the first try. I've proven that. I guess what I'm trying to say is Anthropic was eating their lunch at coding, and OpenAI rose to the challenge, but if you're not doing engineering tasks their current models are arguably worse than older ones.


But how many are willing to fork over $20 or so a month to ask simple trivia questions?

In addition to engineering tasks, it's an ad-free answer-box, outside of cross checking things, or browsing search results it's totally replaced Google/search engine use for me. I also pay for Kagi for search. In the last year I've been able to fully divorce myself from the google ecosystem besides gmail and maps.

My impression is that software developers are the lions share of people actually paying for AI, but perhaps that's just my bubble world view.

According to OpenAI it's something like 4.2% of the use. But this data is from before Codex added subscription support and I think only covers ChatGPT (back when most people were using ChatGPT for coding work, before agents got good).

https://i.imgur.com/0XG2CKE.jpeg


The execs I've talked to, they are paying for it to answer capex questions, as a sounding board for decision making, and perhaps most importantly, crafting/modifying emails for tone/content. In the bay area particularly a lot of execs are foreign with english as their second language and LLMs can cut email generation time in half.

I'd believe that but I was commenting on who actually pays for it. My guess is that most individuals using AI in their personal lives are using some sort of free tier.

Yes 95% are unpaid

how is “GPT 5.2 is good” a response to “downloadable models aren’t relevant”?

> Go download a model from Jan 2024, another from Jan 2025 and one from this year and compare.

I did. The old one is smarter.

(The newer ones are more verbose, though. If that impresses you, then you probably think members of parliament are geniuses.)


Yeah agreed, there were some minor gains, but new releases are mostly benchmark overfit sycopanthic bullshit that are only better on paper and horrible to use. The more synthetic data they add the less world knowledge the model has and the more useless it becomes. But at least they can almost mimic a basic calculator now /s

For api models, OpenAI's releases have regularly not been an improvement for a long while now. Is sonnet 4.5 better than 3.5 outside pretentius agentic workflows it's been trained for? Basically impossible to tell, they make the same braindead mistakes sometimes.


> I am of the opinion that Nvidia's hit the wall with their current architecture

Google presented TPUs in 2015. NVIDIA introduced Tensor Cores in 2018. Both utilize systolic arrays.

And last month NVIDIA pseudo-acquired Groq including the founder and original TPU guy. Their LPUs are way more efficient for inference. Also of note Groq is fully made in USA and has a very diverse supply chain using older nodes.

NVIDIA architecture is more than fine. They have deep pockets and very technical leadership. Their weakness lies more with their customers, lack of energy, and their dependency on TSMC and the memory cartel.


Underrated acquisition. Gives NVIDIA a whole lineup of inference-focused hardware that iirc can retrofit into existing air cooled data centres without needing cooling upgrades. Great hedge against the lower-end $$$-per-watt and watt-per-token competition that has been focused purely at inference.

Also a hedge from the memory cartel as Groq uses SRAM. And a reasonable hedge in case Taiwan gets blockaded or something.

Thanks for this. It put into words a lot of the discomfort I’ve had with the current AI economics.


We've seen this before.

In 2001, there were something like 50+ OC-768 hardware startups.

At the time, something like 5 OC-768 links could carry all the traffic in the world. Even exponential doubling every 12 months wasn't going to get enough customers to warrant all the funding that had poured into those startups.

When your business model bumps into "All the <X> in the world," you're in trouble.


Especially when your investors are still expecting exponential growth rates.

What do I care if there's no profit in LLM's..

I just want to buy ddr5 and not pay an arm and a leg for my power bill!


> which is going to blow out the economics on inference

At this point, I don't even think they do the envelope math anymore. However much money investors will be duped into giving them, that's what they'll spend on compute. Just gotta stay alive until the IPO!


Remember that without real competition, Nvidia has little incentive to release something 16x faster when they could release something 2x faster 4 times.

You’re right but Nvidia enjoys an important advantage Intel had always used to mask their sloppy design work: the supply chain. You simply can’t source HBMs at scale because Nvidia bought everything, TSMC N3 is likewise fully booked and between Apple and Nvidia their 18A is probably already far gone and if you want to connect your artisanal inference hardware together then congratulations, Nvidia is the leader here too and you WILL buy their switches.

As for the business side, I’ve yet to hear of a transformative business outcome due to LLMs (it will come, but not there yet). It’s only the guys selling the shovels that are making money.

This entire market runs on sovereign funds and cyclical investing. It’s crazy.


For instance, I believe Callcenters are in big trouble, and so are specialized contractors (like those prepping for an SOC submission etc).

It is, however, actually funny how bad e.g. the amazon chatbot (Rufus) is on amazon.com. When asked where a particular CC charge comes from, it does all sorts of SQL queries into my account, but it can't be bothered to give me the link to the actual charges (the page exists and solves the problem trivially).

So, maybe, the callcenter troubles will take some time to materialize.


Based on conversations I've had with some people managing GPU's at scale in the datacenters, inference is an after thought. There is a gold rush for training right now, and that's where these massive clusters are being used.

LLM's are probably a small fraction of the overall GPU compute in use right now. I suspect in the next 5 years we'll have full Hollywood movies being completely generated (at least the specialfx) entirely by AI.


Hollywood studios are breathing their last gasps now. Anyone will be able to use AI to create blockbuster type movies, Hollywood's moat around that is rapidly draining.

Have you....used any of the video generators? Nothing they create make any goddamn sense, they're a step above those fake acid trip simulators.

> Nothing they create make any goddamn sense,

I wouldn’t be that dismissive. Some have managed to make impressive things with them (although nothing close to an actual movie, even a short).

https://www.youtube.com/watch?v=ET7Y1nNMXmA

A bit older: https://www.youtube.com/watch?v=8OOpYvxKhtY

Compared to two years ago: https://www.youtube.com/watch?v=LHeCTfQOQcs


The problem with all of these, even the most recent one, is that they have the "AI look". People have tired of this look already, even for short adverts; if they don't want five minutes of it, they really won't like two hours of it. There is no doubt the quality has vastly improved over time, but I see no sign of progress in removing the "AI look" from these things.

My feeling is the definition of the "AI look" has evolved as these models progressed.

It used to mean psychedelic weird things worthy of the strangest dreams or an acid trip.

Then it meant strangely blurry with warped alien script and fifteen fingers, including one coming out of another’s second phalanx

Now it means something odd, off, somewhat both hard to place and obvious, like the CGI "transparent" car (is it that the 3D model is too simple, looks like a bad glass sculpture, and refracts light in squares?) and ice cliffs (I think the the lighting is completely off, and the colours are wrong) in Die Another Day.

And if that’s the case, then these models have covered far more in far less time then it took computer graphics and CGI.


What changed my whole perspective on this a few months ago was Google's Genie 3 demo: https://www.youtube.com/watch?v=PDKhUknuQDg

They have really advanced the coherency of real-time AI generation.


Have you seen https://www.youtube.com/watch?v=SGJC4Hnz3m0

It's not feature length movie but I'm not sure there's any reason why it couldn't be, and its not technically perfect but pretty damn good.


Anybody had the ability to write the next great novel for a while, but few succeed.

There are lots of very good relatively recent novels on the shelf at the bookstore. Certainly orders of magnitude more than there are movies.

The other thing to compare is the narrative quality. I find even middling books to be of much higher quality than blockbuster movies on average. Or rather I'm constantly appalled at what passes for a decent script. I assume that's due to needing to appeal to a broad swath of the population because production is so expensive, but understanding the (likely) reason behind it doesn't do anything to improve the end result.

So if "all" we get out of this is a 1000x reduction in production budgets which leads to a 100x increase in the amount of media available I expect it will be a huge win for the consumer.


Anyone with a $200M marketing budget.

Throw it on YouTube and get a few key TikTokers to promote it.

it's so weird how they spend all this money to train new models and then open sources it. it's gold rush but nvidia is getting all the gold.

> I am of the opinion that Nvidia's hit the wall with their current architecture

Not likely since TSMC has a new process with big gains.

> The story with Intel

Was that their fab couldn’t keep up not designs.


If Intel's original 10nm process and Cannon Lake had launched within Intel's original timeframe of 2016/17, it would have been class leading.

Instead, they couldn't get 10nm to work and launched one low-power SKU in 2018 that had almost half the die disabled, and stuck to 14nm from 2014-2021.


> nothing about the industry's finances add up right now

Nothing about the industry’s finances, or about Anthropic and OpenAI’s finances?

I look at the list of providers on OpenRouter for open models, and I don’t believe all of them are losing money. FWIW Anthropic claims (iirc) that they don’t lose money on inference. So I don’t think the industry or the model of selling inference is what’s in trouble there.

I am much more skeptical of Anthropic and OpenAI’s business model of spending gigantic sums on generating proprietary models. Latest Claude and GPT are very very good, but not better enough than the competition to justify the cash spend. It feels unlikely that anyone is gonna “winner takes all” the market at this point. I don’t see how Anthropic or OpenAI’s business model survive as independent entities, or how current owners don’t take a gigantic haircut, other than by Sam Altman managing to do something insane like reverse acquiring Oracle.

EDIT: also feels like Musk has shown how shallow the moat is. With enough cash and access to exceptional engineers, you can magic a frontier model out of the ether, however much of a douche you are.


It's become rather clear from the local LLM communities catching up that there is no moat. Everyone is still just barely figuring out how this nifty data structures produce such a powerful emergent behavior, there isn't any truly secret sauce yet.

> local LLM communities catching up that there is no moat.

they use Chinese open LLMs, but Chinese companies have moat: training datasets and some non-opensource tech, and also salaried talents, which one would need serious investment for if decide to bootstrap competitive frontier model today.


I’d argue there’s a _bit_ of secret sauce here, but the question is if there’s enough to justify valuations of the prop-AI firms, and that seems unlikely.

> but nothing about the industry's finances add up right now.

The acquisitions do. Remember Groq?


That may not be a good example because everyone is saying Groq isn't worth $20B.

They were valued at $6.9B just three months before Nvidia bought them for $20B, triple the valuation. That figure seems to have been pulled out of thin air.

Speaking generally: It makes sense for a acquisition price to be at a premium to valuation, between the dynamics where you have to convince leadership its better to be bought than to keep growing, and the expected risk posed by them as competition.

Most M&As arent done by value investors.


Maybe it was worth the other $13.1B to make sure their competitors couldn't get them?

What can it actually run? The fact their benchmark plot refers to Llama 3.1 8b signals to me that it's hand implemented for that model and likely can't run newer / larger models. Why else would you benchmark such an outdated model? Show me a benchmark for gpt-oss-120b or something similar to that.

Looking at their blog, they in fact ran gpt-oss-120b: https://furiosa.ai/blog/serving-gpt-oss-120b-at-5-8-ms-tpot-...

I think Llama 3 focus mostly reflects demand. It may be hard to believe, but many people aren't even aware gpt-oss exists.


Many are aware, just can’t offload it onto their hardware.

The 8B models are easier to run on an RTX to compare it to local inference. What llama does on an RTX 5080 at 40t/s, Furiosa should do at 40,000t/s or whatever… it’s an easy way to have a flat comparison across all the different hardware llama.cpp runs on.


> we demonstrated running gpt-oss-120b on two RNGD chips [snip] at 5.8 ms per output token

That's 86 token/second/chip

By comparison, a H100 will do 2390 token/second/GPU

Am I comparing the wrong things somehow?

[1] https://inferencemax.semianalysis.com/


I think you are comparing latency with throughput. You can't take the inverse of latency to get throughput because concurrency is unknown. But then, RNGD result is probably with concurrency=1.

I thought they were saying it was more efficient, as in tokens per watt. I didn’t see a direct comparison on that metric but maybe I didn’t look well enough.

Probably. Companies sell on efficiency when they know they lose on performance.

If you have an efficient chip you can just have more of them and come out ahead. This isn't a CPU where single core performance is all that important.

Only if the price is right...

Eh if there's a human on the other side single stream performance is going to matter to them.

Right, but datacenters also very much operate on electrical cost so it’s not without merit.

Now I'm interested ...

It still kind of makes the point that you are stuck with a very limited range of models that they are hand implementing. But at least it's a model I would actually use. Give me that in a box I can put in a standard data center with normal power supply and I'm definitely interested.

But I want to know the cost :-)


The fact that so many people are focusing solely on massive LLM models is an oversight by people that narrowly focusing on a tiny (but very lucrative) subdomain of AI applications.

Namely killing people or surveiling people, dealers choice.

For those wondering how this differs from Nvidia GPUs:

Nvidia = flexible, general-purpose GPUs that excel at training and mixed workloads. Furiosa = purpose-built inference ASICs that trade flexibility for much better cost, power efficiency, and predictable latency at scale.


These things never pan out.

The reasons why this almost never works is one of the following:

- They assume they can move hardware complexity (scheduling etc, access patterns into software). The magic compiler/runtime never arrives.

- They assume their hard-to-program but faster architecture will get figured out by devs. It won't.

- They assume a certain workload. The workload changes, and their arch is no longer optimal or possibly even workable.

- But most importantly, they don't understand the fundamental bottlenecks, which is usually memory bandwidth. Even if you increase the paper specs, like FLOPS total, FLOPS/W etc. youre usually limited by how much you can read from memory. Which is exactly as much as their competitors. The way you can overcome this is by cleverness and complexity (cache lines, smarter algorithms, acceleration structures etc), but all these require a complex computer to run with all those coherent cache hierarchies, branching and synchronization logic etc. Which is why folks like NVIDIA keep going on despite facing this constant barrage of would-be disruptors.

In fact this continue to be more and more true - memory bandwidth relies on transcievers on the chip edge, and if the size of the chips doesn't increase, bandwidth doesn't increase automatically on newer process nodes. Latency doesn't improve at all. But you get more transistors to play with, which you can use to run your workload more cleverly.

In fact I don't rule out the possibility of CPU based massively parallel compute making a comeback.


> - They assume their hard-to-program but faster architecture will get figured out by devs. It won't.

Or it will get figured out in the niche fields where people are willing to figure out really hard stuff to squeeze out max performance (PE, hedge funds, intelligence)

Either way agree, it's hard to get mass adoption without the software ecosystem feeding back in


And when you layer on top networking, it's another level of sw/hw complexity.

really weird graph where they're comparing to 3x H100 PCI-E which is a config I don't think anyone is using.

they're trying to compare at iso-power? I just want to see their box vs a box of 8 h100s b/c that's what people would buy instead, and they can divide tokens and watts if that's the pitch.


> they're trying to compare at iso-power?

Yeah they are defining a "rack" as 15kW, though 3x H100 PCIe is only a bit over 1kW. So they are assuming GPUs are <10% of rack power usage which sounds suspiciously low.


It would also depend on the purchase cost and cooling infrastructure cost. If this costs what a 3x H100 box costs then it’s a fair comparison even if not a direct comparison to what customers currently buy.

Whats a more realistic config?

8xGPUs per box. this has been the data center standard for the last 8ish years.

furthermore usually NVLink connected within the box (SXM instead of PCIe cards, although the physical data link is still PCIe.)

this is important because the daughter board provides PCIe switches which usually connect NVMe drives, NICs and GPUs together such that within that subcomplex there isn't any PCIe oversubscription.

since last year for a lot of providers the standard is the GB200 I'd argue.


Fascinating! So each GPU is partnered with disk and NICs such that theres no oversubscription for bandwidth within its 'slice'? (idk what the word is) And each of these 8 slices wire up to NVLink back to the host?

Feels like theres some amount of (software) orchestration for making data sit on the right drives or traverse the right NICs, guess I never really thought about the complexity of this kind of scale.

I googled GB200, its cool that Nvidia sells you a unit rather than expecting you to DIY PC yourself.


Got excited, then I saw it was for inference. yawns

Seems like it would obviously be in TSMCs interest to give preferential taping to nvidia competitors, they benefit from having a less consolidated customer base bidding up their prices.


Everything is currently pointing towards inference being the main cost driver for LLMs in the future. Test-time-compute requires huge amounts of tokens in inference and makes providing frontier models as services unprofitable.

Anyone not under some kind of export restrictions can scrounge together some GPUs to train a frontier model (hell, even DeepSeek which is under these restrictions could) but providing a service that can compete with OpenAI et al. will prove to be quite costly. 3x improvements in inference are therefore nothing to sneeze at IMO.


My best guess after dipping my toe into semiconductor fabrication a decade ago is that there is a mysterious guru in a cave under a volcano who decides which customers get access to which nodes at which prices.

I think it's actually really cool to focus on efficiency over just raw performance! The page for the cards themselves goes into more detail and has a pretty nice graph: https://furiosa.ai/rngd

You can see them admit that RNGD will be slower than a setup with H100 SXM cards, but at the same time the tokens per second per watt is way better!

Actually, I wonder how different that is from Cerebras chips, since they're very much optimized for speed and one would think that'd also affect the efficiency a whole bunch: https://www.cerebras.ai/


Having only 48GB of RAM per card seems low. The full server system with 8 cards barely has enough RAM to run modern large open models. And batching together user requests eats quite a lot of memory, too. Curious to see how these machines and cards are received by the market.

Is it reasonable for me not to be able to read a single word of a text-based blog post because I don't have WebGL enabled?

you are not the target audience

whatever runs on typical investor/C-suite laptops and phones (so new iPhone/MacBook with "stock" Safari, maybe in corporate some cursed Windows setup with Chrome) is okay, and obviously they need to maxx out the glitter, it's the 2020s


I know people with iphones 17 pro who do not have webgl enabled for sanitary infosec reasons:)

probably they don't want this site to be scraped by LLMs which would be kinda ironic


A fix for me in FF was toggling 'reader view'. They might be reasonable and it could be a bug.

It misses most important information, price and how quick they can ship. If they can actually deliver and take slice of market share from NVidia then it would make me happy.

The positioning makes sense, but I’m still somewhat skeptical.

Targeting power, cooling, and TCO limits for inference is real, especially in air-cooled data centers.

But the benchmarks shown are narrow, and it’s unclear how well this generalizes across models and mixed production workloads. GPUs are inefficient here, but their flexibility still matters.


Is this from 2024? It mentions "With global data center demand at 60 GW in 2024"

Also, there is no mention of the latest-gen NVDA chips: 5 RNGD servers generate tokens at 3.5x the rate of a single H100 SXM at 15 kW. This is reduced to 1.5x if you instead use 3 H100 PCIe servers as the benchmark.


The title sounds interesting but I get errors and no content on my iPhone15 because it is unable to initialize WebGL. Why do people still link content to such capabilities? Where has simple HTML / CSS gone these days?

Edit: from comments and reading the one page that loads, this is still the 5nm tech they announced in 2024, hence the H100 comparison, which feels dated given the availability of GB300.


How usable is this in practice for the average non AI organization? Are you locked into a niche ecosystem that limits the options of what models you can serve?

Yes, but in principle it isn't that different from running on Trainium or Inferentia (it's a matter of degree), and plenty of non-AI organizations adopted Trainium/Inferentia.

The server seems cool but the networking seems insufficient for data centers.

How is this possible? Doing AI with "dual AMD EPYC processors". I thought you needed to have GPUs or something like that to do the matrix multiplications needed to train LLMs? Is that conventional wisdom wrong?

it uses own chip under the hood, see accelerator mentioned in spec.

This is from September 2025, what's new?

> We are taking inquiries and orders for January 2026.

Hence the relevance, maybe.


What's new is HN discovered it. It wasn't posted in September 2025.

100%

People forget this is also a place of discussion and the comment section is usually peak value as opposed to the article itself.


So inference only and slower than B200s?

Maybe they are cheap.


Why is their website demanding WebGL?

that's a nice rack



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: