Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm curious and not an expert here, do you know why the TTFT is so much worse on Mac? To elaborate, the article just says that this step is compute bound, but I'm wondering whether it is just that simple or if it might also be less optimised in MLX?


Prefill (prompt processing) is compute bound doing large matrix operations. Token generation (aka tokens/s) is memory bandwidth bound.

The RTX 5090 has an incredible amount of compute performance for matrix operations and a lot of memory bandwidth. The Apple Silicon parts have unusually high memory bandwidth for general purpose compute chips, which is why they can generate tokens so fast. Their raw matrix compute performance is amazing for their power envelope but not nearly as fast as a dedicated GPU consuming 400-500W.

Apple added tensor cores on the M5 generation which help with those matrix operations, which is why the M5 performs so much better than the M4 Max in that article.

Dedicate GPUs like the RTX 5090 are in another league, though.

You can see the divergence in the high resolution gaming benchmarks, too. Once he starts benchmarking at 4K or 6K where the CPU emulation stops being a bottleneck, the raw compute of the 5090 completely crushes any of the Apple Silicon GPUs.


The TTFT benchmarks don’t look right to me. I don’t use vLLM, but at 16k pre-fill, the M5 Max is 3.6 times faster than the M4 Max. The 5090 is surely faster, but the numbers in the article are not reflecting what I have seen thus far. Perhaps vLLM hasn’t been updated to use the new tensor APIs for metal?

My point is this: The M5 should have reflected this in the charts, but it doesn’t. The situation on pre-fill is not nearly as bad as in the M4 generation.


Apple GPUs didn’t have tensor cores until the M5 (aka “a neural accelerator in each core”) and in the article’s charts that a M5 Pro significantly beats a M4 Max (while in other workloads it would be much smaller since Pro is ~1/2 Max).

EDIT: since Aurornis beat me by 3 minutes, I’ll add another interesting tidbit instead :)

NVIDIA tensor cores on consumer GPUs are massively less powerful per SM core than on their datacenter counterparts-parts (which also makes them easier to get to peak efficiency on consumer GPUs because the rest of the pipeline is much more quickly a bottleneck as per Amdahl’s Law).

This is potentially changing with Vera Rubin CPX which looks an awful lot like a RTX 5090 replacement but with the full-blown datacenter tensor cores (that won’t be available unless you pay for the datacenter SKU) - so it will have very high TFLOPS relative to its bandwidth.

The target market for the CPX is exactly this: prefill and Time To First Token. You can basically just throw compute at the problem for (parts of) prefill performance (but it won’t help anything else past a certain point) and the 5090/M5 are nowhere near that limit.

So the design choice for NVIDIA/Apple/etc of how much silicon to spend for this on consumer GPUs is mostly dictated by economics and how much they can reuse the same chips for the different markets.


@Ademeure Where do you think the market will be by the time, say year from now, when Apple has rolled out it's M6 generation? Do you think one more process node and architecture revision will be enough yet to tip the balance that local LLM starts to go mainstream?


Does that include stuff like the Pro Blackwell 6000? Or are the tensor cores as good per SM comparably? They perform quite well on many tests


Pro Blackwell 6000 is just a 5090 with more VRAM. It does not have the tcgen05 (5th gen tensor core) instructions despite the "5th gen tensor core) branding and thus do not support any optimized Blackwell (sm100) kernels.

Every Blackwell card other than the (G)B100, (G)B200, (G)B300 and Jetson Thor, use the Ampere tensor core instruction (mma.sync) but with fp4/6/8 added on. Beyond that the DGX Spark (which is advertised as having the same architecture as B200) has especially weak (not tcgen05) tensor cores that have a very narrow operating window and low utilization.


> I'm curious and not an expert here, do you know why the TTFT is so much worse on Mac?

because the GPUs aren't as fantastic as everyone assumes?

> might also be less optimised in MLX?

prefill has gotta be one of the most optimized paths in MLX...


No you don't understand, on Apple Silicon my CPU has comparable memory bandwidth to a $400 Pascal-era GPU. With the unified memory architecture, that means my iGPU gets 2016-levels of DDR transfer speed with none of the upsides of CUDA. It's the most cutting-edge hardware ever put in a personal computer, without a doubt.


Please show me on the 2016-era $400 Pascal GPU where you can install the 256 GB of VRAM.


We're quite lucky that Nvidia didn't ship a 256gb system at sub-500gb/s transfer rate, is my point.


> Nvidia didn't ship a 256gb system at sub-500gb/s transfer rate

DGX Spark has 128 GB and only 273 GB/s BW. Are we lucky that NVIDIA did ship something even worse than what you specified? I'm confused.

People have been complaining [1] about how little VRAM NVIDIA ships with their GPUs for decades. Their whole game has been "oh, you want more VRAM? Buy more or pay us 50x for server grade with 10x as much VRAM. The more you buy, the more you save."

Apple did everyone a solid by shipping something way out of that distribution. We now know more than we did before! We know that a 284B parameter model with 13B active params (or 35B with 3B active, or 671B with 37B active) can outperform a 2T model and draw a fraction as much power. How can you think that's a bad thing?

You could point out that Apple didn't invent the idea of MoE. Everyone knows that. But other than Macs, there simply were no machines with >100GB VRAM directly coupled to ~50 TFLOP/s of compute until the DGX Spark last Dec. If you wanted to run a model with more than 32 GB of weights, you had to either pay up for dozens of GPUs idling at hundreds of watts or really pay up for some $50,000 server GPUs idling at... also 100-200W each.

I feel lucky to have a $3k machine on my shelf that can run DS4-Flash with 1M context at 20t/s while drawing ~150W and making very little noise. The best part? It idles at 30W with DS4 loaded, dropping to 6W after a reboot. There isn't a single GPU on the market that can match that in the same shoebox volume.

[1] https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRlOW0N...


The DGX Spark is also a niche, arbitrarily limited machine that will not displace serious datacenter workloads. It's targeted directly at the homelab LARPers and arguably a waste of money versus similarly priced GPU clusters. A 256gb Spark at LPDDRX5 transfer rates would be a genuine travesty.

You can try to weasel out any sort of edge justification you want - these are not industry-grade machines. They are slow, expensive, bandwidth-constrained SOCs that don't hold a candle to either datacenter GPUs or even decade-old gaming GPUs. It's worth criticizing when Apple does it, and also worth criticism when Nvidia does it. The only difference being that Nvidia has natural datacenter buy-in, while Apple can't even justify their own hardware in the face of TPU inference costs: https://9to5mac.com/2026/03/02/some-apple-ai-servers-are-rep...


What even is an industry grade machine?

Would you own a computer if the smallest computer you were allowed to buy was a $27,000 Supermicro rack that draws 900 watts all the time?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: