Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I currently run the qwen3.5-122B (Q4) on a Strix Halo (Bosgame M5) and am pretty happy with it. Obviously much slower than hosted models. I get ~ 20t/s with empty context and am down to about 14t/s with 100k of context filled.

No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.

 help



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: