So I can use this in claude code with `ollama run claude`?

nunodonato · 2026-04-16T16:38:09 1776357489

https://sleepingrobots.com/dreams/stop-using-ollama/

txtsd · 2026-04-17T01:29:09 1776389349

Thank you, I had no idea ollama was so shady! I will start using llama.cpp directly.

Ladioss · 2026-04-16T15:31:01 1776353461

More like `ollama launch claude --model qwen3.6:latest`

Also you need to check your context size, Ollama default to 4K if <24 Gb of VRAM and you need 64K minimum if you want claude to be able to at least lift a finger.

Patrick_Devine · 2026-04-16T17:47:25 1776361645

If you're on a Mac, use the MLX backend versions which are considerably faster than the GGML based versions (including llama.cpp) and you don't need to fiddle with the context size. The models are `qwen3.6:35b-a3b-nvfp4`, `qwen3.6:35b-a3b-mxfp8`, and `qwen3.6:35b-a3b-mlx-bf16`.

egorfine · 2026-04-17T11:11:38 1776424298

I was comparing various models at M5 Pro 48GB RAM MLX vs GGUF and found that MLX models have a higher time to first token (sometimes by an order of magnitude) while tokens/sec and memory usage is same as GGUF.

Gemma 3 27B q4:

* MLX: 16.7 t/s, 1220ms ttft

* GGUF: 16.4 t/s, 760ms ttft

Gemma 4 31B q8:

* MLX: 8.3 t/s, 25000ms ttft

* GGUF: 8.4 t/s, 1140ms ttft

Gemma 4 A4B q8:

* MLX: 52 t/s, 1790ms ttft

* GGUF: 51 t/s, 380ms ttft

All comparisons done in LM Studio, all versions of everything are the latest.

txtsd · 2026-04-16T18:29:33 1776364173

I only have 16GB VRAM, and my system uses ~4GB from that. What are my options? I got this one: `Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf`

Ladioss · 2026-04-17T09:22:44 1776417764

My system has 16 Gb VRAM / 32 Gb RAM, and ollama runs qwen3.6:latest at decent speed just fine. The 35b model is a moe, so I guess the whole model is offloaded.

pj_mukh · 2026-04-16T14:45:04 1776350704

have you found a model that does this with usable speeds on an M2/M3?

postalcoder · 2026-04-16T14:53:57 1776351237

On a M4 MBP ollama's qwen3.5:35b-a3b-coding-nvfp4 runs incredibly fast when in the claude/codex harness. M2/M3 should be similar.

It's incomparably faster than any other model (i.e. it's actually usable without cope). Caching makes a huge difference.