I agree totally. My only problem is local models running on my old macMini run very much slower than that for example Gemini-2.5-flash. I have my Emacs setup so I can switch between a local model and one of the much faster commercial models.
Someone else responded to you about working for a financial organization and not using public APIs - another great use case.
These being mixture of expert (MOE) models should help. The 20b model only has 3.6b params active at any one time, so minus a bit of overhead the speed should be like running a 3.6b model (while still requiring the RAM of a 20b model).
Here's the ollama version (4.6bit quant, I think?) run with --verbose
total duration: 21.193519667s
load duration: 94.88375ms
prompt eval count: 77 token(s)
prompt eval duration: 1.482405875s
prompt eval rate: 51.94 tokens/s
eval count: 308 token(s)
eval duration: 19.615023208s
eval rate: 15.70 tokens/s
15 tokens/s is pretty decent for a low end MacBook Air (M2, 24gb of ram). Yes, it's not the ~250 tokens/s of 2.5-flash, but for my use case anything above 10 tokens/sec is good enough.
Someone else responded to you about working for a financial organization and not using public APIs - another great use case.