Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)

Not sure about ollama, but llama-server does have a transparent kv cache.

You can run it with

    llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --jinja --reasoning-format none

Web UI at http://localhost:8080 (also OpenAI compatible API)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: