Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It pretty much just works. Run the unsloth quant in llama.cpp and hook it up to pi. A bunch of minor annoyances like not having support for thinking effort. It also defaults to "interleaved thinking" (thinking blocks get stripped from context), set `"chat_template_kwargs": {"preserve_thinking": True},` if you interrupt the model often and don't want it to forget what it was thinking.
 help



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: