Speechmatics - it is on the expensive side, but provides access to a bunch of languages and the accuracy is phenomenal on all of them - even with multi-speakers.
Really great deep dive into a subtle yet impactful problem in voice AI. Turn detection is one of those things users only notice when goes wrong, and this shows a brilliant job showing how traditional VAD-based approaches fall short.
Loved the explanation of using instruction-tuned SLMs for <|im_end|> probability - elegant, efficient, and practical. The code examples very handy too!
This is one of those posts I’ll be coming back to when thinking about latency-sensitive voice interfaces with my own projects.
There's three fantastic niche players in the speech-to-text market right now that you should check out:
- Deepgram (cheap and dirty, but accuracy quite poor)
- Speechmatics (a bit more pricey, but fantastic accuracy)
- Assembly AI (just announced Series C funding of $50m)
Problem with this is Deepgram's accuracy (but agree their speed/latency is excellent).
We used to use them too, but eventually we got so frustrated with poor accuracy we switched to Speechmatics - would definitely recommend checking them out.