How would you make it self-hosted without making it suck? High quality voice rec...

TeMPOraL · on May 18, 2016

Ten years ago I played with Microsoft Speech API - which was completely off-line and trained off your voice. In restricted grammar mode, it worked flawlessly - I built a music control application on it, and utilized it like you would use Amazon Echo - I just said "computer, volume, three quarters" from any place in the room, and the loud music turned down a notch. Etc. That was ten years ago, with a crappy electret microphone I soldered to a cable myself and sticked to my wardrobe with a bit of insulating tape.

I'm not buying you couldn't make a decent, self-contained, off-line speech recognition system. Sure, it may not be as good as Echo or Google Now (though the latter does suck hardly at times, it's nowhere near reliable to use, and it doesn't understand shit over a quite good and expensive Bluetooth headset). But it would be hackable, customizable. You could make it do some actual work for you.

Oh, and it wouldn't lag so terribly as Google Now does. Realtime applications and data over mobile networks don't mix.

dgacmu · on May 19, 2016

"In restricted grammar mode"

That's a key limitation, though.

But we're getting close to the point where you can do some of this. For example - http://arxiv.org/pdf/1603.03185.pdf - LSTM speech recognition running on a Nexus 5.

The more serious problem with this is that it's going to be expensive -- and somewhat wasteful. There's a lot of pressure to keep consumer devices as cheap as possible, and the cloud is an awesome way to do that. Having shared cloud-based infrastructure for the speech recognition as opposed to putting it into every device (even though it's only used for ~5 minutes every day) is probably a lot cheaper. Consider the hardware in an Amazon Echo:

https://www.ifixit.com/Teardown/Amazon+Echo+Teardown/33953

256MB DRAM and a TI DSP: http://www.ti.com/product/dm3725 with a single Cortex-A8 core (about $23 + a smidgeon for the dram)

vs. a Nexus 5 (2GB DRAM, 4 core 2.2Ghz Krait 400) -- the N5 has roughly 8x the DRAM and compute of the CPU in the Echo.

Would you pay an extra $150 for a LocalEcho that still had to send most of your queries to a search engine for resolution, or to a cloud music service for music? (You & I might, but most consumers wouldn't.)

pmlnr · on May 19, 2016

> "In restricted grammar mode"

> That's a key limitation, though.

Why would it be? Sophisticated exchange of theorems and not essential for this scenario, is it?

michaelt · on May 19, 2016

Depends if you want to support things like "OK Google, invite Pawel Moczydłowski to my barbecue" and "OK Google, how do you spell d'Artagnan?"

highwind · on May 19, 2016

> I'm not buying you couldn't make a decent, self-contained, off-line speech recognition system.

I agree. It's not a problem of technology, it's a problem of incentive. There's no money in developing self-contained, off-line speech recognition system, unfortunately.

scrollaway · on May 19, 2016

> There's no money in developing self-contained, off-line speech recognition system

Nonsense. Self-hosting is highly valued in the enterprise sector. But we're not talking about the sort of products that could be sold to consumers for a few hundred dollars here.

delluminatus · on May 18, 2016

A desktop PC is more than able to do good speech recognition as long as it's able to train the model for individual voices. Getting good results without training the model for the user beforehand is harder, and you would probably never be quite as good as a cloud-based system.

A Pi, though, couldn't do well at all, just like you said. If I wanted to build a system like this for myself, I would target an HTPC form factor.

edit: Another possibility, which was explored elsewhere in this thread, would be to keep the listening device "thin", but have the ability to offload the processing to a machine in my LAN instead of one the "cloud".

mbrock · on May 18, 2016

Hey, people with experience in speech recognition, please chime in!

Just the other day I was looking at CMU's Sphinx project for speech recognition. It seems quite capable, even of building something like this Google thing, but I haven't tried to actually use it.

Large-vocabulary recognition probably needs something better than a Raspberry Pi... so, just use a more powerful CPU.

Yes, Google has an incomprehensibly enormous database of proprietary knowledge and information. Good for them! If we want to build a home assistant that doesn't depend on Google, we'll have to make tradeoffs. That doesn't mean it has to suck.

CaptSpify · on May 18, 2016

I have an RPI running Sphinx. It's OK, not great. The biggest issue I have is that you have to pre-define commands.

mbrock · on May 18, 2016

Your own custom software based on Sphinx?

Is it PocketSphinx?

I was mostly interested in automated transcription, didn't look much at the live recognition stuff.

CaptSpify · on May 18, 2016

It was pocketsphinx. Automated transcription would probably be pretty sad.

mbrock · on May 19, 2016

I think the non-pocket version (Sphinx4) should be more capable, no?

CaptSpify · on May 20, 2016

That may be. I haven't had a chance to look into that version

niutech · on May 18, 2016

Sirius (http://sirius.clarity-lab.org/) is open source and self-hosted.

bschwindHN · on May 19, 2016

I have "Offline speech recognition" with Google Voice Typing that seems to work perfectly well in airplane mode. The downloaded language pack (English) is 39 MB.

Is there something I'm missing?