How would you make it self-hosted without making it suck? High quality voice recognition in a small box doesn't seem to be a thing that's even remotely possible today, let alone the query processing and knowledge database that comes with it.
You could build this on a pi with a mic, speakers, some foss stt and tts engines and some basic training data. But it'll suck.
Ten years ago I played with Microsoft Speech API - which was completely off-line and trained off your voice. In restricted grammar mode, it worked flawlessly - I built a music control application on it, and utilized it like you would use Amazon Echo - I just said "computer, volume, three quarters" from any place in the room, and the loud music turned down a notch. Etc. That was ten years ago, with a crappy electret microphone I soldered to a cable myself and sticked to my wardrobe with a bit of insulating tape.
I'm not buying you couldn't make a decent, self-contained, off-line speech recognition system. Sure, it may not be as good as Echo or Google Now (though the latter does suck hardly at times, it's nowhere near reliable to use, and it doesn't understand shit over a quite good and expensive Bluetooth headset). But it would be hackable, customizable. You could make it do some actual work for you.
Oh, and it wouldn't lag so terribly as Google Now does. Realtime applications and data over mobile networks don't mix.
But we're getting close to the point where you can do some of this. For example - http://arxiv.org/pdf/1603.03185.pdf - LSTM speech recognition running on a Nexus 5.
The more serious problem with this is that it's going to be expensive -- and somewhat wasteful. There's a lot of pressure to keep consumer devices as cheap as possible, and the cloud is an awesome way to do that. Having shared cloud-based infrastructure for the speech recognition as opposed to putting it into every device (even though it's only used for ~5 minutes every day) is probably a lot cheaper. Consider the hardware in an Amazon Echo:
vs. a Nexus 5 (2GB DRAM, 4 core 2.2Ghz Krait 400) -- the N5 has roughly 8x the DRAM and compute of the CPU in the Echo.
Would you pay an extra $150 for a LocalEcho that still had to send most of your queries to a search engine for resolution, or to a cloud music service for music? (You & I might, but most consumers wouldn't.)
> I'm not buying you couldn't make a decent, self-contained, off-line speech recognition system.
I agree. It's not a problem of technology, it's a problem of incentive. There's no money in developing self-contained, off-line speech recognition system, unfortunately.
> There's no money in developing self-contained, off-line speech recognition system
Nonsense. Self-hosting is highly valued in the enterprise sector. But we're not talking about the sort of products that could be sold to consumers for a few hundred dollars here.
A desktop PC is more than able to do good speech recognition as long as it's able to train the model for individual voices. Getting good results without training the model for the user beforehand is harder, and you would probably never be quite as good as a cloud-based system.
A Pi, though, couldn't do well at all, just like you said. If I wanted to build a system like this for myself, I would target an HTPC form factor.
edit: Another possibility, which was explored elsewhere in this thread, would be to keep the listening device "thin", but have the ability to offload the processing to a machine in my LAN instead of one the "cloud".
Hey, people with experience in speech recognition, please chime in!
Just the other day I was looking at CMU's Sphinx project for speech recognition. It seems quite capable, even of building something like this Google thing, but I haven't tried to actually use it.
Large-vocabulary recognition probably needs something better than a Raspberry Pi... so, just use a more powerful CPU.
Yes, Google has an incomprehensibly enormous database of proprietary knowledge and information. Good for them! If we want to build a home assistant that doesn't depend on Google, we'll have to make tradeoffs. That doesn't mean it has to suck.
I have "Offline speech recognition" with Google Voice Typing that seems to work perfectly well in airplane mode. The downloaded language pack (English) is 39 MB.
You could build this on a pi with a mic, speakers, some foss stt and tts engines and some basic training data. But it'll suck.