Running pocketsphinx Speech Recognition on Ubuntu
I'm currently researching speech recognition for home automation projects. My primary requirement is that I want an open source package that runs locally in a Linux environment. "Runs locally" means a cloud based solution is not acceptable; I don't want random household chatter shipped off to the cloud.
I thought the pocketsphinx package looked promising. It's available in the Ubuntu repos, so I could try it on my desktop before setting up an embedded testbed.
Here's what I did to get it working.
First, install the following packages:
- pocketsphinx-utils -- the pocketsphinx runtime
- pocketsphinx-hmm-en-hub4wsj -- the "acoustic model"
- pocketsphinx-lm-en-hub4 -- the "language model"
For voice input, I used the microphone in the Logitech Webcam Pro 9000 connected to my system. It's a USB device, and with my Linux sound setup gets routed to /dev/dsp.
To launch the speech recognition engine, run:
pocketsphinx_continuous \ -hmm /usr/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k \ -dict /usr/share/pocketsphinx/model/lm/en_US/cmu07a.dic \ -lm /usr/share/pocketsphinx/model/lm/en_US/hub4.5000.DMP
The engine will start and once initialized, it will show:
I said "hello world", and the output was:
Listening... Stopped listening, please wait... INFO: cmn_prior.c(121): cmn_prior_update: from < 56.00 -3.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > INFO: cmn_prior.c(139): cmn_prior_update: to < 52.51 4.67 0.62 -2.06 -0.19 -0.28 -1.52 -0.17 -0.27 0.41 -0.86 0.20 -0.61 > INFO: ngram_search_fwdtree.c(1549): 3967 words recognized (25/fr) INFO: ngram_search_fwdtree.c(1551): 421042 senones evaluated (2665/fr) INFO: ngram_search_fwdtree.c(1553): 601020 channels searched (3803/fr), 65303 1st, 106012 last INFO: ngram_search_fwdtree.c(1557): 7675 words for which last channels evaluated (48/fr) INFO: ngram_search_fwdtree.c(1560): 55724 candidate words for entering last phone (352/fr) INFO: ngram_search_fwdtree.c(1562): fwdtree 0.17 CPU 0.107 xRT INFO: ngram_search_fwdtree.c(1565): fwdtree 2.72 wall 1.719 xRT INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 120 words INFO: ngram_search_fwdflat.c(937): 2878 words recognized (18/fr) INFO: ngram_search_fwdflat.c(939): 137752 senones evaluated (872/fr) INFO: ngram_search_fwdflat.c(941): 205146 channels searched (1298/fr) INFO: ngram_search_fwdflat.c(943): 10852 words searched (68/fr) INFO: ngram_search_fwdflat.c(945): 7092 word transitions (44/fr) INFO: ngram_search_fwdflat.c(948): fwdflat 0.04 CPU 0.023 xRT INFO: ngram_search_fwdflat.c(951): fwdflat 0.04 wall 0.023 xRT INFO: ngram_search.c(1266): lattice start node <s>.0 end node </s>.148 INFO: ngram_search.c(1294): Eliminated 0 nodes before end node INFO: ngram_search.c(1399): Lattice has 265 nodes, 1375 links INFO: ps_lattice.c(1365): Normalizer P(O) = alpha(</s>:148:156) = -1053144 INFO: ps_lattice.c(1403): Joint P(O,S) = -1077247 P(S|O) = -24103 INFO: ngram_search.c(888): bestpath 0.00 CPU 0.003 xRT INFO: ngram_search.c(891): bestpath 0.00 wall 0.003 xRT 000000000: hello world READY....
That simple test looks good, but, unfortunately, the speech recognition was extremely inaccurate. "Peter Piper picked a pack of pickled peppers" rendered as:
000000000: peter court for to see how are to people that were
If, however, I generated a reduced language model, instead of the large HUB4 language model used above, it was very accurate.
I generated the language model files using the utility here: http://www.speech.cs.cmu.edu/tools/lmtool-new.html
The language model I generated contained just the phrases:
- hello computer
- turn light on
- turn light off
I downloaded the generated files,and restarted the engine with --dict and --lm arguments updated to point to the newly generated files. Now, voice recognition was perfect, even when I muffled or slurred the words.
Next step is to try this on a Raspberry Pi. If this works, I'd like to have the Pi running, decoding voice commands, and sending triggers to a smart home hub.