Running pocketsphinx Speech Recognition on Ubuntu

 I'm currently researching speech recognition for home automation projects. My primary requirement is that I want an open source package that runs locally in a Linux environment. "Runs locally" means a cloud based solution is not acceptable; I don't want random household chatter shipped off to the cloud.

I thought the pocketsphinx package looked promising. It's available in the Ubuntu repos, so I could try it on my desktop before setting up an embedded testbed.

Here's what I did to get it working.

First, install the following packages:

  • pocketsphinx-utils -- the pocketsphinx runtime
  • pocketsphinx-hmm-en-hub4wsj -- the "acoustic model"
  • pocketsphinx-lm-en-hub4 -- the "language model"

For voice input, I used the microphone in the Logitech Webcam Pro 9000 connected to my system. It's a USB device, and with my Linux sound setup gets routed to /dev/dsp.

To launch the speech recognition engine, run:

pocketsphinx_continuous \
    -hmm /usr/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k \
    -dict /usr/share/pocketsphinx/model/lm/en_US/cmu07a.dic \
    -lm /usr/share/pocketsphinx/model/lm/en_US/hub4.5000.DMP

The engine will start and once initialized, it will show:

READY....

I said "hello world", and the output was:

Listening...
Stopped listening, please wait...
INFO: cmn_prior.c(121): cmn_prior_update: from < 56.00 -3.00  1.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 >
INFO: cmn_prior.c(139): cmn_prior_update: to   < 52.51  4.67  0.62 -2.06 -0.19 -0.28 -1.52 -0.17 -0.27  0.41 -0.86  0.20 -0.61 >
INFO: ngram_search_fwdtree.c(1549):     3967 words recognized (25/fr)
INFO: ngram_search_fwdtree.c(1551):   421042 senones evaluated (2665/fr)
INFO: ngram_search_fwdtree.c(1553):   601020 channels searched (3803/fr), 65303 1st, 106012 last
INFO: ngram_search_fwdtree.c(1557):     7675 words for which last channels evaluated (48/fr)
INFO: ngram_search_fwdtree.c(1560):    55724 candidate words for entering last phone (352/fr)
INFO: ngram_search_fwdtree.c(1562): fwdtree 0.17 CPU 0.107 xRT
INFO: ngram_search_fwdtree.c(1565): fwdtree 2.72 wall 1.719 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 120 words
INFO: ngram_search_fwdflat.c(937):     2878 words recognized (18/fr)
INFO: ngram_search_fwdflat.c(939):   137752 senones evaluated (872/fr)
INFO: ngram_search_fwdflat.c(941):   205146 channels searched (1298/fr)
INFO: ngram_search_fwdflat.c(943):    10852 words searched (68/fr)
INFO: ngram_search_fwdflat.c(945):     7092 word transitions (44/fr)
INFO: ngram_search_fwdflat.c(948): fwdflat 0.04 CPU 0.023 xRT
INFO: ngram_search_fwdflat.c(951): fwdflat 0.04 wall 0.023 xRT
INFO: ngram_search.c(1266): lattice start node <s>.0 end node </s>.148
INFO: ngram_search.c(1294): Eliminated 0 nodes before end node
INFO: ngram_search.c(1399): Lattice has 265 nodes, 1375 links
INFO: ps_lattice.c(1365): Normalizer P(O) = alpha(</s>:148:156) = -1053144
INFO: ps_lattice.c(1403): Joint P(O,S) = -1077247 P(S|O) = -24103
INFO: ngram_search.c(888): bestpath 0.00 CPU 0.003 xRT
INFO: ngram_search.c(891): bestpath 0.00 wall 0.003 xRT
000000000: hello world
READY....

That simple test looks good, but, unfortunately, the speech recognition was extremely inaccurate. "Peter Piper picked a pack of pickled peppers" rendered as:

000000000: peter court for to see how are to people that were

If, however, I generated a reduced language model, instead of the large HUB4 language model used above, it was very accurate.

I generated the language model files using the utility here: http://www.speech.cs.cmu.edu/tools/lmtool-new.html

The language model I generated contained just the phrases:

  • hello computer
  • turn light on
  • turn light off

I downloaded the generated files,and restarted the engine with --dict and --lm arguments updated to point to the newly generated files. Now, voice recognition was perfect, even when I muffled or slurred the words.

Next step is to try this on a Raspberry Pi. If this works, I'd like to have the Pi running, decoding voice commands, and sending triggers to a smart home hub.