Raspberry Pi speech recognition

Posted on 8th January 2020

This is a follow-on post about speech recognition on a Raspberry Pi. Of course, Christmas and come and gone; perhaps I will finish this project for Christmas 2020! The original project was to make some voice activated Christmas tree lights. The original project had a push-button activation, but to compete with my son's new Alexa, I wanted to use a hotword wake-up instead.

The first attempt was to use Snowboy which is an open source, but slightly morribund project. The previous blog post details (with links) how to build a Python 3 compatible library.

If you create an account, you can train your own models. There is a crowd-sourcing aspect whereby with enough independent models, a "universal model" can be created. Unfortunately, it seems they didn't attract enough of a crowd, and so there are only a couple of universal models. I couldn't get the demos to run, which some debugging revealed to be due to the fact that my cheap USB microphone wouldn't record at 16Khz. The demo worked, but did nothing useful, if I changed this value. I experimented with the offline mode, by recording my own hotword, making my own model, and then resampling the inputs to 16Khz and feeding them back to the code. This worked! Some messing about with scipy eventually lead to some working code which on-the-fly resampled the audio.

  • My Raspberry Pi Zero is rather slow, and using resample_poly seemed to be faster, and yet sufficiently good.
  • An advantage of Snowboy is that it is quite fast, using less than 50% CPU load.
  • Another nice feature is that it classifies input audio into "silence", "talking" and "hotword" which will be useful once we try to "wake up" on the hotword, and then pass the next audio block to some general-purpose voice detection code / service.
  • A down-side is that there seems to be no "continuous" mode: you have to pass blocks of audio. I ended up with code which sent off 2 second windows, but overlapped these windows extensively, so I didn't miss a hotword. This has the downside of increasing the CPU usage.

A second attempt was to use Porcupine by Picovoice, a Canadian startup. This has the advantage of already being a PIP package, and so has a somewhat less painful installation process than Snowboy. Now forewarned about the need to resample to 16Khz, I adapted my code, and failed to achieve any working voice recognition. Only by carefully looking at the examples did I realise that the input needs to be list or tuple of Python ints each being a 16-bit sample. I had been passing bytes which instead need to be unpacked using the struct package. I sadly had no luck passing a numpy array directly (which would have been more elegant). I do need to learn more about how Python calls native code.

The end result works, but seems more CPU intensive than Snowboy, and while far from perfect, does seem a little more accurate.

The next steps would be to integrate the "wake up" code with some general speech recognition. Some links for the future:

  • https://pypi.org/project/SpeechRecognition/
  • https://github.com/Uberi/speech_recognition/blob/master/reference/pocketsphinx.rst

Recent posts