Posted on 8th January 2020
This is a follow-on post about speech recognition on a Raspberry Pi. Of course, Christmas and come and gone; perhaps I will finish this project for Christmas 2020! The original project was to make some voice activated Christmas tree lights. The original project had a push-button activation, but to compete with my son's new Alexa, I wanted to use a hotword wake-up instead.
The first attempt was to use Snowboy which is an open source, but slightly morribund project. The previous blog post details (with links) how to build a Python 3 compatible library.
If you create an account, you can train your own models.
There is a crowd-sourcing aspect whereby with enough independent models, a "universal model" can be created. Unfortunately, it seems they didn't attract enough of a crowd, and so there are
only a couple of universal models. I couldn't get the demos to run, which some debugging revealed to be due to the fact that my cheap USB microphone wouldn't record at 16Khz. The demo worked,
but did nothing useful, if I changed this value. I experimented with the offline mode, by recording my own hotword, making my own model, and then resampling the inputs to 16Khz and feeding them
back to the code. This worked! Some messing about with
scipy eventually lead to some working code which on-the-fly resampled the audio.
resample_polyseemed to be faster, and yet sufficiently good.
A second attempt was to use Porcupine by Picovoice, a Canadian startup. This has the advantage of already being a PIP package, and so has a somewhat
less painful installation process than Snowboy. Now forewarned about the need to resample to 16Khz, I adapted my code, and failed to achieve any working voice recognition. Only by carefully
looking at the examples did I realise that the input needs to be list or tuple of Python
ints each being a 16-bit sample. I had been passing
bytes which instead need to be
struct package. I sadly had no luck passing a
numpy array directly (which would have been more elegant). I do need to learn more about how Python calls native code.
The end result works, but seems more CPU intensive than Snowboy, and while far from perfect, does seem a little more accurate.
The next steps would be to integrate the "wake up" code with some general speech recognition. Some links for the future: