Skip to content

Speech To text

The speech to text area is composed by the following ROS nodes:

  1. AudioCapturer

    devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.

  2. GetUsefulAudio

    devices/UsefulAudio [python]: A node that takes the chunks of audio and, using webrtcvad, checks for a voice, cut the audio and publishes it to the topic UsefulAudio. Webrtcvad approach was made as an alternative that donĀ“t remove silence but obtains the pieces of audio when someone speaks perfectly, it has a very good performance.

  3. Engine Selector

    action_selectors/hear [python]: This node receives the requests of STT. It checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden with FORCE_ENGINE parameter. * Online engine: it is in AzureSpeechToText node. For that, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudioAzure to relay it to that node. * Offline engine: it is in DeepSpeech node. For that, this node redirect the audio of UsefulAudio to a new topic called UsefulAudioDeepSpeech to relay it to that node.

  4. Speech to text Engines Both are nodes that receive the audio and convert it to text. The difference is that one is online and the other is offline.

    action_selectors/AzureSpeechToText [c++]: A node that takes the audio published in the topic UsefulAudioAzure and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.

    action_selectors/DeepSpeech [python]: A node that takes the audio published in the topic UsefulAudioDeepSpeech and calls DeepSpeech2, converts it to text, and publishes it to the topic RawInput.