Automatic Speech Recognition

Automatic Speech Recognition related modeling class

class pororo.tasks.automatic_speech_recognition.PororoAsrFactory(task: str, lang: str, model: Optional[str])[source]

Bases: pororo.tasks.utils.base.PororoFactoryBase

Recognized speech sentence using trained model. Currently English, Korean and Chinese supports.

English (wav2vec.en)

  • dataset: LibriSpeech

  • metric: WER (clean: 1.9 / other: 4.3)

Korean (wav2vec.ko)

  • dataset: KsponSpeech

  • metric: CER (clean: 4.9 / other: 5.4)

Chinese (wav2vec.zh)

  • dataset: AISHELL-1

  • metric: CER (6.9)

Parameters
  • audio_path (str) – audio path for asr (Supports WAV, FLAC, MP3, and PCM format)

  • top_db (int) – the threshold (in decibels) below reference to consider as silence

  • vad (bool) – flag indication whether to use voice activity detection or not, If it is False, it is split into dB criteria and then speech recognition is made. Applies only when audio length is more than 50 seconds.

  • batch_size (int) – inference batch size

Returns

result of speech recognition

Return type

dict

Examples

>>> asr = Pororo(task='asr', lang='ko')
>>> asr('korean_speech.wav')
{
    'audio': 'example.wav',
    'duration': '0:00:03.297250',
    'results': [
        {
            'speech_section': '0:00:00 ~ 0:00:03',
            'length_ms': 3300.0,
             speech': '이 책은 살 만한 가치가 없어'
        }
    ]
}
>>> asr = Pororo(task='asr', lang='en')
>>> asr('english_speech.wav')
{
    'audio': 'english_speech.flac',
    'duration': '0:00:12.195000',
    'results': [
        {
            'speech_section': '0:00:00 ~ 0:00:12',
            'length_ms': 12200.0,
            'speech': 'WELL TOO IF HE LIKE LOVE WOULD FILCH OUR HOARD WITH PLEASURE TO OURSELVES SLUICING
                       OUR VEIN AND VIGOUR TO PERPETUATE THE STRAIN OF LIFE BY SPILTH OF LIFE WITHIN US STORED'
        }
    ]
}
static get_available_langs()[source]
static get_available_models()[source]
load(device: str)[source]

Load user-selected task-specific model

Parameters

device (str) – device information

Returns

User-selected task-specific model

Return type

object

class pororo.tasks.automatic_speech_recognition.PororoASR(model, config)[source]

Bases: pororo.tasks.utils.base.PororoSimpleBase

predict(audio_path: str, **kwargs)dict[source]

Conduct speech recognition for audio in a given path

Parameters
  • audio_path (str) – the wav file path

  • top_db (int) – the threshold (in decibels) below reference to consider as silence (default: 48)

  • batch_size (int) – inference batch size (default: 1)

  • vad (bool) – flag indication whether to use voice activity detection or not, If it is False, it is split into dB criteria and then speech recognition is made. Applies only when audio length is more than 50 seconds.

Returns

result of speech recognition

Return type

dict