Automatic Speech Recognition¶
Automatic Speech Recognition related modeling class
-
class
pororo.tasks.automatic_speech_recognition.
PororoAsrFactory
(task: str, lang: str, model: Optional[str])[source]¶ Bases:
pororo.tasks.utils.base.PororoFactoryBase
Recognized speech sentence using trained model. Currently English, Korean and Chinese supports.
English (wav2vec.en)
dataset: LibriSpeech
metric: WER (clean: 1.9 / other: 4.3)
Korean (wav2vec.ko)
dataset: KsponSpeech
metric: CER (clean: 4.9 / other: 5.4)
Chinese (wav2vec.zh)
dataset: AISHELL-1
metric: CER (6.9)
- Parameters
audio_path (str) – audio path for asr (Supports WAV, FLAC, MP3, and PCM format)
top_db (int) – the threshold (in decibels) below reference to consider as silence
vad (bool) – flag indication whether to use voice activity detection or not, If it is False, it is split into dB criteria and then speech recognition is made. Applies only when audio length is more than 50 seconds.
batch_size (int) – inference batch size
- Returns
result of speech recognition
- Return type
Examples
>>> asr = Pororo(task='asr', lang='ko') >>> asr('korean_speech.wav') { 'audio': 'example.wav', 'duration': '0:00:03.297250', 'results': [ { 'speech_section': '0:00:00 ~ 0:00:03', 'length_ms': 3300.0, speech': '이 책은 살 만한 가치가 없어' } ] } >>> asr = Pororo(task='asr', lang='en') >>> asr('english_speech.wav') { 'audio': 'english_speech.flac', 'duration': '0:00:12.195000', 'results': [ { 'speech_section': '0:00:00 ~ 0:00:12', 'length_ms': 12200.0, 'speech': 'WELL TOO IF HE LIKE LOVE WOULD FILCH OUR HOARD WITH PLEASURE TO OURSELVES SLUICING OUR VEIN AND VIGOUR TO PERPETUATE THE STRAIN OF LIFE BY SPILTH OF LIFE WITHIN US STORED' } ] }
-
class
pororo.tasks.automatic_speech_recognition.
PororoASR
(model, config)[source]¶ Bases:
pororo.tasks.utils.base.PororoSimpleBase
-
predict
(audio_path: str, **kwargs) → dict[source]¶ Conduct speech recognition for audio in a given path
- Parameters
audio_path (str) – the wav file path
top_db (int) – the threshold (in decibels) below reference to consider as silence (default: 48)
batch_size (int) – inference batch size (default: 1)
vad (bool) – flag indication whether to use voice activity detection or not, If it is False, it is split into dB criteria and then speech recognition is made. Applies only when audio length is more than 50 seconds.
- Returns
result of speech recognition
- Return type
-