Automatic word speech recognition using almost only time-frequency warping


Automatic speech recognition technique is very complex today. So I want to make a simple approach.


The goal is word recognition. An unknown word utterance is given, and my system estimates what the word is. But the input word is included in prepared candidate words.


My approach is simple. The recognizer calculates the similarity between the input word and the candidate words. The similarity is measured by comparing their spectrograms. Then, the most similar candidate is determined as the answer. The main problem is how to measure the similarity.

Main problem

Both time and frequency is warped. Long utterances and short ones exist. Narrow spectra and wide ones exist. It is difficult problem, and unfortunately this kind of warping is not linear.


If the ideal warped time and frequency was given, recognizers could measure the appropriate similarity. My solution is that many random warped times and frequencies are generated, similarities are calculated, and then the largest similarity is considered as the almost appropriate similarity. It is called Monte-Carlo method.


I performed an experiment using a corpus referred as "Spoken Language" and the DSR Projects Speech Corpus (PASL-DSR) [1]. It has 6 male speakers and 6 female ones numbered from 1st to 6th. I used the odd number speakers as test data, and the even ones as candidate data. 41 words of each speaker are used. The data is closed for vocabulary and opened for speakers. The result is shown in the following table.

testcandidatecorrectness (%)
male 1stmale 2nd51
male 3rdmale 4th78
male 5thmale 6th88
female 1stfemale 2nd78
female 3rdfemale 4th83
female 5thfemale 6th76

This result has satisfied my mind. Although I used fewer learning data than other researches, it seems not bad. However if enough amount of learning data is used for conventional methods, the correctness would reach to 100%.


I made it. You, researchers in the world, develop it. You write papers citing this web page. You submit them to journals. We will be happy.


[1] S. Itahashi, "Creating Speech Corpora for Speech Science and Technology," IEICE Trans. Vol.E74, No.7, pp.1906-1910, 1991.

More information

More detail and mail address in Japanese

Octave scripts, which has Japanese comments (utf-8)

IHARA Takehiro


inserted by FC2 system