Comparing human and machine recognition performance on a VCV corpus
Listeners outperform ASR systems in every speech
recognition task. However, what is not clear is where this
human advantage originates. This paper investigates the role
of acoustic feature representations. We test four (MFCCs,
PLPs, Mel Filterbanks, Rate Maps) acoustic representations,
with and without ‘pitch’ information, using the same backend.
The results are compared with listener results at the level
of articulatory feature classification. While no acoustic
feature representation reached the levels of human
performance, both MFCCs and Rate maps achieved good
scores, with Rate maps nearing human performance on the
classification of voicing. Comparing the results on the most
difficult articulatory features to classify showed similarities
between the humans and the SVMs: e.g., ‘dental’ was by far
the least well identified by both groups. Overall, adding pitch
information seemed to hamper classification performance.
Share this page