Modular Recurrent Neural Networks-based Mandarin Speech Recognition
|關鍵字:||中文語音辨認;模組化遞迴類神經網路;Mandarin speech recognition;modular recurrent neural network|
In this dissertation, three recurrent neural network (RNN)-based speech recognition schemes are proposed. One is an RNN-based pre-classifier for improving the recognition speed of the HMM method. It first pre-classifies the input speech into three stable states, including it initial, final, and silence, and a transient state. It then set more restrict constraints in the recognition search for frames with these three stable states to prune some unlikely paths. Experimental results confirmed that it can be used in conjunction with the beam search algorithm. The computational cost of the beam search algorithm is further improved by dropping away additional 38.7% of searching states and by eliminating the likelihood calculations for additional 35.1% of Gaussian components with a paid of a degradation of 0.1% on the recognition rate. This confirms the efficiency of the proposed fast recognition method. Another is a modular RNN (MRNN)-based method for isolated Mandarin syllable recognition. It first employs the "divide-and-conquer" principle to divide the complicated task of recognizing 1280 syllables into five subtasks including three discrimination subtasks, respectively, for 100 initials, 39 finals, and 5 tones, and two broad-class classification subtasks, respectively, for three speech broad-classes of initial, final, and silence, and 9 initial sub-classes. It then uses five RNNs to attack each subtask separately. Outputs of these five RNNs are directly combined to form the discriminant functions for all 1280 syllables. The recognizer is further extended to include two MRNNs for both forward-time and backward-time. Experimental results in a multi-speaker syllable recognition task confirmed that it outperformed the MCE/GPD-trained HMM method on both the recognition complexity and accuracy. The base-syllable and syllable recognition rates of the MCE/GPD-trained HMM were further improved from 76.8% and 70.8% to 82.8% and 76.3% by the MRNN system. The other is an MRNN-based method for continuous Mandarin base-syllable recognition. It extends the previous MRNN method for isolated Mandain syllable recognition to additionally include a syllable boundary detection module and a multi-level pruning recognition search algorithm. Experimental results in a speaker-dependent speech recognition task showed that the proposed method also outperformed the MCE/GPD-trained HMM method. The base-syllable recognition rates of the ML-trained and MCE/GPD-trained HMM were further improved from 80.9% and 84.3% to 85.8% by the MRNN system. In addition, only 53.5% of the surviving base-syllable states and 25.3% of the surviving base-syllable transitions were needed to be considered in the multi-level pruning search with no cost for the degradation of recognition accuracy. From above discussions, we can therefore conclude that the RNN-based speech recognition approach is very promising for both isolated and continuous Mandarin speech.
|Appears in Collections:||Thesis|