Improved Techniques for Mandarin Spoken Language Processing
Dr. Sin-Horng Chen
|Keywords:||音韻模式;遞迴式類神經網路;音轉字;音長模式;說法速度;信號偏差移除;正交轉換;多關鍵詞偵測;prosodic modeling;recurrent neural network;speech-to-text;duration modeling;speaking rate;signal bias removal;orthogonal transform;multi-keyword spotting|
In this dissertation, several issues of Mandarin spoken language processing are addressed. Firstly, the issue of incorporating prosodic modeling into the integration of acoustic decoding and linguistic decoding for speech-to-text conversion is discussed. Three prosodic modeling approaches, including vector quantization (VQ) and self-organizing feature map (SOFM) based unsupervised training methods and recurrent neural network (RNN) based supervised training method, are proposed for evaluation. The prosodic modeling is performed in the post-processing stage of acoustic decoding and aims at detecting word-boundary cues of the input utterance to help the following linguistic decoder solving the problem of word-boundary ambiguity. It can effectively detect the word-boundary information from the input prosodic features extracted from the testing utterance with syllable boundaries pre-determined by the preceding acoustic decoder. The detected word-boundary information is then used in linguistic decoding to assist in determining the best word (or character) sequence. Experimental results showed that the proposed prosodic model was effective on assisting in reducing the computational complexity of the recognition search with a slight improvement on the recognition rate. Secondly, the use of prosodic information in duration modeling for speech recognition is studied. For creating a precise duration model, a new duration modeling method considering three affecting factors of speaking rate, lexical tone, and prosodic state is proposed for HMM-based speech recognition. In this study, we start with introducing an ML-estimated syllable duration model, and then applying the same technique to model initial/final durations and HMM state duration. Experimental results confirmed that the method could isolate the effects of these three major affecting factors. The incorporation of the proposed duration modeling method into continuous Mandarin speech recognition has also been studied. Experimental results showed that it was a promising approach. Lastly, the issue of minimizing the acoustic mismatch between training and testing environments is discussed. A novel approach of orthogonal transform-based SBR (OTSBR) to improve the accuracy of bias estimation is proposed for adverse Mandarin speech recognition. Instead of applying the conventional frame-based process, this method uses a segment-by-segment process. Owing to that the orthogonal transform-based bias resistant features obtained by this method are bias-free, they are very useful for bias estimation. Experimental results confirmed that the proposed OTSBR method outperformed the conventional SBR method on both the recognition rate and the computational complexity. Besides, the OTSBR method is integrated with the RNN noisy pre-segmentation method proposed previously to further improve the performance of noisy speech recognition. Experimental results confirmed that it outperformed the RRSBR method for low SNR environments. Furthermore, the OTSBR method is also incorporated into a multi-keyword spotting system to help improving the keyword detection performance. Its effectiveness was confirmed by experimental results.
|Appears in Collections:||Thesis|