Title: 中文口述語言處理之進一步技術
Improved Techniques for Mandarin Spoken Language Processing
Authors: 王文俊
Wern-Jun Wang
Dr. Sin-Horng Chen
Keywords: 音韻模式;遞迴式類神經網路;音轉字;音長模式;說法速度;信號偏差移除;正交轉換;多關鍵詞偵測;prosodic modeling;recurrent neural network;speech-to-text;duration modeling;speaking rate;signal bias removal;orthogonal transform;multi-keyword spotting
Issue Date: 2000
Abstract: 本篇論文係針對中文口述語言處理所面臨的幾個關鍵問題提出進一步的解決方法。首先討論如何藉由音韻模式整合聲學解碼與語言解碼,以改進音轉字之正確率。對於音韻模式的分析研究,我們提出屬於非監督式訓練的向量量化(VQ)與自組特徵對應(SOFM),以及屬於監督式訓練的遞迴式類神經網路(RNN)等三個方法。在音轉字的實驗中,音韻模式是介於聲學解碼與語言解碼之間的一項處理,利用聲學解碼提供的音節邊界產生音韻特徵以偵測出詞邊界訊息,這些詞邊界訊息有助於語言解碼處理時產生最佳之詞串或字串輸出。實驗結果顯示利用音韻模式確實可有效地提高音轉字正確率及降低搜尋複雜度。其次,在利用音韻資訊進行音長模式之分析研究中,我們藉由考慮說話速度、音調與韻律狀態等音韻資訊以建立正確的音長模式。最佳相似值(maximum likelihood)之估計方式首先被運用來產生音節的音長模式,再進而延伸至以聲母/韻母與HMM狀態為基礎的音長模式。實驗結果顯示此方法可有效地分離上述三種影響因素。另外,整合所提出的音長模式於連續語音辨認的實驗結果也顯示此模式可有效地提高辨認率。最後,對於普遍存在於語音辨認應用中,因訓練與測試環境不匹配所導致的辨認正確率下降的問題,我們提出以正交轉換方法作頻道偏差之估計。此方法是以音段為基礎,有別於傳統方法的以音框為基礎之估計方式。由於正交轉換方法可產生不受頻道偏差影響之特徵,因此對頻道偏差之估計有很大的助益。實驗結果顯示此方法不僅提高了頻道偏差估計之正確率,更可以有效地降低系統複雜度。另外此方法與RNN切割方法結合的實驗結果更展示了此方法在高雜訊環境下的明顯效益,而此方法與RNN音韻模式結合並運用於多關鍵詞偵測的實驗結果也證實其有效性。
In this dissertation, several issues of Mandarin spoken language processing are addressed. Firstly, the issue of incorporating prosodic modeling into the integration of acoustic decoding and linguistic decoding for speech-to-text conversion is discussed. Three prosodic modeling approaches, including vector quantization (VQ) and self-organizing feature map (SOFM) based unsupervised training methods and recurrent neural network (RNN) based supervised training method, are proposed for evaluation. The prosodic modeling is performed in the post-processing stage of acoustic decoding and aims at detecting word-boundary cues of the input utterance to help the following linguistic decoder solving the problem of word-boundary ambiguity. It can effectively detect the word-boundary information from the input prosodic features extracted from the testing utterance with syllable boundaries pre-determined by the preceding acoustic decoder. The detected word-boundary information is then used in linguistic decoding to assist in determining the best word (or character) sequence. Experimental results showed that the proposed prosodic model was effective on assisting in reducing the computational complexity of the recognition search with a slight improvement on the recognition rate. Secondly, the use of prosodic information in duration modeling for speech recognition is studied. For creating a precise duration model, a new duration modeling method considering three affecting factors of speaking rate, lexical tone, and prosodic state is proposed for HMM-based speech recognition. In this study, we start with introducing an ML-estimated syllable duration model, and then applying the same technique to model initial/final durations and HMM state duration. Experimental results confirmed that the method could isolate the effects of these three major affecting factors. The incorporation of the proposed duration modeling method into continuous Mandarin speech recognition has also been studied. Experimental results showed that it was a promising approach. Lastly, the issue of minimizing the acoustic mismatch between training and testing environments is discussed. A novel approach of orthogonal transform-based SBR (OTSBR) to improve the accuracy of bias estimation is proposed for adverse Mandarin speech recognition. Instead of applying the conventional frame-based process, this method uses a segment-by-segment process. Owing to that the orthogonal transform-based bias resistant features obtained by this method are bias-free, they are very useful for bias estimation. Experimental results confirmed that the proposed OTSBR method outperformed the conventional SBR method on both the recognition rate and the computational complexity. Besides, the OTSBR method is integrated with the RNN noisy pre-segmentation method proposed previously to further improve the performance of noisy speech recognition. Experimental results confirmed that it outperformed the RRSBR method for low SNR environments. Furthermore, the OTSBR method is also incorporated into a multi-keyword spotting system to help improving the keyword detection performance. Its effectiveness was confirmed by experimental results.
Appears in Collections:Thesis