標題: 針對聲學訊號的調變處理與學習
Modulation Processing and Learning for Acoustic Signals
作者: 徐忠謙
Hsu, Chung-Chien
冀泰石
Chi, Tai-Shih
電信工程研究所
關鍵字: 字典學習;鑑別式學習;鑑別階層式非負矩陣分解;階層式架構;高維維那濾波器;調變訊雜比;階層式非負矩陣分解;調變非負矩陣分解;非負矩陣分解;語音品質;語音理解度;時頻調變;語音增強;語音分離;維納濾波器;dictionary learning;discriminative learning;DL-NMF;hierarchical structure;high-dimensional Wiener filter;modulation SNR;L-NMF;M-NMF;NMF;speech quality;speech intelligibility;spectro-temporal modulation;speech enhancement;speech separation;Wiener filter
公開日期: 2015
摘要: 本論文首先提出了一個針對傅立葉頻譜的多解析時頻調變處理方法,此方法可以在不同的解析度下分析以及擷取語音頻譜圖上的時間動態與頻譜結構,而這些資訊均與語音特性息息相關(例如:音高、諧波、共振峰、振幅調變、頻率調變以及語音的起始點與結束斷點),並且針對語音產生一多維度的時頻表示法,而後我們基於此架構,提出了一單聲道高維度的維納濾波器演算法來增強語音,此方法除了移除雜訊之外,同時間也強化了語音訊號的結構,不同於以往傳統的語音增強方法,我們所提出的方法不僅可以增強語音品質,同時也能增強語音理解度。除此之外,我們也提出一個基於特殊時頻調變成份的語音偵測演算法,而此特殊時頻調變成份代表的是語音中諧波的頻率調變成份,此演算法引入了函數化的訊雜比來取代傳統的訊雜比,並用來偵測語音端點,在ROC曲線以及一實際的語音辨識系統的實驗數據均顯示我們所提出的方法較三種傳統的演算法在非穩態雜訊下,效能均有顯著的提升。 非負矩陣分解演算法已經成功的被使用在許多的語音相關應用上,此方法以淺層架構對於資料來建立其部份表示法,我們提出三種基於非負矩陣分解的深層架構表示法,一為階層式非負矩陣分解,另一為調變非負矩陣分解,最後一個為鑑別階層式非負矩陣分解,來表示頻譜資料的訊息,首先,階層式非負矩陣分解法在結構上具有若干個數的標準非負矩陣分解組,而後根據最小化重建誤差來調整每一階層的參數,階層式非負矩陣分解藉由結合單層非負矩陣分解訓練出來的部份基底來建構出較為複雜的基底,此較為複雜之基底提供一種有別於部分基底的方式來解析資料,我們將此方法應用在監督式音源分離的應用上,實驗的結果也顯示我們所提出的階層式非負矩陣分解比非負矩陣分解有較佳的效能表現。另一方面,調變非負矩陣分解則是結合非負矩陣分解以及調變處理技術,我們將非負稀疏編碼技術應用在每一個子調變頻帶上來抑制雜訊,由於雜訊在調變空間並不是均勻的破壞語音,根據此理由我們應用基底學習技術在每個調變子頻帶上,實驗結果顯示調變非負矩陣分解法在半監督式雜訊消除的應用上較非負稀疏編碼技術有較佳的效能表現。上述所提的調變處理、階層式非負矩陣分解以及調變非負矩陣分解都是屬於類生成模型的範疇,這類模型都是尋求一個好的資料表示法,然而這類的生成模型並不保證各種實驗應用上的效能,因此為了進一步的提升效能,我們引入一鑑別式最佳化條件來調適類生成模型,基於這樣的想法,我們提出了一個鑑別階層式非負矩陣分解法,並且在監督式音源分離的實驗中,進一步提升階層式非負矩陣分解法的系統效能。 最後,我們藉由提出一個新的調變特徵參數來重新回顧估計理想二位元遮罩技術,理想二位元遮罩被視為計算式聽覺場景分析的一個主要目標,這樣的特徵參數,考慮了人類對於不同調變成份的敏感性並且根據由雜訊訊號求得的調變訊雜比所建立,實驗結果顯示我們所提出的方法在估計二位元遮罩的效能上均能有所提升。
A multiresolution spectro-temporal modulation process for Fourier spectrogram is proposed in this thesis. This process can analyze and characterize temporal dynamics and spectral structures pertaining to speech properties (e.g., pitch, harmonicity, formant, amplitude modulation (AM), frequency modulation (FM) and onset/offset) at different resolutions and generates a multidimensional spectro-temporal representation of the sound. Then, a single-channel high-dimensional Wiener filter in the spectro-temporal modulation domain is presented to enhance speech. The proposed method reduces noise and enhances the “textures” of the speech signal simultaneously. Unlike all traditional noise reduction techniques, it improves not only speech quality but also speech intelligibility. In addition, an voice activity detection (VAD) algorithm based on a specific spectro-temporal modulation, referred to the frequency modulation of harmonics, is proposed. It uses functional SNR instead of SNR to detect active speech regions. Simulation results demonstrate that the proposed VAD performs significantly better than three standard VADs in terms of the receiver operating characteristic (ROC) curves and the recognition rates from a practical distributed speech recognition (DSR) system. Nonnegative matrix factorization (NMF) has been successfully adopted in many speech-related applications. It derives a parts-based representation for an observed data with a shallow underlying structure. Three new hierarchical NMF-based methods, the layered NMF (L-NMF), the modulation NMF (M-NMF), and the discriminative layered NMF (DL-NMF), are proposed in this thesis to represent spectrum data. The proposed L-NMF consists of several layers of standard NMF blocks and is fine-tuned by minimizing the propagated reconstruction error. L-NMF can realize more complex bases by combining parts-based bases extracted by the single layer NMF to interpret the data differently. The proposed L-NMF is evaluated in a supervised source separation task and demonstrates better performance than the standard NMF in terms of the source-to-distortion ratio (SDR). The M-NMF is developed by incorporating the NMF technique into the modulation process. The non-negatives sparse coding (NNSC) method is used in each modulation subband to decompose the resolved spectrogram to suppress noise. Since the noise does not uniformly degrade the speech signal in the modulation domain, it is reasonable to apply the dictionary learning technique in each modulation subband. Simulation results show the proposed M-NMF outperforms the standard NNSC in terms of PESQ scores in noise reduction experiments. The proposed modulation process, L-NMF, and M-NMF all behave like generative models, which attempt to find good representations of data. However, generative models, while being universal, do not guarantee decent performance in all types of tasks. For further improvement, task-dependent discriminative criteria are necessary to be incorporated to directly adapt the generative-like model. Therefore, a discriminative L-NMF (DL-NMF) algorithm is proposed to evaluate this idea and show the improvement in the source separation task . Finally, we revisit the ideal binary mask (IdBM), which is considered as the primary goal of computational auditory scene analysis (CASA) system, using enhanced modulation features. This kind of features take the human sensitivity to the modulations into account and are obtained using the modulation SNRs, which are estimated directly from noisy signals. Experiments demonstrate that the proposed algorithm outperforms the AMS-GMM system in estimating the IdBM in terms of the HIT-FA rate.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079713550
http://hdl.handle.net/11536/127261
Appears in Collections:Thesis