A Study of Frequency Band and Wavelet Analysis for Robust Voice Activity Detection
|關鍵字:||語音偵測;小波分析;適應性噪音估測器;子頻帶分解;voice activity detection;wavelet analysis;adaptive noise estimator;subband decomposition|
|摘要:||本論文主要是針對語音偵測系統(voice activity detection)在弱的訊號與噪音比值(the signal-to-noise ratio, SNR)及劇烈性的噪音程度變動下之所面臨的問題作些探討。迄今，所提出的語音偵測系統都是假定環境噪音程度是穩定的(stationary)。然而，由於傳統演算法的特徵參數都取決於能量的估測，因此其效能易受到實際噪音程度的變動所影響。比如在車上，劇烈的噪音變動就可能因為移動、引檠運轉、車速、煞車及關車門聲而經常地產生。為了要解決這個問題，我們先後提出兩種具強健性(robust)特徵參數為基礎的語音偵測系統。在第一種方法中，根據共鳴頻率(formant frequency)造成在聲音光譜圖(voice spectrogram)的帶狀性紋路(banded line)現象，我們可發現此帶狀性紋路可有效及簡單地表示出具時變特性(time-varying property)語音的存在。透過頻帶分析，我們提出一個以熵為基礎的語音偵測系統。首先，將訊號切成三十二個均勻大小的子頻帶以區隔出共振音頻的分佈。論文中提出一個定義在子頻帶上的帶狀性頻譜熵值(banded spectrum entropy, BSE) 以充分地利用帶狀性紋路在聲音光譜圖上的固有特性。由於所切出的子頻帶可能被噪音干擾，為了增加BSE參數對噪音的抗雜訊能力，我們利用可適性臨界方式(adaptive threshold method)的技巧，建立一個稱作子頻帶自我擷取(subband self-extraction)的方法以能立即地擷取有效的子頻帶。但事實上，聲音光譜圖上帶狀性紋路現像只適合用來特徵有聲的語音訊號。為了要強化語音訊號的無聲部份，其低頻能量對全頻帶能量的比值(the ratio of low-band to full-band energy, RLF)可用來區隔無聲語音與背景噪音特性的差別。相較於其它方法，實驗結果可發現用以建立具強健性的語音偵測系統的BSE及RLF特徵參數可成功地特徵語音特性且不易受噪音程度變動。事實上，語音偵測技術使也在噪音估測器中扮演非常重要的角色；一般都採用語音偵測系統的技術作為判斷何時追蹤噪音頻譜變動的指示器。為了針對噪音程度極遽變動情況下，所提出的噪音估測器加入以熵為基礎的語音偵測技術並以疊代平均的方法及可調適的平滑因子為基礎。
而在另一種語音偵測系統，我們利用語音的暫態及非穩定性的特性最為擷取語音訊號的依據，採以小波作為訊號的分析。首先，離散小波轉換將輸入訊號分成四個不均勻大小的子頻帶，而在每個子頻帶上採用一種非線性(non-linear)的Teager能量運算(Teager Energy Operator, TEO)以有效抑制噪音在各子頻帶的影響，而另一優點就是有助於子頻帶自我相關函示(spectral auto-correlation function, SACF)之結果。為了量化個子頻帶上的自我相關函示採用Mean-Delta(MD)運算以估測各頻帶的週期強度，最後並相加各子頻帶的MDSACF參數以建立一個以小波為基礎的強健性特徵參數。為了建立完整的語音偵測系統，我們採用一個可適性臨界方式作為判斷語音偵測結果的機制。相較於其他方法，實驗結果證實了以小波為基礎的語音偵測方法可提供在可變噪音程度下的強健性且具高效率及易實現的方法。|
This dissertation mainly addresses the problem of a voice activity detection (VAD) failed in poor signal-to-noise ratio (SNR) and in dynamically time-varying background. So far, the commonly used VAD algorithms always assume that the background noise level is stationary. Since the feature extractions from conventional algorithms are closely depended on the estimation of energy level, the corresponding performances are easily contaminated by the variable noise-level. For example, may usually exit in car due to movements, engine running, speed change, braking, slam, etc. To solve the problem, the VAD algorithms based on two types of robust feature parameters are proposed in turn. In the first presented approach, it is found that the nature of banded line is highly efficient, compact representation for the time-varying characteristics of speech signals according to the appearance of banded line on voice spectrogram resulted from formant frequency. For frequency band analysis, an entropy-based VAD is presented herein. First, the input signal is decomposed into 32 uniform subbands to locate the formant frequency bands. A measure of entropy defined in subband domain, regarded as banded spectrum entropy (BSE) parameter, is then proposed to sufficiently exploit the inherent nature of banded lines on voice spectrogram. Due to that the some decomposed subbands can be contaminated by noise, a strategy of subband self-extraction (SSE) based on adaptive threshold skill is presented herein to execute the extraction of useful subbands with time and is further used to let the BSE be robust against to noises. The banded lines on voice spectrogram, in practice, are only suitable for characterizing voiced speech. In order to enhance the part of unvoiced speech, the ratio of low-band energy to full-band energy (RLF) is presented to discriminating the unvoiced sound from background noises. Compare to other VAD approaches, experimental results shown that the two BSE and RLF parameters used for determining voice activity successfully exploit the characteristic of speech signal and is nearly robust against variable noise level. A technology of VAD, in practice, plays an essential role in noise spectrum estimator. The VAD scheme is frequently employed into noise spectrum estimator as an indicator of updating noise spectrum. Enclosed herein the proposed noise spectrum estimation employs an entropy-based VAD above mentioned as an indicator of updating noise spectrum. In addition, a recursive averaging-based formula and an adaptive smoothing factor are then involved herein for quickly adapting to variable level of noise. In the alternative VAD method, wavelet analysis is used for extracting speech signals to further exploit the transient components and non-stationary property. First, we divide the input signal into four non-uniform subbands via discrete wavelet transform (DWT). In addition, a nonlinear Teager energy operator (TEO) is then utilized into each subband signals. We show that the TEO can decrease the influence of noise on subbands significantly. Besides, the other advantage is suitable for the result of subband auto-correlation function (SACF). To obtain the amount of periodicity, a Mean-Delta (MD) operator is then applied into SACF on each subband. Summing up the all MDSACFs derived from each decomposed subband, a robust wavelet-based feature parameter is then proposed. Finally, we adopt an adaptive threshold method as VAD decision to form a complete VAD. The simulation result shows the wavelet-based VAD is robust against changing noise level and is an efficient and simple approach as comparing with other methods.
|Appears in Collections:||Thesis|
Files in This Item: