# 臺灣博碩士論文加值系統

(3.90.139.113) 您好！臺灣時間：2022/01/16 17:54

:::

### 詳目顯示

:

• 被引用:0
• 點閱:141
• 評分:
• 下載:0
• 書目收藏:0
 本論文提出一不特定語者中文數字連續語音辨識的技術。目前連續語音辨認中常用的方法是以隱性馬可夫模型為基礎的 one-stage演算法，最初是用於相連音（connected word）的辨識上。然而這個方法在實用上存有兩個問題：缺乏具有強健性能解決不特定語者語音辨識的語音模型；另一個則是one-stage演算法只有考慮訓練語料的暫態結構，卻未考慮測試語音的暫態結構。因此，本論文將採用主軸分析（PCA）的技術解決這兩個問題。首先，將發展一廣義的共同向量（generalized common vector）方法。該方法係建構在對共分散矩陣（covariance matrix）作特徵分析的基礎上，用於抽取來自不同語者或不同環境之語音特徵之共同向量。所發展的廣義的共同向量法將整合入傳統的隱性馬可夫模型，再發展出一新的語音模型，稱為基於廣義共同向量法之隱性馬可夫模型（generalized common vector based hidden Markov model, GCVHMM），用於不特定語者語音辨識上。為克服第二個問題，我們提出一能代表測試語音暫態資訊的新特徵，成為主軸差異參數（principal component variance、PCV）。該特徵用於one-stage演算法時能指出目前字的暫態狀況。最後應用所有論文中提出的方法在中文連續數字的辨識上，實驗結果顯示出比原來的辨識系統增加了20.5%的辨識率。測試之句子是由長串的中文數字所組成。 音調，可由對照於音節上基頻之變化來指出。對於中文而言，音調扮演著非常重要的角色。本論文也將提出一語音頻普結構分析法以預估語音之基頻。首先，我們提出一量測音高之方法，稱為pitch measure。該方法能探測發聲語音頻譜上之諧波特徵（harmonic characteristic）。該方法主要是利用發聲語音頻譜上機頻及其諧坡位置上具有明顯的脈衝、語音之能量由此明顯之諧波脈衝來主導等特性來進行探測。語音之頻譜可由快速傅立葉轉換（FFT）來求得；然而此求得之頻譜卻會受到雜訊之干擾。為提昇所提方法在雜訊干擾下之性能，我們將應用聯合時間頻率分析（joint time-frequency analysis、JTFA）技術以得到一語音頻譜之適應表示法（adaptive representation）。適應表示法可以精確低抽取出受雜訊干擾之語音中重要的諧波結構，但卻要付出高的計算量。為克服計算量的問題，我們另提出一快速適應表示法演算法（fast adaptive representation algorithm、FAR）來降低所需之計算量。經分析該方法可減低50﹪之計算量。我們將準備一資料庫來評估FAR演算法之性能，並與其他方法做比較。實驗結果顯示，不管是否有無雜訊干擾，FAR演算法之性能皆比其他方法來的好。
 This paper proposes a new speech recognition technique for continuous speech-independent recognition of spoken Mandarin digits. One popular tool for solving such a problem is the HMM-based one-state algorithm, which is a connected word pattern matching method. However, two problems existing in this conventional method prevent it from practical use on our target problem. One is the lack of a proper selection mechanism for robust acoustic models for speaker-independent recognition. The other problem is that the one-state algorithm considers the temporal structure of the continuous speech signals only in the training phase, but not in the testing phase. In this paper, we adopt the principle component analysis (PCA) technique to solve these two problems. At first, a generalized common-vector (GCV) approach is developed based on the eigenanalysis of covariance matrix to extract an invariant feature over different speakers as well as the acoustical environment effects and the phase or temporal difference. The GCV scheme is then integrated into the conventional HMM to form the new GCV-based HMM, called GCVHMM, which is good at speaker-independent recognition. To overcome the second problem of the one-state algorithm, we propose a new temporal information feature called principle component variance (PCV) to characterize the temporal information of a test speech signal. The PCV is a good indicator of word transition for the one-state algorithm. In our experiments on the recognition of speaker-independent continuous speech sentences, each being composed of long random-generated mandarin digits with variable length, the proposed scheme is shown to increase the average recognition rate of the conventional HMM-based one-state algorithm by over 20\% without using any grammar or lexical information. Tone, which is indicated by contrasting variations in fundamental frequency $F_{0}$ at the syllable level, is an important part of a speech understanding system for Mandarin. In this thesis, we also propose a new scheme to analyze the spectral structure of speech signals for fundamental frequency estimation \cite{Liu}. First, we propose a {\em pitch measure} to detect the harmonic characteristics of voiced sounds on the spectrum of a speech signal. This measure utilizes the properties that there are distinct impulses located at the positions of fundamental frequency and its harmonics, and the energy of voiced sound is dominated by the energy of these distinct harmonic impulses. The spectrum can be obtained by the fast Fourier transform (FFT); however, it may be destroyed when the speech is interfered with by additive noise. To enhance the robustness of the proposed scheme in noisy environments, we apply the joint time-frequency analysis (JTFA) technique to obtain the adaptive representation of the spectrum of speech signals. The adaptive representation can accurately extract important harmonic structure of noisy speech signals at the expense of high computation cost. To solve this problem, we further propose a fast adaptive representation algorithm (FAR), which reduces the computation complexity of the original algorithm by 50\%. The basic concept of FAR algorithm is that we search the adaptive representation of speech signal from lower frequency resolution on the full frequency range, and then increase the search resolution on more focused search region step by step to reach the final desired resolution. This kind of "divide-and-conquer" approach reduces the computation complexity obviously. The performance of the proposed fundamental-frequency estimation scheme is evaluated on a large database with or without additive noise. The performance is compared to that of other approaches on the same database. The experimental results show that the proposed scheme performs well on clean speech and is robust in noisy environments.
 Contents Abstract in Chinese i Abstract in English iii Contents iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Research Objectives and Organization of Thesis . . . . . . . . . 6 2 Hidden Markov Model 9 2.1 General Structure of HMM . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 The Output Probability Distribution . . . . . . . . . . . . . . . 12 2.1.4 Elements of an HMM . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Three Basic Issues for HMMs . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Issue 1: Probability Evaluation . . . . . . . . . . . . . . . . . . 14 2.2.1.1 The Forward Procedure . . . . . . . . . . . . . . . . . 14 2.2.1.2 The Backward Procedure . . . . . . . . . . . . . . . . 15 2.2.2 Issue 2: "Optiamal" State Sequence . . . . . . . . . . . . . . . . 16 2.2.2.1 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . 16 2.2.2.2 Alternative Viterbi Implementation . . . . . . . . . . . 17 2.2.3 Issue 3: Parameter Estimation . . . . . . . . . . . . . . . . . . . 18 2.2.3.1 Auxiliary Function and Reestimation Algorithm . . . . 18 2.2.3.2 Maximization of the Auxiliary Function . . . . . . . . 19 3 Generalized Common-Vector-based HMM for Continuous Speaker- independent Mandarin Digits Recognition 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Review of Common Vector Approach . . . . . . . . . . . . . . . . . . . 28 3.2.1 Common vector approach . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 Relationship of CVA to Eigenanalysis . . . . . . . . . . . . . . . 31 3.2.2.1 Eigenanalysis . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.2.2 Principal component analysis . . . . . . . . . . . . . . 32 3.2.2.3 CVA by eigenanalysis . . . . . . . . . . . . . . . . . . 34 3.3 Generalization of CVA | Generalized Common Vecotr (GCV) . . . . . 35 3.4 Generalized Common-Vector-based HMM (GCVHMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.1 Structure of GCVHMM . . . . . . . . . . . . . . . . . . . . . . 38 3.4.2 Reestimation algorithm for the parameters of GCVHMM . . . . 39 3.4.3 Experiments and discussions on performance evaluation of GCVHMM 42 3.5 Principal Component Variance (PCV) Method . . . . . . . . . . . . . . 44 3.5.1 PCV parameter for nding stationary and non-stationary parts of speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.2 The one-stage algorithm with the embedded PCV information . 47 3.6 Experiments of Speaker-Independent Continuous Mandarin Digits Recog- nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4 Fundamental Frequency Estimation Based on the Joint Time-Frequency Analysis of Harmonic Spectral Structure 54 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Detection of Harmonic Spectral Structure . . . . . . . . . . . . . . . . 56 4.2.1 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2.2 Continuous Pitch-Tracking Algorithm with Voiced/Unvoiced De- cision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.3 Determination of Window Widths and Threshold Values . . . . 63 4.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Adaptive Representation of Speech Spectrum . . . . . . . . . . . . . . . 67 4.3.1 Adaptive Representation . . . . . . . . . . . . . . . . . . . . . 67 4.3.2 Fast Adaptive Representation (FAR) Algorithm . . . . . . . . . 69 4.3.3 Fundamental Frequency Estimation Based on Adaptive Repre- sentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4 Experimental Results and Comparisons . . . . . . . . . . . . . . . . . . 73 4.4.1 Testing Database . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4.2 Error Measurements . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.3 Performance Evaluation and Discussion . . . . . . . . . . . . . . 76 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5 Conclusion 81 Appendix 83 A Proof for the properties of energy measure 83 B Proof for the properties of impulse measure 84 C Proof for the properties of pitch measure 85 D Details of FAR algorithm 86 Bibliography 88
 1D. J. Liu and C. T. Lin, Fundamental Frequency Estimation Basedon the Joint Time-Frequency Analysis of Harmonic SpectralStructure," IEEE Tran. Speech Audio Processing, vol. 9, pp.609-621, 2001.2L. Rabiner and B. H. Juang, Fundamental of SpeechRecognition. Englewood Cliffs, NJ: Prentice-Hall, 1993.3L. R. Rabiner, A tutorial on hidden Markov models and selectedapplications in speech recognition," Proc. IEEE, vol. 77,pp. 257-286, Feb. 1989.4T. K. Vintsyuk, Element-wise recognition of continuous speechconsisting of words from a specified vocabulary," Kibernetika(Cybernetics), vol. 7, no. 2, pp. 133-143, March-April 1971.5J. S. Bridle, M. D. Brown, and R. M. Chambrlain, An algorithmfor connected word recognition," in Proc. Int. Conf.Acoustics, Speech, Signal Processing (ICASSP), Paris, pp.899-902, May 1982.6J. S. Bridle, M. D. Brown, and R. M. Chamberlain, Continuousconnected word recognition whole word templates," The Radioand Electronic Engineer, vol. 53 no. 4, pp. 167-175, April 1983.7H. Ney, The use of a one-stage dynamic programming algorithm forconnected word recognition," IEEE Trans. Acoustic, Speech,Signal Processing, vol. ASSP-32, no. 2, pp. 263-271, April 1984.8C. H. Lee, and L. R. Rabiner, A frame-synchronous network searchalgorithm for connected word recognition," IEEE Trans.Acoustic, Speech, Signal Proessing, vol. 37, no. 11, pp.1649-1658, November 1989.9D. Burshtein, Robust parametric modeling of durations in hiddenMarkov models," IEEE Trans. Speech and Audio Processing,vol. 4, no. 3, pp. 240 -242, May 1996.10S. Ramachandrula, and S. Thippur, Connected phoneme HMMs withimplicit duration modelling for better speech recognition," inProceedings of 1997 International Conference on Information,Communications and Signal Processing (ICICS), vol. 2, pp.1024-1028, 1997.11P. Ramesh, and J.G. Wilpon, Modeling state durations in hiddenMarkov models for automatic speech recognition," in Proc.Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP),vol. 1, pp. 381-384, 1992.12B. Logan, and P. Moreno, Factorial HMMs for acoustic modeling,"in Proc. Int. Conf. Acoustics, Speech and Signal Processing(ICASSP), vol. 2, pp. 813-816, 1998.13Z. Ghahramani, and M. Jordan, Factorial hidden Markov models,"Computational Cognitive Science Technical Report 9502, July 1996.14M. Brand, Coupled hidden Markov models for modeling interactingprocesses," MIT Media Lab Perceptual Computing/Learning andCommon Sense Techincal Report 405, June 1997.15T. Hazen, The use of speaker correlation information forautomatic speech recognition," Ph.D. diss., Mass. Inst. Technol.,Cambridge, Jan. 1998.16C. H. Lee, C. H. Lin, and B. H. Juang, A study on speakeradaptation of the parameters of continuous density hidden Markovmodels," IEEE Trans. Signal Processing, vol. 39, pp.806-814, 1991.17Y. Zhao, An acoustic-phonetic-based speaker adaptation techniquefor improving speaker-independent continuous speech recognition,"IEEE Trans. Speech and Audio Processing, vol. 2, no. 3, pp.380-394, July 1994.18A. Sankar, and C.H. Lee, A maximum-likelihood approach tostochastic matching for robust speech recognition," IEEE Trans.Speech and Audio Processing, vol. 4, no. 3, pp. 190-202, May1996.19M. Bilginer G\"{u}lmezo\u{g}lu, Vakif Dzhafarov, Mustafa Keskin,and Ataiay Barkana, A novel approach to isolated wordrecognition," IEEE Trains. Speech and Audio Processing, vol.7, No. 6, pp. 620-628, Nov. 1999.20M. Bilginer G\"{u}lmezo\u{g}lu, Vakif Dzhafarov, and AtaiayBarkana, The common vector approach and its relation toprincipal component analysis," IEEE Trains. Speech and AudioProcessing, vol. 9, no. 6, pp. 655-662, Nov. 2001.21H. Y. Gu, C. Y. Tseng, and L. S. Lee, Isolated-utterance speechrecognition using hidden Markov models with bounded statedurations," IEEE Trans. Signal Processing, vol. 39, no. 8,pp. 1743-1752, Aug. 1991.22C. H. Edwards, and D. E. Penney, Elementary Linear Algebra,Englewood Cliffs, NJ: Prentice-Hall, 1988.23L. Knockaert, An order-recursive algorithm for estimatingpole-zero models," IEEE Trans. Acoustic, Speech, SignalProcessing, vol. ASSP-35, pp. 154-157, Feb. 1987.24S. Haykin, Neural Network, A Comprehensive Foundation,Macmillan College Publishing Company, Inc., 1994, pp. 363-370.25D. F. Morrison, Multivariate Statistical Methods. NY:McGraw-Hill, 1967, pp. 156-195.26A. Dempster, N. Laird, and D. Rubin, Maximum likelihood fromincomplete data via the EM algorithm," J. Royal Statist.Soc., vol.39, pp. 1-38, 1977.27B. H. Juang, Maximum-likelihood estimation for mixturemultivariate stochastic observations of Markov chains," AT\&TTech. J., vol. 64, no. 6, pp. 1235-1249, 1985.28L. Baum, T. Petrie, G. Soules, and N. Weiss, A maximizationtechnique occurring in the statistical analysis of probabilisticfunctions of Markov chains," Ann. Math. Statist., vol. 41,no. 1, pp. 164-171, 1970.29L. R. Liporace, Maximum likelihood estimation for multivariateobservations of Markov sources" IEEE Trans. Inform. Theorey,IT-28, pp. 729-734, September, 1982.30B. H. Juang and L. R. Rabiner, Mixture autoregressive hiddenMarkov Models for speech signals," IEEE Trans. Acoust.,Speech, Signal Processing, vol. 33, pp. 1404-1413, 1985.31L. R. Rabiner, B. H. Juang, S. E. Levinson, and M. M. Sondhi,Recognition of isolated digits using hidden Markov models withcontinuous mixture densities," AT\&T Tech. J., vol. 64, no.6, pp. 1211-34, July-Aug. 1985.32L. R. Rabiner, J. G. Wilpon, and F. K. Soong, High performanceconnected digit recognition using hidden Markov models," inProc. ICASSP, pp. 119-122, 1988.33M. J. Russel and R. K. Moore, Explicit modeling of stateoccupancy in hidden Markov models for automatic speechrecognition," in Proc. ICASSP, pp. 5-8, 1985.34S. E. Levinson, Continuously variable duration hidden Markovmodels for speech analysis," in Proc. ICASSP, pp. 1241-1244,1986.35S. E. Levinson, A. Ljolje, and L. G. Miller, Large vocabularyspeech recognition using a hidden Markov Model foracoustic/phonetic classification," in Proc. ICASSP, pp.505-508, 1988.36J. L. Flanagan, Speech Analysis, Synthesis, and Perception,NY: Springer-Verlag, 1972.37A. V. McGree and T. P. Barnwell III, A Mixed Excitation LPCVocoder Model for Low Bit Rate Speech Coding," IEEE Tran.Speech Audio Processing, vol. 3, pp. 242-250, July, 1995.38S. H. Chen and Y. R. Wang, Tone Recognition of ContinuousMandarin Speech Based on Neural Networks," IEEE Tran. SpeechAudio Processing, vol. 3, pp. 146-150, 1995.39T. Lee, P. C. Ching, L. W. Chan, Y. H. Cheng, and B. Mak, ToneRecognition of Isolated Cantones Syllables," IEEE Tran.Speech Audio Processing, vol. 3, pp. 204-209, May, 1995.40L. S. Lee, C. Y. Tseng, H. Y. Gu, F. H. Liu, C. H. Chang, Y. H.Lin, Y. Lee, S. L. Tu, S. H. Hsieh, and C. H. Chen, GoldenMandarin (I) - A Real-Time Mandarin Speech Dictation Machine forChinese Language with Very Large Vocabulary," IEEE Trans.Speech Audio Processing, vol. 1, pp. 158-178, April, 1998.41S. Potisuk, M. P. Harper, and J. Gandour, Classification of ThaiTone Sequences in Syllable-Segmented Speech Using theAnalysis-by-Synthesis Method," IEEE Trans. Speech AudioProcessing, vol. 7, pp. 95-102, January, 1999.42L. R. Rabiner, M. J. Cheng, A. E. Rosenberg, and C. A. McGonegal,A Comparative Performance Study of Several Pitch DetectionAlgorithms," IEEE Trans. Acoust., Speech, Signal Processing,vol. ASSP-24, pp. 399-417, October, 1976.43S. Ahmadi and A. S. Spanias, Cepstrum-Based Pitch DetectionUsing a New Statistical V/UV Classification Algorithm," IEEETrans. Speech Audio Processing, vol. 7, pp. 333-338, May, 1999.44J. D. Markel, The SIFT Algorithm for Fundamental FrequencyEstimation," IEEE Trans. Audio Electroacoust., vol. AU-20,pp. 367-377, December, 1972.45B. Boashash, Time-Frequency Signal Analysis, NY: John Wiley\& Sons, 1992.46S. Qian and D. Chen, Joint Time-Frequency Analysis, NJ:Prentice-Hall, 1996.47S. G. Mallat and Z. Zhang, Matching Pursuits with Time-FrequencyDictionaries," IEEE Trans. Signal Processing, vol. 41, pp.3397-3415, December, 1993.48S. Qian and D. Chen, Signal Representation using AdaptiveNormalized Gaussian Functions," Signal Processing, vol. 36,pp. 1-11, 1994.49L. R. Rabiner and B. H. Juang, Fundamentals of SpeechRecognition, NJ: Prentice-Hall, 1993.50J. G. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-Time Processing of Speech Signals, NY: MacmillanPublishing Co., 1993.51T. A. C. M. Classen and W. F. G. Mecklenbr\"{a}uker, The Wigner Distribution---A tool for time-frequency signal analysis---Parts I, II, III," Philips J. Res., vol. 35, pp. 217-250, pp. 276-300, pp. 372-389, 1980.52A. V. Oppenheim and R. W. Schafer, Discrete-Time SignalProcessing, NY: Prentice-Hall, 1989.53R. W. Schafer and L. R. Rabiner, Digital Representations ofSpeech Signals," Proceedings of the IEEE, vol. 63, no. 4, pp.662-677, 1975.54R. W. Schafer and L. R. Rabiner, System for Automatic FormantAnalysis of Voiced Speech," J. Acoust. Soc. Amer., vol. 47, pp.634-648, Feb. 1970.55A. M. Noll,  Cepstrum Pitch Determination," J. Acoust. Soc.Amer., vol. 41, pp. 293-309, Feb. 1967.56B. G. Secrest and G. R. Doddington, Postprocessing Techniquesfor Voice Pitch Trackers," in Proc. IEEE ICASSP'82, pp.172-175, 1982.57D. Talkin, A Robust Algorithm for Pitch Tracking (RAPT)", InSpeech Coding and Synthesis, W. B. Kleijn and K. K. PaliwalEds., NY: Elsevier, 1995.
 國圖紙本論文
 推文當script無法執行時可按︰推文 網路書籤當script無法執行時可按︰網路書籤 推薦當script無法執行時可按︰推薦 評分當script無法執行時可按︰評分 引用網址當script無法執行時可按︰引用網址 轉寄當script無法執行時可按︰轉寄

 無相關論文

 1 馮燕 (1987)，「公益性非營利組織資源的運作與管理」，理論與政策， 2 馮燕 (1987)，「公益性非營利組織資源的運作與管理」，理論與政策，

 1 以結合決策樹與GCVHMM為基礎之不特定語者中文連續數字語音辨識 2 基於模糊邏輯理論之非侵入式的血壓和血管順應性測量技術 3 家用機器人之語音辨識系統 4 虛擬實境動態模擬系統之即時嵌入式控制單板研製 5 液晶顯示器基於人眼偏好的影像品質調整迦瑪曲線及反射式液晶顯示器白色色偏之研究 6 利用建立可信賴的背景完成影片中物體的分割與追蹤 7 馬模型動態場景之製作與周邊軟硬體的搭配 8 5.1聲道音響撥放系統之多頻帶空間響應模擬器 9 低位元率語音編碼專用處理器 10 以機器學習發展黑白跳棋棋法之研究 11 混合鎖模摻鉺光纖雷射 12 濾波繞射元件的製作及量測 13 繞射/折射複合光學元件設計與模擬 14 利用自發性布里淵散射量測不同光纖的超音波傳輸特性 15 利用飽和吸收鏡之混成式鎖模半導體雷射之行為研究

 簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室