跳到主要內容

臺灣博碩士論文加值系統

(3.90.139.113) 您好!臺灣時間:2022/01/16 17:54
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:劉得正
研究生(外文):Der-Jenq Liu
論文名稱:以GCVHMM及FAR演算法為基礎之不特定語者中文連續數字語音辨識研究
論文名稱(外文):A Study on Continuous Speaker-Independent Mandarin Digits Recognition based on GCVHMM and FAR Algorithm
指導教授:林進燈林進燈引用關係
指導教授(外文):Chin-Teng Lin
學位類別:博士
校院名稱:國立交通大學
系所名稱:電機與控制工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2002
畢業學年度:90
語文別:英文
中文關鍵詞:共同向量廣義共同向量廣義共同向量隱性馬可夫模型主軸分析適應表示法部分傅立葉轉換
外文關鍵詞:Common vector approachGeneralized common vector (GCV)GCVHMMPrincipal component varianceAdaptive representationpitch measurepartial FFT
相關次數:
  • 被引用被引用:0
  • 點閱點閱:141
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
本論文提出一不特定語者中文數字連續語音辨識的技術。目前連續語音辨認中常用的方法是以隱性馬可夫模型為基礎的 one-stage演算法,最初是用於相連音(connected word)的辨識上。然而這個方法在實用上存有兩個問題:缺乏具有強健性能解決不特定語者語音辨識的語音模型;另一個則是one-stage演算法只有考慮訓練語料的暫態結構,卻未考慮測試語音的暫態結構。因此,本論文將採用主軸分析(PCA)的技術解決這兩個問題。首先,將發展一廣義的共同向量(generalized common vector)方法。該方法係建構在對共分散矩陣(covariance matrix)作特徵分析的基礎上,用於抽取來自不同語者或不同環境之語音特徵之共同向量。所發展的廣義的共同向量法將整合入傳統的隱性馬可夫模型,再發展出一新的語音模型,稱為基於廣義共同向量法之隱性馬可夫模型(generalized common vector based hidden Markov model, GCVHMM),用於不特定語者語音辨識上。為克服第二個問題,我們提出一能代表測試語音暫態資訊的新特徵,成為主軸差異參數(principal component variance、PCV)。該特徵用於one-stage演算法時能指出目前字的暫態狀況。最後應用所有論文中提出的方法在中文連續數字的辨識上,實驗結果顯示出比原來的辨識系統增加了20.5%的辨識率。測試之句子是由長串的中文數字所組成。
音調,可由對照於音節上基頻之變化來指出。對於中文而言,音調扮演著非常重要的角色。本論文也將提出一語音頻普結構分析法以預估語音之基頻。首先,我們提出一量測音高之方法,稱為pitch measure。該方法能探測發聲語音頻譜上之諧波特徵(harmonic characteristic)。該方法主要是利用發聲語音頻譜上機頻及其諧坡位置上具有明顯的脈衝、語音之能量由此明顯之諧波脈衝來主導等特性來進行探測。語音之頻譜可由快速傅立葉轉換(FFT)來求得;然而此求得之頻譜卻會受到雜訊之干擾。為提昇所提方法在雜訊干擾下之性能,我們將應用聯合時間頻率分析(joint time-frequency analysis、JTFA)技術以得到一語音頻譜之適應表示法(adaptive representation)。適應表示法可以精確低抽取出受雜訊干擾之語音中重要的諧波結構,但卻要付出高的計算量。為克服計算量的問題,我們另提出一快速適應表示法演算法(fast adaptive representation algorithm、FAR)來降低所需之計算量。經分析該方法可減低50﹪之計算量。我們將準備一資料庫來評估FAR演算法之性能,並與其他方法做比較。實驗結果顯示,不管是否有無雜訊干擾,FAR演算法之性能皆比其他方法來的好。

This paper proposes a new speech recognition technique for
continuous speech-independent recognition of spoken Mandarin
digits. One popular tool for solving such a problem is the
HMM-based one-state algorithm, which is a connected word pattern matching method. However, two problems existing in this
conventional method prevent it from practical use on our target
problem. One is the lack of a proper selection mechanism for
robust acoustic models for speaker-independent recognition. The
other problem is that the one-state algorithm considers the
temporal structure of the continuous speech signals only in the
training phase, but not in the testing phase. In this paper, we
adopt the principle component analysis (PCA) technique to solve
these two problems. At first, a generalized common-vector (GCV)
approach is developed based on the eigenanalysis of covariance
matrix to extract an invariant feature over different speakers as well as the acoustical environment effects and the phase or
temporal difference. The GCV scheme is then integrated into the
conventional HMM to form the new GCV-based HMM, called GCVHMM,
which is good at speaker-independent recognition. To overcome the second problem of the one-state algorithm, we propose a new
temporal information feature called principle component variance (PCV) to characterize the temporal information of a test speech signal. The PCV is a good indicator of word transition for the one-state algorithm. In our experiments on the recognition of speaker-independent continuous speech sentences, each being composed of long random-generated mandarin digits with variable length, the proposed scheme is shown to increase the average recognition rate of the conventional HMM-based one-state algorithm by over 20\% without using any grammar or lexical information.
Tone, which is indicated by contrasting variations in fundamental frequency $F_{0}$ at the syllable level, is an important part of a speech understanding system for Mandarin. In this thesis, we also propose a new scheme to analyze the spectral structure of speech signals for fundamental frequency estimation \cite{Liu}. First, we propose a {\em pitch measure} to detect the harmonic characteristics of voiced sounds on the spectrum of a speech signal. This measure utilizes the properties that there are distinct impulses located at the positions of fundamental frequency and its harmonics, and the energy of voiced sound is dominated by the energy of these distinct harmonic impulses. The spectrum can be obtained by the fast Fourier transform (FFT); however, it may be destroyed when the speech is interfered with by additive noise. To enhance the robustness of the proposed scheme in noisy environments, we apply the joint time-frequency analysis (JTFA) technique to obtain the adaptive representation of the spectrum of speech signals. The adaptive representation can accurately extract important harmonic structure of noisy speech signals at the expense of high computation cost. To solve this problem, we further propose a fast adaptive representation algorithm (FAR), which reduces the computation complexity of the original algorithm by 50\%. The basic concept of FAR algorithm is that we search the adaptive representation of speech signal from lower frequency resolution on the full frequency range, and then increase the search resolution on more focused search region step by step to reach the final desired resolution. This kind of "divide-and-conquer" approach reduces the computation complexity obviously. The performance of the proposed fundamental-frequency estimation scheme is evaluated on a large database with or without additive noise. The performance is compared to that of other approaches on the same database. The experimental results show that the proposed scheme performs well on clean speech and is robust in noisy environments.

Contents
Abstract in Chinese i
Abstract in English iii
Contents iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Research Objectives and Organization of Thesis . . . . . . . . . 6
2 Hidden Markov Model 9
2.1 General Structure of HMM . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 The Output Probability Distribution . . . . . . . . . . . . . . . 12
2.1.4 Elements of an HMM . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Three Basic Issues for HMMs . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Issue 1: Probability Evaluation . . . . . . . . . . . . . . . . . . 14
2.2.1.1 The Forward Procedure . . . . . . . . . . . . . . . . . 14
2.2.1.2 The Backward Procedure . . . . . . . . . . . . . . . . 15
2.2.2 Issue 2: "Optiamal" State Sequence . . . . . . . . . . . . . . . . 16
2.2.2.1 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . 16
2.2.2.2 Alternative Viterbi Implementation . . . . . . . . . . . 17
2.2.3 Issue 3: Parameter Estimation . . . . . . . . . . . . . . . . . . . 18
2.2.3.1 Auxiliary Function and Reestimation Algorithm . . . . 18
2.2.3.2 Maximization of the Auxiliary Function . . . . . . . . 19
3 Generalized Common-Vector-based HMM for Continuous Speaker-
independent Mandarin Digits Recognition 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Review of Common Vector Approach . . . . . . . . . . . . . . . . . . . 28
3.2.1 Common vector approach . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Relationship of CVA to Eigenanalysis . . . . . . . . . . . . . . . 31
3.2.2.1 Eigenanalysis . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2.2 Principal component analysis . . . . . . . . . . . . . . 32
3.2.2.3 CVA by eigenanalysis . . . . . . . . . . . . . . . . . . 34
3.3 Generalization of CVA | Generalized Common Vecotr (GCV) . . . . . 35
3.4 Generalized Common-Vector-based HMM
(GCVHMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Structure of GCVHMM . . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Reestimation algorithm for the parameters of GCVHMM . . . . 39
3.4.3 Experiments and discussions on performance evaluation of GCVHMM 42
3.5 Principal Component Variance (PCV) Method . . . . . . . . . . . . . . 44
3.5.1 PCV parameter for nding stationary and non-stationary parts
of speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 The one-stage algorithm with the embedded PCV information . 47
3.6 Experiments of Speaker-Independent Continuous Mandarin Digits Recog-
nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Fundamental Frequency Estimation Based on the Joint Time-Frequency
Analysis of Harmonic Spectral Structure 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Detection of Harmonic Spectral Structure . . . . . . . . . . . . . . . . 56
4.2.1 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 Continuous Pitch-Tracking Algorithm with Voiced/Unvoiced De-
cision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.3 Determination of Window Widths and Threshold Values . . . . 63
4.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Adaptive Representation of Speech Spectrum . . . . . . . . . . . . . . . 67
4.3.1 Adaptive Representation . . . . . . . . . . . . . . . . . . . . . 67
4.3.2 Fast Adaptive Representation (FAR) Algorithm . . . . . . . . . 69
4.3.3 Fundamental Frequency Estimation Based on Adaptive Repre-
sentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Experimental Results and Comparisons . . . . . . . . . . . . . . . . . . 73
4.4.1 Testing Database . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.2 Error Measurements . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.3 Performance Evaluation and Discussion . . . . . . . . . . . . . . 76
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Conclusion 81
Appendix 83
A Proof for the properties of energy measure 83
B Proof for the properties of impulse measure 84
C Proof for the properties of pitch measure 85
D Details of FAR algorithm 86
Bibliography 88

1
D. J. Liu and C. T. Lin, ``Fundamental Frequency Estimation Based
on the Joint Time-Frequency Analysis of Harmonic Spectral
Structure," IEEE Tran. Speech Audio Processing, vol. 9, pp.
609-621, 2001.
2
L. Rabiner and B. H. Juang, Fundamental of Speech
Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993.
3
L. R. Rabiner, ``A tutorial on hidden Markov models and selected
applications in speech recognition," Proc. IEEE, vol. 77,
pp. 257-286, Feb. 1989.
4
T. K. Vintsyuk, ``Element-wise recognition of continuous speech
consisting of words from a specified vocabulary," Kibernetika
(Cybernetics), vol. 7, no. 2, pp. 133-143, March-April 1971.
5
J. S. Bridle, M. D. Brown, and R. M. Chambrlain, ``An algorithm
for connected word recognition," in Proc. Int. Conf.
Acoustics, Speech, Signal Processing (ICASSP), Paris, pp.
899-902, May 1982.
6
J. S. Bridle, M. D. Brown, and R. M. Chamberlain, ``Continuous
connected word recognition whole word templates," The Radio
and Electronic Engineer, vol. 53 no. 4, pp. 167-175, April 1983.
7
H. Ney, ``The use of a one-stage dynamic programming algorithm for
connected word recognition," IEEE Trans. Acoustic, Speech,
Signal Processing, vol. ASSP-32, no. 2, pp. 263-271, April 1984.
8
C. H. Lee, and L. R. Rabiner, ``A frame-synchronous network search
algorithm for connected word recognition," IEEE Trans.
Acoustic, Speech, Signal Proessing, vol. 37, no. 11, pp.
1649-1658, November 1989.
9
D. Burshtein, ``Robust parametric modeling of durations in hidden
Markov models," IEEE Trans. Speech and Audio Processing,
vol. 4, no. 3, pp. 240 -242, May 1996.
10
S. Ramachandrula, and S. Thippur, ``Connected phoneme HMMs with
implicit duration modelling for better speech recognition," in
Proceedings of 1997 International Conference on Information,
Communications and Signal Processing (ICICS), vol. 2, pp.
1024-1028, 1997.
11
P. Ramesh, and J.G. Wilpon, ``Modeling state durations in hidden
Markov models for automatic speech recognition," in Proc.
Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP),
vol. 1, pp. 381-384, 1992.
12
B. Logan, and P. Moreno, ``Factorial HMMs for acoustic modeling,"
in Proc. Int. Conf. Acoustics, Speech and Signal Processing
(ICASSP), vol. 2, pp. 813-816, 1998.
13
Z. Ghahramani, and M. Jordan, ``Factorial hidden Markov models,"
Computational Cognitive Science Technical Report 9502, July 1996.
14
M. Brand, ``Coupled hidden Markov models for modeling interacting
processes," MIT Media Lab Perceptual Computing/Learning and
Common Sense Techincal Report 405, June 1997.
15
T. Hazen, ``The use of speaker correlation information for
automatic speech recognition," Ph.D. diss., Mass. Inst. Technol.,
Cambridge, Jan. 1998.
16
C. H. Lee, C. H. Lin, and B. H. Juang, ``A study on speaker
adaptation of the parameters of continuous density hidden Markov
models," IEEE Trans. Signal Processing, vol. 39, pp.
806-814, 1991.
17
Y. Zhao, ``An acoustic-phonetic-based speaker adaptation technique
for improving speaker-independent continuous speech recognition,"
IEEE Trans. Speech and Audio Processing, vol. 2, no. 3, pp.
380-394, July 1994.
18
A. Sankar, and C.H. Lee, ``A maximum-likelihood approach to
stochastic matching for robust speech recognition," IEEE Trans.
Speech and Audio Processing, vol. 4, no. 3, pp. 190-202, May
1996.
19
M. Bilginer G\"{u}lmezo\u{g}lu, Vakif Dzhafarov, Mustafa Keskin,
and Ataiay Barkana, ``A novel approach to isolated word
recognition," IEEE Trains. Speech and Audio Processing, vol.
7, No. 6, pp. 620-628, Nov. 1999.
20
M. Bilginer G\"{u}lmezo\u{g}lu, Vakif Dzhafarov, and Ataiay
Barkana, ``The common vector approach and its relation to
principal component analysis," IEEE Trains. Speech and Audio
Processing, vol. 9, no. 6, pp. 655-662, Nov. 2001.
21
H. Y. Gu, C. Y. Tseng, and L. S. Lee, ``Isolated-utterance speech
recognition using hidden Markov models with bounded state
durations," IEEE Trans. Signal Processing, vol. 39, no. 8,
pp. 1743-1752, Aug. 1991.
22
C. H. Edwards, and D. E. Penney, Elementary Linear Algebra,
Englewood Cliffs, NJ: Prentice-Hall, 1988.
23
L. Knockaert, ``An order-recursive algorithm for estimating
pole-zero models," IEEE Trans. Acoustic, Speech, Signal
Processing, vol. ASSP-35, pp. 154-157, Feb. 1987.
24
S. Haykin, Neural Network, A Comprehensive Foundation,
Macmillan College Publishing Company, Inc., 1994, pp. 363-370.
25
D. F. Morrison, Multivariate Statistical Methods. NY:
McGraw-Hill, 1967, pp. 156-195.
26
A. Dempster, N. Laird, and D. Rubin, ``Maximum likelihood from
incomplete data via the EM algorithm," J. Royal Statist.
Soc., vol.39, pp. 1-38, 1977.
27
B. H. Juang, ``Maximum-likelihood estimation for mixture
multivariate stochastic observations of Markov chains," AT\&T
Tech. J., vol. 64, no. 6, pp. 1235-1249, 1985.
28
L. Baum, T. Petrie, G. Soules, and N. Weiss, ``A maximization
technique occurring in the statistical analysis of probabilistic
functions of Markov chains," Ann. Math. Statist., vol. 41,
no. 1, pp. 164-171, 1970.
29
L. R. Liporace, ``Maximum likelihood estimation for multivariate
observations of Markov sources" IEEE Trans. Inform. Theorey,
IT-28, pp. 729-734, September, 1982.
30
B. H. Juang and L. R. Rabiner, ``Mixture autoregressive hidden
Markov Models for speech signals," IEEE Trans. Acoust.,
Speech, Signal Processing, vol. 33, pp. 1404-1413, 1985.
31
L. R. Rabiner, B. H. Juang, S. E. Levinson, and M. M. Sondhi,
``Recognition of isolated digits using hidden Markov models with
continuous mixture densities," AT\&T Tech. J., vol. 64, no.
6, pp. 1211-34, July-Aug. 1985.
32
L. R. Rabiner, J. G. Wilpon, and F. K. Soong, ``High performance
connected digit recognition using hidden Markov models," in
Proc. ICASSP, pp. 119-122, 1988.
33
M. J. Russel and R. K. Moore, ``Explicit modeling of state
occupancy in hidden Markov models for automatic speech
recognition," in Proc. ICASSP, pp. 5-8, 1985.
34
S. E. Levinson, ``Continuously variable duration hidden Markov
models for speech analysis," in Proc. ICASSP, pp. 1241-1244,
1986.
35
S. E. Levinson, A. Ljolje, and L. G. Miller, ``Large vocabulary
speech recognition using a hidden Markov Model for
acoustic/phonetic classification," in Proc. ICASSP, pp.
505-508, 1988.
36
J. L. Flanagan, Speech Analysis, Synthesis, and Perception,
NY: Springer-Verlag, 1972.
37
A. V. McGree and T. P. Barnwell III, ``A Mixed Excitation LPC
Vocoder Model for Low Bit Rate Speech Coding," IEEE Tran.
Speech Audio Processing, vol. 3, pp. 242-250, July, 1995.
38
S. H. Chen and Y. R. Wang, ``Tone Recognition of Continuous
Mandarin Speech Based on Neural Networks," IEEE Tran. Speech
Audio Processing, vol. 3, pp. 146-150, 1995.
39
T. Lee, P. C. Ching, L. W. Chan, Y. H. Cheng, and B. Mak, ``Tone
Recognition of Isolated Cantones Syllables," IEEE Tran.
Speech Audio Processing, vol. 3, pp. 204-209, May, 1995.
40
L. S. Lee, C. Y. Tseng, H. Y. Gu, F. H. Liu, C. H. Chang, Y. H.
Lin, Y. Lee, S. L. Tu, S. H. Hsieh, and C. H. Chen, ``Golden
Mandarin (I) - A Real-Time Mandarin Speech Dictation Machine for
Chinese Language with Very Large Vocabulary," IEEE Trans.
Speech Audio Processing, vol. 1, pp. 158-178, April, 1998.
41
S. Potisuk, M. P. Harper, and J. Gandour, ``Classification of Thai
Tone Sequences in Syllable-Segmented Speech Using the
Analysis-by-Synthesis Method," IEEE Trans. Speech Audio
Processing, vol. 7, pp. 95-102, January, 1999.
42
L. R. Rabiner, M. J. Cheng, A. E. Rosenberg, and C. A. McGonegal,
``A Comparative Performance Study of Several Pitch Detection
Algorithms," IEEE Trans. Acoust., Speech, Signal Processing,
vol. ASSP-24, pp. 399-417, October, 1976.
43
S. Ahmadi and A. S. Spanias, ``Cepstrum-Based Pitch Detection
Using a New Statistical V/UV Classification Algorithm," IEEE
Trans. Speech Audio Processing, vol. 7, pp. 333-338, May, 1999.
44
J. D. Markel, ``The SIFT Algorithm for Fundamental Frequency
Estimation," IEEE Trans. Audio Electroacoust., vol. AU-20,
pp. 367-377, December, 1972.
45
B. Boashash, Time-Frequency Signal Analysis, NY: John Wiley
\& Sons, 1992.
46
S. Qian and D. Chen, Joint Time-Frequency Analysis, NJ:
Prentice-Hall, 1996.
47
S. G. Mallat and Z. Zhang, ``Matching Pursuits with Time-Frequency
Dictionaries," IEEE Trans. Signal Processing, vol. 41, pp.
3397-3415, December, 1993.
48
S. Qian and D. Chen, ``Signal Representation using Adaptive
Normalized Gaussian Functions," Signal Processing, vol. 36,
pp. 1-11, 1994.
49
L. R. Rabiner and B. H. Juang, Fundamentals of Speech
Recognition, NJ: Prentice-Hall, 1993.
50
J. G. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-Time Processing of Speech Signals, NY: Macmillan
Publishing Co., 1993.
51
T. A. C. M. Classen and W. F. G. Mecklenbr\"{a}uker, ``The Wigner Distribution---A tool for time-frequency signal analysis---Parts I, II, III," Philips J. Res., vol. 35, pp. 217-250, pp. 276-300, pp. 372-389, 1980.
52
A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal
Processing, NY: Prentice-Hall, 1989.
53
R. W. Schafer and L. R. Rabiner, ``Digital Representations of
Speech Signals," Proceedings of the IEEE, vol. 63, no. 4, pp.
662-677, 1975.
54
R. W. Schafer and L. R. Rabiner, ``System for Automatic Formant
Analysis of Voiced Speech," J. Acoust. Soc. Amer., vol. 47, pp.
634-648, Feb. 1970.
55
A. M. Noll, `` Cepstrum Pitch Determination," J. Acoust. Soc.
Amer., vol. 41, pp. 293-309, Feb. 1967.
56
B. G. Secrest and G. R. Doddington, ``Postprocessing Techniques
for Voice Pitch Trackers," in Proc. IEEE ICASSP'82, pp.
172-175, 1982.
57
D. Talkin, ``A Robust Algorithm for Pitch Tracking (RAPT)", In
Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal
Eds., NY: Elsevier, 1995.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top