研究生(外文):Chih-Hsien Huang
論文名稱(外文):Bayesian Discriminative Speaker Adaptation for Large Vocabulary Continuous Speech Recognition
指導教授(外文):Jen-Tzung Chien
外文關鍵詞:LVCSRBayesian LearningDiscriminative LearningSpeaker AdaptationAdaptive Duration ModelQuasi-Bayes EstimateAAPLR
  Automatic recognition and summarization of continuous speech with high performance is the aim of many speech related researches. In the laboratory, the recognition performance was generally good because the training and test data are from the same environmental condition. However, there are many mismatch sources between training and test data in real applications, such as speaking rate, environment noise, channel effect, etc. To deal with these mismatches, the most popular technique is to conduct speaker/environment adaptation. More specifically, it is desirable that the continuous speech recognition system is equipped with rapid, sequential and discriminative capabilities.

  This dissertation proposes a series of Bayesian and discriminative adaptation methods for large vocabulary continuous speech recognition (LVCSR). We aim to adapt the speaker-independent models to the test environment/speaker. The first method is to deal with mismatch of duration models. The Bayesian speech duration modeling and learning is presented for hidden Markov model (HMM) based speech recognition. We focus on the sequential learning of HMM state duration using quasi-Bayes (QB) estimate. The adapted duration models are robust to nonstationary speaking rates and noise conditions. In this study, the Gaussian, Poisson and gamma distributions are investigated to characterize the duration models. The maximum a posteriori (MAP) estimate of gamma duration model is developed. To exploit the sequential learning, we adopt the Poisson duration model incorporated with gamma prior density, which belongs to the conjugate prior family. When the adaptation data are sequentially observed, the gamma posterior density is produced with twofold advantages. One is to determine the optimal QB duration parameter, which can be merged in HMM’s for speech recognition. The other one is to build the updating mechanism of gamma prior statistics for sequential learning. EM algorithm is applied to fulfill QB parameter estimation. The adaptation of overall HMM parameters can be performed simultaneously.

  Also, we present two approaches to improve performance of eigenvoice-based speaker adaptation. First, we present the maximum a posteriori eigen-decomposition (MAPED), where the linear combination coefficients for eigenvector decomposition are estimated according to MAP criterion. By incorporating the prior decomposition knowledge using Gaussian distribution, the MAPED is established so as to achieve better performance than maximum likelihood eigen-decomposition (MLED) when adaptation data is few. On the other hand, we exploit the adaptation of HMM covariance matrices in framework of eigenvoice speaker adaptation. Our method is to use the principal component analysis (PCA) to project the speaker-specific HMM parameters onto a smaller orthogonal model space. Then, we reliably calculate the HMM covariance matrices using the observations in the reduced space. The HMM covariance matrices are then estimated by transforming the covariance matrices in reduced space back to the original space.

  To establish Bayesian discriminative adaptation, we also present a new linear regression adaptation algorithm for LVCSR. The cluster-dependent regression matrices are estimated from speaker-specific adaptation data through maximizing the aggregate a posteriori probability, which can be expressed in a form of classification error function adopting the logarithm of posterior distribution as the discriminant function. Accordingly, the aggregate a posteriori linear regression (AAPLR) is developed for discriminative adaptation where the classification errors of adaptation data are minimized. Because the prior distribution of regression matrix is involved, AAPLR is geared with the Bayesian learning capability. We demonstrate that the difference between AAPLR discriminative adaptation and maximum a posteriori linear regression (MAPLR) adaptation is due to the treatment of the evidence. Different from minimum classification error linear regression (MCELR), AAPLR has closed-form solution to fulfill rapid adaptation. The proposed AAPLR speaker adaptation is investigated by comparing with maximum likelihood linear regression (MLLR), MAPLR, MCELR and conditional maximum likelihood linear regression (CMLLR).

  In the experiments, we examine the incremental and discriminative speaker adaptation algorithms for large vocabulary continuous speech recognition. We adopt Mandarin broadcast news databases for system evaluation. Broadcast news speech data are collected with spontaneous speaking style which is known as the most challenging task among speech recognition applications. Also, the vocabulary size is so large that full search algorithm is not useful in real implementation. We apply tree copy search algorithm to implement a fewer computation and less storage search for LVCSR. From experimental results, we do improve recognition performance using the proposed Bayesian duration adaptation, Bayesian eigenvoice adaptation and AAPLR discriminative adaptation.
中文摘要 iii
Abstract v
誌謝(Acknowledgement) viii

Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Related Works 2
1.3 Outline of This Dissertation 6
Chapter 2 Basics of Speech Recognition 7
2.1 Statistical Speech Recognition 7
2.1.1 Bayesian Theory 8
2.1.2 Preprocessing of Speech, Speech Units and Lexicons 9
2.2 Hidden Markov Models in Speech Recognition 10
2.3 Large Vocabulary Continuous Speech Recognition 10
2.3.1 Tree Organization of Lexicons 11
2.3.2 One Pass Search Algorithm 12
2.3.3 Tree Copy Search 14
2.3.4 Look-ahead and Pruning 16
2.4 Speaker Adaptation 18
Chapter 3 Bayesian Learning of Speech Duration Models 21
3.1 Parametric Duration Modeling 22
3.1.1 ML Parameter Estimation 23
3.1.2 ML Estimation for Different Duration Parameters 24
3.2 Bayesian Learning of Duration Models 26
3.2.1 MAP and QB Parameter Estimation 26
3.2.2 MAP Estimation for Gamma Duration Parameters 28
3.2.3 QB Estimation for Gaussian Duration Parameters 29
3.2.4 QB Estimation for Poisson Duration Parameters 30
3.3 Experiments 31
3.3.1 Experimental Setup 31
3.3.2 Implementation Issues 33
3.3.3 Evaluation of Different ML Duration Models 34
3.3.4 Evaluation of MAP Batch Learning for Different Duration Models 37
3.3.5 Evaluation of QB Sequential Learning for Different Duration Models 38
3.3.6 Evaluation of Recognition and Adaptation Time Costs for Different Duration Models 42
3.4 Summary 43
Chapter 4 A New Eigenvoice Approach to Speaker Adaptation 45
4.1 Eigenvoice 45
4.2 Maximum a Posteriori Eigen-decomposition 47
4.3 Eigenvoice-based Covariance Adaptation 49
4.4 Experiments 51
4.5 Summary 53
Chapter 5 Aggregate a Posteriori Linear Regression 55
5.1 Review of Discriminative Training and Linear Regression Algorithm 56
5.1.1 MCE and MMI Discriminative Training 56
5.1.2 MLLR and MAPLR Adaptation 58
5.1.3 MCELR, CMLLR and MPELR Adaptation 59
5.2 Aggregate a Posteriori Linear Regression Adaptation 61
5.2.1 AAP Probability 62
5.2.2 AAPLR Criterion 63
5.2.3 Relation between AAPLR and MAPLR Criteria 64
5.2.4 Derivation of AAPLR Solution 65
5.2.5 Comparison of Different Linear Regression Adaptation 67
5.3 Experiments 69
5.3.1 Speech Database and Experimental Setup 69
5.3.2 Implementation Issues 71
5.3.3 Linear Regression Adaptation Versus Recognition Performance and Adaptation Time 72
5.3.4 Evaluation of MLLR, MAPLR, MCELR, CMLLR and AAPLR for Unsupervised Adaptation 75
5.3.5 Evaluation of Speech Recognition Performance for Multiple Speaker Adaptation 77
5.4 Summary 78
Chapter 6 Conclusion 81
Bibliography 83
作者簡歷(Author’s Biographical Notes) 90
