|
When the speech recognition system is operated under telephone networks, the acoustic mismatch between training and testing environments always causes the performance degradation. The mismatch sources in telephone environments areattributed to the ambient noise, the channel effect and the variation among speakers. This dissertation describes a number of robust algorithms which improve the recognition performance by compensating these three mismatch factors. In the experiments of hidden Markov model (HMM) based speech recognition, the proposed methods can successfully overcome the mismatch problems in telephone environments. The noise effect on speech cepstral vector and its associated HMM acoustic parameters is first investigated. Due to the shrinkage of cepstral vector in noisy environment, the projection-based likelihood measure which uses an optimalequalization factor for adapting the cepstral mean vector of HMM parameters is robust to noise contamination. This dissertation extends this measure by further compensating the shrinkage of covariance matrix and the bias of mean vector. The compensation factors are obtained from a set of adaptation functions. Using this method, the recognition accuracy can be remarkably improved. To overcome the channel effect in telephone speech, a channel-effect-cancellation method is developed. This approach is to estimate a channel-effect- cancellation filter by the convex combination of several reference filters. The reference filters, represented in cepstrum, are generated by clustering the cepstra of inverse telephone channels. The convex combination coefficients are calculated by the accumulated observation probabilities when the testing utterance passes through the reference filters. Using this method,the channel effect can be mostly canceled. Next, this dissertation presents two transformation-based adaptation approaches for adapting the HMM parameters so that the adapted HMM parameters are acoustically close to the telephone environment. The bias and the affine transformations are examined. We apply the maximum a posteriori (MAP) estimation technique which incorporates the prior knowledge into the transformation for estimating the transformation parameters.In our evaluation, the transformation-based adaptation using the MAP estimationoutperforms that using the maximum likelihood (ML) estimation. The affine transformation is also demonstrated to be superior to the bias transformation. Furthermore, a phone- dependent channel compensation (PDCC) technique is proposed for adapting the HMM parameters to a new channel environment by using some adaptation data. The adaptation of HMM parameters is completed by incorporating the corresponding PDCC vectors. To improve the performance, two extended PDCC techniques are presented. One is based on the refinement of PDCC using vector quantization. The other is based on the interpolation of compensation vectors. This method is carried out and shown to be effective in telephone speech recognition as well as speaker adaptation. In addition, we also propose a hybrid algorithm for adapting the HMM parameters to a new speaker. This algorithm is constructed by iteratively and alternately combining three adaptation techniques. First, the clusters of HMM parameters are locally transformed through a group of transformation functions. Then, the transformed HMM parameters are globally smoothed via the MAP adaptation. Within the MAP adaptation, the parameters of unseen units in adaptation data are further adapted by applying the transfer vector interpolation scheme. Using this algorithm, the advantages of these three adaptation techniques can be simultaneously captured. The resulting performance is consistently better than other methods for almost any practical amount of adaptation data.
|