跳到主要內容

臺灣博碩士論文加值系統

(3.235.140.84) 您好!臺灣時間:2022/08/13 04:30
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:林經展
研究生(外文):Jin-Jang Lin
論文名稱:音訊視訊轉換之研究
論文名稱(外文):A Study on Audio-to-Visual Conversion
指導教授:鐘太郎
指導教授(外文):Tai-Lang Jong
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2002
畢業學年度:90
語文別:英文
論文頁數:56
中文關鍵詞:音訊視訊轉換適應共振理論
外文關鍵詞:Audio-to-Visual ConversionART2
相關次數:
  • 被引用被引用:0
  • 點閱點閱:112
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
由於電腦,多媒體,網際網路的進步,人與人遠距離的溝通介面變成了一個極為熱門的題目.許多的研究皆以此為主題,而其困難點在於頻寬是有限的,而影像資料又太過於龐大.我們可以利用一個虛擬但極為相似活生的人臉模型,並採用音訊視訊轉換,避免掉直接傳送龐大的影像資料.因此音訊視訊轉換之技術對於即時多人視訊會議有一定的重要程度.
在本篇論文中,我們專注於“音訊視訊轉換”之研究.對於單一使用者的系統,我們分別採用兩套不同的理論 (GRBF,ART2) 來設計. GRBF 可以有效的提高利用分類器為理論基礎之音訊轉視訊系統的效率.ART2 是基於ART 所發展出來的類神經網路,其不同處在於ART2可以處理類比的訊號而ART只能處理數位的訊號.ART2 的架構以及學習方式與人的記憶學習組織極為相似,並跟人類的大腦一樣有持續記憶與永續學習等優點.在我們的實驗裡發現到,在模型的建立與測試的錯誤率方面 用ART2效果比GMM 好.但同時我們也發現到了,ART2的缺點在於模型參數資料量太過於龐大,較難實現於多人使用的音訊視訊轉換系統.對於多人使用的音訊視訊轉換系統,我們利用已建立的參考模型並採用“音訊調變與視訊學習”的調變技術.參考模型中包含一個語音的共通模型與共同的嘴巴參數模型.對於語音模型,每一位使用者僅需傳送其語音模型對於參考語音模型的修正參數及可,因此有效率縮小了模型參數資料量.最後經由實驗的結果,我們發現到ART2的確適用於音訊視訊轉換.

Research on creating friendly human interfaces between a human and a computer, or between human in distant locations flourished lately, partly because of the advances in computer, multimedia and internet technologies. One such style involves using an avatar, others might use a synthetic animated human face to provide an effective and efficient “face-to-face” multimodal communication channel in the distributed collaboration environments. Most of them adopt a real-time speech-driven face animation technique to avoid the need of directly transmitting the much larger-sized video data in order to meet the real time interpersonal communication requirement. The audio-to-visual conversion plays an important role in such real-time speech-driven face animation systems.
In this thesis, we focus on the study of deriving the lip movement of a human user from its corresponding speech signal used in a speech-driven facial expression animation system. Two methods are proposed to design an audio to visual system for single user case, namely, the GRBF and ART2. The GRBF can improve the efficiency of VQ based audio to visual conversion system. Adaptive resonant theory 2 (ART 2) is extended from previous ART model with the capability of handling real input signals. The ART works like human’s memory. It has the ability to learn new thing fast without forgetting things learned in the past. An improvement in learning speed of ART2 over GMM with comparable error rate is observed in our experiments. However, the size of model parameter of ART2 is its disadvantage. In the multi-users case, a framework utilizing a reference ART2 audio-to-visual conversion model and an audio-adapting and visual learning mechanism is proposed to handle multi-user adaptation. Since the reference ART2 model is used for every user, only the incremental differences between the new user and the reference model need to be transmitted, the size disadvantage of ART2 is partially overcome in the multi-user adaptation case. Experiments supported the suitability of ART2 to the audio-visual conversion problem.

Chapter 1 1
Introduction 1
1.1 History 2
1.2 Thesis Organization 3
Chapter 2 4
Background in Speech Process 4
and Audio Feature Extraction 4
2.1 Speech Production 4
2.1.1 Multitube Lossless Model of the Vocal Tract 5
2.2 Linear Prediction Analysis 8
2.2.1 Levinson-Durbin Recursion 10
2.3 Line Spectrum Pair 11
2.4 Mel-Scale Cepstrum 12
2.5 Audio Feature Extraction 16
Chapter 3 Single-User Audio to Visual Conversion 17
3.1 Radial-Basis Function Network (RBF) 18
3.1.1 Generalized Radial-Basis Function Netwrok 18
3.1.2 Vector Quantization Algorithm 21
3.2 Adaptive Resonance Theory 2 23
3.2.1 ART2 Architecture 23
3.2.1 The Processing Equations OF ART2 25
3.2.2 ART2 Orienting Subsystem 26
3.3 DATABASE AND ERROR ANALYSIS 27
3.3.1 Database 27
3.3.2 Error Analysis 29
3.4 GRBF Simulations and Results 29
3.4.1 Difference in the Number of Center 31
3.4.2 Different in Order of Medium Filter 31
3.4.3 Single-Frame and Multi-Frames 32
3.4.4 VQ vs. GRBF 33
3.5 ART2 SYSTEM SIMULATION 35
3.5.1 Different in Threshold Values 36
3.5.2 Single-Frame and Multi-Frames 37
3.5.3 Different in Training Pattern Segmentation 37
3.5.4 ART2 vs. ART2 A_E 38
3.5.5 Simulation for the never trained words 40
3.6 Performance and Conclusion for Single User Audio to Visual Conversion 43
Chapter 4 46
Multi-Users Audio to Visual Conversion 46
4.1 Audio-Visual Adaptation 46
4.2 Simulations and Results 49
4.3 Improvement of ART2 53
Chapter 5 54
Conclusions and Feature Work 54
REFERENCES: 55

[1] D. G. Stork, G. Wolff, and E. Levine, “Neural network lipreading system for improved speech recognition,” in Proc. Int. Joint Conf. Neural Networks, pp. 285—295, 1992.
[2] Simon Haykin, ”Neural Networks”, Prentice Hall, pp. 156-255,pp. 466-473,1999.
[3] Ram R. Rao, Tsuhan Chen, “Audio-to-Visual Conversion for Multimedia Communication,” IEEE Transactions on Industrial Electronics, Vol. 45, No.1, pp. 15-22, Feb. 1998.
[4] Ram R. Rao, Tsuhan Chen, “Audio-to-Visual Integration in Multimedia Communication,” Proceedings of the IEEE, Vol. 86, No.5, pp. 837-852, May 1998.
[5] Yao-Jen Chang, Chih-Chung Chen, Jen-Chung Chou, and Yung-Chang Chen, “Virtual Talk: A Model-Based Virtual Phone Using a Layered Audio-Visual Integration,” IEEE Multimedia and Expo, 2000. ICME 2000. 2000 IEEE International Conference on , Volume: 1 , 2000, pp. 415 -418 vol.1
[6] Fabio Lavagetto, “Time-Delay Neural Networks for Estimating Lip Movements From Speech Analysis: A Useful Tool in Audio-Video Synchronization,” IEEE Transactions on Circuits and systems for Video Technology, Vol. 7, No.5, pp. 786-800, 1997.
[7] R. Rao, R. Mersereau, Tsuhan Chen, “Using HMM’s in Audio-toVisual Conversion,” IEEE 1997 First Workshop on Multimedia Signal Processing, pp. 19-24, 1997.
[8] KyoungHo Choi, Jeng-Neng Hwang, “Baum-Welch Hidden Markov Model Inversion For Reliable Audio-to-Visual Conversion”, IEEE 3rd Workshop on Multimedia Signal Processing (MMSP99), pp. 175 —180, Copenhagen, Denmark, Sept. 13-15, 1999
[9]John R. Deller, Jr., John G. P roakis, John H. L. Hansen, “Discrete-Time Processing of Speech Singals,”chapter 3, Macmillan Publishing Company.
[10]John R. Deller, Jr., John G. P roakis, John H. L. Hansen, “Discrete-Time Processing of Speech Singals,”chapter 5, Macmillan Publishing Company.
[11]John R. Deller, Jr., John G. P roakis, John H. L. Hansen, “Discrete-Time Processing of Speech Singals,”chapter 6, Macmillan Publishing Company.
[12]G. A. Carpenter,S.Grossberg,“Art2:Self-organization of stable category recognition codes for analog input patterns” Applied Optics,26(23):4919-4930, 1987.
[13] G. A. Carpenter,S. Grossberg, “The art of adaptive pattern recognition by self-organizing neural network ” IEEE computer,21(3):77-88,March 1988..
[14]M.J.F. Gales, D. Pye, P.C. Woodland, “Variance Compensation within the MLLR framework for Robust Speech Recognition and Speaker Adaptation”, Fourth International Conference On Spoken Language Proceedings. Vol. 3 pp. 1832-1835, 1996.
[15] Carey E. Priebe, “Adaptive Mixtures,” Journal of the American Statistical Association, Vol. 89, No. 427, Sep. 1994.
[16]Chih-Chung Chen “Adaptation of Gaussian Mixture Model for Multi-user Audio to Visual Conversion” 國立清華大學碩士論文,June 2000.
[17]Ru-Yu Yu “Frame Based Audio to Visual Conversion Using Line Spectrum Pairs” 國立清華大學碩士論文,June 2001.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文
 
無相關期刊