跳到主要內容

臺灣博碩士論文加值系統

(44.192.22.242) 您好!臺灣時間:2021/08/01 13:56
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:陳宗佑
研究生(外文):Zong-You Chen
論文名稱:基於聲韻辨識之互動式即時語音驅動人臉系統
論文名稱(外文):Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition
指導教授:王駿發
指導教授(外文):Jhing-Fa Wang
學位類別:碩士
校院名稱:國立成功大學
系所名稱:電機工程學系碩博士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:英文
論文頁數:51
中文關鍵詞:聲韻辨識語音驅動人臉
外文關鍵詞:Voice-Driven Human Talking FacePhonetic Recognition
相關次數:
  • 被引用被引用:0
  • 點閱點閱:129
  • 評分評分:
  • 下載下載:8
  • 收藏至我的研究室書目清單書目收藏:0
科技始終來自於人性!日漸普及運用在生活中的互動式多媒體應用還有很大的研究改善空間,如何將此一技術改善,為人們帶來更多的便利是我們一直以來努力的目標!
本論文中,我們提出一個即時的語音驅動人臉技術,藉以應用於影音通訊系統中。此系統主要是將接收到的語音段作預強調、漢明窗接著取12階的LPCC (Linear Predictive Cepstral Coefficient) 作為特徵參數,中文韻母辨識則利用SVM (Support Vector Machine) 做分類。
由於每個人說話的習慣與腔調都有所不同,我們針對其所唸的16個中文單韻母之對應的嘴型圖片,利用SAD (Sum of Absolute Differences) 將差異較小的嘴型圖片歸為一類,藉以找出最符合個人說話特質的分類,以提升辨識率與效能。
最後,採用Alpha Blending作為兩張圖之間平滑化的一種方法,其為藉由調整圖片之透明度來混合來源圖片與目的圖片的像素,使其在圖片轉換時呈現出即時影像動畫的效果。經由實驗結果,此架構之中文單韻音錯誤率(Phoneme error rate, PER)為19.22%,分類後可降低為8.78%,字錯誤率(Word Error Rate, WER)可達27.65%,針對單音辨識率與影像動畫的自然度與流暢度之MOS評分表平均可達3.43分。
Technology always comes from human nature. The growing popularity of multimedia interactive applications in living still has a great room for improvement. How to improve the multimedia interactive technology and bring more convenience for people is our continuously target for pursuing.
In this thesis, we propose a real-time voice-driven human talking face technology for digital home communication system. For each speech segment, we perform pre-emphasis and hamming windowing first. The 12-order linear predictive cepstral coefficients (LPCCs) are then extracted as the speech feature vector for this segment. The Chinese phonetic symbol recognition is done by the support vector machines (SVMs).
The human mouth shape pictures of the 16 Chinese single vowels can be clustered into several groups based on the similarity of the shapes. According to the fact that every person has his own accent and habits while talking, we use sum of absolute difference (SAD) as a shape difference measurement to cluster each mouth shape of user into several categories. Because the categories adopted by each user can fit personal speech characteristic best, the recognition rate and performance are thus enhanced.
At last, we use alpha blending to blend the pixels of source and destination pictures by adjusting the transparent level of a picture. This method improves the smoothness between two successive pictures. Experimental results show that the Phoneme Error Rate (PER) is 19.22%. After phoneme clustering, the PER is reduce to 8.78%, and the Word Error Rate (WER) is 27.65%. The MOS for single word recognition, delay and nature for the whole system on average is 3.43 point.
摘要 I
Abstract II
誌謝 IV
Index of Figures VI
Index of Tables VII
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Thesis Organization 3
Chapter 2 Background and Related Works 4
2.1 Background 4
2.2 MPEG-4 facial Animation 7
2.3 Audio-Visual Articulatory Model 9
2.4 Talking Head System Based on a Single Face Image 11
Chapter 3 Framework of Proposed System 14
3.1 Mandarin Vowels Classification Based on Mouth Shape 16
3.2 Feature Extraction 20
3.2.1 Frame Blocking 20
3.2.2 Energy 20
3.2.3 Pre-Emphasis 21
3.2.4 Hamming Window 21
3.2.5 Cepstrum Coefficient 22
3.2.6 Linear Predictive Cepstral Coefficients 23
3.3 Vowels Recognition based on Support Vector Machine 28
3.3.1 Linear Classifier 28
3.3.2 Non-separable case 32
3.3.3 Kernel Function 34
3.3.4 Multi-class SVMs 36
3.4 Alpha Blending 39
3.4.1 Introduction of Alpha Channel 39
3.4.2 Alpha Blending 40
Chapter 4 Experiment and Comparison 41
4.1 Experimental Setup 41
4.2 Experimental Results and Comparison 43
Chapter 5 Conclusion & Future Work 47
References 49
[1] I-Chen Lin, Chen-Sheng Hung, Tzong-Jer Yang, Ming Ouhyoung, "A Speech Driven Talking Head System Based on a Single Face Image", pp. 43-49, Proc. Pacific Graphics'99 (IEEE ISBN 0-7695-0293-8), Oct., Seoul, Korea.
[2]J. Ostermann and A.Weissenfeld, “Talking faces-technologies and applications,” In Proc. of ICPR’04, Aug. 2004, vol. 3, pp. 826–833.
[3]R. Koenen, F. Pereira, and L. Chiariglione, “MPEG-4: Context and objectives,” Image Commun. J., vol. 9, no. 4, pp. 295–304, May 1997, http://drogo.cselt.it/ufv/leonardo/icjfiles/mpeg-4_si/paper1.htm.
[4]Hyewon Pyun, Wonseok Chae, Yejin Kim, Hyungwoo Kang, and Sung Yong Shin “An Example-based Approach to Text-driven Speech Animation with Emotional Expressions” CS/TR-2004-200 July 19, 2004
[5]J. Ostermann, “Animation of Synthetic Faces in MPEG-4”, Proc. of Computer Animation, pp.49-51, Philadelphia, Pennsylvania, USA, June 8-10, 1998.
[6]Jhing-Fa Wang, Hung-Tzu Kao , “Voice Driven Multimedia Interactive System with Ubiquitous Sound Recognition for Digital Home Application ”. Master Thesis. Department of Electrical Engineering National Cheng Kung University, Tainan, Taiwan, R.O.C. July 2007
[7]I-Chen Lin, Chen-Sheng Hung, Tzong-Jer Yang, Ming Ouhyoung, "A Speech Driven Talking Head System Based on a Single Face Image", pp. 43-49, Proc. Pacific Graphics'99 (IEEE ISBN 0-7695-0293-8), Oct., Seoul, Korea.
[8]T. Kim, Y. Kang, and H. Ko, “Achieving real-time lip synch via SVM-based phoneme classification and lip shape refinement,” in Proc. ICMI’02, 2002, pp. 299–304.
[9]Regine Andre-Obrecht, “A New Statistical Approach for the Automatic Segmentation of Continuous Speech Signals”, IEEE Transactions on Acoustic, Speech, and Signal Processing, Vol.36, No. 1, January 1998.
[10]L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993.
[11]E. Cosatto, J. Ostermann, H. P. Graf, and J. Schroeter, “Lifelike talking faces for interactive services,” Proc. IEEE, vol. 91, no. 9, pp. 1406–1428, Sep. 2003
[12]L. Xie, “Research on Key Issues of Audio Visual Speech Recognition,” Ph.D. dissertation, Northwestern Polytechnical Univ., Xian, China, 2004.
[13]Junho Park, HANSEOK KO, “Real-Time Continuous Phoneme Recognition System using Class-Dependent Tied-Mixture HMM with HBT Structure for Speech-Driven Lip-synchronization”, IEEE Trans-Multimedia, Vol.10, Issue 7, pp.1299-1306, Nov, 2008
[14]F. Schwenker, “Hierarchical support vector machines for multi-class pattern recognition” Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 2000. Proceedings. Fourth International Conference on, Vol. 2, 30 Aug.-1 Sept. 2000.
[15]L. Xie and Z. Liu, “Realistic mouth-synching for speech-driven talking face using articulatory modeling,” IEEE Trans. Multimedia, vol. 9, no. 3, pp. 500–510, Apr. 2007.
[16]A.M. Kondoz,“Digital speech : Coding for Low Bit Rate Communications System,” WILEY, 1994
[17]J. Makhoul, “Stable and Efficient methods for Linear Prediction,”IEEE Trans. On ASSP, Vol. 25, pp. 423-428, October 1977
[18]Wan Vincent and Renals Steve, “Speaker verification using sequence discriminant support vector machines,” IEEE transactions on speech and audio processing, vol. 13, No. 2, march 2005.
[19]William M. Campbell, Joseph P. Campbell, Terry P. Gleason, Douglas A. Reynolds, and Wade Shen, “Speaker Verification Using Support Vector Machines and High-Level Features,” IEEE transactions on speech , audio and language processing, vol. 15, no. 7, september 2007.
[20]J.C. Wang, C.H.Yang, J.F. Wang, and H.P. Lee, “Robust speaker identification and verification,” IEEE Compu. Intell. Mag., pp.52-59, May 2007.
[21] Harrison, B.L., Kurtenbach, G., and Vicente, K.J. An Experimental Evaluation of Transparent User Interface Tools and Information Content. In Proc. UIST ‘95, 81–90.
[22] Porter, T. and Duff, T. Compositing Digital Images, Computer Graphics 18, 3, July 1984, pp. 253–259.
[23]王小川, 語音訊號處理, 全華圖書股份有限公司
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top