跳到主要內容

臺灣博碩士論文加值系統

(98.82.120.188) 您好!臺灣時間:2024/09/17 06:54
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:黃喆青
研究生(外文):Che-ChingHuang
論文名稱:應用語者和發音屬性特徵表示於語者切換偵測
論文名稱(外文):Speaker Change Detection using Speaker and Articulatory Feature Embeddings
指導教授:吳宗憲吳宗憲引用關係
指導教授(外文):Chung-Hsien Wu
學位類別:碩士
校院名稱:國立成功大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:英文
論文頁數:49
中文關鍵詞:語者切換偵測語者特徵發音屬性特徵
外文關鍵詞:Speaker Change DetectionSpeaker RepresentationArticulatory Features
相關次數:
  • 被引用被引用:0
  • 點閱點閱:113
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
現今隨著許多語音處理的相關技術的改善和進步,語音互動軟體和產品也開始變得普及。其中在多人對話語音的部分,我們會需要用到語者切換點偵測這個技術,進行語音前處理,再去做進一步的分析和處理。在過去的語者切換點偵測技術的論文中,大部分都是依據聲學上面的特徵去做偵測和辨識,而本篇論文提出來的方法是從發音特徵的角度,去提供語者在發音特性上面的差異性,以提升語者切換點偵測的準確性,達到相輔相成的效果。
在本論文當中,我們藉由從語音當中抽取聲學特徵,應用卷積類神經網路來訓練一個語者的表達方式模型,用以得到代表語者特性的向量。再另外訓練一個發音特徵的模型,應用多層感知網路從這個發音特徵的模型當中,抽取得到語音發聲特徵的向量,最後藉由這兩種向量,應用另一多層感知網路訓練出我們的語者切換偵測模型,最後通過這個語者切換偵測模型,來幫助我們做偵測和判別切換點的準確位置。
在本論文中,我們會使用到兩個語料庫,第一個為VoxCeleb2語料庫,它是一個廣泛用在語者辨識領域的語料庫,第二個是Librispeech語料庫,它是過去很常被用來在語者辨識或語音辨識研究的語料庫。其中,因為我們需要有對話相關的語料,所以本論文利用VoxCeleb2語料庫去組成我們論文中的對話語料。在本論文當中,我們主要會去訓練以下三種模型,第一個是語者特徵模型,第二個是發音特徵模型,第三個是語者切換點偵測模型。
本論文實驗結果顯示,在語者切換點偵測任務上面,從加入考慮發音學的特徵,能使False Alarm Rate減少了1.94%,Accuracy提升1.1%、Precision Rate提升2.04%和F1 Score提升了0.16%。由實驗結果可知,我們所提出的方法優於傳統方法並且可以應用在需要語者偵測的產品上。
Nowadays, with the improvement and advancement of many related technologies for voice processing, voice interactive software and products have become more and more popular. In the part of the multi-person dialogue voice, we will need to use the speaker change point detection technology to perform voice pre-processing, and then do further analysis and processing. In the past research on speaker change point detection, most of them are based on the characteristics of acoustic features for detection. The method proposed in this thesis is to provide the speaker information from the perspective of articulatory features. The difference in pronunciation characteristics is to improve the accuracy of the detection of the change point of the speaker and achieve complementary effects.
In this thesis, we use a convolutional neural network to train a speaker embedding model by extracting acoustic features from speech to obtain a vector representing the characteristics of the speaker. In addition, a model of articulatory features (AFs) is trained, and a multi-layered perception network is used to extract the AF embedding of speech features. Finally, using these two vectors, another multilayer perceptual network is used to train the speaker change detection model. The trained speaker change detection model is helpful to determine the exact position of the change point.
Two speech databases were utilized in this thesis. The first one was the VoxCeleb2 speech database, which was a corpus widely used in the field of speaker identification. The second was the Librispeech corpus, which was a corpus widely used in the field of speech or speaker recognition. In this thesis, we mainly trained three models, the first was the speaker embedding model, the second was the AF embedding model, and the third was the speaker change detection model. In the speaker embedding model part, this thesis used the VoxCeleb2 database for training and evaluation of model parameter settings. In the articulatory feature embedding model, we used the Librispeech database for training and evaluation of model parameter settings. In the part of the speaker change detection model, we used a dialogue corpus composed from the VoxCeleb2 database to train the speaker change detection model.
In the speaker change detection task, the experimental results showed that the proposed method could reduce false alarm rate by 1.94%, increasing accuracy by 1.1%, precision rate by 2.04% and F1 score by 0.16%. From the experimental results, the proposed method was superior to the traditional method and could be applied to products that require speaker change detection.
摘要 I
Abstract III
誌謝 V
Contents VII
List of Tables IX
List of Figures X
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 3
1.3 Literature Review 3
1.3.1 Speaker Change Detection 3
1.3.2 Speaker Representations 8
1.3.3 Articulatory Representations 10
1.4 Problems and Proposed Methods 10
1.5 Research Framework 12
Chapter 2 Corpus 13
2.1 Corpus Introduction 13
2.1.1 VoxCeleb2 Corpus 13
2.1.2 Librispeech Corpus 14
2.2 Corpus Arrangement and Statistics 15
Chapter 3 Proposed Methods 17
3.1 Speaker Embedding 18
3.2 Articulatory Feature Embedding 22
3.3 Speaker Change Detection 24
Chapter 4 Experimental Results and Discussion 26
4.1 Speaker Embedding Model 26
4.2 Articulatory Feature Embedding Model 30
4.3 SCD Model 35
4.4 Discussion 38
Chapter 5 Conclusion and Future Work 40
References 42
[1]Ovum.https://www.2cm.com.tw/2cm/zh-tw/tech/2597DF1FE24843BBB254C7380B3A281B (accessed 25, Dec., 2018).
[2]A. F. Martin and M. A. Przybocki, Speaker recognition in a multi-speaker environment, in Seventh European Conference on Speech Communication and Technology, Sep. 2001, vol. 2, pp. 787–790.
[3]D. Istrate, N. Scheffer, C. Fredouille, and J.-F. Bonastre, Broadcast news speaker tracking for ester 2005 campaign, in Ninth European Conference on Speech Communication and Technology, Sep. 2005, pp. 2445–2448.
[4]J. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, and C. Wellekens, A speaker tracking system based on speaker turn detection for NIST evaluation, in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), Jun. 2000, vol. 2, pp. 1177-1180, doi: 10.1109/ICASSP.2000.859175.
[5]A. L. Gorin, Z. Liu, S. Parthasarathy, and A. E. Rosenberg, Unsupervised speaker segmentation of multi-speaker speech data, ed: Google Patents, 2007.
[6]X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals, Speaker Diarization: A Review of Recent Research, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 356-370, Feb. 2012, doi: 10.1109/TASL.2011.2125954.
[7]S. E. Tranter and D. A. Reynolds, An overview of automatic speaker diarization systems, IEEE Transactions on audio, speech, and language processing, vol. 14, no. 5, pp. 1557-1565, Sep. 2006.
[8]P. C. Woodl, M. J. F. Gales, D. Pye, and S. Young, The Development Of The 1996 Htk Broadcast News Transcription System, in DARPA Speech Recognition Workshop, Feb. 1997, pp. 73-78.
[9]D. Liu and F. Kubala, Fast speaker change detection for broadcast news transcription and indexing, in Sixth European Conference on Speech Communication and Technology, Sep. 1999, vol. 3, pp. 1031–1034.
[10]J.-L. Gauvain, L. Lamel, and G. Adda, The LIMSI Broadcast News transcription system, Speech Communication, vol. 37, no. 1, pp. 89-108, May 2002, doi: https://doi.org/10.1016/S0167-6393(01)00061-9.
[11]K. Jørgensen, L. Mølgaard, and L. K. Hansen, Unsupervised speaker change detection for broadcast news segmentation, in 2006 14th European Signal Processing Conference, Sep. 2006: IEEE, pp. 1-5.
[12]M. Hrúz and Z. Zajíc, Convolutional neural network for speaker change detection in telephone speaker diarization system, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017: IEEE, pp. 4945-4949.
[13]R. Yin, H. Bredin, and C. Barras, Speaker change detection in broadcast TV using bidirectional long short-term memory networks, in Interspeech 2017, Aug. 2017: ISCA, pp. 3827–3831.
[14]E. El-Khoury, C. Senac, and J. Pinquier, Improved speaker diarization system for meetings, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Apr. 2009: IEEE, pp. 4097-4100.
[15]Z. Ge, A. N. Iyer, S. Cheluvaraja, and A. Ganapathiraju, Speaker change detection using features through a neural network speaker classifier, in 2017 Intelligent Systems Conference (IntelliSys), Sept. 2017, pp. 1111-1116, doi: 10.1109/IntelliSys.2017.8324268.
[16]L. Lu and H.-J. Zhang, Unsupervised speaker segmentation and tracking in real-time audio content analysis, Multimedia systems, vol. 10, no. 4, pp. 332-343, Apr. 2005.
[17]L. Lu and H.-J. Zhang, Speaker change detection and tracking in real-time news broadcasting analysis, in Proceedings of the tenth ACM international conference on Multimedia, Juan-les-Pins, France, Dec. 2002, 641127: ACM, pp. 602-610, doi: 10.1145/641007.641127.
[18]M. Yang, Y. Yang, and Z. Wu, A pitch-based rapid speech segmentation for speaker indexing, Seventh IEEE International Symposium on Multimedia (ISM'05), pp. 571-576, Dec. 2005.
[19]S. Kwon and S. S. Narayanan, Speaker change detection using a new weighted distance measure, in Seventh International Conference on Spoken Language Processing, Sep. 2002, vol. 4, pp. 2537–2540.
[20]D. A. Reynolds, R. B. Dunn, and J. L. McLaughlin, The Lincoln speaker recognition system: NIST EVAL2000, in Sixth International Conference on Spoken Language Processing, Oct. 2000, pp. 470–474.
[21]T. Kemp, M. Schmidt, M. Westphal, and A. Waibel, Strategies for automatic segmentation of audio data, in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), Nov. 2000, vol. 3: IEEE, pp. 1423-1426.
[22]S. Shaobing and C. P. S. Gopalakrishnan, Speaker, Environment And Channel Change Detection And Clustering Via The Bayesian Information Criterion, DARPA Broadcast News Transcription and Understanding Workshop, pp. 127–132, Feb. 1998.
[23]H. Gish, M. Siu, and R. Rohlicek, Segregation of speakers for speech recognition and speaker identification, in [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing, Apr. 1991, vol. 2, pp. 873–876, doi: 10.1109/ICASSP.1991.150477.
[24]M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, Automatic segmentation, classification and clustering of broadcast news audio, in Proc. DARPA speech recognition workshop, Feb. 1997, pp. 97–99.
[25]A. G. Adam, S. S. Kajarekar, and H. Hermansky, A new speaker change detection method for two-speaker segmentation, in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Dec. 2002, vol. 4: IEEE, pp. 13–17.
[26]J. Ajmera, I. McCowan, and H. Bourlard, Robust speaker change detection, IEEE signal processing letters, vol. 11, no. 8, pp. 649-651, Aug. 2004.
[27]L. Lu and H.-J. Zhang, Real-time unsupervised speaker change detection, in Object recognition supported by user interaction for service robots, Aug. 2002, vol. 2: IEEE, pp. 358-361.
[28]C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, Multistage speaker diarization of broadcast news, IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1505-1512, Sep. 2006.
[29]I. Magrin-Chagnolleau, A. E. Rosenberg, and S. Parthasarathy, Detection of target speakers in audio databases, in 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), 1999, vol. 2, pp. 821-824, doi: 10.1109/ICASSP.1999.759797.
[30]J. Ajmera and C. Wooters, A robust speaker clustering algorithm, in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), Nov. 2003: IEEE, pp. 411-416.
[31]F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, and C. Vair, Stream-based speaker segmentation using speaker factors and eigenvoices, in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Apr. 2008, pp. 4133-4136, doi: 10.1109/ICASSP.2008.4518564.
[32]S. Meignier and J.-F. Bonastre, E-HMM approach for learning and adapting sound models for speaker indexing, in Odyssey Speaker and Language Recognition Workshop, Jun. 2001, pp. 175–180.
[33]B. Fergani, M. Davy, and A. Houacine, Speaker diarization using one-class support vector machines, Speech Communication, vol. 50, pp. 355-365, May 2008, doi: 10.1016/j.specom.2007.11.006.
[34]Y. Lukic, C. Vogt, O. Dürr, and T. Stadelmann, Speaker identification and clustering using convolutional neural networks, in 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP), Sep. 2016: IEEE, pp. 1-6.
[35]P. Cyrta, T. Trzcinski, and W. Stokowiec, Speaker Diarization Using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings, in International Conference on Information Systems Architecture and Technology, Sep. 2018, pp. 107-117, doi: 10.1007/978-3-319-67220-5_10.
[36]V. Gupta, Speaker change point detection using deep neural nets, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015: IEEE, pp. 4420-4424.
[37]R. Wang, M. Gu, L. Li, M. Xu, and T. F. Zheng, Speaker segmentation using deep speaker vectors for fast speaker change scenarios, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017, pp. 5420-5424, doi: 10.1109/ICASSP.2017.7953192.
[38]A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, Lake Tahoe, Nevada, Dec. 2012, 2999257: Curran Associates Inc., pp. 1097-1105.
[39]Y. LeCun et al., Backpropagation applied to handwritten zip code recognition, Neural computation, vol. 1, no. 4, pp. 541-551, 1989.
[40]F. Schroff, D. Kalenichenko, and J. Philbin, Facenet: A unified embedding for face recognition and clustering, in Proceedings of the IEEE conference on computer vision and pattern recognition, Jun. 2015, pp. 815-823.
[41]A. Graves, Supervised sequence labelling with recurrent neural networks. 2012, Available: http://books.google.com/books, 2012.
[42]M. Sundermeyer, R. Schlüter, and H. Ney, LSTM neural networks for language modeling, in Thirteenth annual conference of the international speech communication association, Mar. 2012, vol. 23, pp. 517–529.
[43]G. Gelly and J.-L. Gauvain, Minimum word error training of RNN-based voice activity detection, in Sixteenth Annual Conference of the International Speech Communication Association, Sep. 2015, pp. 2650–2654.
[44]A. Graves, A.-r. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networks, in 2013 IEEE international conference on acoustics, speech and signal processing, May 2013: IEEE, pp. 6645-6649.
[45]I. Sutskever, O. Vinyals, and Q. V. Le, Sequence to sequence learning with neural networks, in Advances in neural information processing systems, Dec. 2014, pp. 3104-3112.
[46]Z. Zuo et al., Convolutional recurrent neural networks: Learning spatial dependencies for image representation, in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Jun. 2015, pp. 18-26.
[47]D. Tang, B. Qin, and T. Liu, Document modeling with gated recurrent neural network for sentiment classification, in Proceedings of the 2015 conference on empirical methods in natural language processing, Sep. 2015, pp. 1422-1432.
[48]K. Choi, G. Fazekas, M. Sandler, and K. Cho, Convolutional recurrent neural networks for music classification, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017: IEEE, pp. 2392-2396.
[49]E. Cakir, S. Adavanne, G. Parascandolo, K. Drossos, and T. Virtanen, Convolutional recurrent neural networks for bird audio detection, in 2017 25th European Signal Processing Conference (EUSIPCO), Aug. 2017: IEEE, pp. 1744-1748.
[50]N. Brummer et al., ABC system description for NIST SRE 2010, in NIST 2010 Speaker Recognition Evaluation, Jun. 2010, pp. 1-20.
[51]N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel, Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification, in Interspeech, Sep. 2009, vol. 1, pp. 1559-1562.
[52]E. Variani, X. Lei, E. McDermott, I. Lopez Moreno, and J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in ICASSP, May 2014, pp. 4052-4056, doi: 10.1109/ICASSP.2014.6854363.
[53]S. Yaman, J. Pelecanos, and R. Sarikaya, Bottleneck features for speaker recognition, in Odyssey 2012-The Speaker and Language Recognition Workshop, Jun. 2012, pp. 105–108.
[54]F. Richardson, D. Reynolds, and N. Dehak, Deep neural network approaches to speaker and language recognition, IEEE signal processing letters, vol. 22, no. 10, pp. 1671-1675, Oct. 2015.
[55]Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, A novel scheme for speaker recognition using a phonetically-aware deep neural network, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014: IEEE, pp. 1695-1699.
[56]D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, Deep Neural Network Embeddings for Text-Independent Speaker Verification, Aug. 2017, pp. 999-1003, doi: 10.21437/Interspeech.2017-620.
[57]D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, Deep neural network-based speaker embeddings for end-to-end speaker verification, in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec. 2016, pp. 165-170, doi: 10.1109/SLT.2016.7846260.
[58]N. Chen, Y. Qian, and K. Yu, Multi-task learning for text-dependent speaker verification, in Sixteenth annual conference of the international speech communication association, Sep. 2015, pp. 185–189.
[59]Z. Shi, M. Wang, L. Liu, H. Lin, and R. Liu, A Double Joint Bayesian Approach for J-Vector Based Text-dependent Speaker Verification, arXiv preprint arXiv:1711.06434, 2017.
[60]Z. Zajíc, M. Kunešová, and V. Radová, Investigation of segmentation in i-vector based speaker diarization of telephone speech, in International Conference on Speech and Computer, Aug. 2016: Springer, pp. 411-418.
[61]H. Bredin, TristouNet: Triplet loss for speaker turn embedding, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017, pp. 5430-5434, doi: 10.1109/ICASSP.2017.7953194.
[62]K. Erler and L. Deng, Hidden Markov model representation of quantized articulatory features for speech recognition, Computer Speech & Language, vol. 7, pp. 265-282, Jan. 1993, doi: 10.1006/csla.1993.1014.
[63]D. Technischen Fakult At Der, K. Kirchhoff, and B. Juni, Robust Speech Recognition Using Articulatory Information, Ph.D. dissertation, Bielefeld, 1999.
[64]S. Parandekar and K. Kirchhoff, Multi-stream language identification using data-driven dependency selection, in ICASSP, Apr. 2003, vol. 1, pp. 28–31, doi: 10.1109/ICASSP.2003.1198708.
[65]L. Ka-Yee, M. Man-Wai, and K. Sun-Yuan, Applying articulatory features to telephone-based speaker verification, in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 17-21 May 2004, vol. 1, pp. I-85–I-88, doi: 10.1109/ICASSP.2004.1325928.
[66]L. Ka-Yee and S. Manhung, Phone level confidence measure using articulatory features, in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). Apr. 2003, vol. 1, pp. I–600–I–603, doi: 10.1109/ICASSP.2003.1198852.
[67]V. Mitra, G. Sivaraman, H. Nam, C. Espy-Wilson, E. Saltzman, and M. Tiede, Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition, Speech Communication, vol. 89, pp. 103-112, May 2017, doi: https://doi.org/10.1016/j.specom.2017.03.003.
[68]S. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357-366, Aug. 1980.
[69]H. Hermansky and N. Morgan, RASTA processing of speech, IEEE transactions on speech and audio processing, vol. 2, no. 4, pp. 578-589, Oct. 1994.
[70]H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, the Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738-1752, Apr. 1990.
[71]普通話世界. http://www.putonghuaworld.com/putonghua/100630/100630_0101.htm (accessed 30, Jun., 2010).
[72]J. S. Chung, A. Nagrani, and A. Zisserman, VoxCeleb2: Deep speaker recognition, arXiv preprint arXiv:1806.05622, 2018.
[73]J. S. Chung, A. Jamaludin, and A. Zisserman, You said that?, arXiv preprint arXiv:1705.02966, 2017.
[74]T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, Audio-driven facial animation by joint end-to-end learning of pose and emotion, ACM Transactions on Graphics (TOG), vol. 36, no. 4, p. 94, Jul. 2017.
[75]T. Afouras, J. S. Chung, and A. Zisserman, The conversation: Deep audio-visual speech enhancement, arXiv preprint arXiv:1804.04121, 2018.
[76]A. Ephrat et al., Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation, arXiv preprint arXiv:1804.03619, 2018.
[77]A. Nagrani, S. Albanie, and A. Zisserman, Seeing voices and hearing faces: Cross-modal biometric matching, in Proceedings of the IEEE conference on computer vision and pattern recognition, Jun. 2018, pp. 8427-8436.
[78]A. Nagrani, S. Albanie, and A. Zisserman, Learnable PINs: Cross-modal embeddings for person identity, in Proceedings of the European Conference on Computer Vision (ECCV), Sep. 2018, pp. 71-88.
[79]V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015: IEEE, pp. 5206-5210.
[80]J. Ramirez, J. M. Górriz, and J. C. Segura, Voice activity detection. fundamentals and speech recognition system robustness, in Robust speech recognition and understanding, Jun. 2007: IntechOpen, p. 460.
[81]T. N. Sainath, B. Kingsbury, A.-r. Mohamed, and B. Ramabhadran, Learning filter banks within a deep neural network framework, in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Dec. 2013: IEEE, pp. 297-302.
[82]A. Torfi, N. M. Nasrabadi, and J. Dawson, Text-Independent Speaker Verification Using 3D Convolutional Neural Networks, in IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2017, pp. 1–6.
[83]A. Torfi, S. M. Iranmanesh, N. Nasrabadi, and J. Dawson, 3d convolutional neural networks for cross audio-visual matching recognition, IEEE Access, vol. 5, pp. 22081-22091, 2017.
[84]S. H. Yella, A. Stolcke, and M. Slaney, Artificial neural network features for speaker diarization, in 2014 IEEE Spoken Language Technology Workshop (SLT), Dec. 2014: IEEE, pp. 402-406.
[85]Z. Meng, L. Mou, and Z. Jin, Hierarchical RNN with Static Sentence-Level Attention for Text-Based Speaker Change Detection, in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, Singapore, Nov. 2017, 3133110: ACM, pp. 2203–2206, doi: 10.1145/3132847.3133110.
[86]C. Zhang and K. Koishida, End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances, in Interspeech, Aug. 2017, pp. 1487-1491.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top