|
[1] Yuxuan Wang, RJ SkerryRyan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards endtoend speech synthesis. Proc. Interspeech 2017, pages 4006–4010, 2017. [2] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj SkerrvRyan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018. [3] RJ SkerryRyan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A Saurous. Towards endtoend prosody transfer for expressive speech synthesis with tacotron. In international conference on machine learning, pages 4693–4702. PMLR, 2018. [4] James A Russell. A circumplex model of affect. Journal of personality and social psychology, 39(6):1161, 1980. [5] IThuan Khoki Iuhan Kongsi. Suísiann dataset. https://suisiann-dataset.ithuan.tw/, 2019. [6] ChaoPeng Liu. Empathetic generativebased chatbot with emotion understanding via reinforcement learning. Master’s thesis, Department of Electrical Engineering, National Taiwan University, 2020. [7] JiunHao Jhan. Empathetic and retrievalbased chatbot using deep reinforcement learning. Master’s thesis, Graduate Institute of Communication Engineering, National Taiwan University, 2020. [8] Ch G Kratzenstein. Sur la formation et la naissance des voyelles. Journal de Physique, 21:358–380, 1782. [9] John J Ohala. Christian gottlieb kratzenstein: Pioneer in speech synthesis. In ICPhS, pages 156–159, 2011. [10] Homer Dudley. The carrier nature of speech. Bell System Technical Journal, 19(4):495–515, 1940. [11] Andrew J Hunt and Alan W Black. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, pages 373–376. IEEE, 1996. [12] Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. speech communication, 51(11):1039–1064, 2009. [13] Heiga Ze, Andrew Senior, and Mike Schuster. Statistical parametric speech synthesis using deep neural networks. In 2013 ieee international conference on acoustics, speech and signal processing, pages 7962–7966. IEEE, 2013. [14] Sercan Ö Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep voice: Realtime neural texttospeech. In International Conference on Machine Learning, pages 195–204. PMLR, 2017. [15] Jose Sotelo, Soroush Mehri, Kundan Kumar, João Felipe Santos, Kyle Kastner, Aaron C Courville, and Yoshua Bengio. Char2wav: Endtoend speech synthesis. In ICLR (Workshop), 2017. [16] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and TieYan Liu. Fastspeech: fast, robust and controllable text to speech. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 3171–3180, 2019. [17] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014. [18] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [19] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJSkerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and transfer in endtoend speech synthesis. In International Conference on Machine Learning, pages 5180–5189. PMLR, 2018. [20] WeiNing Hsu, Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, et al. Hierarchical generative modeling for controllable speech synthesis. In International Conference on Learning Representations, 2018. [21] Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ SkerryRyan, Daisy Stanton, David Kao, and Tom Bagby. Semisupervised generative modeling for controllable speech synthesis. In International Conference on Learning Representations, 2019. [22] Ye Jia, Yu Zhang, Ron J Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, et al. Transfer learning from speaker verification to multispeaker texttospeech synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4485–4495, 2018. [23] Erica Cooper, ChengI Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, and Junichi Yamagishi. Zeroshot multispeaker texttospeech with stateoftheart neural speaker embeddings. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6184–6188. IEEE, 2020. [24] Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ SkerryRyan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran. Learning to speak fluently in a foreign language: Multilingual speech synthesis and crosslanguage voice cloning. Proc. Interspeech 2019, pages 2080–2084, 2019. [25] Xuehao Zhou, Xiaohai Tian, Grandee Lee, Rohan Kumar Das, and Haizhou Li. Endtoend codeswitching tts with crosslingual language model. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7614–7618. IEEE, 2020. [26] Tao Li, Shan Yang, Liumeng Xue, and Lei Xie. Controllable emotion transfer for endtoend speech synthesis. In 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 1–5. IEEE, 2021. [27] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized endtoend loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883. IEEE, 2018. [28] David Snyder, Daniel GarciaRomero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. Xvectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333. IEEE, 2018. [29] Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, and Hiroshi Saruwatari. Crosslingual texttospeech synthesis via domain adaptation and perceptual similarity regression in speaker space. In INTERSPEECH, pages 2947–2951, 2020. [30] Ralph Beebe Blackman and John Wilder Tukey. The measurement of power spectra from the point of view of communications engineering—part i. Bell System Technical Journal, 37(1):185–282, 1958. [31] Stanley Smith Stevens, John Volkmann, and Edwin Broomell Newman. A scale for the measurement of the psychological magnitude pitch. The journal of the acoustical society of america, 8(3):185–190, 1937. [32] Daniel Griffin and Jae Lim. Signal estimation from modified shorttime fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984. [33] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015. [34] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. [35] Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully characterlevel neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5:365–378, 2017. [36] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015. [37] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997. [38] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014. [39] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015. [40] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. [41] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR, abs/1609.03499, 2016. [42] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997. [43] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attentionbased models for speech recognition. In Proceedings of the 28th International Conference on Neural Information Processing SystemsVolume 1, pages 577–585, 2015. [44] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419. PMLR, 2018. [45] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flowbased generative network for speech synthesis. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019. [46] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems, pages 14881–14892, 2019. [47] Pengfei Wu, Zhenhua Ling, Lijuan Liu, Yuan Jiang, Hongchuan Wu, and Lirong Dai. Endtoend emotional speech synthesis using style tokens and semisupervised training. In 2019 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 623–627. IEEE, 2019. [48] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domainadversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016. [49] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015. [50] Keith Ito and Linda Johnson. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017. [51] Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 920–924. IEEE, 2021. [52] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier GonzalezDominguez. Deep neural networks for small footprint textdependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4052–4056. IEEE, 2014. [53] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer. Endtoend textdependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5115–5119. IEEE, 2016. [54] Manoj Kumar, Tae JinPark, Somer Bishop, and Shrikanth Narayanan. Designing neural speaker embeddings with meta learning. arXiv preprint arXiv:2007.16196, 2020. [55] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: A largescale speaker identification dataset. Proc. Interspeech 2017, pages 2616–2620, 2017. [56] Guillaume Chanel, Karim AnsariAsl, and Thierry Pun. Valencearousal evaluation using physiological signals in an emotion recall paradigm. In 2007 IEEE International Conference on Systems, Man and Cybernetics, pages 2662–2667. IEEE, 2007. [57] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8, pages 18–25. Citeseer, 2015. [58] Carnegie Mellon University. The cmu pronouncing dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict, 2014. [59] 中華民國教育部. 臺灣閩南語羅馬字拼音方案使用手冊. https://ws.moe.edu.tw/001/Upload/FileUpload/3677-15601/Documents/tshiutsheh.pdf, 2007. [60] WenChin Huang, YiChiao Wu, and Tomoki Hayashi. Anytoone sequencetosequence voice conversion using selfsupervised discrete speech representations. In ICASSP 20212021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5944–5948. IEEE, 2021. [61] Pochun Hsu and Hungyi Lee. Wgwavenet: Realtime highfidelity speech synthesis without gpu. Proc. Interspeech 2020, pages 210–214, 2020.
|