|
[1] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards End-to-End Speech Synthesis,” arXiv preprint arXiv:1703.10135, 2017.
[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783, IEEE, 2018.
[3] S. Ö. Arık, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, et al., “Deep Voice: Realtime Neural Text-to-Speech,” in International Conference on Machine Learning, pp. 195–204, PMLR, 2017.
[4] A. Gibiansky, S. Ö. Arik, G. F. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou, “Deep Voice 2: MultiSpeaker Neural TexttoSpeech,” in NIPS, 2017.
[5] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: 2000-Speaker Neural Text-to-Speech,” Proc. ICLR, pp. 214– 217, 2018.
[6] N. Perraudin, P. Balazs, and P. L. Søndergaard, “A Fast Griffin-Lim Algorithm,” in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4, IEEE, 2013.
[7] W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters, “WaveNet Based Low Rate Speech Coding,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 676–680, IEEE, 2018.
[8] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stim berg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient Neural Audio Synthesis,” in International Conference on Machine Learning, pp. 2410–2419, PMLR, 2018.
[9] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A FlowBased Generative Net work for Speech Synthesis,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621, IEEE, 2019.
[10] K. Azizah, M. Adriani, and W. Jatmiko, “Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages,” IEEE Access, vol. 8, pp. 179798–179812, 2020.
[11] T. Tu, Y.J. Chen, C.c. Yeh, and H.Y. Lee, “End-to-end Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning,” arXiv preprint arXiv:1904.06508, 2019.
[12] Y. Lee, S. Shon, and T. Kim, “Learning Pronunciation from A Foreign Language in Speech Synthesis Networks,” arXiv preprint arXiv:1811.09364, 2018.
[13] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, et al., “Transfer Learning from Speaker Verification to Multi-speaker Text-to-Speech Synthesis,” arXiv preprint arXiv:1806.04558, 2018.
[14] Y. Wang, D. Stanton, Y. Zhang, R.S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis,” in International Conference on Machine Learning, pp. 5180–5189, PMLR, 2018.
[15] A. Prakash, A. L. Thomas, S. Umesh, and H. A. Murthy, “Building Multilingual End-to-End Speech Synthesisers for Indian Languages,” in Proc. of 10th ISCA Speech Synthesis Workshop (SSW’10), pp. 194–199, 2019.
[16] H. Ming, Y. Lu, Z. Zhang, and M. Dong, “A Light-weight Method of Building An LSTM-RNN-Based Bilingual TTS System,” in 2017 International Conference on Asian Language Processing (IALP), pp. 201–205, IEEE, 2017.
[17] B. Li and H. Zen, “Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN Based Statistical Parametric Speech Synthesis,” 2016.
[18] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are All You Need: Endto end Multilingual Speech Recognition and Synthesis with Bytes,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5621–5625, IEEE, 2019.
[19] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosen berg, and B. Ramabhadran, “Learning to Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning,” arXiv preprint arXiv:1907.04448, 2019.
[20] T. Nekvinda and O. Dušek, “One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech,” arXiv preprint arXiv:2008.00768, 2020.
[21] A. J. Hunt and A. W. Black, “Unit Selection in A Concatenative Speech Synthesis System Using A Large Speech Database,” in 1996 IEEE International Conference on Acous tics, Speech, and Signal Processing Conference Proceedings, vol. 1, pp. 373–376, IEEE, 1996.
[22] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), vol. 3, pp. 1315–1318, IEEE, 2000.
[23] M. C. Orhan and C. Demiroğlu, “HMM-Based Text to Speech System with Speaker Interpolation,” in 2011 IEEE 19th Signal Processing and Communications Applications Conference (SIU), pp. 781–784, IEEE, 2011.
[24] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” arXiv preprint arXiv:1409.3215, 2014.
[25] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[26] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” arXiv preprint arXiv:1412.3555, 2014.
[27] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-Based Models for Speech Recognition,” arXiv preprint arXiv:1506.07503, 2015.
[28] F. Yu and V. Koltun, “Multi-Scale Context Aggregation by Dilated Convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[29] D. P. Kingma and P. Dhariwal, “Glow: Generative Flow with Invertible 1x1 Convolutions,” arXiv preprint arXiv:1807.03039, 2018.
[30] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous, “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,” in International Conference on Machine Learning, pp. 4693–4702, PMLR, 2018.
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
[32] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized End-to-End Loss for Speaker Verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883, IEEE, 2018.
[33] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marc hand, and V. Lempitsky, “Domain-Adversarial Training of Neural Networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
[34] E. A. Platanios, M. Sachan, G. Neubig, and T. Mitchell, “Contextual Parameter Generation for Universal Neural Machine Translation,” arXiv preprint arXiv:1808.08493, 2018.
[35] Y.Y. Wang, A. Acero, and C. Chelba, “Is Word Error Rate A Good Indicator for Spo ken Language Understanding Accuracy,” in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), pp. 577–582, IEEE, 2003.
[36] R. Kubichek, “Mel-Cepstral Distance Measure for Objective Speech Quality Assessment,” in Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, vol. 1, pp. 125–128, IEEE, 1993.
[37] Executedone. https://zhuanlan.zhihu.com/p/117634492.
[38] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A System for Large-Scale Machine Learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.
[39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “PyTorch: An Imperative Style, HighPerformance Deep Learning library,” arXiv preprint arXiv:1912.01703, 2019.
[40] T. E. Oliphant, “Python for Scientific Computing,” Computing in Science & Engineering, vol. 9, no. 3, pp. 10–20, 2007.
[41] H. Nam and H.E. Kim, “Batch-Instance Normalization for Adaptively Style-Invariant Neural Networks,” arXiv preprint arXiv:1805.07925, 2018.
[42] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” arXiv preprint arXiv:1406.2661, 2014.
[43] L. BiaoBei (Beijing) Technology Co., “Chinese Standard Mandarin Speech Corpus.” https://www.data-baker.com/open_source.html.
[44] A. for Computational Linguistics and C. L. Processing, “NER-TRS-VOL1-4.” https: //scidm.nchc.org.tw/zh_TW/dataset/ner-trs-vol1-text.
[45] S. Junyi, “Jieba.” https://github.com/fxsjy/jieban.
[46] H. Huang, “Python-pinyin.” https://github.com/mozillazg/python-pinyin.
[47] B. McFee and B. McFee, “Librosa.” https://github.com/librosa/librosa.
[48] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in International Conference on Machine Learning, pp. 448–456, PMLR, 2015.
[49] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density Estimation Using Real NVP,” arXiv preprint arXiv:1605.08803, 2016.
[50] M. N. Timothy Gu and BtbN, “FFmpeg.” https://github.com/FFmpeg/FFmpeg.
[51] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv preprint arXiv:1412.6980, 2014.
[52] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines,” arXiv preprint arXiv:2010.11567, 2020.
[53] J. Yamagishi, C. Veaux, K. MacDonald, et al., “CSTR VCTK Corpus: English Multi Speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92),” 2019.
[54] PyTorch, “PyTorch.” https://pytorch.org/docs/stable/generated/torch.nn. Embedding.html.
[55] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4788, IEEE, 2018.
[56] K. Ito and L. Johnson, “The LJ Speech Dataset.” https://keithito.com/ LJ-Speech-Dataset/, 2017.
[57] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance Normalization: The Missing Ingredient for Fast Stylization,” arXiv preprint arXiv:1607.08022, 2016.
[58] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, IEEE, 2015.
[59] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A Large-Scale Speaker Identification Dataset,” arXiv preprint arXiv:1706.08612, 2017.
|