|
[1] M. C. A. C. G. D. A. G. Y. K. X. L. J. M. A. N. J. R. S. S. M. S. Sercan O. Arik, “Deep Voice: Real-time Neural Text-to-Speech,” arXiv:1702.07825 [cs.CL], 2017. [2] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark 且 R. A. Saurous, “Tacotron: Towards End-to-End Speech Synthesis,” arXiv:1703.10135v2 [cs.CL], 2017. [3] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. C. Courville 且 Y. Bengio, “Char2Wav: End-to-End Speech Synthesis.,” 於 ICLR (Workshop), 2017. [4] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis 且 Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” arXiv:1712.05884 [cs.CL], 2017. [5] S. D. H. Z. K. S. O. V. A. G. N. K. A. S. K. K. Aaron van den Oord, “WaveNet: A Generative Model for Raw Audio,” arXiv:1609.03499 [cs.SD], 2016. [6] G. D. A. G. J. M. K. P. W. P. J. R. Y. Z. Sercan Arik, “Deep Voice 2: Multi-Speaker Neural Text-to-Speech,” arXiv:1705.08947 [cs.CL], 2017. [7] K. P. A. G. S. O. A. A. K. S. N. J. R. J. M. Wei Ping, “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning,” arXiv:1710.07654 [cs.SD], 2017. [8] D. S. Y. Z. R. S.-R. E. B. J. S. Y. X. F. R. Y. J. R. A. S. Yuxuan Wang, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis,” arXiv:1803.09017 [cs.CL], 2018. [9] C. R. Chung-Cheng Chiu, “Monotonic Chunkwise Attention,” arXiv:1712.05382v2 [cs.CL], 2018. [10] D. B. D. S. K. C. Y. B. Jan Chorowski, “Attention-Based Models for Speech Recognition,” arXiv:1506.07503 [cs.CL], 2015. [11] Z.-H. L. L.-R. D. Jing-Xuan Zhang, “FORWARDATTENTIONINSEQUENCE-TO-SEQUENCEACOUSTICMODELINGFOR SPEECHSYNTHESIS,” arXiv:1807.06736 [cs.CL], 2018. [12] M.-T. L. P. J. L. R. J. W. D. E. Colin Raffel, “Online and Linear-Time Attention by Enforcing Monotonic Alignments,” arXiv:1704.00784v2 [cs.LG], 2017. [13] H. Miao, G. Cheng, P. Zhang, T. Li 且 Y. Yan, “Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” 於 Interspeech 2019, 2019. [14] D. K. J. P. Florian Schroff, “FaceNet: A Unified Embedding for Face Recognition and Clustering,” arXiv:1503.03832 [cs.CV], 2015. [15] X. M. B. J. X. L. X. Z. X. L. Y. C. A. K. Z. Z. Chao Li, “Deep Speaker: an End-to-End Neural Speaker Embedding System,” arXiv:1705.02304 [cs.CL], 2017. [16] I. M. S. B. N. S. Georg Heigold, “End-to-End Text-Dependent Speaker Verification,” arXiv:1509.08062 [cs.LG], 2015. [17] Q. W. A. P. I. L. M. Li Wan, “Generalized End-to-End Loss for Speaker Verification,” arXiv:1710.10467 [eess.AS], 2017. [18] Y. Z. R. J. W. Q. W. J. S. F. R. Z. C. P. N. R. P. I. L. M. Y. W. Ye Jia, “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis,” arXiv:1806.04558 [cs.CL], 2018. [19] S. S. S. N. Andros Tjandra, “Machine Speech Chain with One-shot Speaker Adaptation,” arXiv:1803.10525 [cs.CL] , 2018. [20] J. Z. K. V. K. Shaojie Bai, “An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,” arXiv:1803.01271v2 [cs.LG] , 2018. [21] K. Ito, "The LJ Speech Dataset," 2017. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/. [22] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. [23] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner and M. Sonderegger, "Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi.," in Interspeech 2017, 2017.
|