|
參考文獻 [1] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989. [2] D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, et al., “The subspace gaussian mixture model—a structured model for speech recognition,” Computer Speech & Language, vol. 25, no. 2, pp. 404– 439, 2011. [3] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376, 2006. [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014. [5] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232, 2017. [6] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376, 2006. [7] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Transformer ASR with contextual block processing,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 427–433, 2019. [8] E. Tsunoo, Y. Kashiwagi, and S. Watanabe, “Streaming transformer ASR with blockwise synchronous beam search,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 22–29, 2021. [9] J. Sun, “Jieba chinese word segmentation tool,” 2012. [10] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Interspeech 2015, pp. 3586–3589, 2015. [11] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv Preprint arXiv:1904.08779, 2019. [12] “Sox, audio manipulation tool.” Available: http://sox.sourceforge.net, accessed: March 25,2015. [13] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [14] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv Preprint arXiv:1412.3555, 2014. [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017. [16] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964, 2016. [17] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Advances in Neural Information Processing Systems, vol. 28, 2015. [18] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv Preprint arXiv:1607.06450, 2016. [19] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformerxl: Attentive language models beyond a fixed-length context,” arXiv Preprint arXiv:1901.02860, 2019. [20] Y. Lu, Z. Li, D. He, Z. Sun, B. Dong, T. Qin, L. Wang, and T.-Y. Liu, “Understanding and improving transformer from a multi-particle dynamic system point of view,” arXiv Preprint arXiv:1906.02762, 2019. [21] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv Preprint arXiv:1710.05941, 2017. [22] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in International Conference on Machine Learning, pp. 933–941, 2017. [23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, pp. 448–456, 2015. [24] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017. [25] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839, 2017. [26] T. Nakatani, “Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,” in Proc. Interspeech 2019, 2019. [27] T. Hori, S. Watanabe, and J. R. Hershey, “Joint ctc/attention decoding for end-to-end speech recognition,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 518–529, 2017. [28] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602– 610, 2005. [29] L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5656– 5660, 2019. [30] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” arXiv Preprint arXiv:1712.05382, 2017. [31] K. Kim, K. Lee, D. Gowda, J. Park, S. Kim, S. Jin, Y.-Y. Lee, J. Yeo, D. Kim, S. Jung, et al., “Attention based on-device streaming speech recognition with large speech corpus,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 956–963, 2019. [32] H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum latency training strategies for streaming sequence-to-sequence asr,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6064–6068, 2020. [33] D. Wang and T. F. Zheng, “Transfer learning for speech and language processing,” in 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1225–1237, 2015. [34] L. YouTube, “Youtube,” Retrieved, vol. 27, p. 2011, 2011. [35] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv Preprint arXiv:2006.04558, 2020. [36] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783, 2018. [37] P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs in Statistics, pp. 492–518, Springer, 1992. [38] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019. [39] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” arXiv Preprint arXiv:1803.02155, 2018. [40] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43–49, 1978. [41] M. Cuturi and M. Blondel, “Soft-dtw: a differentiable loss function for time-series,” in International Conference on Machine Learning, pp. 894–903, 2017. [42] P. Blanchard, D. J. Higham, and N. J. Higham, “Accurately computing the log-sum-exp and softmax functions,” IMA Journal of Numerical Analysis, vol. 41, no. 4, pp. 2311– 2330, 2021. [43] E. L. Denton, S. Chintala, R. Fergus, et al., “Deep generative image models using a laplacian pyramid of adversarial networks,” Advances in Neural Information Processing Systems, vol. 28, 2015. [44] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv Preprint arXiv:1511.06434, 2015. [45] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in European Conference on Computer Vision, pp. 597–613, 2016. [46] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun, “Disentangling factors of variation in deep representation using adversarial training,” Advances in Neural Information Processing Systems, vol. 29, 2016. [47] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al., “Espnet: End-to-end speech processing toolkit,” arXiv Preprint arXiv:1804.00015, 2018. [48] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., “The kaldi speech recognition toolkit,” in 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 1–4, 2011. [49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in Neural Information Processing Systems, vol. 32, 2019. [50] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source framework for deep learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in the Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), vol. 5, pp. 1–6, 2015. [51] R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: A modular machine learning software library,” tech. rep., Idiap, 2002. [52] G. Van Rossum and F. L. Drake, Python 3 reference manual. CreateSpace, 2009. [53] B. B. T. C. L, “Chinese standard mandarin speech corpus.” https://www.data-baker. com/opensource.html, 2017. [54] H.-P. Lin, “Improving speech recognition systems for low-resource languages with hidden speaker information.” https://hdl.handle.net/11296/p2yng3, 2021. [55] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “Aishell-3: A multi-speaker mandarin tts corpus and the baselines,” arXiv Preprint arXiv:2010.11567, 2020. [56] C. Veaux, J. Yamagishi, K. MacDonald, et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017. [57] Y. Ren, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Almost unsupervised text to speech and automatic speech recognition,” in International Conference on Machine Learning, pp. 5410–5419, 2019.
|