|
[1] G. Chen, C. Parada, and G. Heigold, “Smallfootprint keyword spotting using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4087–4091. [2] T. Sainath and C. Parada, “Convolutional neural networks for smallfootprint keyword spotting,” in Interspeech, 2015. [3] R. Tang and J. Lin, “Deep residual learning for smallfootprint keyword spotting,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5484–5488. [4] Y. Bai, J. Yi, J. Tao, Z. Wen, Z. Tian, C. Zhao, and C. Fan, “A time delay neural network with shared weight selfattention for smallfootprint keyword spotting.” in INTERSPEECH, 2019, pp. 2190–2194. [5] G. Chen, C. Parada, and T. N. Sainath, “Querybyexample keyword spotting using long shortterm memory networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5236–5240. [6] M. Weintraub, “Lvcsr loglikelihood ratio scoring for keyword spotting,” in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1995, pp. 297–300 vol.1. [7] H. Yan, Q. He, and W. Xie, “Crnnctc based mandarin keywords spotting,” in ICASSP 2020 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7489–7493. [8] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” the Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990. [9] R. C. Rose and D. B. Paul, “A hidden markov model based keyword recognition system,” in International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1990, pp. 129–132. [10] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017. [12] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur, “A timerestricted selfattention layer for asr,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5874–5878. [13] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [14] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, “Selfattentional acoustic models,” arXiv preprint arXiv:1803.09519, 2018. [15] Y. Zhang and J. R. Glass, “Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams,” in 2009 IEEE Workshop on Automatic Speech Recognition Understanding, 2009, pp. 398–403. [16] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series.” in KDD workshop, vol. 10, no. 16. Seattle, WA, USA:, 1994, pp. 359–370. [17] M. C. Madhavi and H. A. Patil, “Vtlnwarped gaussian posteriorgram for qbestd,” in 2017 25th European Signal Processing Conference (EUSIPCO), 2017, pp. 563–567. [18] E. Eide and H. Gish, “A parametric approach to vocal tract length normalization,” in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1. IEEE, 1996, pp. 346–348. [19] N. Sacchi, A. Nanchen, M. Jaggi, and M. Cernak, “Openvocabulary keyword spotting with audio and text embeddings,” in INTERSPEECH 2019IEEE International Conference on Acoustics, Speech, and Signal Processing, no. CONF, 2019. [20] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823. [21] G. D. Forney, “The viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268–278, 1973. [22] D. R. Miller, M. Kleber, C.L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, and H. Gish, “Rapid and accurate spoken term detection,” in Eighth Annual Conference of the international speech communication association, 2007. [23] G. Chen, O. Yilmaz, J. Trmal, D. Povey, and S. Khudanpur, “Using proxies for oov keywords in the keyword search task,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013, pp. 416–421. [24] Y. Wang and Y. Long, “Keyword spotting based on ctc and rnn for mandarin chinese speech,” in 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2018, pp. 374–378. [25] Z. Chen, Y. Qian, and K. Yu, “Sequence discriminative training for deep learning based acoustic keyword spotting,” Speech Communication, vol. 102, pp. 100–111, 2018. [26] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376. [27] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequencetrained neural networks for asr based on latticefree mmi.” in Interspeech, 2016, pp. 2751–2755. [28] A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “Firstpass large vocabulary continuous speech recognition using bidirectional recurrent dnns,” arXiv preprint arXiv:1408.2873, 2014. [29] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015. [30] Y. Kim and A. M. Rush, “Sequencelevel knowledge distillation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 1317–1327. [31] J. Wong and M. J. F. Gales, “Sequence studentteacher training of deep neural networks,” in Interspeech. ISCA, September 2016, pp. 2761–2765. [32] R. Takashima, S. Li, and H. Kawai, “An investigation of a knowledge distillation method for ctc acoustic models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5809–5813. [33] R. Takashima, L. Sheng, and H. Kawai, “Investigation of sequencelevel knowledge distillation methods for ctc acoustic models,” in ICASSP 2019 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6156–6160. [34] G. Kurata and K. Audhkhasi, “Improved knowledge distillation from bidirectional to unidirectional lstm ctc for endtoend speech recognition,” in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 411–417. [35] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018. [36] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integerarithmeticonly inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713. [37] O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8bert: Quantized 8bit bert,” arXiv preprint arXiv:1910.06188, 2019. [38] (2020) Dynamic quantization. [Online]. Available: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html [39] Beijing DataTang Technology Co., Ltd., “aidatatang 200zh, a free chinese mandarin speech corpus,” www.datatang.com. [40] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell1: An opensource mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (OCOCOSDA), 2017, pp. 1–5. [41] Magic Data Technology Co., Ltd., “Magicdata mandarin chinese read speech corpus,” http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/101, 2019. [42] L. Primewords Information Technology Co., “Primewords chinese corpus set 1,” 2018, https://www.primewords.cn. [43] Surfingtech, “Stcmds20170001 1 free st chinese mandarin corpus.” [44] Z. Z. Dong Wang, Xuewei Zhang, “Thchs30 : A free chinese speech corpus,” 2015. [Online]. Available: http://arxiv.org/abs/1512.01882 [45] J. Hou, Y. Shi, M. Ostendorf, M. Hwang, and L. Xie, “Region proposal network based smallfootprint keyword spotting,” IEEE Signal Process. Lett., vol. 26, no. 10, pp. 1471–1475, 2019. [Online]. Available: https://doi.org/10.1109/LSP.2019.2936282 [46] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224. [47] I. Szöke, M. Skácel, L. Mošner, J. Paliesek, and J. Černocký, “Building and evaluation of a real room impulse response dataset,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 863–876, 2019. [48] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1. [49] Y. Zhuang, X. Chang, Y. Qian, and K. Yu, “Unrestricted vocabulary keyword spotting using lstmctc.” in Interspeech, 2016, pp. 938–942. [50] S. Kim, T. Hori, and S. Watanabe, “Joint ctcattention based endtoend speech recognition using multitask learning,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 4835–4839. [51] R. Serizel and D. Giuliani, “Vocal tract length normalisation approaches to dnnbased children’s and adults’ speech recognition,” in 2014 IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 135–140. [52] K. Matsuura, M. Mimura, S. Sakai, and T. Kawahara, “Generative adversarial training data adaptation for very lowresource automatic speech recognition,” in Proc. Interspeech 2020, 2020, pp. 2737–2741. [53] B. Huang, D. Ke, H. Zheng, B. Xu, Y. Xu, and K. Su, “Multitask learning deep neural networks for speech feature denoising,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. [54] G. Kurata and K. Audhkhasi, “Multitask ctc training with auxiliary feature reconstruction for endtoend speech recognition.” in INTERSPEECH, 2019, pp. 1636–1640.
|