
[1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013. [2] A. van den Oord and B. Schrauwen, “The studentt mixture as a natural image patch prior with application to image compression,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 2061–2086, 2014. [3] M. H. Law, M. A. Figueiredo, and A. K. Jain, “Simultaneous feature selection and clustering using mixture models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1154–1166, 2004. [4] D. M. Blei, A. Y. Ng, and M. I Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 45, pp. 993–1022, 2003. [5] S. Kim, S. Narayanan, and S. Sundaram, “Acoustic topic model for audio information retrieval,” in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (ASPAA), 2009, pp. 37–40. [6] P. Hu, W. Liu, W. Jiang, and Z. Yang, “Latent topic model based on GaussianLDA for audio retrieval,” in Pattern Recognition, Springer Berlin Heidelberg, 2012, pp. 556–563. [7] ——, “Latent topic model for audio retrieval,” Pattern Recognition, vol. 47, no. 3, pp. 11381143, 2014. [8] N. Rasiwasia and N. Vasconcelos, “Latent Dirichlet allocation models for image classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2665–2679, 2013. [9] D. B. W. Chong and F. Li, “Simultaneous image classification and annotation,” in Proc. CVPR, IEEE, 2009, pp. 1903–1910. [10] L. J. Li, C. Wang, Y. Lim, D. M Blei, and F. F. Li, “Building and using a semantic visual image hierarchy,” in Proc. CVPR, IEEE, 2010, pp. 3336–3343. [11] F. F. Li and P. Perona, “A Bayesian hierarchical model for learning natural scene categories,” in Proc. CVPR, IEEE, vol. 2, 2005, pp. 524–531. [12] Z. Lu, L. Wang, and J. R. Wen, “Image classification by visual bagofwords refinement and reduction,” Neurocomputing, vol. 173, pp. 373–384, 2016. [13] L. Su, C. C. M. Yeh, J. Y. Liu, J. C. Wang, and Y. H. Yang, “A systematic evaluation of the bagofframes representation for music information retrieval,” IEEE Transactions on Multimedia, vol. 16, no. 5, pp. 1188–1200, 2014. [14] T. Nakano, K. Yoshii, and M. Goto, “Vocal timbre analysis using latent dirichlet allocation and crossgender vocal timbre similarity,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 5202–5206. [15] S. Kim, S. Sundaram, P. Georgiou, and S. Narayanan, “Audio scene understanding using topic models,” in Proceedings of the Neural Information Processing Systems (NIPS) Workshop, 2009, pp. 1–4. [16] S. Kim, P. Georgiou, and S. Narayanan, “Supervised acoustic topic model with a consequent classifier for unstructured audio classification,” in Proceedings of the International Workshop on ContentBased Multimedia Indexing (CBMI), 2012, pp. 16. [17] P. Hu, W. Liu, W. Jiang, and Z. Yang, “Latent topic model based on gaussianlda for audio retrieval,” in Pattern Recognition: Chinese Conference, CCPR 2012, Beijing, China, September 2426, 2012. Proceedings. Springer Berlin Heidelberg, 2012, pp. 556–563. [18] R. Das, M. Zaheer, and C. Dyer, “Gaussian lda for topic models with word embeddings,” in Proc. ACL, 2015, pp. 795–804. [19] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association, vol. 101, 2004. [20] K. W. Lim, W. Buntine, C. Chen, and L. Du, “Nonparametric Bayesian topic modelling with the hierarchical Pitman¡vYor processes,” International Journal of Approximate Reasoning, vol. 78, pp. 172 –191, 2016. [21] S. J. Gershman and D. M. Blei, “A tutorial on bayesian nonparametric models,” Journal of Mathematical Psychology, vol. 56, no. 1, pp. 1 –12, 2012. [22] D. M. Blei, T. L. Griffiths, and M. I. Jordan, “The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies,” Journal of the ACM, vol. 57, no. 2, 7:1–7:30, 2010. [23] D. M. Blei, M. I. Jordan, T. L. Griffiths, and J. B. Tenenbaum, “Hierarchical topic models and the nested chinese restaurant process,” in Proceedings of the 16th International Conference on Neural Information Processing Systems, ser. NIPS’03, Whistler, British Columbia, Canada: MIT Press, 2003, pp. 17–24. [24] Y. L. Chang, J. J. Hung, and J. T. Chien, “Bayesian nonparametric modeling of hierarchical topics and sentences,” in Proc. MLSP, 2011, pp. 1–6. [25] J. T. Chien and Y. L. Chang, “Hierarchical theme and topic model for summarization,” in Proc. MLSP, 2013, pp. 1–6. [26] ——, “The nested Indian buffet process for flexible topic modeling,” in Proc. INTERSPEECH, 2014, pp. 1434–1437. [27] J. T. Chien, “Hierarchical theme and topic modeling,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 3, pp. 565–578, 2016. [28] ——, “Bayesian nonparametric learning for hierarchical and sparse topics,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 2, pp. 422–435, 2018. [29] I. T. Jolliffe, Principal Component Analysis. New York: SpringerVerlag, 1986. [30] G. W. Cottrell, P. Munro, and D. Zipser, “Learning internal representations from grayscale images: An example of extensional programming,” in Proc. CogSci, 1987. [31] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” J. Roy. Statist. So. B (Statist. Methodol.), vol. 61, no. 3, pp. 611–622, 1999. [32] Y. S. F. Ju, J. Gao, Y. Hu, and B. Yin, “Image outlier detection and feature extraction via l1normbased 2d probabilistic pca,” IEEE Trans. Image Process., vol. 24, no. 12, pp. 4834–4846, 2015. [33] X. Cui, M. Afify, and B. Zhou, “Stereobased stochastic mapping with context using probabilistic pca for noise robust automatic speech recognition,” in Proc. ICASSP, 2012, pp. 4705–4708. [34] N. D. Lawrence, “Gaussian process latent variable models for visualisation of high dimensional data,” in Proc. NIPS, 2004, pp. 329–336. [35] M. K. Titsias and N. Lawrence, “Bayesian Gaussian process latent variable model,” in AISTATS, 2010. [36] A. C. Damianou and N. D. Lawrence, “Deep Gaussian processes,” arXiv preprint arXiv:1211.0358v2, 2012. [37] G. Zhong, W. J. Li, D. Y. Yeung, X. Hou, and C. L. Liu, “Gaussian process latent random field,” in Proc. AAAI, 2010, pp. 679–684. [38] S. Eleftheriadis, O. Rudovic, and M. Pantic, “Discriminative shared Gaussian processes for multiview and viewinvariant facial expression recognition,” IEEE Trans. Image Process., vol. 24, no. 1, pp. 189–204, 2015. [39] R. Urtasun and T. Darrell, “Discriminative gaussian process latent variable model for classification,” in Proc. ICML, 2007, pp. 927–934. [40] J. Snoek, R. P. Adams, and H. Larochelle, “Nonparametric guidance of autoencoder representations using label information,” J. Mach. Learn. Res., vol. 13, pp. 2567–2588, 2012. [41] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. [42] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for deep learning of images, labels and captions,” arXiv preprint arXiv: 1609.08976v1, 2016. [43] S. Eleftheriadis, O. Rudovic, M. P. Deisenroth, and M. Pantic, “Variational Gaussian process autoencoder for ordinal prediction of facial action units,” arXiv preprint arXiv: 1608.04664v2, 2016. [44] Z. Dai, A. Damianou, J. González, and N. Lawrence, “Variational autoencoded deep Gaussian processes,” arXiv preprint arXiv:1511.06455v2, 2016. [45] D. P Kingma and M. Welling, “Autoencoding variational Bayes,” arXiv preprint arXiv:1312.6114v10, 2013. [46] T. Gerkmann, M. KrawczykBecker, and J. L. Roux, “Phase processing for single channel speech enhancement: History and recent advances,” IEEE Signal Process. Mag., vol. 32, no. 2, pp. 55–66, 2015. [47] E. Loweimi, S. M. Ahadi, and T. Drugman, “A new phasebased feature representation for robust speech recognition,” in Proc. ICASSP, 2013, pp. 7155–7159. [48] L. Su, L. F. Yu, and Y. H. Yang., “Sparse cepstral and phase codes for guitar playing technique classification,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2014, pp. 9–14. [49] A. Diment, E. Cakir, T. Heittola, and T. Virtanen, “Automatic recognition of environmental sound events using allpole group delay features,” in Proc. EUSIPCO, 2015, pp. 734–738. [50] S. Liwicki, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “Euler principal component analysis,” Int. J. Comput. Vis., vol. 101, no. 3, pp. 498–518, 2013. [51] A. Fitch, A. Kadyrov, W. Christmas, and J. Kittler, “Fast robust correlation,” IEEE Trans. Image Process., vol. 14, no. 8, pp. 1063–1073, 2005. [52] J. D. Horel, “Complex principal component analysis: Theory and examples,” J. Climate Appl. Meteor., vol. 23, pp. 1660–1673, 1984. [53] S. S. P. Rattan and W. W. Hsieh, “Complexvalued neural networks for nonlinear complex principal component analysis,” Neural Networks, vol. 18, no. 1, pp. 61–69, 2005. [54] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999. [55] H. Kameoka, O. Nobutaka, K. Kunio, and S. Shigeki, “Complex NMF: A new sparse representation for acoustic signals,” in Proc. ICASSP, 2009, pp. 3437–3440. [56] V. H. Duong, Y. S. Lee, J. J. Ding, B. T. Pham, M. Q. Bui, P. T. Bao, and J. C. Wang, “Exemplarembed complex matrix factorization for facial expression recognition,” in Proc. ICASSP, 2017, pp. 1837–1841. [57] P. Baldi and Z. Lu, “Complexvalued autoencoders,” Neural Networks, vol. 33, pp. 136–147, 2012. [58] T. Nakashika, S. Takaki, and J. Yamagishi, “Complexvalued restricted Boltzmann machine for direct learning of frequency spectra,” in Proc. INTERSPEECH, 2017. [59] M. Schedl, E. Gömez, and J. Urbano, “Music information retrieval: Recent developments and applications,” Foundations and trends in information retrieval, vol. 8, no. 23, pp. 127–261, 2014. [60] K. Choi, G. Fazekas, K. Cho, and M. Sandler, “A tutorial on deep learning for music information retrieval,” arXiv preprint arXiv:1709.04396v1, 2017. [61] O. Lartillot and P. Toiviainen, “A matlab toolbox for musical feature extraction from audio,” in Proceedings of the International Conference on Digital Audio Effects, 2007, pp. 237–244. [62] M. Muller, D. P. W. Ellis, A. Klapuri, and G. Richard, “Signal processing for music analysis,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1088– 1110, 2011. [63] J. Nam, J. Herrera, M. Slaney, and J. Smith, “Learning sparse feature representations for music annotation and retrieval,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2012. [64] K. O’Hanlon and M. D. Plumbley, “Automatic music transcription using row weighted decompositions,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 16–20. [65] K. Yazawa, K. Itoyama, and H. G. Okuno, “Automatic transcription of guitar tablature from audio signals in accordance with player’s proficiency,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3122–3126. [66] B. Logan, “Mel frequency cepstral coefficients for music modeling,” in Proceedings of the International Society of Music Information Retrieval (ISMIR), 2000. [67] J. Abeßer, H. Lukashevich, and G. Schuller, “Featurebased extraction of plucking and expression styles of the electric bass guitar,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010, pp. 2290–2293. [68] A. Tindale, A. Kapur, G. Tzanetakis, and I. Fujinaga, “Retrieval of percussion gestures using timbre classification techniques,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2004, pp. 541–545. [69] L. Su, H. M. Lin, and Y. H. Yang, “Sparse modeling of magnitude and phasederived spectra for playing technique classification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 2122–2132, 2014. [70] F. Auger and P. Flandrin, “Improving the readability of timefrequency and timescale representations by the reassignment method,” IEEE Transactions on Signal Processing, vol. 43, no. 5, pp. 1068–1089, 1995. [71] K. R. Fitz and S. A. Fulop, “A unified theory of timefrequency reassignment,” CoRR, vol. abs/0903.3080, 2009. [72] Y. P. Chen, L. Su, and Y. H. Yang, “Electric guitar playing technique detection in realworld recordings based on F0 sequence pattern recognition,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2015, pp. 708–714. [73] P. Manzagol, T. BertinMahieux, and D. Eck, “On the use of sparse timerelative auditory codes for music,” in Proceedings of the International Society of Music Information Retrieval (ISMIR), 2008, pp. 14–18. [74] J. Nam, J. Herrera, M. Slaney, and J. Smith, “Learning sparse feature representations for music annotation and retrieval,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2012, pp. 565–560. [75] E. J. Humphrey, J. P. Bello, and Y. LeCun, “Deep architectures and automatic feature learning in music informatics,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2012, pp. 403–408. [76] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Kluwer Academic Publishers, 1991. [77] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding,” in Proceedings of the Annual International Conference on Machine Learning (ICML), 2009, pp. 689–696. [78] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd. Boca Raton, FL, USA: CRC Press, Inc., 2013. [79] Y. Ephraim and D. Malah, “Speech enhancement using a minimummean square error shorttime spectral amplitude estimator,” IEEE Trans. Audio, Speech, Language Process., vol. 32, no. 6, pp. 1109–1121, 1984. [80] ——, “Speech enhancement using a minimum meansquare error logspectral amplitude estimator,” IEEE Trans. Audio, Speech, Language Process., vol. 33, no. 2, pp. 443–445, 1985. [81] N. Lyubimov and M. Kotov, “Nonnegative matrix factorization with linear constraints for singlechannel speech enhancement,” arXiv preprint arXiv:1309.6047, 2013. [82] D. S. Williamson, Y. Wang, and D. Wang, “A sparse representation approach for perceptual quality improvement of separated speech,” in Proc. ICASSP, 2013, pp. 7015–7019. [83] ——, “A twostage approach for improving the perceptual quality of separated speech,” in Proc. ICASSP, 2014, pp. 7034–7038. [84] D. S. Williamson, Y. Wang, and D. L. Wang, “Reconstruction techniques for improving the perceptual quality of binary masked speech,” J Acoust Soc Am., vol. 136, no. 2, pp. 892–902, 2014. [85] J. C. Wang, Y. S. Lee, C. H. Lin, S. F. Wang, C. H. Shih, and C. H. Wu, “Compressive sensingbased speech enhancement,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 24, no. 11, pp. 2122–2131, 2016. [86] S. Gonzalez and M. Brookes, “Maskbased enhancement for very low quality speech,” in Proc. ICASSP, 2014, pp. 7029–7033. [87] Y. Luo, G. Bao, Y. Xu, and Z. Ye, “Supervised monaural speech enhancement using complementary joint sparse representations,” IEEE Signal Process. Lett., vol. 23, no. 2, pp. 237–241, 2016. [88] G. Min, X. Zhang, J. Yang, W. Han, and X. Zou, “A perceptually motivated approach via sparse and lowrank model for speech enhancement,” in Proc. ICME, 2016, pp. 1–6. [89] J. F. Gemmeke, H. V. Hamme, B. Cranen, and L. Boves, “Compressive sensing for missing data imputation in noise robust speech recognition,” IEEE J. Sel. Topics Signal Process., vol. 4, no. 2, pp. 272–287, 2010. [90] L. Josifovski, M. Cooke, P. Green, and A. Vizinho, “State based imputation of missing data for robust speech recognition and speech enhancement,” in Proc. EUROSPEECH, 1999, pp. 2837–2840. [91] J. F. Gemmeke, T. Virtanen, and A. Hurmalainen, “Exemplarbased sparse representations for noise robust automatic speech recognition,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 7, pp. 2067–2080, 2011. [92] P. Magron, R. Badeau, and B. David, “Complex NMF under phase constraints based on signal modeling: Application to audio source separation,” in Proc. ICASSP, 2016, pp. 46–50. [93] F. J. RodriguezSerrano, S. Ewert, P. VeraCandeas, and M. Sandler, “A scoreinformed shiftinvariant extension of complex matrix factorization for improving the separation of overlapped partials in music recordings,” in Proc. ICASSP, 2016, pp. 61–65. [94] S. Souli and Z. Lachiri, “Environmental sound classification using loggabor filter,” in Proc. ICSP, 2012, pp. 144–147. [95] M. Zhang, W. Li, L. Wang, J. Wei, Z. Wu, and Q. Liao, “Sparse coding for sound event classification,” in Proc. APSIPA, 2013, pp. 1–5. [96] H. D. Tran and H. Li, “Probabilistic distance svm with hellingerexponential kernel for sound event classification,” in Proc. ICASSP, 2011, pp. 2272–2275. [97] A. Plinge, R. Grzeszick, and G. A. Fink, “A bagoffeatures approach to acoustic event detection,” in Proc. ICASSP, 2014, pp. 3704–3708. [98] J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad, and A. Serralheiro, “Nonspeech audio event detection,” in Proc. ICASSP, 2009, pp. 1973–1976. [99] J. Li, L. Deng, D. Yu, Y. Gong, and A. Acero, “Hmm adaptation using a phasesensitive acoustic distortion model for environment robust speech recognition,” in Proc. ICASSP, 2008, pp. 4069–4072. [100] L. Wang, S. Ohtsuka, and S. Nakagawa, “High improvement of speaker identification and verification by combining mfcc and phase information,” in Proc. ICASSP, 2009, pp. 4529–4532. [101] I. McCowan, D. Dean, M. McLaren, R. Vogt, and S. Sridharan, “The deltaphase spectrum with application to voice activity detection and speaker recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2026–2038, 2011. [102] K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase in speech enhancement,” Elsevier Speech Commun., vol. 53, no. 4, pp. 465–494, 2011. [103] H. Zhang, I. McLoughlin, and Y. Song, “Robust sound event recognition using convolutional neural networks,” in Proc. ICASSP, 2015, pp. 559–563. [104] G. Guo and S. Z. Li, “Contentbased audio classification and retrieval by support vector machines,” IEEE Transactions on Neural Networks, vol. 14, no. 1, pp. 209–215, Jan. 2003. [105] G. Shi, M. M. Shanechi, and P. Aarabi, “On the importance of phase in human speech recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 5, pp. 1867–1874, 2006. [106] M. S. E. Langarani, H. Veisi, and H. Sameti, “The effect of phase information in speech enhancement and speech recognition,” in Proc. ISSPA, 2012, pp. 1446–1447. [107] L. Gelman and S. Braun, “The optimal usage of the fourier transform for pattern recognition,” Mech. Syst. Signal Process., vol. 15, no. 3, pp. 641–646, 2001. [108] L. German, “Signal recognition: Both components of the short time fourier transform vs. power spectral density,” Patt. Anal. Appl., vol. 6, no. 2, pp. 91–96, 2003. [109] C. Singh, E. Walia, and N. Mittal, “Rotation invariant complex zernike moments features and their applications to human face and character recognition,” IET Computer Vision, vol. 5, no. 5, pp. 255–265, 2011. [110] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. ICASSP, 1979, pp. 208–211. [111] C. Plapous, C. Marro, and P. Scalart, “Improved signaltonoise ratio estimation for speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 6, pp. 2098–2108, 2006. [112] W. Nogueira, G. Roma, and P. Herrera, “Automatic event classification using front end single channel noise reduction, mfcc features and a support vector machine classifier,” Tech. Rep., 2013. [113] J. Schröder, B. Cauchi, M. R. Schädler, N. Moritz, K. Adiloglu, J. Anemüller, S. Doclo, B. Kollmeier, and S. Goetze, “Acoustic event detection using signal enhancement and spectrotemporal feature extraction,” Tech. Rep., 2013. [114] J. Dennis, H. D. Tran, and H. Li, “Spectrogram image feature for sound event classification in mismatched conditions,” IEEE Signal Process. Lett., vol. 18, no. 2, pp. 130–133, 2011. [115] I. McLoughlin, H. Zhang, Z. Xie, Y. Song, and W. Xiao, “Robust sound event classification using deep neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 3, pp. 540–552, 2015. [116] S. Chu, S. Narayanan, and C. C. J. Kuo, “Environmental sound recognition with timefrequency audio features,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 17, no. 6, pp. 1142–1158, 2009. [117] S. Sivasankaran and K. Prabhu, “Robust features for environmental sound classification,” in Proc. CONECCT, 2013, pp. 1–6. [118] J. C. Wang, C. H. Lin, B. W. Chen, and M. K. Tsai, “Gaborbased nonuniform scale frequency map for environmental sound classification in home automation,” IEEE Trans. Autom. Sci. Eng., vol. 11, no. 2, pp. 607–613, 2014. [119] J. W. Hung, H. J. Hsieh, and B. Chen, “Robust speech recognition via enhancing the complexvalued acoustic spectrum in modulation domain,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 2, pp. 236–251, 2016. [120] H. Xu, Z. H. Tan, P. Dalsgaard, and B. Lindberg, “Robust speech recognition by nonlocal means denoising processing,” IEEE Signal Process. Lett., vol. 15, pp. 701–704, 2008. [121] Y. Zhang and Y. Zhao, “Spectral subtraction on real and imaginary modulation spectra,” in Proc. ICASSP, 2011, pp. 4744–4747. [122] J. C. Wang, C. H. Lin, E. Siahaan, B. W. Chen, and H. L. Chuang, “Mixed sound event verification on wireless sensor network for home automation,” IEEE Trans. Ind. Informat., vol. 10, no. 1, pp. 803–812, 2014. [123] J. C. Wang, H. P. Lee, J. F. Wang, and C. B. Lin, “Robust environmental sound recognition for home automation,” IEEE Transactions on Automation Science and Engineering, vol. 5, no. 1, pp. 25–31, Jan. 2008. [124] J. C. Wang, Y. S. Lee, C. H. Lin, E. Siahaan, and C. H. Yang, “Robust environmental sound recognition with fast noise suppression for home automation,” IEEE Trans. Autom. Sci. Eng., vol. 12, no. 4, pp. 1235–1242, 2015. [125] R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. Audio, Speech, Language Process., vol. 9, no. 5, pp. 504–512, 2001. [126] N. Lawrence, “Probabilistic nonlinear principal component analysis with Gaussian process latent variable models,” J Mach Learn Res., vol. 6, pp. 1783–1816, 2005. [127] R. BoloixTortosa, F. J. PayanSomet, E. AriasdeReyna, and J. J. MurilloFuentes, “Proper complex Gaussian processes for regression,” arXiv preprint arXiv abs/1502.04868, 2015. [128] E. A.d.R.J.J.M.F. R. BoloixTortosa F. J. PayánSomet, “Proper complex Gaussian processes for regression,” 2015. [129] L. J. M. Cooke P. Green and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Elsevier Speech Commun., vol. 34, pp. 267–285, 2001. [130] CHTTL, http://www.aclclp.org.tw/use_mat_c.php. [131] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)A new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001, pp. 749–752. [132] C. C. Chang and L. C. Jen, “Libsvm: A library for support vector machines,” 2001. [133] X. Wei and C. T. Li, “Fixation and saccade based face recognition from single image per person with various occlusions and expressions,” in Proc. CVPR, 2013, pp. 70–75. [134] X. X. Li, D. Q. Dai, X. F. Zhang, and C. X. Ren, “Structured sparse error coding for face recognition with occlusion,” IEEE Trans. Image Process., vol. 22, no. 5, pp. 1889–1900, 2013. [135] Y. Wen, W. Liu, M. Yang, and M. Li, “Efficient misalignmentrobust face recognition via localityconstrained representation,” in Proc. ICIP, 2016, pp. 3021–3025. [136] E. J. He, J. A. Fernandez, B. V.K. V. Kumar, and M. Alkanhal, “Masked correlation filters for partially occluded face recognition,” in Proc. ICASSP, 2016, pp. 1293–1297. [137] R. Zhi, M. Flierl, Q. Ruan, and W. B. Kleijn, “Graphpreserving sparse nonnegative matrix factorization with application to facial expression recognition,” IEEE Trans. Syst. Man. Cybern. B, Cybern., vol. 41, no. 1, pp. 38–52, 2011. [138] J. A. Russell, “A circumplex model of affect,” Journal of Personality and Social Psychology, vol. 39, pp. 1161–1178, 1980. [139] B. Han, S. Rho, R. B. Dannenberg, and E. Hwang, “Smers: Music emotion recognition using support vector regression,” in Proc. Int. Conf. Music Information Retrieval, 2009. [140] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen, “A regression approach to music emotion recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, pp. 448–457, 2008. [141] Z. Chuang and C. Wu, “Emotion recognition using acoustic features and textual content,” in Proc. ICME, 2004, 5356. [142] B. Schuller, R. Müller, M. Lang, and G. Rigoll, “Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles,” in Proc. INTERSPEECH, 2005, pp. 805–808. [143] Y. H. Yang and H. H. Chen, “Machine recognition of music emotion: A review,” ACM Transactions on Intelligent system and Technology, vol. 3, no. 3, 2012. [144] Y. Chin, C. Lin, and J. Wang, “Robust emotion recognition in live music using noise suppression and a hierarchical sparse representation classifier,” in Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 AsiaPacific, 2014, pp. 1–4. [145] S. H. Chen, Y. S. Lee, and J. C. Wang, “Phaseincorporating speech enhancement based on complexvalued Gaussian process latent variable model,” arXiv preprint arXiv:abs/1612.09150v2, 2016. [146] J. C. Lin, C. H. Wu, and W. L. Wei, “Error weighted semicoupled hidden Markov model for audiovisual emotion recognition,” IEEE Trans. Multimedia, vol. 14, no. 1, pp. 142–156, 2012. [147] P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc. ICML, 2008, pp. 1096–1103. [148] All Music Guide, http://www.allmusic.com. [149] Last.fm, http://cn.last.fm/home. [150] V. Phongthongloa, S. Kamonsantiroj, and L. Pipanmaekaporn, “Learning high level features for chord recognition using autoencoder,” in Proc. First International Workshop Pattern Recognition, 2016, pp. 1 001 117–1 001 117. [151] N. Steenbergen, T. Gevers, and J. Burgoyne, “Chord recognition with stacked denoising autoencoders.,” Master’s thesis, University of Amsterdam, Amsterdam, Netherlands, 2014. [152] J. Schlüter, “Learning binary codes for efficient largescale music similarity search,” in Proc. ISMIR, 2013. [153] M. Defferrard, “Structured autoencoder with application to music genre recognition,” Tech. Rep., 2015. [154] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in Proc. ICML, 2007. [155] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980v9, 2014. [156] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in Proc. INTERSPEECH, 2013. [157] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” in arXiv preprint arXiv:1401.4082v3, 2014. [158] J. M. HernándezLobato and R. P. Adams, “Probabilistic backpropagation for scalable learning of bayesian neural networks,” arXiv preprint arXiv:1502.05336v2, 2015. [159] J. T. Rolfe, “Discrete variational autoencoders,” arXiv preprint arXiv:1609.02200, 2017. [160] J. Quiñonero Candela and C. E. Rasmussen, “A unifying view of sparse approximate gaussian process regression,” J. Mach. Learn. Res., vol. 6, pp. 1939–1959, Dec. 2005. [161] N. D. Lawrence, “Learning for larger datasets with the gaussian process latent variable model,” in Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, M. Meila and X. Shen, Eds., ser. Proceedings of Machine Learning Research, vol. 2, San Juan, Puerto Rico: PMLR, 2007, pp. 243–250. [162] M. P. Deisenroth and J. W. Ng, “Distributed gaussian processes,” arXiv preprint arXiv:1502.02843, 2015.
