|
[1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Developing a benchmark for emotional analysis of music. PLOS ONE, 12(3):1–22, 03 2017. [2] S. Amiriparian, M. Gerczuk, E. Coutinho, A. Baird, S. Ottl, M. Milling, and B. Schuller. Emotion and themes recognition in music utilising convolutional and recurrent neural networks. 2019. [3] A. Anna, Y. Yi-Hsuan, and M. Soleymani. Emotion in music task at mediaeval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop, 2015. [4] R. Arandjelovic and A. Zisserman. Objects that sound. In The European Conference on Computer Vision (ECCV), September 2018. [5] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems, pages 892–900, 2016. [6] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen. Liris-accede: A video database for affective content analysis. IEEE Transactions on Affective Computing, 6(1):43–55, 2015. [7] J. Chao, H. Wang, W. Zhou, W. Zhang, and Y. Yu. Tunesensor: A semantic-driven music recommendation service for digital photo albums. In Proceedings of the 10th International Semantic Web Conference. ISWC2011 (October 2011), 2011. [8] M. Chmulik, R. Jarina, M. Kuba, and E. Lieskovska. Continuous music emotion recognition using selected audio features. In 2019 42nd International Conference on Telecommunications and Signal Processing (TSP), pages 589–592, 2019. [9] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017. [10] A. J. Cohen. Congruence-association model of music and multimedia: Origin and evolution. The psychology of music in multimedia, pages 17–47, 2013. [11] E. Coutinho, G. Trigeorgis, S. Zafeiriou, and B. Schuller. Automatically estimating emotion in music with deep long-short term memory recurrent neural networks. In CEUR Workshop Proceedings, volume 1436, 2015. [12] T. Dahiru. Pvalue, a true test of statistical significance? a cautionary note. Annals of Ibadan postgraduate medicine, 6(1):21–26, 2008. [13] E. Dellandréa, M. Huigsloot, L. Chen, Y. Baveye, Z. Xiao, and M. Sjöberg. The mediaeval 2018 emotional impact of movies task. 2018. [14] P. Ekman and W. V. Friesen. The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica, 1(1), 1969. [15] M. B. Er and I. B. Aydilek. Music emotion recognition by using chroma spectrogram and deep visual features. International Journal of Computational Intelligence Systems, 12(2):1622–1634, 2019. [16] F. Eyben, M. Wöllmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pages 1459–1462, 2010. [17] Y. Fan, X. Lu, D. Li, and Y. Liu. Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, ICMI '16, page 445–450, New York, NY, USA, 2016. Association for Computing Machinery. [18] D. Gerónimo and H. Kjellström. Unsupervised surveillance video retrieval based on human action and appearance. In 2014 22nd International Conference on Pattern Recognition, pages 4630–4635. IEEE, 2014. [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. [20] J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [21] W. Gu, X. Gu, J. Gu, B. Li, Z. Xiong, and W. Wang. Adversary guided asymmetric hashing for cross-modal retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, ICMR '19, page 159–167, New York, NY, USA, 2019. Association for Computing Machinery. [22] V. N. Gudivada and V. V. Raghavan. Content based image retrieval systems. Computer, 28(9):18–22, 1995. [23] R. Gupta and S. S. Narayanan. Predicting affect in music using regression methods on low level features. In MediaEval, 2015. [24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [25] S. Hong, W. Im, and H. S. Yang. Cbvmr: Content-based video-music retrieval using soft intra-modal structure constraint. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR '18, page 353–361, New York, NY, USA, 2018. Association for Computing Machinery. [26] T.-H. Hsieh, L. Su, and Y.-H. Yang. A streamlined encoder/decoder architecture for melody extraction. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 156–160. IEEE, 2019. [27] S. Koelstra, C. Muhl, M. Soleymani, J.S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras. Deap: A database for emotion analysis; using physiological signals. IEEE transactions on affective computing, 3(1): 18–31, 2011. [28] B. Kostiuk, Y. M. G. Costa, A. S. Britto, X. Hu, and C. N. Silla. Multi-label emotion classification in music videos using ensembles of audio and video features. In 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 517–523, 2019. [29] B. Li, Z. Chen, S. Li, and W.-S. Zheng. Affective video content analyses by using cross-modal embedding learning features. 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 844–849, 2019. [30] B. Li and A. Kumar. Query by video: Cross-modal music retrieval. In ISMIR, pages 604–611, 2019. [31] J.-C. Lin, W.-L. Wei, and H.-M. Wang. Automatic music video generation based on emotion-oriented pseudo song prediction and matching. In Proceedings of the 24th ACM International Conference on Multimedia, MM '16, page 372–376, New York, NY, USA, 2016. Association for Computing Machinery. [32] J.-C. Lin, W.-L. Wei, and H.-M. Wang. Demv-matchmaker: emotional temporal course representation and deep similarity matching for automatic music video generation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2772–2776. IEEE, 2016. [33] C. Liu, T. Tang, K. Lv, and M. Wang. Multi-feature based emotion recognition for video clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI '18, page 630–634, New York, NY, USA, 2018. Association for Computing Machinery. [34] H. Liu, Y. Fang, and Q. Huang. Music emotion recognition using a variant of recurrent neural network. In 2018 International Conference on Mathematics, Modeling, Simulation and Statistics Application (MMSSA 2018), pages 15–18. Atlantis Press, 2019. [35] X. Liu, Q. Chen, X. Wu, Y. Liu, and Y. Liu. Cnn based music emotion classification. arXiv preprint arXiv:1704.05665, 2017. [36] Y. Ma, X. Liang, and M. Xu. Thuhcsi in mediaeval 2018 emotional impact of movies task. 2018. [37] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008. [38] M. Malik, S. Adavanne, K. Drossos, T. Virtanen, D. Ticha, and R. Jarina. Stacked convolutional and recurrent neural networks for music emotion recognition. arXiv preprint arXiv:1706.02292, 2017. [39] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. [40] R. Orjesek, R. Jarina, M. Chmulik, and M. Kuba. Dnn based music emotion recognition from raw audio signal. In 2019 29th International Conference Radioelektronika (RADIOELEKTRONIKA), pages 1–4. IEEE, 2019. [41] R. Panda, R. M. Malheiro, and R. P. Paiva. Novel audio features for music emotion recognition. IEEE transactions on affective computing, 2018. [42] Y. Peng and J. Qi. Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(1):1–24, 2019. [43] S. Qiao, R. Wang, S. Shan, and X. Chen. Deep heterogeneous hashing for face video retrieval. IEEE Transactions on Image Processing, 29:1299–1312, 2020. [44] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pages 5533–5541, 2017. [45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015. [46] G. Salton. Developments in automatic text retrieval. science, 253(5023):974–980, 1991. [47] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. [48] E. Schubert. Modeling perceived emotion with continuous musical features. Music perception, 21(4):561–585, 2004. [49] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [50] J. J. Sun, T. Liu, and G. Prasad. Gla in mediaeval 2018 emotional impact of movies task. arXiv preprint arXiv:1911.12361, 2019. [51] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6398–6407, 2020. [52] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. [53] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. [54] B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, MM '17, page 154–162, New York, NY, USA, 2017. Association for Computing Machinery. [55] H. Wang, D. Sahoo, C. Liu, E.p. Lim, and S. C. Hoi. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11572–11581, 2019. [56] X. Wu, Y. Qiao, X. Wang, and X. Tang. Bridging music and image via cross-modal ranking analysis. IEEE Transactions on Multimedia, 18(7):1305–1318, 2016. [57] M. Xu, X. Li, H. Xianyu, J. Tian, F. Meng, and W. Chen. Multi-scale approaches to the mediaeval 2015 emotion in music task. In MediaEval, 2015. [58] Y. Yu, S. Luo, S. Liu, H. Qiao, Y. Liu, and L. Feng. Deep attention based music genre classification. Neurocomputing, 372:84–91, 2020. [59] Z. Yu, X. Xu, X. Chen, and D. Yang. Temporal pyramid pooling convolutional neural network for cover song identification. In IJCAI, pages 4846–4852, 2019. [60] D. Zeng, Y. Yu, and K. Oyama. Audiovisual embedding for cross-modal music video retrieval through supervised deep cca. In 2018 IEEE International Symposium on Multimedia (ISM), pages 143–150, 2018. [61] K. Zhang, H. Zhang, S. Li, C. Yang, and L. Sun. The pmemo dataset for music emotion recognition. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR '18, page 135–142, New York, NY, USA, 2018. Association for Computing Machinery. [62] L. Zhen, P. Hu, X. Wang, and D. Peng. Deep supervised crossmodal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
|