跳到主要內容

臺灣博碩士論文加值系統

(44.200.86.95) 您好!臺灣時間:2024/05/30 06:10
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:林意婕
研究生(外文):YI CHIEH LIN
論文名稱:跨語言主題模型之比較研究
論文名稱(外文):The research on the Comparisons of Cross-Lingual Topic models
指導教授:黃三益黃三益引用關係
指導教授(外文):Hwang, San-Yih
學位類別:碩士
校院名稱:國立中山大學
系所名稱:資訊管理學系研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2021
畢業學年度:109
語文別:中文
論文頁數:51
中文關鍵詞:主題模型跨語言主題模型詞向量最大期望演算法AEVB狄式分佈高斯分佈
外文關鍵詞:Topic ModelingCross-lingual topic modelingWord vectorExpectation-maximization algorithmAuto-Encoding Variational BayesDirichlet distributionGaussian distribution
相關次數:
  • 被引用被引用:0
  • 點閱點閱:237
  • 評分評分:
  • 下載下載:4
  • 收藏至我的研究室書目清單書目收藏:0
相較於傳統的主題模型只能針對一種語言,跨語言主題模型可以同時分析多種語言的文本,找出潛在主題分佈及各主題下不同語言的關鍵字。傳統跨語言主題模型大多是基於統計方法來訓練,並且需要對稱型語料的資源,但隨著網路發展,大規模且非對稱的文本分析變得日益重要。近年來,在無需平行語料的優勢下,將文字轉換成向量的方式被廣泛使用在主題模型上,透過空間對應,我們可以更精準的知道單詞的語義以及詞與詞之間的關係。
基於詞向量的跨語言主題模型中,我們比較了使用統計方法的center-based cross-lingual topic model (Chang et al., 2021)與深度學習方法的 embedded topic model (Dieng et al., 2020),發現先驗分佈與推論演算法是其中最大的差異:在Cb-CLTM中用了最大期望演算法並以狄式分佈作為主題模型的先驗分佈,而ETM則以AEVB(auto-encoding variational bayes)為演算法與高斯分佈為先驗分佈。經過實驗分析,發現兩者結果並無太大的優劣之分,然而透過深度學習的方法,我們將能更快速的分析大量的跨語言文件。
Cross-lingual topic modeling analyzes corpora across languages, uncover latent topics and the keywords of the topics between different languages. Most traditional top-ic models are based on statistical training and require parallel corpus. However, as de-velopment of the Internet, analysis of large-scale and non-parallel corpus is becoming essential. In recent years, without non-parallel corpus, word-embeddings-based topic models have been widely used. Through mapping to vector space, we capture semantic regularities and the relationships among words more precisely.
In this study, we compared two word-embeddings-based topic models, Cb-CLTM: center-based cross-lingual topic model (Chang et al., 2021), and ETM: embedded topic model (Dieng et al., 2020). The main differences are that Cb-CLTM is based on EM and uses Dirichlet distribution as its prior distribution, whereas ETM utilizes neural networks whose inference algorithm is AEVB (auto-encoding variational bayes) and applies Gaussian as prior. After experiments, we found the performance between the two models is comparable and nearly equal. However, with neural networks, we can analyze large-scale cross-lingual corpora more rapidly.
論文審定書 i
誌謝 ii
摘要 iii
Abstract iv
CHAPTER 1 - Introduction 1
CHAPTER 2 - Related Works 3
2.1. Cross-lingual Topic Modeling 3
2.1.1. Document linking 3
2.1.2. Vocabulary linking 4
2.1.3. Mixed linking 5
2.2. Word Embeddings 6
2.3. Continuous topic model 7
CHAPTER 3 - Comparisons of Cb-CLTM and ETM 9
3.1. Variational Autoencoder (VAE) 9
3.2. Preparations, Cb-CLTM and ETM 12
3.2.1. Cross-Lingual Alignments 12
3.2.2. Center-based cross-lingual topic model (Cb-CLTM) 14
3.2.3. Embedded Topic Model (ETM) 15
3.3. Comparisons 17
3.3.1. Difference in Prior Distributions 17
3.3.2. Difference in Inference Algorithms 18
CHAPTER 4 - Experiments and Results 19
4.1. Dataset Description 19
4.2. Evaluation Metrics 21
4.3. Parameters 24
4.4. Coherence Performance 27
4.5. Topic Diversity 29
4.6. Quality in Document Representation 31
CHAPTER 5 - Conclusion 36
Reference 37
1.Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022.
2.Boyd-Graber, J. L., Hu, Y., & Mimno, D. (2017). Applications of topic models (Vol. 11). Now Publishers Incorporated.
3.Yang, W., Boyd-Graber, J., & Resnik, P. (2019, November). A multilingual topic model for learning weighted topic links across corpora with low comparability. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) (pp. 1243-1248).
4.Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2020). Topic modeling in embedding spac-es. Transactions of the Association for Computational Linguistics, 8, 439-453.
5.Yuan, M., Van Durme, B., & Ying, J. L. (2018, January). Multilingual Anchoring: Interactive Topic Modeling and Alignment Across Languages. In NeurIPS (pp. 8667-8677).
6.Heyman, G., Vulić, I., & Moens, M. F. (2016). C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content. Data Mining and Knowledge Discovery, 30(5), 1299-1323.
7.Hu, M., & Liu, B. (2004, August). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 168-177).
8.Mimno, D., Wallach, H., Naradowsky, J., Smith, D. A., & McCallum, A. (2009, Au-gust). Polylingual topic models. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 880-889).
9.Chang, C. H., & Hwang, S. Y. (2021). A word embedding-based approach to cross-lingual topic modeling. Knowledge and Information Systems, 63(6), 1529-1555.
10.Koehn, P. (2005, September). Europarl: A parallel corpus for statistical machine trans-lation. In MT summit (Vol. 5, pp. 79-86).
11.Jagarlamudi, J., & Daumé, H. (2010). Extracting Multilingual Topics from Unaligned Comparable Corpora. Advances in Information Retrieval, 444–456. Springer Berlin Heidelberg
12.Hao, S., & Paul, M. J. (2018b). Learning Multilingual Topics from Incomparable Corpora. Proceedings of the 27th International Conference on Computational Linguis-tics, 2595–2609. aclweb.org.
13.Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neu-ral information processing systems (pp. 3111-3119).
14.Harris, Z. (1954). Distributional hypothesis. Word World, 10(23), 146-162.
15.Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
16.Sahlgren, M. (2008). The distributional hypothesis. Italian Journal of Disability Stud-ies, 20, 33-53.
17.Landauer, T. K. (1984, January). Statistical semantics-analysis of the potential per-formance of keyword information-systems, and a cure for an ancient problem. In Journal of psycholinguistic research (Vol. 13, No. 6, pp. 495-496). 233 SPRING ST, NEW YORK, NY 10013: PLENUM PUBL CORP.
18.Xun, G., Li, Y., Zhao, W. X., Gao, J., & Zhang, A. (2017, August). A correlated topic model using word embeddings. In IJCAI (pp. 4207-4213).
19.Das, R., Zaheer, M., & Dyer, C. (2015, July). Gaussian lda for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers) (pp. 795-804).
20.Batmanghelich, K., Saeedi, A., Narasimhan, K., & Gershman, S. (2016, August). Nonparametric spherical topic modeling with word embeddings. In Proceedings of the conference. Association for Computational Linguistics. Meeting (Vol. 2016, p. 537). NIH Public Access.
21.Reisinger, J., Waters, A., Silverthorn, B., & Mooney, R. J. (2010, January). Spherical topic models. In ICML.
22.Srivastava, A., & Sutton, C. (2017). Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488.
23.Card, C. Tan, and N. A. Smith. 2017. A neural framework for generalized topic mod-els. In arXiv:1705.09296.
24.Cong, Y., Chen, B., Liu, H., & Zhou, M. (2017, July). Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In International Conference on Machine Learning (pp. 864-873). PMLR.
25.Zhang, H., Chen, B., Guo, D., & Zhou, M. (2018). WHAI: Weibull hybrid autoen-coding inference for deep topic modeling. arXiv preprint arXiv:1803.01328.
26.Mikolov, T., Le, Q. V., & Sutskever, I. (2013b). Exploiting similarities among lan-guages for machine translation. CoRR, abs/1309.4168.
27.Smith, S. L., Turban, D. H., Hamblin, S., & Hammerla, N. Y. (2017). Offline bilin-gual word vectors, orthogonal transformations and the inverted softmax. arXiv pre-print arXiv:1702.03859.
28.Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized word embedding and or-thogonal transform for bilingual word translation. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu-man Language Technologies, 1006–1011.
29.Artetxe, M., Labaka, G., & Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of EMNLP, pp. 2289–2294.
30.Zhang, Y., Gaddy, D., Barzilay, R., & Jaakkola, T. (2016b). Ten Pairs to Tag – Multi-lingual POS tagging via coarse mapping between embeddings. In Proceedings of NAACL-HLT, pp. 1307–1317.
31.Zhang, M., Liu, Y., Luan, H., & Sun, M. (2017, July). Adversarial training for unsu-pervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1959-1970).
32.Faruqui, M., & Dyer, C. (2014, April). Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 462-471).
33.Ruder, S., Vulić, I., & S gaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65, 569–631.
34.Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., & Yi, L. (2014, May). UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Trans-lation. In LREC (pp. 1837-1842).
35.Lazaridou, A., Dinu, G., & Baroni, M. (2015, July). Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 270-280).
36.Klementiev, A., Titov, I., & Bhattarai, B. (2012, December). Inducing crosslingual distributed representations of words. In Proceedings of COLING 2012 (pp. 1459-1474).
37.Lewis, D. D., Yang, Y., Russell-Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of machine learning re-search, 5(Apr), 361-397.
38.Schwenk, H., & Li, X. (2018). A corpus for multilingual document classification in eight languages. arXiv preprint arXiv:1805.09821.
39.Conneau, A., Lample, G., Ranzato, M. A., Denoyer, L., & Jégou, H. (2017). Word translation without parallel data. arXiv preprint arXiv:1710.04087.
40.Bischof, J., & Airoldi, E. M. (2012). Summarizing topical content with word fre-quency and exclusivity. In Proceedings of the 29th International Conference on Ma-chine Learning (ICML-12) (pp. 201-208)
41.Hao, S., Boyd-Graber, J., & Paul, M. J. (2018). Lessons from the Bible on modern topics: Low-resource multilingual topic model evaluation. arXiv preprint arXiv:1804.10184.
42.Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.
43.Aletras, N., & Stevenson, M. (2013, March). Evaluating topic coherence using distri-butional semantics. In Proceedings of the 10th International Conference on Computa-tional Semantics (IWCS 2013)–Long Papers (pp. 13-22).
44.Fuglede, B., & Topsoe, F. (2004, June). Jensen-Shannon divergence and Hilbert space embedding. In International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings. (p. 31). IEEE.
45.Niwattanakul, S., Singthongchai, J., Naenudorn, E., & Wanapu, S. (2013, March). Using of Jaccard coefficient for keywords similarity. In Proceedings of the interna-tional multiconference of engineers and computer scientists (Vol. 1, No. 6, pp. 380-384)
46.Boyd-Graber, J., & Blei, D. (2012). Multilingual topic models for unaligned text. arXiv preprint arXiv:1205.2657.
47.Ma, T., & Nasukawa, T. (2016). Inverted bilingual topic models for lexicon extraction from non-parallel data. arXiv preprint arXiv:1612.07215.
48.Gutiérrez, E. D., Shutova, E., Lichtenstein, P., de Melo, G., & Gilardi, L. (2016). Detecting cross-cultural differences using a multilingual topic model. Transactions of the Association for Computational Linguistics, 4, 47-60.
49.Liu, X., Duh, K., & Matsumoto, Y. (2015). Multilingual topic models for bilingual dictionary extraction. ACM Transactions on Asian and Low-Resource Language In-formation Processing (TALLIP), 14(3), 1-22.
50.Guo, J., Che, W., Yarowsky, D., Wang, H., & Liu, T. (2015, July). Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Processing (Volume 1: Long Pa-pers) (pp. 1234-1244).
51.Ono, M., Miwa, M., & Sasaki, Y. (2015). Word embedding-based antonym detection using thesauri and distributional information. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu-man Language Technologies (pp. 984-989).
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top