跳到主要內容

臺灣博碩士論文加值系統

(44.201.97.0) 您好!臺灣時間:2024/04/16 08:50
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:林哲毅
研究生(外文):Che-Yi Lin
論文名稱:以前後文增進主題模型之效能
論文名稱(外文):Use Context Information to Improve the Performance of LatentDirichlet Allocation
指導教授:鄭卜壬鄭卜壬引用關係
口試委員:陳建錦張嘉惠陳信希曾新穆
口試日期:2014-07-21
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2014
畢業學年度:102
語文別:英文
論文頁數:39
中文關鍵詞:主題模型隱含狄利克雷分布前後文意義向量機器學習隱含主題
外文關鍵詞:topic modellatent dirichlet allocationcontextconcept vectormachine learninglatent topic
相關次數:
  • 被引用被引用:0
  • 點閱點閱:210
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
隱含狄利克雷分布模型是一種常被用來尋找文件中隱含主題的主題
模型,在一些情況之下,例如︰文件數目不足的文本或者是須要前後
文才能判斷意思的字,傳統的隱含狄利克雷分布模型會得到比較差的
結果。造成這個問題的主因是因為一個字可以包含有多種意思,不看
前後文的情況之下,很難分辨一個字真正的意思。在一些之前的研究
中,他們打破原本隱含狄利克雷分布模型中對於字是彼此獨立的假設
並嘗試著把字與字之間的關係加進他們所提的主題模型。在這篇研究
中,我們提出了一個新的模型-包含前後文隱含狄利克雷分布模型。
首先,我們的模型會將原本的文本轉成一堆帶有前後文資訊的「意義向量」,並找出這些向量彼此之間的等價關係;接著我們主題模型會在這些向量與他們之間的等價關係中找出原本文件隱含主題。包含前後文隱含狄利克雷分布模型不僅可以解決傳統的隱含狄利克雷分布模型所遇到的問題,還可以簡單的被平行及擴充。就算文本的數目不足,只要額外給予一些字與字之間的關係包含前後文隱含狄利克雷分布模型仍然可以得到不錯的結果。我們在20Newsgroup 這個文本上做了許多不同的實驗來驗證我們模型的效能,這些實驗數據顯示了我們模型的效能的確比隱原本的狄利克雷分布模型要來的好。最後,我們也列出了兩個模型在同一個文本中找出來的隱含主題。

Latent Dirichlet Allocation (LDA), is a wildly used topic model for discovering the topics in documents, however it suffers from many problems like lack of dependency between words and sparse data. The main cause of these problems is the word-sense disambiguation in the natural language. In previous works, they ignore the assumption of "bag of words" and add the dependency between each words. However, we use different approach. In order to solve these problems, we proposed a topic model called context LDA (CLDA) model.
The CLDA model first build up concept vectors with context information at each position and use these vectors to distinguish the equivalent relationships between word, then we present a topic model which can take these relationship as input and model the words into latent topics. The CLDA model can not only overcome the word disambiguation problem but also be easily parallelized and extended. With some extra knowledge and slight modification, we show that our model can solve the sparse data problem easily. We conduct several experiments based on 20Newsgroup dataset; in the results we show that our model can actually improve the performance of the original LDA and fix the imbalance topic problem via using the vectors and equivalent relationships. Finally we show the examples of latent topics produced by the LDA model and our model.

Contents
致謝i
中文摘要ii
Abstract iii
Contents iv
List of Figures vi
List of Tables vii
1 Introduction 1
2 Related Work 3
2.1 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Previous work on disambiguation problem . . . . . . . . . . . . . . . . . 5
3 Our proposed method 8
3.1 Concept vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Kernel function . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Equivalent between vectors . . . . . . . . . . . . . . . . . . . . 12
3.2 Context LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Inference of Gibbs Sampling 16
4.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Extension 24
6 Experiment 26
6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Parameter measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Information rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.5 Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.6 Word example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Conclusions and Future work 35
Bibliography 36

[1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J.
Mach. Learn. Res., 3:993–1022, March 2003.
[2] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National
Academy of Sciences, 101(Suppl. 1):5228–5235, April 2004.
[3] Hanna M. Wallach. Topic modeling: Beyond bag-of-words. In Proceedings of the
23rd International Conference on Machine Learning, ICML ’06, pages 977–984,
New York, NY, USA, 2006. ACM.
[4] Thomas L. Griffiths, Joshua B. Tenenbaum, and Mark Steyvers. Topics in semantic
representation. Psychological Review, 114:2007, 2007.
[5] Xuerui Wang, Andrew McCallum, and Xing Wei. Topical n-grams: Phrase and topic
discovery, with an application to information retrieval. In Proceedings of the 2007
Seventh IEEE International Conference on Data Mining, ICDM ’07, pages 697–702,
Washington, DC, USA, 2007. IEEE Computer Society.
[6] Robert V. Lindsey, William P. Headden, III, and Michael J. Stipicevic. A phrasediscovering
topic model using hierarchical pitman-yor processes. In Proceedings of
the 2012 Joint Conference on Empirical Methods in Natural Language Processing
and Computational Natural Language Learning, EMNLP-CoNLL ’12, pages 214–
222, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
[7] Wei Li and Andrew McCallum. Pachinko allocation: Dag-structured mixture models
of topic correlations. In Proceedings of the 23rd International Conference on
Machine Learning, ICML ’06, pages 577–584, New York, NY, USA, 2006. ACM.
[8] Pu Wang, Carlotta Domeniconi, and Kathryn Blackmond Laskey. Latent dirichlet
bayesian co-clustering. In Proceedings of the European Conference on Machine
Learning and Knowledge Discovery in Databases: Part II, ECML PKDD ’09, pages
522–537, Berlin, Heidelberg, 2009. Springer-Verlag.
[9] M. Mahdi Shafiei and Evangelos E. Milios. A statistical model for topic segmentation
and clustering. In Proceedings of the Canadian Society for Computational Studies
of Intelligence, 21st Conference on Advances in Artificial Intelligence, Canadian
AI’08, pages 283–295, Berlin, Heidelberg, 2008. Springer-Verlag.
[10] Shoaib Jameel and Wai Lam. An unsupervised topic segmentation model incorporating
word order. In Proceedings of the 36th International ACM SIGIR Conference
on Research and Development in Information Retrieval, SIGIR ’13, pages 203–212,
New York, NY, USA, 2013. ACM.
[11] Xiaogang Wang and Eric Grimson. Spatial latent dirichlet allocation. In J.C. Platt,
D. Koller, Y. Singer, and S.T. Roweis, editors, Advances in Neural Information Processing
Systems 20, pages 1577–1584. Curran Associates, Inc., 2008.
[12] Jordan L. Boyd-Graber, David M. Blei, and Xiaojin Zhu. A topic model for word
sense disambiguation. In EMNLP-CoNLL, pages 1024–1033. ACL, 2007.
[13] Jordan L. Boyd-graber and David M. Blei. Syntactic topic models. In D. Koller,
D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information
Processing Systems 21, pages 185–192. Curran Associates, Inc., 2009.
[14] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):
39–41, November 1995.
[15] Yuanhua Lv and ChengXiang Zhai. Positional language models for information retrieval.
In Proceedings of the 32Nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09, pages 299–306, New
York, NY, USA, 2009. ACM.
[16] Owen de Kretser and Alistair Moffat. Effective document presentation with a
locality-based similarity heuristic. In Proceedings of the 22Nd Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR ’99, pages 113–120, New York, NY, USA, 1999. ACM.
[17] Koichi Kise, Markus Junker, Andreas Dengel, and Keinosuke Matsumoto. Passage
retrieval based on density distributions of terms and its applications to document
retrieval and question answering. In Andreas Dengel, Markus Junker, and Anette
Weisbecker, editors, Reading and Learning, volume 2956 of Lecture Notes in Computer
Science, pages 306–327. Springer, 2004.
[18] Desislava Petkova and W. Bruce Croft. Proximity-based document representation
for named entity retrieval. In Proceedings of the Sixteenth ACM Conference on
Conference on Information and Knowledge Management, CIKM ’07, pages 731–
740, New York, NY, USA, 2007. ACM.
[19] Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang. Plda:
Parallel latent dirichlet allocation for large-scale applications. In Proceedings of the
5th International Conference on Algorithmic Aspects in Information and Management,
AAIM ’09, pages 301–314, Berlin, Heidelberg, 2009. Springer-Verlag.
[20] Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. A biterm topic model
for short texts. In Proceedings of the 22Nd International Conference on World Wide
Web, WWW ’13, pages 1445–1456, Republic and Canton of Geneva, Switzerland,
2013. International World Wide Web Conferences Steering Committee.
[21] Michael A. Newton and Adrian E. Raftery. Approximate bayesian inference with
the weighted likelihood bootstrap. Journal of the Royal Statistical Society. Series B
(Methodological), 56(1), 1994.
[22] Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation
methods for topic models. In Proceedings of the 26th Annual International
Conference on Machine Learning, ICML ’09, pages 1105–1112, New York, NY,
USA, 2009. ACM.
[23] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction
to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊