跳到主要內容

臺灣博碩士論文加值系統

(35.172.136.29) 您好!臺灣時間:2021/07/26 21:05
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:曾鵬元
研究生(外文):Peng-Yuan Tseng
論文名稱:潛在狄氏分段法於文件分類之研究
論文名稱(外文):Latent Dirichlet Sengmentation for Text Categorization
指導教授:簡仁宗簡仁宗引用關係
指導教授(外文):Jen-Tzung Chien
學位類別:碩士
校院名稱:國立成功大學
系所名稱:資訊工程學系碩博士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:中文
論文頁數:68
中文關鍵詞:文件模型馬可夫鏈
外文關鍵詞:LDAVSMHMMVB-EMDirichlet
相關次數:
  • 被引用被引用:0
  • 點閱點閱:256
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
隨著資訊科技的進步,資訊量隨著逐漸增加,而使用者所面對的資訊也就越來越多。有時使用者希望得知的是更精確的文件內容範圍,如段落或表列等,這些文件段落對使用者來說才是真正有用的資訊。在資訊檢索與分類的研究中,文件模型扮演著重要的角色,強健的文件模型可以容易地讓使用者擷取真正想要的資訊內容,並且提升系統的效能。一般而言,文章的撰寫常依循某種起、承、轉、合的結構,各個段落有各自對應的文體規律,所呈現的意境和重要性也不同,因此,針對不同段落之字句模型化,可使得文件模型更加明確具體,容易掌握不同的文體內容和句構分析。常見的文件模型包括向量空間模型、N-gram模型與主題模型等。而現有的文件生成模型,並無法完全符合這部分的應用。為了解決這個問題。本研究擬提出一種統計式潛在主題分段模型,將馬可夫鏈融入於潛在狄氏配置(latent Dirichlet allocation,LDA) 以擷取段落特性,各段落可利用不同之主題模型參數加以描述,使得文件模型更精細,提供更好的閱讀理解。不同於傳統批次執行文件分段與模型化的方法,本文提出的潛在狄氏分段模型(latent Dirichlet segmentation, LDS)模型整合文件分段與主題模型於單一架構之中,其模型參數可透由一致的目標函數加以訓練得到。在此,利用變異性貝氏(variational Bayes, VB)演算法對模型參數加以估測,同時建立維特比(Viterbi)VB演算法,考慮最佳狀態序列以減少訓練時的計算複雜度。實驗中,我們使用20 Newsgroups和ICASSP論文資料庫對本論文方法進行評估分析及比較,初步實驗結果已顯示出我們提出的LDS模型可以達到較佳的文件分類正確率。
As the information technology grows rapidly, the end users face the explosion of information including the media of text, speech and image. In many cases, end users are expecting to extract useful information or even obtain the summarized paragraphs and tables from the information providers. As a result, how to effectively extract coherent segments or paragraphs plays an important role in document modeling and text categorization. This study concerns the issues of text segmentation and topic-based modeling, and constructs a delicate model for document representation and categorization. We present a new statistical approach to partition the text documents into coherent segments and simultaneously extract the topic regularities. We use the latent Dirichlet allocation (LDA) to extract the latent topics and incorporate a Markov chain to detect the stylistic segments within the document. The capacities of Markov model in representation of time-varying word statistics are embedded. The latent Dirichlet segmentation (LDS) model is accordingly constructed and trained by a variational Bayesian inference procedure where a Viterbi decoder is inherent in carrying out the text segmentation. Each segment is represented by a Markov state, and so the word variations within a document are compensated. The nonstationary stylistic and contextual information can be discovered. In the experiments on using Twenty Newsgroups corpus and ICASSP proceedings corpus, we compare the performance of hidden Markov model, LDA and LDS in document modeling and classification and illustrate the advantage of LDS in terms of perplexity and classification accuracy.
摘 要 I
Abstract III
誌謝 V
章節目錄 VI
圖目錄 IX
表目錄 XI
第一章 緒論 1
1.1研究背景及動機 1
1.2研究目的與方法 3
1.3章節概要 5
第二章 相關文獻探討 6
2.1 文件分段 6
2.1.1 TextTiling 7
2.1.2統計式分段模型 8
2.1.3隱藏式馬可夫分段模型 10
2.2 文件模型 11
2.2.1 潛在狄氏配置模型 12
2.2.2 動態主題模型 17
2.2.3 HMM-LDA模型 19
第三章 潛在狄氏分段模型 23
3.1 潛在狄氏分段模型 23
3.2 貝氏變異性推論 26
3.3 維特比近似法 32
第四章 實驗 35
4.1 實驗描述 35
4.2 評估方法 36
4.3 實驗結果(20 Newsgroup) 38
4.3.1 20 Newsgroup實驗資料 38
4.3.2 複雜度 40
4.3.3 Log-Likelihood 收斂曲線 41
4.3.4 每段落出現機率最高的幾個字 42
4.3.5 每個字在不同段落的機率 43
4.3.6 20 Newsgroup 文件分段結果與正確結果差距 44
4.3.7 20 Newsgroup文件分類的正確率 45
4.4 實驗結果(ICASSP) 46
4.4.1 ICASSP實驗資料 46
4.4.2 複雜度(比較LDA ,HMM ,state數不同的LDS) 47
4.4.3 Log-Likelihood 收斂曲線 48
4.4.5 每個字在不同段落的機率 50
第五章 結論與未來研究方向 52
附錄 58
[1]D. Beeferman, A. Berger, and J. D. Lafferty, “Statistical models for text segmentation”, Machine Learning vol. 34, nos. 1-3, pp. 177-210, 1999.
[2]Y. Bestgen, “Improving text segmentation using latent semantic analysis: a reanalysis of Choi, Weimer-Hastings, and Moore (2001)”, Computational Linguistics, vol. 32, no. 1, pp. 5-12, 2006.
[3]C. Brants, F. Chen, and I. Tsochantaridis, “Topic-based document segmentation with probabilistic latent semantic analysis”, In Proceedings of the International Conference on Information and Knowledge Management, pp. 211-218, 2002.
[4]D. Blei, and J. D. Lafferty, “Correlated topic models”, In Advances in Neural Information Processing Systems Cambridge, MA: MIT Press, pp.17-35, 2006.
[5]D. Blei, and J. D. Lafferty, “Dynamic topic model”, In Proceedings of the International Conference on Machine Learning, pp. 113-120, 2006.
[6]D. Blei, A.Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation”, Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.
[7]D. Blei, and J. Moreno Pedro, “Topic Segmentation with an Aspect Hidden Markov Model”, In Proceedings of the 24th Annual International Conference on Research and Development in Information Retrieval, pp. 343-348, 2001.
[8]D. Blei, M. I. Jordan, and A. Y. Ng, “Hierarchical Bayesian models for applications in information retrieval”, Bayesian Statistics, vol. 7, pp. 25-43, 2003.
[9]J. Boyd-Graber, and D. Blei, “Syntactic topic models”, In Advances in Neural Information Processing Systems, 2008.
[10]J.-T. Chien and C.-H. Chueh, “Latent Dirichlet language model for speech recognition”, In Proceeding of IEEE Workshop on Spoken Language Technology, pp. 201-204, 2008.
[11]S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, “Indexing by latent semantic analysis”, Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[12]A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm”, Journal of the Royal Statistical Society, vol. 39, pp. 1-38, 1977.
[13]G.E. Forsythe, M.A. Malcolm, and C.B. Moler, Computer Methods for Mathematical Computations (Chapter 9: Least squares and the singular value decomposition). Englewood Cliffs, NJ: Prentice Hall, 1977.
[14]M. Girolami and A. Kaban, “On an equivalence between PLSI and LDA”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 433-434, 2003.
[15]T. L. Griffiths, M. Steyvers, D. Blei and J. B. Tenenbaum, “Integrating topics and syntax”, In Advances in Neural Information Processing Systems, vol. 17, pp. 537-544, 2004.
[16]T. L. Griffiths and M. Steyvers, “Finding scientific topics”, Proceedings of the National Academy of Science, vol. 101, pp. 5228-5225, 2004.
[17]M. A. Hearst, “TextTiling: segmenting text into multi-paragraph subtopic passages”, Computational Linguistics, vol. 23, no. 1, pp. 33-64, 1997.
[18]M. A. Hearst, “TextTiling: a quantitative approach to discourse segmentation.” Computer Science Division, 571 Evan hall, 1997.
[19]M. Hirohata, Y. Shinnaka, K. Iwano and S. Furui, “Sentence-extractive automatic speech summarization and evaluation techniques”, Speech Communication, vol. 48, no. 9, pp. 1151-1161, 2006
[20]T. Hofmann, “Probabilistic latent semantic indexing”, In Proceedings of ACM SIGIR, pp. 35-44, 1999.
[21]T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis”, Machine Learning, vol. 42, no. 1, pp. 177–196, 2001.
[22]T. Hofmann, “Unsupervised learning from dyadic data”, In Advances in Neural Information Processing Systems, vol. 11. MIT Press, 2006.
[23]B. J. Hsu and J. Glass, “Style & topic language model adaptation using HMM-LDA”, In Proceedings of Empirical Methods in Natural Language Processing, pp. 373-381, 2006.
[24]M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Sail, “Introduction to variational methods for graphical models”, Machine Learning, vol. 37, pp. 183-233, 1999.
[25]T. Koshinaka, K. Iso, and A. Okumura, “An HMM-based segmentation method using variational Bayes approach and its application to LVCSR for broadcast news”, In Proceedings of ICASSP, pp. 485-488, 2005.
[26]L. Azzopardi, M. Girolami and C. J. Van Rijsbergen, “Topic based language models for ad hoc information retrieval”, Proceedings of the International Joint Conference on Neural Networks, pp. 3281-3286, 2004.
[27]W. Li, and A. McCallum, “Pachinko allocation: DAG-structured mixture models of topic correlations”, In Proceedings of the International Conference on Machine Learning, pp. 577-584, 2006.
[28]T. Minka and J. Lafferty, “Expectation-propagation for the generative aspect model”, Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp. 352-359, 2002.
[29]T. Minka, “Estimating a Dirichlet distribution”, Technical report, MIT, 2000.
[30]L. Rabiner, and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993.
[31]G. Salton and M. J. Mcgill, Introduction to Modern Information Retrieval, New-York: McGraw-Hill, 1983.
[32]T. Koshinaka, K. Iso, and A. Okumura, “An HMM-based text segmentation method using variational Bayes approach and its application to LVCSR for broadcast new”, In Proceedings of ICASSP, pp. 485-488, 2005.
[33]Y.-C. Tam and T. Schultz, “Correlated latent semantic model for unsupervised LM adaptation”, In Proceedings of ICASSP, pp. 41-44, 2007.
[34]P. Van Mulbregt, I. Carp, L. Gillick, S. Lowe and. J. Yamron, “Text segmentation and topic tracking on broadcast news via a hidden markov model approach”, Proceedings ICSLP, pp. 2519-2522, 1998.
[35]H. M. Wallach, “Topic modeling: beyond bag-of-words”, In Proceedings of the International Conference on Machine Learning, pp. 977-984, 2006.
[36]C. Wang, B. Thiesson, C. Meek and D. Blei, “Markov topic models”, In Proceedings of International Conference on Artificial Intelligence and Statistics, 2009.
[37]C. Wang, D. Blei and D. Heckerman, “Continuous time dynamic topic models”, In Proceedings of Uncertainty in Artificial Intelligence, 2008.
[38]X. Wei, and W. Croft, “LDA-based document models for ad-hoc retrieval”, In Proceedings of ACM SIGIR, pp. 178-185, 2006.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top