跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.87) 您好!臺灣時間:2025/03/19 21:35
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:陳榮蓬
研究生(外文):Chen, Jung-Peng
論文名稱:基於語言模板之分散式表示法於社群媒體主題分類之研究
論文名稱(外文):Linguistic Pattern-based Distributed Representation Method for Topic Detection on Social Media
指導教授:許聞廉許聞廉引用關係
指導教授(外文):Hsu, Wen-Lian
口試委員:馬偉雲張詠淳
口試委員(外文):Ma, Wei-YunChang, Yung-Chun
口試日期:2018-12-05
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學門:電算機學門
學類:系統設計學類
論文種類:學術論文
論文出版年:2018
畢業學年度:107
語文別:中文
論文頁數:48
中文關鍵詞:主題偵測語言模板深度學習分散式向量表示法知識表示卷積神經網路文本分類語意分析
外文關鍵詞:Topic DetectionLinguistic PatternFrequent Pattern MiningDistributed RepresentationKnowledge RepresentationDeep LearningConvolutional Neural NetworkText ClassificationSemantic Analysis
相關次數:
  • 被引用被引用:0
  • 點閱點閱:292
  • 評分評分:
  • 下載下載:21
  • 收藏至我的研究室書目清單書目收藏:0
直到現在自然語言處理的研究上也大多以機器學習為主。但是機器學習的缺點也很明顯,可讀性差且不易維護,希望能使用直觀的人類思考方式作為研究方法,結合機器學習的先進做法,兼具可維護以及高準確率的優點。本研究提出以語言模板(Linguistic Pattern)概念建構符合人類直覺的分類方法,以最新於2017年提出的語言模板之生產模型為基礎,轉換模板為文章在各主題的表示向量,使深度學習模型得以接收更豐富的語意資訊,同時結合機器學習高效能的優點,對整個分類模型進行完善,我們將以此研究方法建構中文電子媒體的主題偵測模型,同時將模型與現行語意分析常用的方法比較,觀察其成效。本研究方法利用本體論(Ontology)和關鍵字(Keywords)建構而成,再經過頻繁樣式勘測(Frequent Pattern Mining)尋找出關鍵的語意模板,這個研究方法除了做法符合人類分類的直覺之外,還能利用語言模板的產生與匹配了解各主題的語意結構,並且每一階段都可以產生結果,得以加以利用並對模型改進。
Most of the studies on Natural Language Processing are on the basis of Machine Learning nowadays. However, lack of readability and maintainability are obvious for Machine Learning Approaches. In this thesis, we propose an intuitive classifier based on latest Linguistic Pattern generator, combining Deep Learning model to learn semantic information from topical document distributed representations. Our Linguistic Pattern generator is based on Keywords and Ontology, exploiting Frequent Pattern Mining to extract critical patterns. Understandable Results of every step could help us to grasp syntactic and semantic information of classes and provide a guide to maintain our system. Simultaneously, pattern, sentence, document, totally three level distributed representation in our model have the flexibility to be used on other tasks and Machine Learning Model. We implement our model on topic detection dataset extracted from Chinese social media and find out our model outperforming other popular classifiers. Moreover, we combine our document representation with SVM, FastText, and CNN and find out our representation provide much more semantic information than common document representation and thus improve the classification.
摘要 i
Abstract ii
誌謝辭 iii
目錄 iv
圖目錄 vi
表目錄 vii
第一章 簡介(Introduction) 1
1.1 研究動機與目的(Research Motivation and Proposes) 1
1.2 論文架構(Thesis Structure) 2
第二章 相關文獻探討(Related Work) 3
2.1 關鍵詞提取(Keyword Extraction) 3
2.2 本體論(Ontology) 3
2.3 頻繁樣式勘測(Frequent Patten Mining) 8
2.4 詞嵌入(Word Embedding) 9
2.5 主題偵測(Topic Detection) 11
第三章 基於語言模板之分散式表示法之文本分類技術(Linguistic Pattern-based Distribution Representation on Text Classification) 14
3.1 基於語言模板之類別知識生成 (Linguistic Pattern Generation for Domain Knowledge Representation) 15
3.1.1 關鍵樣元之標記(Critical Element Labeling, CEL) 17
3.1.2 頻繁樣式勘測(Frequent Pattern Mining) 19
3.1.3 語言模板決策(Linguistic Pattern Determination) 21
3.2 基於語言模板之分散式表示法(Linguistic Pattern-based Distributed Representation Method, Doc2LPVec) 22
3.2.1 語言模板之向量表示法(Linguistic Pattern Representation) 23
3.2.2 子句向量(sentence vector) 25
3.2.3 文章向量(document vector) 26
3.3 文本分類(Text Classification) 30
第四章 效能評估(Evaluation) 32
4.1 資料蒐集與前處理(Data Collection and Data Preprocessing) 32
4.2 主題分類之效能評估(Evaluation on Topic Detection) 33
第五章 結論與未來工作(Conclusion and Future Work) 44
1. Chang, Y.-C., et al. Semantic frame-based statistical approach for topic detection. in Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing. 2014.
2. Chang, Y.-C., et al., A semantic frame-based intelligent agent for topic detection. 2017. 21(2): p. 391-401.
3. Bharti, S.K. and K.S.J.a.p.a. Babu, Automatic keyword extraction for text summarization: A survey. 2017.
4. Salton, G., C.J.I.p. Buckley, and management, Term-weighting approaches in automatic text retrieval. 1988. 24(5): p. 513-523.
5. Blei, D.M., A.Y. Ng, and M.I.J.J.o.m.L.r. Jordan, Latent dirichlet allocation. 2003. 3(Jan): p. 993-1022.
6. Matsuo, Y. and M.J.I.J.o.A.I.T. Ishizuka, Keyword extraction from a single document using word co-occurrence statistical information. 2004. 13(01): p. 157-169.
7. Paquot, M., Y.J.L. Bestgen, and c.s.i.p. linguistics, Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. 2009. 68: p. 247.
8. Seretan, V., Syntax-based collocation extraction. Vol. 44. 2011: Springer Science & Business Media.
9. Dunning, T.J.C.l., Accurate methods for the statistics of surprise and coincidence. 1993. 19(1): p. 61-74.
10. Mihalcea, R. and P. Tarau. Textrank: Bringing order into text. in Proceedings of the 2004 conference on empirical methods in natural language processing. 2004.
11. Conroy, J.M. and D.P. O'leary. Text summarization via hidden markov models. in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 2001. ACM.
12. Guyon, I., et al., Gene selection for cancer classification using support vector machines. 2002. 46(1-3): p. 389-422.
13. Hirao, T., et al. Ntt’s text summarization system for duc-2002. in Proceedings of the Document Understanding Conference 2002. 2002. Citeseer.
14. Le Nguyen, M., et al. Sentence extraction with support vector machine ensemble. in First World Congress of the International Federation for Systems Research (IFSR’05), Symposium on Data/Text Mining from Large Databases. Kobe. 2005.
15. Zhang, K., et al. Keyword extraction using support vector machine. in International Conference on Web-Age Information Management. 2006. Springer.
16. Frank, E., et al. Domain-specific keyphrase extraction. in 16th International joint conference on artificial intelligence (IJCAI 99). 1999. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
17. Gruber, T.R.J.K.a., A translation approach to portable ontology specifications. 1993. 5(2): p. 199-220.
18. 成功大學資訊工程學系學位論文, 鍾.J., 基於 Ontology 架構之文件分類網路服務研究與建構. 2004: p. 1-65.
19. Agarwal, A., et al. Sentiment analysis of twitter data. in Proceedings of the workshop on languages in social media. 2011. Association for Computational Linguistics.
20. 張博勇, 運用中文剖析與詞彙庫於本體論自動建構之研究, in 電機工程所. 2009, 國立中正大學: 嘉義縣. p. 110.
21. 宋啟聖, 詞網同義詞集的中文語意表達之研究, in 資訊科學系. 2003, 東吳大學: 台北市. p. 46.
22. 陳信裕, 利用廣義知網及維基百科於劇本文件之廣告推薦, in 資訊工程學系. 2016, 國立臺灣師範大學: 台北市. p. 57.
23. Li, C.-R., C.-H. Yu, and H.-H. Chen. Predicting the semantic orientation of terms in E-HowNet. in Proceedings of the 23rd conference on computational linguistics and speech processing. 2011. Association for Computational Linguistics.
24. Agrawal, R. and R. Srikant. Fast algorithms for mining association rules. in Proc. 20th int. conf. very large data bases, VLDB. 1994.
25. Han, J. and J.J.A.S.e.n. Pei, Mining frequent patterns by pattern-growth: methodology and implications. 2000. 2(2): p. 14-20.
26. Inokuchi, A., T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. in European Conference on Principles of Data Mining and Knowledge Discovery. 2000. Springer.
27. Washio, T. and H.J.A.S.E.N. Motoda, State of the art of graph-based data mining. 2003. 5(1): p. 59-68.
28. Yun, U. and J.J. Leggett. WFIM: weighted frequent itemset mining with a weight range and a minimum weight. in Proceedings of the 2005 SIAM international conference on data mining. 2005. SIAM.
29. Tao, F., F. Murtagh, and M. Farid. Weighted association rule mining using weighted support and significance framework. in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003. ACM.
30. Wang, K., Y. He, and J. Han. Mining frequent itemsets using support constraints. in VLDB. 2000.
31. Chang, Y.-C., et al. Semantic frame-based natural language understanding for intelligent topic detection agent. in International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. 2014. Springer.
32. Lovász, L.J.C., Paul erdos is eighty, Random walks on graphs: A survey. 1993. 2(1): p. 1-46.
33. Turian, J., L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. in Proceedings of the 48th annual meeting of the association for computational linguistics. 2010. Association for Computational Linguistics.
34. Hinton, G.E. Learning distributed representations of concepts. in Proceedings of the eighth annual conference of the cognitive science society. 1986. Amherst, MA.
35. Mikolov, T., et al., Efficient estimation of word representations in vector space. 2013.
36. Mnih, A. and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. in Advances in neural information processing systems. 2013.
37. Chen, S., et al. 使用詞向量表示與概念資訊於中文大詞彙連續語音辨識之語言模型調適 (Exploring Word Embedding and Concept Information for Language Model Adaptation in Mandarin Large Vocabulary Continuous Speech Recognition)[In Chinese]. in Proceedings of the 27th Conference on Computational Linguistics and Speech Processing (ROCLING 2015). 2015.
38. Qiu, L., et al. Learning Word Representation Considering Proximity and Ambiguity. in AAAI. 2014.
39. Maas, A.L., et al. Learning word vectors for sentiment analysis. in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1. 2011. Association for Computational Linguistics.
40. Lee, M., W. Wang, and H.J.B.b. Yu, Exploring supervised and unsupervised methods to detect topics in biomedical text. 2006. 7(1): p. 140.
41. Hatch, P., N. Stokes, and J. Carthy. Topic Detection, a new application for lexical chaining. in the proceedings of BCS-IRSG. 2000.
42. Chali, Y., Topic detection of unrestricted texts: Approaches and evaluations. 2005.
43. Ko, Y., J.J.I.P. Seo, and Management, Text classification from unlabeled documents with bootstrapping and feature projection techniques. 2009. 45(1): p. 70-83.
44. Agrawal, R., T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. in Acm sigmod record. 1993. ACM.
45. De Boom, C., et al., Representation learning for very short texts using weighted word embedding aggregation. 2016. 80: p. 150-156.
46. Chen, M.J.a.p.a., Efficient vector representation for documents through corruption. 2017.
47. Goldberg, Y. and O.J.a.p.a. Levy, word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. 2014.
48. Kim, Y.J.a.p.a., Convolutional neural networks for sentence classification. 2014.
49. Maaten, L.v.d. and G.J.J.o.m.l.r. Hinton, Visualizing data using t-SNE. 2008. 9(Nov): p. 2579-2605.
50. Van Der Maaten, L.J.T.J.o.M.L.R., Accelerating t-SNE using tree-based algorithms. 2014. 15(1): p. 3221-3245.
51. Sebastiani, F.J.A.c.s., Machine learning in automated text categorization. 2002. 34(1): p. 1-47.
52. Rijsbergen, C.J.V., Information Retrieval. 1979: Butterworth-Heinemann. 208.
53. Yang, Y. and X. Liu. A re-examination of text categorization methods. in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. 1999. ACM.
54. Yang, Y. A study of thresholding strategies for text categorization. in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 2001. ACM.
55. Yu, H., et al., Libshorttext: A library for short-text classification and analysis. 2013.
56. Joulin, A., et al., Bag of tricks for efficient text classification. 2016.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊