跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.106) 您好!臺灣時間:2026/04/04 01:46
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:曹又心
研究生(外文):You-Hsin Tsao
論文名稱:結合搭配詞與主題概念改善中文口碑分類
論文名稱(外文):Integration of collocation and concepts for improvement of word of mouth classification
指導教授:洪智力洪智力引用關係
指導教授(外文):Chihli Hung
學位類別:碩士
校院名稱:中原大學
系所名稱:資訊管理研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:中文
論文頁數:98
中文關鍵詞:情感分析語料庫CKIP關聯規則交互資訊量
外文關鍵詞:Sentiment AnalysisCorpusCKIPAssociation RulesMutual Information
相關次數:
  • 被引用被引用:0
  • 點閱點閱:344
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
在這大數據時代,人們習慣在網路上表達對於某項產品或服務的使用經驗,由於龐大的資訊量導致使用者要取得符合自己需求,並消化資訊需要耗費相當多的時間,文字探勘中的情感分析能從電子本文中判斷出文章的情感傾向為正向推薦或負向不推薦,為了將非結構化的文章做有效的分析與分類,學者們常使用情感語料庫做為情感分類的依據,但目前來說學術上情感分析的研究大多是針對英文,以現有的情感語料庫為基礎,如SentiWordNet、SenticNet來輔助做特徵擷取與加權計算,然而中文語系目前還沒有附有情感極性及分數的情感語料庫同時上述的語料庫都屬靜態分數,不能因應不同領域及隨時間的演進更動其情感分數。因此本研究以中文領域文章內容為本,針對領域分別建置適應性中文情感語料庫,運用評論網站中用戶對某產品的使用經驗及其產品的評分,做詞彙與詞彙、詞彙與領域、詞彙與用戶評分間的關係來訂定詞彙情感傾向及分數,詞彙間的關聯計算以每個句子為單位找出特徵詞與意見詞,本研究運用關聯規則及交互資訊量來萃取特徵詞與其配對的意見詞兩者組合視為一搭配詞,詞彙與領域、詞彙與用戶評分間的關係計算則使用文章機率、相關係數及TF-ICF (Term Frequency -Inverse Class Frequency)來訂定詞彙極性分數。實驗結果顯示,詞彙的使用和分布會根據領域的不同變化,因此適合的情感標註方法也不同。在這項研究中,我們已經開發出多種不同的情緒標記技術,這些方法能夠適應來自不同領域的口碑文章。

Due to the fast development of big data, users post their experiences and opinions about brands, products, services, and companies on the Internet. Thus a great amount of information is produced, and the method to process and analyze that amount of information becomes an immense issue. In the field of text mining technique, sentiment analysis can solve and determine electronic text articles sentiment, which is positive or negative. In order to make effective sentiment classification for unstructured data, scholars often use sentiment corpora, which define fixed sentiment score for each word in the corpora in the tasks of sentiment classification. Most scholars only apply the corpora to the English language (e.g. SentiWordNet, SenticNet). However, there are two main problems that need to be solved. The first problem is the lack of sentiment polarity and scores, which are defined by the Chinese language Sentiment Corpora. The second problem is that the Sentiment score of each word defined in the sentiment corpora is fixed, which does not adapt to different domains and changes over time. In this study, we propose a way to build an adaptability Chinese sentiment corpus, which is based on the Chinese word of mouth documents. Using product review websites, which contain user experiences of products and product reviews, we define the sentimental tendencies and sentiment scores from analyzing the relationships of words, words and domains, and word and users ratings. Then they are calculated by the correlations between words to identify feature words and opinion words of each sentence. In this paper, we use association rules and mutual information to extract the feature words and their associated opinion words, namely collocation words. Three approaches, i.e. article probability, correction coefficient and term frequency-inverse class (TF-ICF) are used to extract word sentiment scores. Experimental results show that the usage and distribution of words are varied from different domains and thus their suitable sentiment tagging approaches are different. In this research, we have developed several different sentiment tagging techniques and these approaches are able to adapt to word of mouth documents from various domains.

目錄
摘要 I
Abstract II
目錄 III

圖目錄 V
表目錄 VI
第一章、緒論 1
1.1研究背景與動機 1
1.2研究問題 3
1.3研究目的 3
1.4研究貢獻 4
1.5研究流程說明 4
第二章、文獻探討 6
2.1情感分類 6
2.2情感語料庫建置 6
2.3 特徵擷取 10
2.3.1 關聯規則 13
2.3.2 交互資訊量 14
2.3.3 相關係數 15
2.3.4 TF-IDF 16
2.3.5 TF-ICF 16
第三章、研究方法 18
3.1 研究架構 18
3.2 資料蒐集模組 21
3.2.1 網路爬蟲 21
3.2.2 去除HTML標籤 23
3.3 文章預處理 23
3.3.1 文章斷句 24
3.3.2 CKIP斷詞 24
3.3.3 字義反轉 24
3.3.4 取關鍵詞性 25
3.4 建立內容情感向量表示矩陣 25
3.4.1 建立領域概念表示矩陣 26
3.4.2 建立口碑評價表示矩陣 27
3.5 搭配詞萃取 30
3.5.1 第一階段搭配詞萃取 32
3.5.2 第二階段搭配詞萃取 33
3.5.3 小結 34
3.6 建立文章向量 34
3.6.1 文章向量 35
3.6.2 分數計算 39
3.7 訓練分類器 41
3.8 文章分類評估 41
第四章、實驗結果與評估 43
4.1 實驗說明 43
4.1.1 適應性中文情感語料庫 43
4.1.2 實驗文章測試集 45
4.1.3 分類器 46
4.2 實驗結果 47
4.2.1 電影 48
4.2.2 美食 56
4.2.3 美妝 63
4.3 實驗結果分析 70
第五章、結論與未來展望 82
5.1 結論 82
5.2 未來研究方向 83
參考文獻 84
附錄一、中研院平衡語料庫詞類標記集 90


圖目錄
圖1-1 研究流程圖 5
圖2-1 形容詞good、bad距離測量 8
圖2-2 美妝-關聯規則篩選 14
圖2-3 美妝-相關係數 15
圖3-1 研究架構圖 20
圖3-2資料蒐集模組 21
圖3-3 文章預處理 24
圖3-4口碑評價表示矩陣-次數 28
圖3-5文章機率處理後之評價表示矩陣-篇數 30
圖3-6搭配詞概念 31
圖3-7 搭配詞萃取 32
圖3-8 特徵詞與意見詞架構圖 33
圖3-9搭配詞雜訊 33
圖3-10 實驗組合 35
圖3-11建立情感空間模型 37
圖3-12 建立顯著向量文章向量 39
圖3-13 文章機率 40
圖3-14 相關係數 40
圖4-1 三領域測試集1之實驗組比較圖1 (SSM) 72
圖4-2 電影領域各測試集之實驗組比較圖2 (SSM) 73
圖4-3 美食領域各測試集之實驗組比較圖3 (SSM) 74
圖4-4 美妝領域各測試集之實驗組比較圖4 (SSM) 75
圖4-5 電影第一~三組分析圖 76
圖4-6 美食第一~三組分析圖 76
圖4-7 美妝第一~三組分析圖 77
圖4-8 電影第一、四、五組分析圖 77
圖4-9 美食第一、四、五組分析圖 78
圖4-10 美妝第一、四、五組分析圖 78
圖4-11 電影第一、六、七組分析圖 79
圖4-12 美食第一、六、七組分析圖 79
圖4-13 美妝第一、六、七組分析圖 80
圖4-14 三領域測試集1之實驗組比較圖5 (顯著向量) 81

表目錄
表2-1 中文語料庫使用概況 11
表3-1 各領域評價資料集 22
表3-2 各領域類別資料集 22
表3-3 領域概念表示矩陣-次數 26
表3-4 領域概念表示矩陣NVI 27
表3-5 口碑評價表示矩陣-次數/篇數 28
表4-1 口碑評價語料庫資料集 43
表4-2 領域概念語料庫資料集 44
表4-3 各領域測試集 45
表4-4 SVM設定參數 46
表4-5 第一~十三組實驗數值資訊 47
表4-6 電影第一組實驗分類準確率 48
表4-7 電影關聯規則門檻設定 49
表4-8 電影第二組實驗分類準確率 49
表4-9 電影第三組實驗分類準確率 50
表4-10 電影交互資訊量門檻設定 51
表4-11 電影第四組實驗分類準確率 51
表4-12 電影第五組實驗分類準確率 51
表4-13 電影第六組、第七組實驗分類準確率 52
表4-14 電影第八組實驗分類準確率 52
表4-15 電影第九組實驗分類準確率 53
表4-16 電影第十組實驗分類準確率 54
表4-17 電影第十一組實驗分類準確率 54
表4-18 電影第十二組、第十三組實驗分類準確率 54
表4-19 美食第一組實驗分類準確率 56
表4-20 美食關聯規則門檻設定 56
表4-21 美食第二組實驗分類準確率 57
表4-22 美食第三組實驗分類準確率 57
表4-23 美食交互資訊量門檻設定 58
表4-24 美食第四組實驗分類準確率 58
表4-25 美食第五組實驗分類準確率 59
表4-26 美食第六組、第七組實驗分類準確率 59
表4-27 美食第八組實驗分類準確率 59
表4-28 美食第九組實驗分類準確率 60
表4-29 美食第十組實驗分類準確率 61
表4-30 美食第十一組實驗分類準確率 61
表4-31 美食第十二組、第十三組實驗分類準確率 61
表4-32 美妝第一組實驗分類準確率 63
表4-33 美妝關聯規則門檻設定 63
表4-34 美妝第二組實驗分類準確率 64
表4-35 美妝第三組實驗分類準確率 64
表4-36 美妝交互資訊量門檻設定 65
表4-37 美妝第四組實驗分類準確率 65
表4-38 美妝第五組實驗分類準確率 66
表4-39 美妝第六組、第七組實驗分類準確率 66
表4-40 美妝第八組實驗分類準確率 66
表4-41 美妝第九組實驗分類準確率 67
表4-42 美妝第十組實驗分類準確率 68
表4-43 美妝第十一組實驗分類準確率 68
表4-44 美妝第十二組、第十三組實驗分類準確率 69
表4-45 三領域語料庫各詞性詞彙數 71


參考文獻
英文文獻:
Altuntas, S., Dereli, T., &; Kusiak, A. (2015). Analysis of patent documents with weighted association rules. Technological Forecasting and Social Change, 92, 249-262.
Baccianella, S., Esuli, A., &; Sebastiani, F. (2010). SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining, In International Conference on Language Resources and Evaluation, 2200-2204.
Bai, Y., Guo, L., Liu, L., Cai, D., &; Zhou, B. (2008). KECIR Question Answering System at NTCIR7 CCLQA. In The proceedings of the 7th NTCIR workshop meeting, 54-59.
Benamara, F., Cesarano, C., Picariello, A., Reforgiato, D. and Subrahmanian, V. S. (2007). Sentiment Analysis: Adjectives and Adverbs are better than Adjectives Alone. Proceedings of the International Conference on Weblogs and Social Media (ICWSM).
Cambria, E., Olsher, D., &; Rajagopal, D. (2010). SenticNet3: a common and common-sense knowledge base for cognition-driven sentiment analysis. In Eighth AAAI Conference on Artificial Intelligence, 1515-1521.
Chen, W. T., Lin, S. C., Huang, S. L., Chung, Y. S., &; Chen, K. J. (2010, August). E-HowNet and automatic construction of a lexical ontology. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, 45-48.
Cen, S., Mao, Y., Li, R. &; Wang, X. (2008). Credit Distribution: A Graph-Based Approach to Extract Product Description Words. International Symposium on Knowledge Acquisition and Modeling, IEEE, 398-402.
Dave, K., Lawrence, S., &; Pennock, D. (2003). Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews. Proceedings of the 12th International Conference on World Wide Web, ACM, pp. 519-528.
Denecke, K. (2008, April). Using sentiwordnet for multilingual sentiment analysis. IEEE 24th International Conference on Data Engineering Workshop 2008. 507-512.
Dragut, E.C., Yu, C., &; Meng, W. (2010). Construction of a sentimental word dictionary. Proceedings of CIKM, 1761-1764.
Duan H, Bao S, &; Yu. (2007). CCRM:An Effective Algorithm for Mining Commodity Information from Threaded Chinese Customer Review. Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data, 473-480.
Esuli, A., &; Sebastiani, F. (2006). SentiWordNet: a publicly available lexical resource for opinion mining. Proceedings of LREC, 417-422.
Feldman, R., W. Klosgen, B. Y. Yaniv, G. Kedar and V. Reznikov (1997) Pattern based browsing in document collections. Proceedings of First European Symposium on Principles of Data Mining and Knowledge Discovery, London, UK.
Gilly, M. C., Graham, J. L., Wolfinbarger, M. F., &; Yale, L. J. (1998). A dyadic study of interpersonal information search. Journal of Academy of Marketing Science, 26(2), 83-100.
Haddad, H., Chevallet, J., and Bruandet, M. (2000). Relations between terms discovered byassociation rules.
Hatzivassiloglou, V. and McKeown, K. R. (1997). Predicting the Semantic Orientation of Adjectives. Proceedings of the 8th conference on European chapter of the Association for Computational Linguistics.
Herr, P. M., Kardes, F. R., &; Kim, J. (1991). Effects of word-of-mouth and product-attribute information on persuasion: An accessibility-diagnosticity perspective. Journal of Consumer Research, 17(3), 454-462.
Hu, M., &; Liu, B. (2004a). Mining and summarizing customer reviews. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 168-177.
Hu, M., &; Liu, B. (2004b). Mining opinion features in customer reviews. Proceedings of 9th National Conference on Artificial Intelligence, 755-760.
Hung, C. (2008). A personalized word of mouth recommender model. Webology, vol. 5, no. 3. http://www.webology.ir/2008/v5n3/toc.html
Ma, Wei-Yun and Chen, Keh-Jiann (2003). Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff. Proceedings of ACL, Second SIGHAN Workshop on Chinese Language Processing, 168-171.
Kamps, J., Marx, M., Mokken, R.J., &; De Rijke, M. (2004). Using WordNet to measure semantic orientation of adjectives. Proceedings of LREC, 1115-1118.
Luhn, H. P. (1957). A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development,1(4), 309-317.
Mahmood, S., Shahbaz, M., &; Guergachi, A. (2014). Negative and Positive Association Rules Mining from Text Using Frequent and Infrequent Itemsets.The Scientific World Journal, 2014.
Miao, Q., Li, Q., and Dai, R. (2009). AMAZING: A Sentiment Mining and Retrieval System. Expert Systems with Applications: An International Journal, 7192-7198.
Miller, G.A. (1985). WordNet: a dictionary browser. Proceedings of the First International Conference on Information in Data, 25-28.
Ohana, B., &; Tierney, B. (2009). Sentiment classification of reviews using SentiWordNet. Proceedings of the 9th IT &; T Conference, pp. 13.
Pak, A., &; Paroubek, P. (2010, May). Twitter as a Corpus for Sentiment Analysis and Opinion Mining. In LREC, 1320-1326.
Pak, C.W., Whitney, P., Thomas, J. (1999). Visualizing association rules for text mining. IEEE Symposium on Information Visualization, 120-123.
Poria, S., Gelbukh, A., Cambria, E., Yang, P., Hussain, A., &; Durrani, T. (2012, October). Merging SenticNet and WordNet-Affect emotion lists for sentiment analysis. IEEE 11th International Conference on Signal Processing, 2, 1251-1255.
Reed, J. W., Jiao, Y., Potok, T. E., Klump, B., Elmore, M. T., &; Hurson, A. R. (2006, December). TF-ICF: A new term weighting scheme for clustering dynamic data streams. In Machine Learning and Applications, 2006. ICMLA''06. 5th International Conference on (pp. 258-263). IEEE.
Rodionov, S. (2015). A sequential method of detecting abrupt changes in the correlation coefficient and its application to Bering Sea climate. arXiv preprint arXiv:1504.07536.
Sample, C. D. F. A. S. (1921). correlation coefficients covering the cases (i)“The frequency dis-tribution of the values of the correlation coefficient in samples from an indefinitely large population,” Biometrika, Vol. 10, pp. 507v521, 1915. Here the method of defining the sample by the coordinates of.
Scaffidi, C., Bierhoff, K., Chang, E., Felker, M., Ng, H., and Jin, C. (2007). Red Opal: Product-Feature Scoring from Reviews. Proceedings of the 8th Annual Conference on Electronic Commerce, 182-191.
Shih, Y. Y., Huang, S. L., &; Chen, K. J. (2006). Semantic representation and composition for unknown compounds in E-HowNet. In Proceedings of PACLIC,20, 378-382.
Singh, L., B. Chen, R. Haight and P. Scheuermann (1999) An algorithm for constrained association rule mining in semi-structured data. Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining, London, UK.
Stone, P.J., Dunphy, D.C., Smith, M.S., &; Oglivie, D.M. (1996). The General Enquirer: AComputer Approach to Content Analysis. Cambridge MA: MIT Press.
Strapparava, C., &; Valitutti, A. (2004). WordNet-Affect: an affective extension of WordNet . Proceedings of LREC, 1083-1086.
Tang, H., Tan, S., &; Cheng, X. (2009). A survey on sentiment detection of reviews. Expert Systems with Applications, 36(7), 10760-10773.
Wang, G. and Araki, K. (2008a). A Graphic Reputation Analysis System for Mining Japanese Weblog Based on both Unstructured and Structured Information. Proceedings of the 22nd International Conference on Advanced Information Networking and Applications, IEEE Computer Society, 1240-1245.
Wei, W., Liu, H., He, J., Yang, H., and Du, X. (2008). Extracting Feature and Opinion Words Effectively from Chinese Product Reviews. Proceedings of the 5th International Conference on Fuzzy Systems and Knowledge Discovery, 170-174.
Westbrook, R. A. (1987). Product/consumption-based affective responses and postpurchase processes. Journal of Marketing Research, 24(3), 258-270.
Wermter S., Panchev C. &; Arevian G. (1999). Hybrid neural plausibility networks for news agents. Proceedings of the National Conference on Artificial Intelligence AAAI, 93-98.
Qiu, G., Liu, K., Bu, J., Chen, C., &; Kang, Z., (2007). Extracting Opinion Topics for Chinese Opinions using Dependence Grammar, Proceedings of the 1st International Workshop on Data Mining and Audience Intelligence for Advertising(ADKDD), 40-44.
Yu, S., Zhou, W., Jia, W., Guo, S., Xiang, Y., &; Tang, F. (2012). Discriminating DDoS attacks from flash crowds using flow correlation coefficient. Parallel and Distributed Systems, IEEE Transactions on, 23(6), 1073-1080.
Yuan, B., Liu, Y., Li, H., PHAN, T. T. T., Kausar, G., Sing-Bik, C. N., &; Wahi, W. (2013). Sentiment Classification in Chinese Microblogs: Lexicon-based and Learning-based Approaches. International Proceedings of Economics Development and Research (IPEDR), 68, 1-5.
Zebende, G. F. (2011). DCCA cross-correlation coefficient: quantifying level of cross-correlation. Physica A: Statistical Mechanics and its Applications, 390(4), 614-618.
Zhang, Z., Li, Y., Ye, Q., and Law, R. (2008). Sentiment Classification for Chinese Product Reviews Using an Unsupervised Internet-based Method. Proceedings of the 15th Annual Conference on International Conference on Management Science and Engineering, 3-9.
Zhang, G., Zhang, W., Bai, Y., Kang, S., &; Wang, P. (2010). An Open-domain Question Answering System for NTCIR-8 CC Task. In Proceedings of NTCIR-8 Workshop Meeting, 25-30.
中文文獻:
羅佳玲(2009)。同步式關鍵字萃取方法應用於美妝評論。元智大學資訊管理學系學位論文,1-45。
簡之文(2012)。部落格文章情感分析之研究。淡江大學資訊管理學系學位論文,1-59。
洪智力(2013)。運用適應性內容情感語料庫改善口碑品質分類。國科會專題研究計畫,NSC 102-2410-H-033-033-MY2。
戚玉樑與蔡明宏(2007)。以文件為對象的概念萃取程序建立知識本體的雛型架構。資訊管理學報,14(3),47-66。
周立柱、贺宇凯、王建勇(2008)。 情感分析研究综述。計算機應用,28(11),2725-2728。
周濟群、戚玉樑、曾建勛(2012)。以詞彙表為基礎的知識本體雛型建構研究──以「公司治理」領域知識為例。圖書資訊學研究,6(2),37-81。
顏國偉與譚慧敏(1999)。基於知網的常識知識標注。中文計算語言學期刊,4(2),39-85。
陳鳳儀、蔡碧芳、陳克健、黃居仁(1999)。中文句結構樹資料庫的構建。中文計算語言學期刊,4(2),87-104。
孫瑛澤、陳建良、劉峻杰、劉昭麟、蘇豐文(2010)。中文短句之情緒分類。自然語言與語音處理研討會,184-198。
李政儒、游基鑫、陳信希(2012)。廣義知網詞彙意見極性的預測。中文計算語言學期刊,17(2),21-36。
呂菁菁、呂俊宏、呂明蓁、胡志偉、許聞廉(2010)。從詞類標記探討唐詩詩句中語詞的詞類分佈及語法對稱情形。文學與資訊學術研討會論文集,5,139-161。
趙杰與李衛華(2014)。基於知網的矛盾問题語意二異性研究。廣東工業大學學報,31(2),21-26。
網路資料:
民族語(2005) 第十五版http://www.ethnologue.com/


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top