跳到主要內容

臺灣博碩士論文加值系統

訪客IP:216.73.216.44
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:施文烽
研究生(外文):SHIH, WEN-FENG
論文名稱:聊天機器人對話中的情緒表現之評估方法初探
論文名稱(外文):A Study on the Evaluation Method of the Emotional Expression in Chatbot Dialogue
指導教授:吳世弘吳世弘引用關係
指導教授(外文):WU, SHIH-HUNG
口試委員:盧文祥蔡宗翰
口試委員(外文):LU, WEN-HSIANGTSAI, TZONG-HAN
口試日期:2018-07-12
學位類別:碩士
校院名稱:朝陽科技大學
系所名稱:資訊工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2018
畢業學年度:106
語文別:中文
論文頁數:45
中文關鍵詞:對話系統評估方法情緒表現NTCIR聊天機器人
外文關鍵詞:Dialogues systemEvaluation methodEmotional expressionNTCIRChatbot
相關次數:
  • 被引用被引用:1
  • 點閱點閱:1065
  • 評分評分:
  • 下載下載:314
  • 收藏至我的研究室書目清單書目收藏:2
隨著科技的進步與普及。通訊軟體影響著人們之間的溝通方式,聊天機器人也隨著推陳出新。在現今有關聊天機器人的研究大多為技術取向方面,評估方面的研究較少。如何去評估聊天機器人所回覆的內容在不同面向的程度是一項重要的議題。
評估的方法有許多種,包含自動評分與人工評分。雖然自動評分可以快速的評估對話系統的好壞,但是自動評分在許多情況下會因為資料的關係而導致分數不佳,而人工評分能夠避免這個問題,所以在評估方法中加入人工評分能讓評分更加有效,本研究採取人工評分的方式來評估聊天機器人的對話系統。
本研究在NTCIR裡的STC任務中發現,其大會評估方法的規則屬於一般化的回答評分方式,導致無法評估回覆在各種面向的表現好壞。如何設計出一個能有效的評估對話系統為本研究的核心重點。我們設計了一套新的問卷設計流程,並且分析大會的評估方法。從大會資料中去發現多面向的因素,並且針對不同的問答類型來增設不同的多面向評估問題。使用一致性的分數來評估設計的評估問題是否良好。
本研究以抒發類型的對話為主,從大會的資料中篩選了這類的資料進行分析與處理。發現從篩選出的抒發類型對話中,回覆內容會包含幾個面向,包括:同情、安慰和嘲笑等。因此在評估回覆的內容應要考慮到這些面向,本研究將大會的評估問題與這些面向設計一份評估問卷進行調查,驗證各面向問題是否容易達到一致性的意見。從問卷調查結果中發現我們所設計的評估問題獲得較高的一致性意見,因此我們所提出的問卷設計流程是可以設計一份問卷來多面向的去評估對話系統,針對不同的語料庫(corpus)也可以適用於我們設計的流程。
Along with the advancement of technology and popularity, people’s communication way is influenced by Instant Messaging. The Chatbot technology is also rising. Most studies of Chatbot is focused on engineering aspect, that is: how to make better Chatbot. But the research on evaluation aspect is few. However, it is also important on how to evaluate the degree of different aspects in responses that replied by Chatbot.
There are two major evaluation methods: automatic evaluation method and the evaluation method that judge by human. Although the automatic evaluation method can judge the dialogue system quickly, it has low correlation with human in many cases. On the other hand, judgement by human can avoid this situation. Therefore, the evaluation method can get more effective result at the evaluation method that judge by human. We will adopt the evaluation method that judge by human in our research.
The evaluation rules in STC task of NTCIR are designed for general reply evaluation method. So it do not evaluate the expression degree in each aspects of responses. To design an evaluation method that can effective judge dialogue system is our research objective. By observing and analyze the evaluation methods of STC task in NTCIR, we designed a process of designing questionnaire, and find out factors with different aspects to design evaluation questions with different aspects for different question and answer types. We use consistency to judge our evaluation questions.
In the experiments, we picked the dialog data with emotional expression type and analyzed it. In general, about the emotional expression type is venting his emotion. People might have different thoughts, and then the content of responses contains of sympathy, comfort and ridicule. So it needs to consider these different aspects in evaluation method. We designed the new evaluation questionnaire with different aspects, and verify it whether easily achieve consistency or not. In questionnaire survey result, we found out that the different aspects are easily achieve consistency. Therefore, our presented the process of design questionnaire that can make a new questionnaire to evaluate dialogues system with different aspects. In different corpus also suitable for our process of design questionnaire.

目錄
摘要  I
Abstract III
誌謝   V
目錄   VI
表目錄  VIII
圖目錄  X
第一章 緒論 1
1.1 問題點:自動評估與人工評估 2
1.2 研究目的 5
1.3 研究方法 5
1.4 論文架構 6
第二章 知識背景 7
2.1 人機互動情況與應用 7
2.2 聊天機器人 8
2.3 一致性評估方法 10
第三章 研究架構 11
3.1 IR系統 11
3.1.1 系統架構 11
3.1.2 前處理 12
3.1.3 斷詞 13
3.1.4 建立索引與檢索 13
3.1.5 STC評估計算方式 14
3.2 問卷設計 16
3.2.1 設計流程 16
3.2.2 面向分析 17
3.2.3 問卷製成 19
3.2.4 評估方法計算 20
第四章 實驗與結果 23
4.1 實驗一 23
4.2 實驗二 25
4.2.1 STC比賽數據 25
4.2.2 實驗二結果 26
第五章 結論與未來工作 30
參考文獻 32
附錄一 35
附錄二 43

表目錄
表 1、自動評估機制與人工評估之間的差異性[4] 3
表 2、STC對話舉例 4
表 3、含有特殊字元的例子 12
表 4、斷詞舉例 13
表 5、正式測試階段系統的結果舉例 14
表 6、聊天類型問卷調查結果 18
表 7、篩選後的問答例子 18
表 8、抒發問答的類型 19
表 9、新問卷的評估問題 20
表 10、Fleiss' kappa舉例表 20
表 11、Cohen's kappa舉例表 22
表 12、各區間kappa值的程度[32] 22
表 13、實驗一Fleiss' kappa結果 23
表 14、我們的評估問題分佈結果(第40、41份) 24
表 15、大會的評估問題分佈結果(第40、41份) 24
表 16、Cohen's kappa的結果(第40、41份) 25
表 17、STC比賽數據資料[2] 26
表 18、我們的評估問題分佈結果(評估者1、2) 26
表 19、大會的評估問題分佈結果(評估者1、2) 26
表 20、Cohen's kappa的結果(評估者1、2) 27
表 21、我們的評估問題分佈結果(評估者2、3) 27
表 22、大會的評估問題分佈結果(評估者2、3) 27
表 23、Cohen's kappa的結果(評估者2、3) 27
表 24、我們的評估問題分佈結果(評估者1、3) 28
表 25、大會的評估問題分佈結果(評估者1、3) 28
表 26、Cohen's kappa的結果(評估者1、3) 28
表 27、實驗二Fleiss' kappa結果 28
表 28、實驗二中3位評估者意見分佈情況 29

圖目錄
圖 1、STC評估規則 4
圖 2、IR系統架構圖 12
圖 3、問卷設計流程 17



[1]Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li, Ryuichiro Higashinaka, Yusuke Miyao. "Overview of the NTCIR-12 short text conversation task." In Proceedings of NTCIR-12, 2016, pp. 473-484.
[2]Lifeng Shang, Tetsuya Sakai, Hang Li, Ryuichiro Higashinaka, Yusuke Miyao, Yuki Arase, Masako Nomoto. "Overview of the NTCIR-13 Short Text Conversation Task." In Proceedings of NTCIR-13, 2017, pp. 194-210.
[3]PAPINENI, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002. p. 311-318.
[4]Liu, C. W., Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., & Pineau, J. (2016). How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
[5]Shih-Hung Wu, Wen-Feng Shih, Liang-Pu Chen and PingChe Yang, "CYUT Short Text Conversation System for NTCIR-12 STC." In Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies, June 7-10, 2016, Tokyo Japan, pp.541-546.
[6]Shih-Hung Wu, Wen-Feng Shih, Che-Cheng Yu, Liang-Pu Chen and PingChe Yang, "CYUT-III Short Text Conversation System at NTCIR-13 STC-2 Task. " In Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017, Tokyo Japan, pp.289-294.
[7]Van Den Berg, J., Miller, S., Duckworth, D., Hu, H., Wan, A., Fu, X. Y., ... & Abbeel, P. "Superhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations." In: Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010. p. 2074-2081.
[8]GOUKO, Manabu; KIM, Chyon Hae. "Can object-exclusion behavior of robot encourage human to tidy up tabletop? ." In: Robotics and Biomimetics (ROBIO), 2016 IEEE International Conference on. IEEE, 2016. p. 1838-1844.
[9]SADIK, Ahmed R.; TARAMOV, Andrei; URBAN, Bodo. "Optimization of tasks scheduling in cooperative robotics manufacturing via johnson's algorithm case-study: One collaborative robot in cooperation with two workers." In: Systems, Process and Control (ICSPC), 2017 IEEE Conference on. IEEE, 2017. p. 36-41.
[10]Scassellati, Brian M. Foundations for a Theory of Mind for a Humanoid Robot. Diss. Massachusetts Institute of Technology, 2001.
[11]BREAZEAL, Cynthia. "Function meets style: insights from emotion theory applied to HRI." IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2004, 34.2: 187-194.
[12]Sekmen, Ali, and Prathima Challa. "Assessment of adaptive human–robot interactions." Knowledge-Based Systems, 2013, vol. 42, pp. 49-59.
[13]Weizenbaum, Joseph. "ELIZA—a computer program for the study of natural language communication between man and machine." Communications of the ACM, 1966, vol. 9, no. 1 pp. 36-45.
[14]TURING, Alan M. Computing machinery and intelligence. In: Parsing the Turing Test. Springer, Dordrecht, 2009. p. 23-65.
[15]Wallace, Richard. "The elements of AIML style." Alice AI Foundation (2003).
[16]Zadrozny, W., Budzikowska, M., Chai, J., Kambhatla, N., Levesque, S., & Nicolov, N. "Natural language dialogue for personalized interaction." Communications of the ACM, 2000, vol. 43, no. 8, pp. 116-120.
[17]ARGAL, Ashay, et al. Intelligent travel chatbot for predictive recommendation in echo platform. In: Computing and Communication Workshop and Conference (CCWC), 2018 IEEE 8th Annual. IEEE, 2018. p. 176-183.
[18]Zhao, H., Du, Y., Li, H., Qian, Q., Zhou, H., Huang, M., & Xu, J. "SG01 at the NTCIR-13 STC-2 Task. " In Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017, Tokyo Japan, pp.313-316.
[19]Liu, X., Wu, X., Chen, R., Zhao, Z., Lin, H., & Yu, K. "splab at the NTCIR-13 STC-2 Task." In Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017, Tokyo Japan, pp.282-288.
[20]Yihan, L., Shanshan, J., Lei, D., Yixuan, T., & Bin, D. "SRCB at the NTCIR-13 STC-2 Task." In Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017, Tokyo Japan, pp.237-240.
[21]Nakatani, H., Nishiumi, S., Maeda, T., & Araki, M. "KIT Dialogue System for NTCIR-13 STC Japanese Subtask." In Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017, Tokyo Japan, pp.257-264.
[22]SUTSKEVER, Ilya; VINYALS, Oriol; LE, Quoc V. "Sequence to sequence learning with neural networks." In: Advances in neural information processing systems. 2014. p. 3104-3112.
[23]Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
[24]Zhang, R., Guo, J., Fan, Y., Lan, Y., Xu, J., & Cheng, X. "Learning to Control the Specificity in Neural Response Generation." In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. p. 1108-1117.
[25]Fleiss, Joseph L., and Jacob Cohen. "The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability." Educational and psychological measurement, 1973, vol. 33, no. 3 pp. 613-619.
[26]Fleiss, Joseph L. "Measuring nominal scale agreement among many raters." Psychological bulletin, 1971, vol. 76, no. 5 pp. 378-382.
[27]Benesty, J., Chen, J., Huang, Y., & Cohen, I. "Pearson correlation coefficient." Noise reduction in speech processing. Springer, Berlin, Heidelberg, 2009. 1-4.
[28] Myers, Jerome L.; Well, Arnold D, Research Design and Statistical Analysis. Lawrence Erlbaum, 2nd ed, p. 508, 2003
[29]Apache Lucene. [online] Available at: https://lucene.apache.org/ [Accessed 12 Mar. 2018].
[30]Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., & Wu, S. L. "Intent-based diversification of web search results: metrics and algorithms. " Information Retrieval, 2011, vol. 14, no. 6 pp. 572-592.
[31]SAKAI, Tetsuya. "Bootstrap-based comparisons of IR metrics for finding one relevant document." In: Asia Information Retrieval Symposium. Springer, Berlin, Heidelberg, 2006. p. 374-389.
[32]MCHUGH, Mary L. "Interrater reliability: the kappa statistic." Biochemia medica: Biochemia medica, 2012, vol. 22, no. 3 pp. 276-282.
[33]Falotico, Rosa, and Piero Quatto. "Fleiss’ kappa statistic without paradoxes." Quality & Quantity, 2015, vol. 49, no. 2 pp. 463-470.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top