跳到主要內容

臺灣博碩士論文加值系統

(2600:1f28:365:80b0:8e11:74e4:2207:41a8) 您好!臺灣時間:2025/01/15 16:51
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:李岳樵
研究生(外文):Yueh-Chyau Lee
論文名稱:應用深度學習遞歸神經網路於部落格業配文辨識
論文名稱(外文):Application of Recurrent Neural Network on Identification of Blogger’s Sponsored Article
指導教授:呂永和
指導教授(外文):Yung-Ho Leu
口試委員:呂永和楊維寧陳雲岫
口試委員(外文):Yung-Ho LeuWei-Ning YangYun-Shiow Chen
口試日期:2017-07-29
學位類別:碩士
校院名稱:國立臺灣科技大學
系所名稱:資訊管理系
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:中文
論文頁數:61
中文關鍵詞:深度學習文字探勘遞歸神經網路長短期記憶神經網路業配文
外文關鍵詞:Deep LearningSponsored ArticlesRecurrent Neural NetworkLong-Short Term MemoryText Mining
相關次數:
  • 被引用被引用:0
  • 點閱點閱:1751
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
由於電腦網路的快速發展,社群媒體、部落格及佈告欄(BBS)已經成為資訊散佈的重要管道。人們大量地在網路上發表文章,累積數量可觀的資訊供他人查詢;在言論自由的現代,人們能夠暢所欲言,言論並沒有被嚴格的規範,容易造成假文章魚目混珠於真文章之中,使得人們因為相信假文章而為其所害。人們在網路上撰寫大量的產品經驗分享,原意是給未使用過的人們有一個參考的對象,但是許多商家會利用此一管道,花錢聘請寫手撰寫有利於商家的假文章,此種假文章稱為「業配文」,這些「業配文」的內容誇大不實,讓消費者受騙上當。本研究主要目的在於協助辨別網路文章是否屬於「業配文」,使「業配文」無所遁形。本研究利用中文斷詞,擷取出訓練文本的關鍵詞彙,再利用關鍵詞彙從訓練文本中計算各詞彙的出現頻率,作為訓練資料集的文章特徵,用以建立預測模型,推論測試文本所屬之類別。本研究使用貝式分類器、遞歸神經網路(RNN)與長短期記憶神經網路(RNN-LSTM),建立分類模型,並使用五次交叉驗證(5-fold cross validation),計算各分類模型的準確率;結果顯示長短期記憶神經網路的分類模型,在預測文章是否屬於「業配文」上表現優異,其敏感度(Sensitive)為94%,精確度(Precision)、準確度(Accuracy)與F1-meausre,皆高於95%,而貝氏分類器則優於RNN。
Due to the advance in networks, social networks, blogs and bulletin board systems (BBS) are becoming important media for information dissemination. Myriads of information posted on the web or blog allow people to acquire information for their daily life. However, due to the lack of regulations on posting articles on the web, much fake articles have been posted on the web. Among them, the articles which are written by a sponsored author may mislead a customer into buying products that do not meet their expectations. An article of this kind is called "a sponsored article". This thesis aims to detect the sponsored articles from a set of blog's articles. To this end, we wrote a crawler program to collect 10000 articles from a dining forum in PTT which is a famous BBS in Taiwan. The articles are carefully classified into two different categories: 5000 normal articles and 5000 sponsored articles. To build a classifier, we first find the important terms in the dining articles using a Chinese words segmentation tool. Then, we count the frequency of each important term in an article to transform the article into an occurrence vector of about 2000 different terms. Using the occurrence vectors as the features of the article set, we built three different classifiers using the Naïve Bayes, Recurrent neural network (RNN) and Long-short term memory (RNN-LSTM) algorithms. The experiment results showed that the RNN-LSTM offers the highest prediction accuracy, with the average sensitivity of 94% and both the average precision and F1-meaure more than 95%. The experiment results also showed that the RNN without regularization is inferior to the Naïve Bayes algorithm.
目錄
摘要 III
ABSTRACT IV
誌謝 V
目錄 VI
圖目錄 VIII
表目錄 IX
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 3
1.3 研究目的 4
1.4 研究貢獻 5
1.5 研究架構 6
第二章 文獻探討 7
2.1 業配廣告 7
2.2 文字探勘 7
2.3 中文斷詞相關研究 8
2.3.1 N-Gram演算法 8
2.3.2 TF-IDF演算法 9
2.3.3 凝聚程度 10
2.3.4 熵值法 11
2.4 分類模型 12
2.4.1 Naïve Bayes演算法 13
2.4.2遞歸神經網路 16
2.4.3長短期記憶 22
2.5 TensorFlow開源碼程式庫 32
第三章 研究方法 33
3.1 研究架構 33
3.2 研究對象與研究工具 34
3.2.1 研究對象 34
3.2.2 研究工具 34
3.3 定義特徵值 35
3.3.1 詞頻 35
3.3.2 詞彙機率 36
3.4 研究方法流程 36
3.2.1 區塊一:抓取PTT美食版文章 36
3.2.2 區塊二:中文斷詞 38
3.2.3 區塊三:建立分類模型 39
3.5 分類模型評估指標 40
3.5.1 混淆矩陣 41
3.5.2 評估指標 42
第四章 實作方法與結果 43
4.1 實作環境 43
4.2 關鍵詞彙庫建立 44
4.3 分類模型結果與評估 44
4.4 預測分類誤判分析 46
第五章 結論與建議 47
5.1 結論 47
5.2 未來研究方向 48
參考文獻 49
[1]Salton, G. and McGill M.J. (1983), Introduction to modern information retrieval, McGraw-HIII Book company.
[2]Elman, J. (1990), Finding Structure in Time, Cognitive Science, 14, 179-211.
[3]Rumelhart, D. E., Hinton, G. E., Williams, R. J. (1986), Learning internal representations by back-propagating errors, Nature, 323:533.536.
[4]Salton, G. (1989), Automatic text processing, Addison-Wesley Publishing Company.
[5]Salton, G. and Buckley, C. (1988), Term-weighting approaches in automatic text retrieval. Information Processing & Management
[6]Y. Lu and J. Chen. (2014), Public Opinion Analysis of Microblog Content. 2014 International Conference on Information Science & Applications (ICISA), Seoul, 2014, pp. 1-5.
[7]Lei Zhang, Riddhiman Ghosh, Mohamed Dekhil, Meichun Hsu, Bing Liu. (2011), Combining Lexicon-based and Learning-based Methods for Twitter Sentiment Analysis. HP Laboratories, HPL-2011-89.
[8]Wei, C.P., Chen L.C., Chen H.Y. and Yang C.S. (2013), Mining Suppliers from Online News Documents, Proceedings of the Pacific Asia Conference on Information Systems (PACIS 2013), Jeju Island, Korea, June 18-22.
[9]Sullivan, D. (2001), Document Warehousing and Text Mining, Wiley.
[10]Trappey, A.J.C., Hsu, F.C., Trappey, C.V. and Lin, C.I. (2006), Development of a patent document classification and search platform using a back-propagation network, Expert Systems with Applications, Vol. 31, No. 4, pp. 755-765.
[11]Yu, B., Xu, Z. and Li, C. (2008), Latent semantic analysis for text categorization using neural network, Knowledge-Based Systems, Vol. 21, No. 8, pp. 900-904.
[12]Zhang, M.L. and Zhou, Z.H. (2006), Multilabel neural networks with applications to functional genomics and text categorization, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 10, pp. 1338-1351.
[13]Wei, C.P., Lin, Y.T. and Yang, C.C. (2011), Cross-lingual text categorization: Conquering language boundaries in globalized environments, Information Processing & Management, Vol. 47, No. 5, pp. 786-804.
[14]Luo. Q., Chen, E. and Xiong. H. (2011), A semantic term weighting scheme for text categorization, Expert Systems with Applications, Vol. 38, No. 10, pp. 12708-12716.
[15]Zhang. W., Yoshida. T. and Tang. X. (2011), A comparative study of TF-IDF, LSI and multi-words for text classification, Expert Systems with Applications, Vol. 38, No. 3, pp. 2758-2765.
[16]A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. (2009), A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5,.
[17]R. Irina.( 2001), An empirical study of the naive Bayes classifier, IJCAI Workshop on Empirical Methods in AI.
[18]S. Hochreiter, J. Schmidhuber (1997), Long short-term memory, Neural computation 9 (8), 1735-1780
[19]Hochreiter. S., Bengio. Y., Frasconi. P., Schmidhuber. J. (2001), Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies. In Kremer. S. C., Kolen. J. F., A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press.
[20]A. Graves(2012), Supervised sequence labelling with recurrent neural networks, vol. 385, Springer.
[21]J. Guo(2013), BackPropagation Through Time.
[22]D. Britz(2015), Recurrent Neural Networks Tutorial:Part 3 – Backpropagation Through Time and Vanishing Gradients.
[23]X. Glorot, A. Bordes, Y. Bengio(2011), Deep Sparse Rectifier Neural Networks, Proc. Conf. Artificial Intelligence and Statistics.
[24]Olah. C.(2015), Understanding LSTM Networks (http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
[25]Aidan Gomez(2016), Backpropogating an LSTM: A Numerical Example (https://medium.com/@aidangomez/let-s-do-this-f9b699de31d9)
[26]許舒涵。(2016)。你相信部落客嗎?以文章性質和互動性高低探討部落客和紛絲間關係和可信度與廣告效果。
[27]簡仁宗,陳鴻儀。使用關聯法則為主之語言模型於擷取長距離中文文字關聯性。國立成功大學資訊工程系。
[28]顧森。(2012)。互聯網時代的社會語言學:基於SNS的文本數據挖掘。程序員。
[29]白昌永。(2015)。Naïve Bayes 貝氏分類演算法(http://enginebai.logdown.com/posts/241677/bayes-classification)
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊