研究生(外文):Yueh-Chyau Lee
論文名稱(外文):Application of Recurrent Neural Network on Identification of Blogger’s Sponsored Article
指導教授(外文):Yung-Ho Leu
口試委員(外文):Yung-Ho LeuWei-Ning YangYun-Shiow Chen
外文關鍵詞:Deep LearningSponsored ArticlesRecurrent Neural NetworkLong-Short Term MemoryText Mining
由於電腦網路的快速發展,社群媒體、部落格及佈告欄(BBS)已經成為資訊散佈的重要管道。人們大量地在網路上發表文章,累積數量可觀的資訊供他人查詢;在言論自由的現代,人們能夠暢所欲言,言論並沒有被嚴格的規範,容易造成假文章魚目混珠於真文章之中,使得人們因為相信假文章而為其所害。人們在網路上撰寫大量的產品經驗分享,原意是給未使用過的人們有一個參考的對象,但是許多商家會利用此一管道,花錢聘請寫手撰寫有利於商家的假文章,此種假文章稱為「業配文」,這些「業配文」的內容誇大不實,讓消費者受騙上當。本研究主要目的在於協助辨別網路文章是否屬於「業配文」,使「業配文」無所遁形。本研究利用中文斷詞,擷取出訓練文本的關鍵詞彙,再利用關鍵詞彙從訓練文本中計算各詞彙的出現頻率,作為訓練資料集的文章特徵,用以建立預測模型,推論測試文本所屬之類別。本研究使用貝式分類器、遞歸神經網路(RNN)與長短期記憶神經網路(RNN-LSTM),建立分類模型,並使用五次交叉驗證(5-fold cross validation),計算各分類模型的準確率;結果顯示長短期記憶神經網路的分類模型,在預測文章是否屬於「業配文」上表現優異,其敏感度(Sensitive)為94%,精確度(Precision)、準確度(Accuracy)與F1-meausre,皆高於95%,而貝氏分類器則優於RNN。
Due to the advance in networks, social networks, blogs and bulletin board systems (BBS) are becoming important media for information dissemination. Myriads of information posted on the web or blog allow people to acquire information for their daily life. However, due to the lack of regulations on posting articles on the web, much fake articles have been posted on the web. Among them, the articles which are written by a sponsored author may mislead a customer into buying products that do not meet their expectations. An article of this kind is called "a sponsored article". This thesis aims to detect the sponsored articles from a set of blog's articles. To this end, we wrote a crawler program to collect 10000 articles from a dining forum in PTT which is a famous BBS in Taiwan. The articles are carefully classified into two different categories: 5000 normal articles and 5000 sponsored articles. To build a classifier, we first find the important terms in the dining articles using a Chinese words segmentation tool. Then, we count the frequency of each important term in an article to transform the article into an occurrence vector of about 2000 different terms. Using the occurrence vectors as the features of the article set, we built three different classifiers using the Naïve Bayes, Recurrent neural network (RNN) and Long-short term memory (RNN-LSTM) algorithms. The experiment results showed that the RNN-LSTM offers the highest prediction accuracy, with the average sensitivity of 94% and both the average precision and F1-meaure more than 95%. The experiment results also showed that the RNN without regularization is inferior to the Naïve Bayes algorithm.
摘要 III
誌謝 V
目錄 VI
圖目錄 VIII
表目錄 IX
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 3
1.3 研究目的 4
1.4 研究貢獻 5
1.5 研究架構 6
第二章 文獻探討 7
2.1 業配廣告 7
2.2 文字探勘 7
2.3 中文斷詞相關研究 8
2.3.1 N-Gram演算法 8
2.3.2 TF-IDF演算法 9
2.3.3 凝聚程度 10
2.3.4 熵值法 11
2.4 分類模型 12
2.4.1 Naïve Bayes演算法 13
2.4.2遞歸神經網路 16
2.4.3長短期記憶 22
2.5 TensorFlow開源碼程式庫 32
第三章 研究方法 33
3.1 研究架構 33
3.2 研究對象與研究工具 34
3.2.1 研究對象 34
3.2.2 研究工具 34
3.3 定義特徵值 35
3.3.1 詞頻 35
3.3.2 詞彙機率 36
3.4 研究方法流程 36
3.2.1 區塊一:抓取PTT美食版文章 36
3.2.2 區塊二:中文斷詞 38
3.2.3 區塊三:建立分類模型 39
3.5 分類模型評估指標 40
3.5.1 混淆矩陣 41
3.5.2 評估指標 42
第四章 實作方法與結果 43
4.1 實作環境 43
4.2 關鍵詞彙庫建立 44
4.3 分類模型結果與評估 44
4.4 預測分類誤判分析 46
第五章 結論與建議 47
5.1 結論 47
5.2 未來研究方向 48
參考文獻 49
