跳到主要內容

臺灣博碩士論文加值系統

(100.26.176.111) 您好!臺灣時間:2024/07/13 03:46
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:吳誌航
研究生(外文):Chih-Hang Wu
論文名稱:從二階段分群萃取輿情事件
論文名稱(外文):Extracting the Opinion Events from Two-Stage Clustering
指導教授:洪智力洪智力引用關係
指導教授(外文):Chihli Hung
學位類別:碩士
校院名稱:中原大學
系所名稱:資訊管理研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2016
畢業學年度:104
語文別:中文
論文頁數:57
中文關鍵詞:主題偵測與追蹤自我組織類神經網路K平均法輿情分析隱含狄利克雷分配模型
外文關鍵詞:Toptic detection and trackingSelf-organizaing mapK-meansOpinion analysisLatent Dirichlet allocation
相關次數:
  • 被引用被引用:1
  • 點閱點閱:155
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
網路新聞為一般人普遍蒐集、接受資訊的來源處,許多人透過網路新聞的閱讀,取得當前社會的議題事件,進而留下口碑、想法產生網路輿情。網路新聞具有時效性及連續性,若一般人需要完整理解某議題事件的全貌,除了需要往前回顧大量新聞資料,還必須持續追蹤新聞事件的未來發展。蒐集網際網路上大量流傳的公眾議題,歸納並分析稱為輿情探勘(Public opinion mining),輿情探勘在文獻上使用的技術為主題偵測與追蹤(Topic detection and tracking; TDT),主要針對網際網路上的資訊採用自動化的方式辨識與分析可能的主題。主題偵測與追蹤所使用的分群歸納模型方法為非監督式學習,較為常見的分群法為K平均法(K-means),主要的優點是它容易明白且操作,分群之間的群聚效果明顯,但是當大規模資料的分群時,也難以處理重疊的資料。另外常見的分群法為自我組織類神經網路(Self-organizing map; SOM),在主題偵測與追蹤上能迅速取得群集分布的關係,但是圖形化結果的呈現和無法自動劃分群集的特性造成主題事件萃取的困難。最後一個常用於萃取主題事件關鍵字的方法為隱含狄利克雷分配模型(Latent Dirichlet allocation; LDA),用於從文章中找出隱含語意並萃取出主題代表字。本研究將這些萃取法結合,利用SOM產生初始的關鍵字群集,再利用K-means取得最終的關鍵字群集,最後將每個群集視為詞袋使用LDA的萃取關鍵字。實驗結果指出,本研究方法透過二階段分群縮減第一階輸出結果並萃取重要的輿情關鍵字,因此在宏平均法(Marco-Average)和微平均法(Micro-Average)較傳統的單一分群法佳,但是議題關鍵字呈現則是只使用SOM分群的方法較佳。

Internet news is a source which people collect and receive information from. They read the Internet news for getting social events and leave word of mouths which become opinions. Internet news are continuously broadcasting. If people want to know the picture of topic events, they need to review a lot of previous news and keep tracking the development of news events. The process of gathering, extracting, summarizing and analyzing popular news events on the Internet is the task of public opinion mining. Traditional opinion mining usually use Topic detection and tracking (TDT) as its main method, which automatically tells and analyzes possible topics from information. TDT usually uses a clustering-based method which is unsupervised learning. The most common model is K-means which can easily use and efficiently cluster its information. However, it is hard to deal with data when it deals with large scale data. Another method is self-organizing map (SOM) which can get clusters faster. But its graphical results and non-automatic partition clusters make it harder to extract topic events. Last method is latent dirichlet allocation (LDA) which finds latent semantics from documents and extracts topic keywords. The paper proposes the two-stage clustering which combines these methods. The first step is producing the initial keyword-clusters by SOM. Then we get the final keyword-clustes by K-means. Finally each cluster will be considered as bag of word and final keywords are extracted by LDA. According to the experiments, the two-stage model is more efficiently than traditional one-stage clustering evaluated by both Macro-average and Micro-average criteria. But the traditional one-stage clustering is better at visualized news events presentation.

摘要 I
Abstract II
致謝詞 III
目錄 IV
圖目錄 VI
表目錄 VII
第 一 章、 緒論 1
1.1研究背景與動機 1
1.2研究問題 3
1.3研究目的 3
1.4研究貢獻 4
1.5研究流程說明 4
第 二 章、 文獻回顧 5
2.1主題偵測與追蹤 5
2.2二階段分群法 7
2.3向量空間模型 8
2.4 TF-IDF 9
2.5分群方法比較 10
第 三 章、 研究方法 11
3.1研究架構 11
3.2文字預處理模組 13
3.2.1去除無用字 13
3.2.2資料集VSM 14
3.3 第一階段分群模組 15
3.3.1 SOM演算法 15
3.3.2初始關鍵字群集 18
3.4 第二階段分群模組 20
3.4.1初始關鍵字群集VSM 20
3.4.2 K-means演算法 21
3.5 LDA 24
3.6評估模組 26
第 四 章、 實驗結果與評估 28
4.1實驗說明 28
4.1.1 實驗資料集 28
4.1.2 研究工具 30
4.2 實驗結果 31
4.3 實驗評估 33
4.4 敏感度分析 36
第 五 章、 研究結論與未來發展 40
5.1 研究結論 40
5.2 未來發展 41
參考文獻 42
附錄一、無用字詞庫 50

圖1.1研究流程架構圖 4
圖2.1 Topic Detection and Tracking架構圖 6
圖2.2 SOM & K-means的二階段分群法運作圖 7
圖2.3向量空間模型概念圖 9
圖3.1研究架構圖 12
圖3.2自我組織類神經網路示意圖 16
圖3.3初始關鍵字輸出神經元示意圖 19
圖3.4 K-means演算步驟示意圖 21
圖3.5 LDA模型圖 24
圖3.6新聞文章標題內容 27
圖4.1資料集內容 29
圖4.2 SOM評估方式 34
圖4.3二階段評估方式 35
圖4.4各SOM輸出大小+10個關鍵字+Kmeans =5數據圖 38
圖4.5各SOM輸出大小+10個關鍵字+Kmeans =10數據圖 38
圖4.6各SOM輸出大小+5個關鍵字+Kmeans =5數據圖 39
圖4.7各SOM輸出大小+5個關鍵字+Kmeans =10數據圖 39

表3.1資料集VSM 14
表3.2 SOM輸入向量 18
表3.3 SOM輸出向量 19
表3.4初始關鍵字群集VSM 20
表3.5 SOM輸出結果 22
表3.6 K-means 輸出結果 23
表3.7最終關鍵字群集表示方法 23
表3.8詞袋表示方法 25
表3.9 LDA萃取最終代表關鍵字表示方法 26
表3.10 Marco-Average、Micro-Average 計算實驗結果與評估 27
表4.1資料集屬性 29
表4.2 SOM參數設置 30
表4.3 K-means參數設置 30
表4.4第一階段SOM結果(2015/05/17) 31
表4.5初始關鍵字群集VSM屬性(2015/05/17) 32
表4.6 K-means結果(2015/05/17) 32
表4.7 LDA結果(2015/05/17) 33
表4.8最終實驗結果 37




英文文獻:
Allan, J., Carbonell, J. G., Doddington, G., Yamron, J., & Yang, Y. (1998). Topic detection and tracking pilot study final report. In In proceedings of the darpa broadcast news transcription and understanding (pp. 194–218).
Allan, J., Papka, R., & Lavrenko, V. (1998). On-line new event detection and tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 37–45).
Berghel, H. (1997). Cyberspace 2000: Dealing with information overload. Communications of the ACM, 40(2), 19–24.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
Bochereau, L., & Boutgine, P. (1990). Extraction of semantic features and logical rules from multilayer neural networks. ResearchGate, 2.
Brown, J. S., & Duguid, P. (2002). The Social Life of Information. Harvard Business School Press.
Bruske, J., & Sommer, G. (1995). Dynamic cell structure learns perfectly topology preserving map. Neural Computation, 7(4), 845–865.
Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.
Canini, K. R., Shi, L., & Griffiths, T. L. (2009). Online Inference of Topics with Latent Dirichlet Allocation. In AISTATS (Vol. 9, pp. 65–72).
Chen, Y., Qin, B., Liu, T., Liu, Y., & Li, S. (2010). The comparison of som and k-means for text clustering. Computer and Information Science, 3(2), 268.
Cselle, G., Albrecht, K., & Wattenhofer, R. (2007). BuzzTrack: topic detection and tracking in email. In Proceedings of the 12th international conference on intelligent user interfaces (pp. 190–197).
Cui, X., Potok, T. E., & Palathingal, P. (2005). Document clustering using particle swarm optimization. In Swarm Intelligence Symposium, 2005. SIS 2005. Proceedings 2005 IEEE (pp. 185–191).
Deboeck, G., & Kohonen, T. (2013). Visual explorations in finance: with self-organizing maps. Springer Science & Business Media.
Dillon, W. R., & Goldstein, M. (1984). Multivariate Analysis: Methods and Applications (1 edition). New York: Wiley.
Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Using latent semantic analysis to improve access to textual information. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 281–285).
Franz, M., Ward, T., McCarley, J. S., & Zhu, W.-J. (2001). Unsupervised and supervised clustering for topic tracking. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 310–317).
Freeman, R., Yin, H., & Allinson, N. M. (2002). Self-organising maps for tree view based hierarchical document clustering. In Neural Networks, 2002. IJCNN’02. Proceedings of the 2002 International Joint Conference on (Vol. 2, pp. 1906–1911).
Gildea, D., & Hofmann, T. (1999). Topic-based language models using EM. History.
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.
Hung, C. (2008). A constrained neural learning rule for eliminating the border effect in online self-organising maps. Connection Science, 20(1), 1–20.
Hung, C. (2015). A constrained growing grid neural clustering model. Applied Intelligence, 43(1), 15–31.
Hung, C., Chi, Y.-L., & Chen, T.-Y. (2009). An attentive self-organizing neural model for text mining. Expert Systems with Applications, 36(3), 7064–7071.
Hung, C., & Tsai, C.-F. (2008). Market segmentation based on hierarchical self-organizing map for markets of multimedia on demand. Expert Systems with Applications, 34(1), 780–787.
Hung, C., & Wermter, S. (2003). A dynamic adaptive self-organising hybrid model for text clustering. In ICDM (pp. 75–82).
Hung, C., & Wermter, S. (2005). A constructive and hierarchical self-organizing model in a non-stationary environment. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on (Vol. 5, pp. 2948–2953).
Hung, C., & Wermter, S. (2008). A novel self-organising clustering model for time-event documents. The Electronic Library, 26, 260–272.
Hung, C., Wermter, S., & Smith, P. (2004). Hybrid neural document clustering using guided self-organization and wordnet. Intelligent Systems, IEEE, 19(2), 68–77.
Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69.
Ku, L. W. (2000). A study on the multilingual topic detection of news articles. Master Dissertation, Department of Computer Science and Information Engineering, National Taiwan University.
Kuo, R. J., Ho, L. M., & Hu, C. M. (2002). Integration of self-organizing feature map and K-means algorithm for market segmentation. Computers & Operations Research, 29(11), 1475–1493.
Laaksonen, J., Koskela, M., Laakso, S., & Oja, E. (2000). PicSOM–content-based image retrieval with self-organizing maps. Pattern Recognition Letters, 21(13), 1199–1207.
Lee, C.-H., & Yang, H.-C. (1999). A web text mining approach based on self-organizing map. In Proceedings of the 2nd international workshop on Web information and data management (pp. 59–62).
Macqueen, J. (1967). Some methods for classification and analysis of multivariate observations (Vol. 1, pp. 281–297). Presented at the Procedings of the Fifth Berkeley Symposium on Math, Statistics, and Probability, University of California Press.
Masada, T., Fukagawa, D., Takasu, A., Hamada, T., Shibata, Y., & Oguri, K. (2009). Dynamic hyperparameter optimization for bayesian topical trend analysis. In Proceedings of the 18th ACM Conference on Information and knowledge management (pp. 1831–1834).
Masada, T., Fukagawa, D., Takasu, A., Shibata, Y., & Oguri, K. (2010). Modeling topical trends over continuous time with priors. In International Symposium on Neural Networks (pp. 302–311).
Maskeri, G., Sarkar, S., & Heafield, K. (2008). Mining business topics in source code using latent dirichlet allocation. In Proceedings of the 1st India software engineering conference (pp. 113–120).
Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33(9), 1455–1465.
Mohd, M., Crestani, F., & Ruthven, I. (2011). Construction of topics and clusters in Topic Detection and Tracking tasks. In 2011 International Conference on Semantic Technology and Information Retrieval (STAIR) (pp. 171–174).
Mori, M., Miura, T., & Shioya, I. (2006). Topic detection and tracking for news web pages. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 338–342).
Punj, G., & Steward, D. W. (1983). Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research, 134–148.
Salton, G., Wong, A., & Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Commun. ACM, 18(11), 613–620.
Seo, Y.-W., & Sycara, K. (2004). Text clustering for topic detection.
Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21.
Tokinaga, S., Jianjun, L. U., & Ikeda, Y. (2005). Neural network rule extraction by using the genetic programming and its applications to explanatory classifications. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 88(10), 2627–2635.
Tsai, C.-F., & Hung, C. (2012). Cluster ensembles in collaborative filtering recommendation. Applied Soft Computing, 12(4), 1417–1425.
Tsai, C.-F., & Hung, C. (2013). Popular research topics in multimedia. Scientometrics, 95(1), 465–479.
Tsai, C.-F., & Hung, C. (2014). Modeling credit scoring using neural network ensembles. Kybernetes, 43(7), 1114–1123.
Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3), 586–600.
Wermter, S., & Elshaw, M. (2003). Learning robot actions based on self-organising language memory. Neural Networks, 16(5), 691–699.
Xiao, X., Dow, E. R., Eberhart, R., Miled, Z. B., & Oppelt, R. J. (2003). Gene clustering using self-organizing maps and particle swarm optimization. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International (pp. 22–26).
Yang, Y., Carbonell, J. G., Brown, R. D., Pierce, T., Archibald, B. T., & Liu, X. (1999). Learning approaches for detecting and tracking news events. IEEE Intelligent Systems, (4), 32–43.
Yang, Y., Pierce, T., & Carbonell, J. (1998). A study on retrospective and online event detection. In Proceedings of the 21st ACM SIGIR International Conference on Research and Development in Information Retrieval (Vol. 28–36).

網站文獻
Topic Detection and Tracking Evaluation (TDT-2002). (2007, August 21). Retrieved July 24, 2016, from http://www.itl.nist.gov/iad/mig/tests/tdt/2002/
中文文獻:
洪智力 (2015). 運用二階段分散式動態分群進行輿情探勘, 科技部專題研究計畫,計畫編號 MOST 104-2420-H-033-002
洪智力, 樓逸軒, 吳誌航, & 吳李祺. (2015). 運用二階段分散式分群於輿情探勘. TANET台灣網際網路研討會. 暨南大學.
黃純敏, 陳聰宜, & 詹雅筑. (2014). 新聞事件偵測與追蹤之分群分類演算法研究. 資訊科技國際期刊, 8(1), 70–78.


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top