(3.227.235.183) 您好!臺灣時間:2021/04/14 18:31
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:甘子典
研究生(外文):Zih-Dian Gan
論文名稱:一個文件相似度測量方法及其應用
論文名稱(外文):A Document Similarity Measure and Its Applications
指導教授:李錫智李錫智引用關係
指導教授(外文):Shie-Jue Lee
學位類別:碩士
校院名稱:國立中山大學
系所名稱:電機工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2011
畢業學年度:99
語文別:中文
論文頁數:50
中文關鍵詞:相似度測量方法文件相似度多標籤單標籤準確度文件分類亂度文件分群
外文關鍵詞:k-meansdocument similaritySimilarity measureBEPF1single-labelmulti-labelAccuracytext classificationEntropydocument clusteringk-NNML-KNN
相關次數:
  • 被引用被引用:1
  • 點閱點閱:444
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:46
  • 收藏至我的研究室書目清單書目收藏:0
在本論文中,我們提出新的文件相似度測量方法並且將此方法應用於文件分類和文件分群。對於測量兩筆文件的相似度而言,我們考慮三種情況:(a)當詞彙特徵同時出現在兩筆文件中,(b)當詞彙特徵僅出現在其中一筆文件,(c)當詞彙特徵不出現在兩筆文件中。對於第一種情況,我們給定一個下界,並且根據兩筆文件特徵值的差異來給予其相似度;對於第二種情況,無論特徵值為何,給定一個負數值;對於第三種情況,不考慮其影響力。我們將這個方法應用在以相似度為基礎的單標籤文件分類器k-NN及多標籤文件分類器ML-KNN,並且延伸來測量文件與文件集合之間的相似度,應用於文件分群,在此我們使用的是k-means like演算法。實驗結果證明我們的方法比其他方法效果更佳。
In this paper, we propose a novel similarity measure for document data processing and apply it to text classification and clustering. For two documents, the proposed measure takes three cases into account: (a) The feature considered appears in both documents, (b) the feature considered appears in only one document, and (c) the feature considered appears in none of the documents. For the first case, we give a lower bound and decrease the similarity according to the difference between the feature values of the two documents. For the second case, we give a fixed value disregarding the magnitude of the feature value. For the last case, we ignore its effectiveness. We apply it to the similarity based single-label classifier k-NN and multi-label classifier ML-KNN, and adopt these properties to measure the similarity between a document and a specific set for document clustering, i.e., k-means like algorithm, to compare the effectiveness with other measures. Experimental results show that our proposed method can work more effectively than others.
目錄
摘要 I
Abstract II
目錄 III
圖次 V
表次 VI
第一章 導論 1
第二章 文獻探討 3
2.1傳統相似度量測方法 3
2.2 相關分類及分群演算法之應用 5
2.2.1 k-NN單標籤分類演算法 6
2.2.2 ML-KNN多標籤分類演算法 7
2.2.3 k-means like 分群演算法 8
第三章 研究方法 10
3.1 研究動機 10
3.2 我們的方法 10
3.2.1 方法概述 10
3.2.2 方法論證 12
3.3 範例說明 18
第四章 實驗結果與分析 24
4.1 實驗資料 24
4.2 單標籤文件分類 26
4.3 多標籤文件分類 29
4.4 文件分群 32
第五章 結論 36
參考文獻 37
圖次
圖2.1 Cosine相似度示意圖 3
圖2.2 k-NN單標籤分類演算法示意圖 7
圖4.1 我們的方法在不同λ值下的分類結果 27
圖4.2 各種方法在WebKB文件集合的分類結果 28
圖4.3 各種方法在Reuters-21578的8個文件集合分類結果 29
圖4.4 我們的方法在不同λ值下的結果 30
圖4.5 各方法在RCV1的比較 31
圖4.6 我們的方法在不同λ值下Accuracy結果 33
圖4.7 WebKB文件分群結果比較 34
圖4.8 Reuters-21578 TOP 8之文件分群結果比較 35
表次
表4.1 WEBKB文件集合的文件分布情形 23
表4.2 路透社前8個文件集合的文件分布情形 24
表4.3 RCV1的5個子文件集合 25
表4.4 各種方法在WEBKB文件集合分類的正確性 26
表4.5 各種方法在Reuters-21578的8個文件集合分類的正確性 27
表4.6 在WebKB正確性的統計顯著性檢驗 27
表4.7 各方法在RCV1文件集合的F1 30
表4.8 各方法在RCV1文件集合的BEP 31
表4.9 各方法在RCV1上F1的統計顯著性檢驗結果 31
表4.10 Reuters-21578 TOP 8文件Accuracy和Entropy統計顯著性檢驗 34
參考文獻
[1] http://web.ist.utl.pt/acardoso/datasets/.
[2] http://www.cs.technion.ac.il/ronb/thesis.html.
[3] http://www.daviddlewis.com/resources/testcollections/reuters21578/.
[4] P. K. Agarwal and C. M. Procopiuc. Exact and approximation algorithms for clustering. Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 658–667, 1998.
[5] D. W. Aha. Lazy learning: Special issue editorial. Artificial Intelligence Review, 11(1-5):7–10, 1997.
[6] D. Cai, X. He, and J. Han. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering, 17(12):1624–1637, 2005.
[7] H. Chim and X. Deng. Efficient phrase-based document similarity for clustering. IEEE Transactions on Knowledge and Data Engineering, 20(9):1217–1229, 2008.
[8] M. Craven, D. DiPasquo, D. Freitag, A. K. McCallum, T. M. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge form the world wide web. Proceedings of 15th National Conference on Artificial Intelligence, 1998.
[9] I. S. Dhillon, J. Kogan, and C. Nicholas. Feature Selection and Document Clustering. In Berry MW Ed. A Comprehensive Survey of Text Mining, 2003.
[10] I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, 2001.
[11] J. D’hondt, J. Vertommen, P.-A. Verhaegen, D. Cattrysse, and J. R. Duflou. Pairwise-adaptive dissimilarity measure for document clustering. Information Sciences, 180:2341–2358, 2010.
[12] C. G. Gonz′alez, W. B. Jr., and A. L. V. Rodrigues. Density of closed balls in real-valued and autometrized boolean spaces for clustering applications. 19th Brazilian Symposium on Artificial Intelligence, pages8–22, 2008.
[13] K. M. Hammouda and M. S. Kamel. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10):1279–1296, 2004.
[14] K. M. Hammouda and M. S. Kamel. Hierarchically distributed peer-to-peer document clustering and cluster summarization. IEEE Transactionson Knowledge and Data Engineering, 21(5):681–698, 2009.
[15] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Second Edition, Morgan Kaufmann, Elsevier, 2006.
[16] T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. International Conference on Machine Learning, pages143–151, 1997.
[17] T. Joachims and F. Sebastiani. Guest editors’ introduction to the special issue on automated text categorization. Journal of Intelligent Information Systems, 18(2/3):103–105, 2002.
[18] T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881–892, 2002.
[19] H. Kim, P. Howland, and H. Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37–53, 2005.
[20] S.-B. Kim, K.-S. Han, H.-C. Rim, and S. H. Myaeng. Some effective techniques for naïve bayes text classification. IEEE Transactions on Knowledge and Data Engineering, 18(11):1457–1466, 2006.
[21] K. Knight. Mining online text. Communications of the ACM, 42(11):58–61, 1999.
[22] J. Kogan, C. Nicholas, and V. Volkovich. Text mining with information-theoretic clustering. Computing in Science and Engineering, 5(6):52–59, 2003.
[23] J. Kogan, M. Teboulle, and C. K. Nicholas. Data driven similarity measures for k-means like clustering algorithms. Information Retrieval, 8(2):331–349, 2005.
[24] S. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the euclidean k-median problem. Seventh Annual European Symposium on Algorithms, pages362–371, 1999.
[25] V. Lertnattee and T. Theeramunkong. Multidimensional text classification for drug information. IEEE Transactions on Information Technology in Biomedicine, 8(3):306–312, 2004.
[26] D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.
[27] M. G. Michie. Use of the bray-curtis similarity measure in cluster analysis of foraminiferal data. Mathematical Geology, 14(6):661–667, 1982.
[28] T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[29] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2/3):103–134, 2000.
[30] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Learning to classify text from labeled and unlabeled documents. Proceedings of 15th National Conference on Artificial Intelligence, 1998.
[31] G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
[32] T. W. Schoenharl and G. Madey. Evaluation of measurement techniques for the validation of agent-based simulations against streaming data. International Conference on Computational Science, 2008.
[33] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.
[34] C. Silva, U. Lotric, B. Ribeiro, and A. Dobnikar. Distributed text classification with an ensemble kernel-based learning approach. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 40(3):287–297, 2010.
[35] A. Strehl and J. Ghosh. Value-based customer grouping from large retail data-sets. SPIE Conference on Data Mining and Knowledge Discovery, 4057:33–42, 2000.
[36] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addision-Wesley, 2006.
[37] M. L. Zhang and Z. H. Zhou. ML-KNN: Alazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048, 2007.
[38] T. Zhang, Y. Y. Tang, B. Fang, and Y. Xiang. Document clustering in correlation similarity measure space. IEEE Transactions on Knowledge and Data Engineering (to appear), 2011.
[39] Y. Zhao and G. Karypis. Comparison of agglomerative and partitional document clustering algorithms. The Workshop on Clustering High Dimensional Data and its Applications at the Second SIAM International Conference on Data Mining, pages83–93, 2002.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊
 
系統版面圖檔 系統版面圖檔