跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.88) 您好!臺灣時間:2026/02/15 20:46
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:曾有德
研究生(外文):Yu-De Tseng
論文名稱:以Web2.0概念建構自動化文件分群與內容相似性比對之研究
論文名稱(外文):Automatic Clustering and Similarity Measurement for Documents Based on the Web 2.0 Concept
指導教授:曾守正曾守正引用關係
指導教授(外文):Frank Shou-Cheng Tseng
學位類別:碩士
校院名稱:國立高雄第一科技大學
系所名稱:資訊管理所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2008
畢業學年度:96
語文別:中文
論文頁數:101
中文關鍵詞:Web 2.0文件分群相似度測量抄襲比對
外文關鍵詞:Document ClusteringDocument Similarity MeasurementPlagiarism DetectionWeb 2.0
相關次數:
  • 被引用被引用:1
  • 點閱點閱:551
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:3
由於網際網路迅速的發展,促成了大量電子文件的快速流通與共享。在傳統的作法上,若要找出相似度高的文件,則必須從閱讀過的文件中,逐一進行人工判讀,不僅耗費大量時間和人力,且效果不彰,最後的判別依據也常會因為不易量化而取決於審核者本身的主觀意識。
本研究將提出一個輔助人們,檢驗文件內容相似度的系統化方法與架構,透過自動化擷取文件內容,進行相似度比對,藉此判別文件之間的相似關係,以應用於學術論文間的抄襲偵測,以及文件內容索引,藉此開發「文件範例查詢」(Document Query-By-Example, DQBE) 的使用者介面。為了提升系統的整體效率,在文件比對之前,本研究先將文件集透過階層式分群機制,自動從大量文件中找出內文屬於同一領域的檔案文件,加以分群,再逐步分析隸屬於相同群組內,不同文件之相似度,提高比對品質與效能。
整個系統的設計將結合Web 2.0的概念,由使用者自行提供文件來源,並自由開放論文檢索權限,和客製化比對文件的各項參數設定。本研究架構可經由簡易的調整,擴充額外的計算方法,以延伸至處理不同類別的文件、不同語系、不同內文形式,進行跨領域文件的相似度測量,具備相當的大彈性。
Due to the quick proliferation of Internt, there are a large volume of electronic documents shared over the cyberspace. To identify the similarity in traditional approaches, we have to read and interpret the content of all documents manually, which is a time-consuming, error-prone, and labor-intensive task. However, the effectiveness is still very poor, and the final judgement may be mainly based on the auditor’s subjective cognition, which lacks quantitative data analysis to support objective decision.
This study will propose an approach for examining the content similarity of documents by incorporating systematic methods into a unified framework to achieve the goal. Our approach first captures the document contents, then evaluates their similarities to determine the inter-relationships in-between, and finally applies the plagiarism detection process to accomplish the task. Our approach can also be employed to develop a Document Query-By-Example (DQBE) interface for flexible document retrievals. To enhance the overall efficiency, before examining the content similarities, we used a hierarchical document clustering algorithm to cluster the document set into a hierarchical tree, which is a layered theme structure organized by keywords. Then, the similarities of documents in the same cluster will be evaluated. This clustering process substantially boosts the quality and performance of our system.
The whole system is established based on the Web 2.0 concept, such that users are capable of providing candidate document sources, and freely setting the corresponding parameters for the whole process. The whole system can be extended for dealing with different types of documents in different languages or different formats, with a great flexibility of testing document similarity interdisciplinarily.
中文摘要 i
ABSTRACT ii
誌謝 iii
目錄 iv
圖目錄 vii
表目錄 viii
1.緒論 1
1.1研究背景與動機 1
1.2研究目的 2
1.3研究範圍與限制 3
1.3.1語言屬性範圍 3
1.3.2文件檔案格式範圍 3
1.3.3文件內容 3
1.3.4系統設計環境 3
1.4研究貢獻 5
1.5研究流程 6
1.6論文結構 7
2.相關研究 8
2.1中繼資料 (MetaDate) 8
2.2文件倉儲 (Document Warehouse) 10
2.3文件前置處理 11
2.3.1英文詞幹處理 12
2.3.2中文斷詞處理 12
2.3.2.1 N-Gram斷詞法 14
2.3.2.2 PAT-Tree-Based斷詞法 15
2.3.2.3中文斷詞規則 17
2.3.3關鍵詞的選取 17
2.3.3.1文件頻率法 (DF-Threshold Method) 17
2.3.3.2交叉資訊法 (Mutual Information Method) 18
2.3.3.3資訊增益值法 (Information Gain Method) 18
2.3.3.4卡方檢定法 (x2 - statistic Method) 19
2.3.4關鍵詞權重計算 19
2.3.4.1 TF-IDF (Term Frequency–Inverse Document Frequency) 22
2.4分群技術方法 22
2.4.1分割式分群演算法 22
2.4.2階層式分群演算法 23
2.4.3啟發式分群方法 25
2.4.4分群演算法之比較 25
2.4.5 K-means Clustering 26
2.4.6 FHTC (High Frequent Itemset Clustering) 分群法 26
2.4.7 FIHC (Frequent Itemset Hierarchical Clustering) 分群法 27
2.5相似度測量 28
2.5.1字串間的相似度測量 29
2.5.1.1編輯距離法 (Edit Distance) 30
2.5.1.2最長共同子序列法 (Longest Common Subsequence, LCS) 32
2.5.1.3 Jaro-Winkler法 33
2.6何謂抄襲 34
2.6.1抄襲種類 34
2.6.2抄襲偵測 36
3.研究方法 38
3.1系統架構 38
3.2系統流程 41
3.2.1 文件擷取 41
3.2.2 文件前置處理 42
3.2.3 文件分群 45
3.2.4 文件相似度比對 46
3.3系統程式 50
3.3.1 預儲程序 (Stored Procedure) 50
3.3.2文件檔案內文擷取 52
3.3.2.1 WORD內文擷取 53
3.3.2.2 PDF內文擷取 55
3.3.2.3 TXT內文擷取 57
3.3.3讀取中繼資料 59
3.3.4文件內文斷句 61
3.3.5斷詞處理 62
3.3.6關鍵詞篩選計算 62
3.3.7文件分群計算 63
3.3.8計算文件相似度 63
3.3.9 OverPower自動繪圖元件 64
3.3.10實作文件倉儲 65
4.系統實作與評估 67
4.1系統設計 67
4.2使用者介面 67
4.3系統評估 71
4.3.1資料範本來源 71
4.3.2系統效能評估 71
4.3.3文件計算結果 73
4.3.4抄襲計算評估 74
5.總論與未來研究 77
5.1總論 77
5.2未來研究 79
參考文獻 80
附錄A 鍵盤對應ASCII碼 85
附錄B 中文停用字表 86
附錄C 英文停用字表 88
[1] Bergroth, L., H. Hakonen and T. Raita, “A Survey of Longest Common Subsequence Algorithms,” Proceedings of the 7th International Symposium on String Processing Information Retrieval (SPIRE), 2000, pp.39-48
[2] Chien, L. F., “PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval,” Proceedings of the 1997 ACM SIGIR, pp.50-58.
[3] Dai, Y., Loh, T. E. and Khoo, C. S. G.., “A New Statistical Formula For Chinese Text Segmentation Incorporating Contextual Information,” Proceeding of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 1999, pp.82-89.
[4] Dey, D., S. Sarkar, and P. De, “A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases,” IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 3, 2002, pp.567-582.
[5] Domingo, J. and V. Torra, “Distance-based and Probabilistic Record Linkage for Re-identification of Records with Categorical Variables,” Butlleti de l''Associacio Catalana d''Intelligencia Artificial, No. 28, Fall 2002, pp.243-250.
[6] F. Beil, M. Ester, and X. Xu, “Frequent Term-based Text Clustering,” In Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD)''2002, Edmonton, Alberta, Canada, 2002. http://www.cs.sfu.ca/~ ester/ publications.html.
[7] Fox, C., “A Stop List for General Text,” ACM SIGIR Forum, Vol. 24, No. 1-2, 1989, pp.19-35.
[8] Frank Tseng S.C. and Chou A.Y.H., “The Concept of Document Warehousing for Content Management of Enterprise Business Intelligence,” Decision Support Systems, 2006, Vol. 42, pp.727-744.
[9] Fung, BCM, Wang, K, Ester M. “Hierarchical Document Clustering Using Frequent Itemsets,” Proceedings of the 2003 SIAM Intl. Conf.on Data Mining (SIAM''03).
[10] Gravano, L., Panagiotis, G. Ipeirotis, H.V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. “Approximate String Joins in a Database (almost) for Free” Proceedings of the 27th International Conference on Very Large Databases (VLDB 2001), 2001, pp.491-500.
[11] Gravano, L., Panagiotis G. Ipeirotis, Nick Koudas and Divesh Srivastava,“Text Joins in an RDBMS for Web Data Integration,”Proceedings of the 12th international conference on World Wide Web, 2003, pp.90-101.
[12] Guan, Y., A. Ghorbani and N. Belace, “Y-means: a Clustering Method for Intrusion Detection,” Technical Report, National Research Council of Canada, 2003, pp.1083-1086.
[13] Han, H., X. L. Lu, J. Lu, C. Bo and R. L. Yong, “Data Mining Aided Signature Discovery in Network-based Intrusion Detection System,” ACM SIGOPS Operating Systems Review, Vol. 36 , Issue 4, pp. 7-13, 200.
[14] Han, Jiawei and Micheline Kamber, Data Mining: Concepts and Techniques, San Fransisco: Morgan Kaufmann, 2003.
[15] Hernandez, M. A., and S. J. Stolfo, “The Merge/Purge Problem for Large Databases,” ACM SIGMOD Record, Vol. 24, No. 2, 1995, pp. 127-138.
[16] Jain, A. K., M. N. Murty and P. J. Flynn, “Data Clustering: a Review,” ACM Computing Surveys, Vol. 31, 1999, pp. 264-323.
[17] Jardine, N. and van Rijsbergen, C.J. “The use of hierarchical clustering in information retrieval,” Information Storage and Retrieval, 217-240, 7, 1971.
[18] Johnson, R. A. and D. W. Wichern, Applied Multivariate Statistical Analysis, New Jersey: Prentice-Hall, 1998.
[19] Jude Carroll, “A Handbook for Deterring Plagiarism in Higher Education,” Oxford: The Oxford Centre for Staff and Learning Development (2002), pp. 96, ISBN 1–873576–56–0.
[20] Kang, In-Ho and GilChang Kim, “Query Type Classification for Web Document Retrieval,” Proceedings of the 26th annual international ACM SIGIR conference, July 28-August 1, 2003, pp.64-71.
[21] Kim, D. J., Y. W. Park and D. J. Park, “A Novel Validity Index for Determination of The Optimal Number of Clusters,” IEICE Transactions on Information and Systems Society, Vol. E84-D, No. 2, 2001, pp.281-285.
[22] Laan, M. J. and Pollard K. S., “A New Algorithm for Hybrid Hierarchical Clustering with Visualization and The Bootstrap,” Journal of Statistical Planning and Inference, Vol. 117, No.2, 2003, pp.275-303.
[23] Lee, W., J. Stolfo and Mok K. W., “A Data Mining Framework for Building Intrusion Detection Models,” Proceedings of the IEEE Symposium on Security and Privacy, 1999, pp.120-132.
[24] Liu, Y., Liu Qun, Zhang Xiang and Chang Baobao, “A Hybrid Approach to Chinese-English Machine Translation,” Proceedings. Int. Conference. Intelligent Processing Systems, 1997, pp.1146-1150.
[25] Mannila, H., Toivonen H. and Verkamo A. I., “Discovery of Frequent Episodes in Event Sequences,” Data Mining and Knowledge Discovery, Vol.1, No.3, 1997, pp.259-289.
[26] Mannila, H. and Toivonen H., “Discovering Generalized Episodes using Minimal Occurrences,” Proceedings of the Second Int’l Conference. on knowledge discovery and data mining, 1996, pp.146-151.
[27] MetaTexis, http://www.metatexis.net/.
[28] Morrison, D. R., “PATRICIA- Practical Algorithm to Retrieve Information Coded in Alphanumeric,” Journal of the ACM, Vol.15, No.4, Oct 1968, pp.514-534.
[29] Nie, J. Y., Brisebois, M. and Ren, X., “On Chinese text retrieval,” Proceeding of the 19nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 1996, pp.225-233.
[30] Oh, S. J. and Kim J. Y., “A Hierarchical Clustering Algorithm for Categorical Sequence Data,” Information Processing Letters, Vol. 91, No. 3, 2004, pp.135-140.
[31] Omega, T, http://www.omegat.org/omegat/omegat_en/omegat.html.
[32] Palmer, D. D. and Hearst M. A., “Adaptive Multilingual Sentence Boundary Disambiguation,” Computational Linguistics, 23/3, 1997, pp.241-267.
[33] Plagiarism.org, http://www.plagiarism.org/.
[34] Plagiarism Tools,http://www.shambles.net/pages/staff/ptools/.
[35] Popesuc, A. R., “Implementation of Term Weighting in a Simple IR System,” Kursprojekt, June 2001.
[36] Porter, M. F., “An Algorithm for Suffix Stripping,” Program, Vol.14, No.3, 1980, pp. 130-137.
[37] Porter, E. H. and Winkler, W. E. Approximate String Comparison and Its Effect on an Advanced Record Linkage System, US Bureau of the Census, 1997.
[38] Reynar, J. C. and Ratnaparkhi, A. “A Maximum Entropy Approach to Identifying Sentence Boundaries,” Proceedings of the Fifth A CL Conference on Applied Natural Language Processing (ANLP''97), 1997, pp.16-19.
[39] Riley, M. D., “Some Applications of Tree-based Modeling to Speech and Language Indexing,” Proceedings of the DARPA Speech and Natural Language Workshop, 1989, pp.339-352.
[40] Salton, G. and McGill M. J., Introduction to Modern Information Retrieval, New York: McGraw-Hill Company, 1983.
[41] Salton, G., Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer, Mass.Wokingham: Addison-Wesley Publishing Company, 1988.
[42] Salton, G., Singhal, A., Mitra and Buckley, C.,”Automatic Text Structuring and summarization,” Information Processing and Management, Vol.33, 1997, pp.193-204.
[43] Shannon, C. E., “Prediction and Entropy of Printed English,” Bell System Technical, 1951, pp. 50-64.
[44] Swan, R. and J. Allan, “Automatic generation of overview timeliness,"Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, Athen, Greece, 2000.
[45] Trados Translators Workbench, http://www.trados.com/Default.asp/.
[46] Vijaya, P. A., M. Murty and D. K. Subramanian, “Leaders–Subleaders: An Efficient Hierarchical Clustering Algorithm for Large Data Sets,” Pattern Recognition Letters, Vol. 25, No. 4, 2004, pp. 505-513.
[47] Webb, Lynn E., “Advantages and Disadvantages of Translation memory: A Cost/Benefit Analysis,” A thesis of MA in Translation of German Graduate Division California: Monterey Institute of International Studies, 1999.
[48] Wong, K. F. and Li, W. J., “Intelligent Chinese Information Retrieval: Why is it so Difficult?” Proceedings of the First Asia Digital Library Workshop, 1998, pp. 47-56.
[49] Wordfast, http://www.wordfast.net/.
[50] Xie, X. L. and G. Beni, “A Validity Measure for Fuzzy Clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 13, No. 8, 1991, pp. 841-847.
[51] Yancey, W. E., “An Adaptive String Comparator for Record Linkage,” Statistical Research Division U.S. Bureau of the Census Washington D.C., 2003, pp.1-22.
[52] Yang, Y. and Jan, O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” International Conference on Machine Learning, 1997, pp.412-420.
[53] Yeh, C. L. and Lee, H. J., “Rule-Based Word Identification for Mandarin Chinese Sentences-A Unification Approach,” Computer Processing of Chinese and Oriental Languages, Vol. 5, No. 2, March 1991, pp. 97-118.
[54] 中央研究院資訊科學研究所詞庫小組,http://rocling.iis.sinuca.edu.tw/CKIP/
[55] 陳振南、吳毓傑,「特徵選取與權重分配於中文新聞分類之比較」,第十三屆國際資訊管理學術研討會,淡江大學,2002,721-728頁。
[56] 陳淑美,「財經新聞自動分類之研究」,國立台灣大學圖書館學研究所,碩士論文,1992。
[57] 曹偉駿、楊景隆、劉經緯,「運用網路誘捕系統對入侵行為之分析與實作」電子商務與數位生活研討會,2005。
[58] 曾元顯,「關鍵詞自動擷取技術與相關詞回饋」,中國圖書館學會會報,第59期,1997年11月。
[59] 曾元顯,「關鍵詞自動擷取技術之探討」,中國圖書學會會訊,第106期,1997年9月。
[60] 曾守正、曾有德,「透過研討會承辦系統建置文件倉儲以實現知識管理平台之研究」,2007第二屆數位內容管理與應用學術研討會(DCMA 2007) 論文集,國立台南大學主辦,Jun. 1-2, 2007。
[61] 陳永德,「中文斷詞中長詞優先、詞頻對比、前詞優先規則之使用」,國立台灣大學心理學研究所,博士論文,1997。
[62] 蔡嘉嘉、曾守正,「Fuzzy-Based Multi-Categorization of Chinese Documents」,資訊管理學報 (Journal of Information Management),第十二卷,第四期,2005,75-106頁。
[63] 維基百科,http://zh.wikipedia.org/wiki/
[64] 魏玲玉、曾守正,「以文件倉儲概念實現動態群聚與多重文件摘要之研究-以中文電子新聞為例」,資訊管理學報 (Journal of Information Management),2005。
[65] 謝清俊、陳淑美、楊允言、陳克健,「Auto classification of Texts」,如何利用大語料庫作研究研討會,計算語言學會主辦,1992。
[66] 顧皓光,「網路文件自動分類」,國立台灣大學資訊管理研究所,碩士論文,1996。
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top