跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.170) 您好!臺灣時間:2024/12/03 14:01
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:陳寶燦
研究生(外文):Antony Chen
論文名稱:應用分群技術於同義書目之過濾與最佳化
論文名稱(外文):Synonymous Book Records Filtering and Optimization Using Clustering Techniques
指導教授:楊燕珠楊燕珠引用關係
指導教授(外文):Yen-Ju Yang
學位類別:碩士
校院名稱:大同大學
系所名稱:資訊經營學系(所)
學門:商業及管理學門
學類:一般商業學類
論文種類:學術論文
論文出版年:2010
畢業學年度:98
語文別:中文
論文頁數:46
中文關鍵詞:同義書目過濾重複書目中國機讀編目格式機讀編目格式動態資料分群
外文關鍵詞:Synonymous Book Records FilteringCMARC (Chinese MARC)Dynamic Data ClusteringMARC21 (MAchine-Readable Cataloging for the 21stDuplicate Bibliographic Records
相關次數:
  • 被引用被引用:0
  • 點閱點閱:475
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
機讀編目格式(MARC21以下簡稱為書目格式)除了為全球圖書館建立書目資料庫之標準規範外,其功能還可以用來著錄與描述圖書文獻的內容。因此所有圖書館自動化系統多以此格式作為儲存標準、並且以此作為提供文獻檢索以及書目資料交換之依據。而我國之書目格式也因為國情的不同,由國立中央圖書館於民國七十一年制定《中國機讀編目格式》(CMARC),作為我國書目發展之標準。由於圖書出版之推陳出新資料眾多,因此大部分的書目資料是透過館際合作來進行書目的交流,但由於書目編撰之工作是由人工進行,因此難免會有輸入錯誤或因為編目人員對於編目標準的認定不同而造成資料誤植,使得同一本書會有不同的多筆書目紀錄,導致書目資料混亂,參考價值大打折扣。
也由於書目資料眾多格式特殊,因此如何透過資訊技術協助進行書目資料之整理,將是一大挑戰。所以本文提出將書目資料依照其格式欄位之重要性,將之分別給予不同權重,轉換為向量資料,然後進行向量空間的動態資料分群,同一集群內的資料代表類似的書目。之後集群內書目資料進行相似度計算,並依據所設定之門檻值選出可能為同一本書的重複同義書目,最後經過分數計算,過濾較為不良的書目,保留最佳化的書目。
根據實驗結果顯示,本研究提出的方法透過分群技術,並依書目資料之特性,選擇具關鍵判別欄位,並且給予不同欄位資料加重其權重比例,作為比較之標準,在同義書目之過濾與最佳化整理上,相較於過去規則式的過濾,不但比較精準,並且可以大大縮減比對時間,為重複書目的整理提供新的方向,相信再經過細部調整,未來可以實際提供圖書館使用。
MARC21 (MAchine-Readable Cataloging for the 21st century), the standard specification of the bibliographic database in the world's libraries, is developed for content description of books. Therefore, all the library automation systems adopt this format as storage standard, in order to make bibliography retrieval and exchange of bibliographic records. The bibliographic format in our country is called CMARC (Chinese MARC) suitable for Chinese. Due to the large number of book publishing, most of the bibliographic records are exchanged through interlibrary cooperation. However, the compilation of the bibliographic work is carried out by manual, so inevitably there will be errors and inconsistence, making the same book has different multiple bibliographic records. Bibliographic information is so confusion that greatly reduce the reference value.
Because there are a number of records for the same book, how to use information technology to assist in bibliography coordination will be a big challenge. Therefore the following approaches is presented in this research to identity the duplicate bibliographic records. First, Feature Selection: the words in the important fields of CMARC are chosen as features and given the weights according to the importance of fields. Second, Vector Construction, the weights of the features are integration of tf-idf computation and then every book record is represented as one vector. Third, the Dynamic Data Clustering, grouping is performed on the vector space. the book records in the same cluster behalf their bibliographic records are similar. Fourth, Synonymous Book Records Filtering, the similarity between pairs of vectors in the same cluster is computed, all the vectors with the similarity above the threshold are viewed as duplicate bibliographic records. Fifth, Book Records Optimization, the score of duplicate bibliographic records is calculated, and retaining the best one as the standard bibliography of this book.
According to the experimental results, the presented methods are more accurate and faster than previous rule based methods. It is believed that after adjusting in detail, the presented methods can be actually used by library in the future.
致謝 i
摘要 ii
Abstract iv
目錄 vi
表目錄viii
圖目錄ix
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 1
1.3 論文架構 2
第二章 文獻探討 3
2.1 中文編目規範標準 3
2.2書目常見問題 5
2.3書目品質與一般解決方式 6
2.3.1書目品質 6
2.3.2一般解決方式 6
2.4 特徵選取 7
2.4.1 關鍵詞 8
2.4.2 反向文件頻率(Inverse Document Frequency, idf) 8
2.5 向量空間模型 (Vector Space Model) 9
2.6 文件分群 (Document Clustering) 10
2.6.1 凝聚式階層法(Unweighted Pair Group Method with Arithmatic Mean, UPGMA) 10
2.6.2 二分K-means分群演算法(Bisecting K-means Clustering) 11
2.6.3 階層式高頻詞彙為基礎分群法(Hierarchical Frequent Term-Based Clustering algorithm, HFTC) 13
2.6.4 以關鍵詞分群的非監督式分群方法 14
2.6.5 密度和相似度為基礎之二階段分群演算法 15
第三章 研究方法 17
3.1 研究流程 17
3.2 特徵欄位選取 18
3.3 書目前置處理 19
3.3.1 停用詞移除 19
3.3.2 詞根還原 19
3.3.3 中文斷詞與詞類標記 20
3.4 權重計算 21
3.5 書目分群 23
3.6 群內同義書目擷取 24
3.7 最佳書目決定 24
4.1 書目資料 26
4.2 分群品質評估標準 26
4.3.1實驗一 特徵選擇 27
4.3.2實驗二 書目分群 31
4.3.3實驗三 群內書目相似度比較 32
4.4分析與討論 32
4.4.2錯誤分析 36
第五章 結論與未來方向 43
參考文獻 44
[1]楊燕珠、陳志豐, "基於高頻項目集結合近似樣式匹配之文件分群 Document Clustering Based on Frequent Itemset Integrated with Approximate Pattern Matching," 資訊管理學報, 第十六卷專刊, Jan. 2009, pp.165-184.
[2]林呈潢, 『我國的書目控制工作』 教育資料與圖書館學, Vol. 22, No.1, 1984, pp. 65-90.
[3]Al-Kofahi, K., Tyrrell, A., Vachher, A., Travers, T. and Jackson, P., “Combining Multiple Classifiers for Text Categorization,” in Proceedings of the Tenth International Conference on Information and knowledge management Atlanta, Georgia, USA, 2001, pp.97-104.
[4]Baeza-Yates, R. and Ribeiro-Neto, B., Modern Information Retrieval, Addison Wesley, 1999.
[5]Beil, F., Ester, M. and Xu, X., “Frequent Term-Based Text Clustering.”In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp.436-442.
[6]Chang, H. C., Hsu, C. C. and Deng, Y. W., “Unsupervised Document Clustering Based on Keyword Clusters,” in Proceedings of the International Symposium on Communications and Information Technologies, Oct. 2004, pp. 1198-1203.
[7]Dubes, Richard C. and Jain, Anil K., Algorithms for Clustering Data, Prentice Hall, 1988.
[8]Ester, M., Kriegel, H.-P., Sander, J., and Xu, X., “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226-231.
[9]Fung, B. C. M., Wang, K. and Ester, M., “Herarchical Document Clustering Using Frequent Itemsets,” in Proceedings of SIAM Conference on Data Mining, 2003.
[10]Han, J., Pei, J., Yin, Y., and Mao, R., “Mining Frequent Patterns Without Candidate Generation,” in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 1996, pp. 1-12x.
[11]Liu, X.-W., He, P.-L. and Wang, H.-Y., “The Research of Text Clustering Algorithms Based on Frequent Term Sets,” in Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, Aug. 2005, pp.18-21.
[12]Porter, M., “An Algorithm for Suffix Stripping,” Program, Vol. 14, No. 1, 1980, pp.130-137.
[13]Salton, G. and Buckley, C., “Term-weighting Approaches in Automatic Text Retrieval,” Information Processing & Management, Vol. 24, No. 5, 1988, pp.513-523.
[14]Salton, G. and McGill, M., Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1983.
[15]Salton, G., Wong, A. and Yang, C. S., “A Vector Space Model for Automatic Indexing,” Communications of the ACM, Vol. 18, 1975, pp.613-620.
[16]Steinbach, M., Karypis, G., and Kumor, V., “A comparison of Document Clustering Techniques,” in Proceeding of KDD-2000 Workshop on Text Mining , 2000.
[17]Yang, Y.-J. and Yu, S.-H., “Chinese Text Clustering for Topic Detection Based on Word Pattern Relation,” in Proceeding of AI-2006 The Twenty-sixth SGAI International Conference on Artificial Intelligence, Dec. 2006, pp.408-412.
[18]陳秋枝,『我國大學圖書館書目紀錄品質控制實施之研究』,臺北市漢美出版社,民國84年,頁9-10。
[19]陳和琴,『關於核心書目紀錄 Concerning Core Bibliographic Record』,教育資料與圖書館學, 35卷1期, 民國86年, 頁68。
[20]楊燕珠 、 邱瑞民,『未知群數的模糊分群之研究 Fuzzy Clustering with Unknown Cluster Number』, ICIM 2007 十八屆國際資訊管理學術研討會,May. 2007。
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top