研究生(外文):Antony Chen
論文名稱(外文):Synonymous Book Records Filtering and Optimization Using Clustering Techniques
指導教授(外文):Yen-Ju Yang
外文關鍵詞:Synonymous Book Records FilteringCMARC (Chinese MARC)Dynamic Data ClusteringMARC21 (MAchine-Readable Cataloging for the 21stDuplicate Bibliographic Records
MARC21 (MAchine-Readable Cataloging for the 21st century), the standard specification of the bibliographic database in the world's libraries, is developed for content description of books. Therefore, all the library automation systems adopt this format as storage standard, in order to make bibliography retrieval and exchange of bibliographic records. The bibliographic format in our country is called CMARC (Chinese MARC) suitable for Chinese. Due to the large number of book publishing, most of the bibliographic records are exchanged through interlibrary cooperation. However, the compilation of the bibliographic work is carried out by manual, so inevitably there will be errors and inconsistence, making the same book has different multiple bibliographic records. Bibliographic information is so confusion that greatly reduce the reference value.
Because there are a number of records for the same book, how to use information technology to assist in bibliography coordination will be a big challenge. Therefore the following approaches is presented in this research to identity the duplicate bibliographic records. First, Feature Selection: the words in the important fields of CMARC are chosen as features and given the weights according to the importance of fields. Second, Vector Construction, the weights of the features are integration of tf-idf computation and then every book record is represented as one vector. Third, the Dynamic Data Clustering, grouping is performed on the vector space. the book records in the same cluster behalf their bibliographic records are similar. Fourth, Synonymous Book Records Filtering, the similarity between pairs of vectors in the same cluster is computed, all the vectors with the similarity above the threshold are viewed as duplicate bibliographic records. Fifth, Book Records Optimization, the score of duplicate bibliographic records is calculated, and retaining the best one as the standard bibliography of this book.
According to the experimental results, the presented methods are more accurate and faster than previous rule based methods. It is believed that after adjusting in detail, the presented methods can be actually used by library in the future.
致謝 i
摘要 ii
Abstract iv
目錄 vi
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 1
1.3 論文架構 2
第二章 文獻探討 3
2.1 中文編目規範標準 3
2.2書目常見問題 5
2.3書目品質與一般解決方式 6
2.3.1書目品質 6
2.3.2一般解決方式 6
2.4 特徵選取 7
2.4.1 關鍵詞 8
2.4.2 反向文件頻率(Inverse Document Frequency, idf) 8
2.5 向量空間模型 (Vector Space Model) 9
2.6 文件分群 (Document Clustering) 10
2.6.1 凝聚式階層法(Unweighted Pair Group Method with Arithmatic Mean, UPGMA) 10
2.6.2 二分K-means分群演算法(Bisecting K-means Clustering) 11
2.6.3 階層式高頻詞彙為基礎分群法(Hierarchical Frequent Term-Based Clustering algorithm, HFTC) 13
2.6.4 以關鍵詞分群的非監督式分群方法 14
2.6.5 密度和相似度為基礎之二階段分群演算法 15
第三章 研究方法 17
3.1 研究流程 17
3.2 特徵欄位選取 18
3.3 書目前置處理 19
3.3.1 停用詞移除 19
3.3.2 詞根還原 19
3.3.3 中文斷詞與詞類標記 20
3.4 權重計算 21
3.5 書目分群 23
3.6 群內同義書目擷取 24
3.7 最佳書目決定 24
4.1 書目資料 26
4.2 分群品質評估標準 26
4.3.1實驗一 特徵選擇 27
4.3.2實驗二 書目分群 31
4.3.3實驗三 群內書目相似度比較 32
4.4分析與討論 32
4.4.2錯誤分析 36
第五章 結論與未來方向 43
參考文獻 44
