(18.207.134.98) 您好!臺灣時間:2019/10/23 23:41
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
本論文永久網址: 
line
研究生:謝其男
研究生(外文):Chi-Nan Hsieh
論文名稱:書目資料中著者姓名歧異性之解析
論文名稱(外文):Ambiguity Resolution of Author Names for Bibliographic Data
指導教授:陳光華陳光華引用關係
口試委員:唐牧群黃乾綱
口試日期:2011-07-21
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:圖書資訊學研究所
學門:傳播學門
學類:圖書資訊檔案學類
論文出版年:2011
畢業學年度:99
語文別:英文
論文頁數:47
中文關鍵詞:著者歧義性書目資料機器學習
外文關鍵詞:Author DisambiguationBibliographic DataMachine Learning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:544
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
在檢索大量的學術資訊時,使用者經常會面臨到著者歧異性的問題,使得對同名著者群的解析成為一項重要的研究課題。相較於前人研究,本研究充分應用文獻書目資料的資訊進行辨識工作,且不使用書目資訊以外的資訊。因此,我們使用「共同著者姓名(C)」、「文獻題名(T)」、「期刊題名(J)」、「出版年(Y)」、「頁數(P)」等五項特徵資訊,其中「出版年」與「頁數」從未有其他研究使用過。本研究分別使用監督式學習方法與非監督式分類方法,探討總共28項不同的特徵資訊組合,分別對著者姓名歧義性解析的正確率。
研究發現「期刊題名(J)」與「共同作者(C)」是特別有效的特徵資訊,其中「期刊題名(J)」無論在各種方法中都展現重要性,而「共同作者(C)」則主要在使用支持向量機(Support Vector Machine,SVM)方法時十分出色。另外,「出版年(Y)」與「頁數(P)」在與其他特徵資訊的組合明顯地提升歧義性解析的正確率,兩者以「出版年(Y)」的輔助效果較為突出(約平均提升2.5%),此外出版年與頁數對歧異性解析的影響效果在使用K-means分群方法時的特別明顯(約5%)。
在前人研究中經常被使用的特徵資訊組合「CTJ」並不一定能取得最佳的正確率,透過不同分類方法發現其他特徵組合亦能達到最佳的正確率,如JYP、JY、CJ等特徵組合。最後根據資料集的規模與複雜度進行辨識結果的比較中發現,當測試的資料集日益龐雜時,僅倚靠引用文獻的書目資料則難以提供充足的辨識效果。顯現在未來研究中,若要有效地解決人名歧異性之問題,必須從書目資料的資訊向外與其他資訊進行連結與對應,以獲取更明確的作者特徵。

In order to solve name ambiguity when retrieving academic information, researches on author identification are indispensable. With comparison to previous works, this study attempts to address this problem using information contained in bibliographic data only. Five features, co-author (C), article title (T), journal title (J), year (Y), and number of pages (P), are extracted from bibliographic data and will be used to disambiguate author names in this work. Note that feature Y and feature P are not ever used before. Both supervised learning methods (Naive Bayes and Support Vector Machine) and unsupervised learning method (K-means) are employed to explore 28 different feature combinations.
The findings show that the performance of feature journal title (J) and co-author (C) is very effective. Feature J plays an important role in three different approaches, and feature C is mainly outstanding in SVM. In addition, feature year (Y) and feature number of pages (P) obviously enhance accuracy rate while they accompanied with various feature combination(s), and the average improvement rate of inclusion with feature Y is more significant than feature P. However, it is significant that the effect is more positive in K-means clustering (+4.98% in average) than that in Naive Bayes Model (+0.90% in average) and Support Vector Machine (+0.15% in average).
It is also shown that the performance of feature combination CTJ used traditionally is not superior to JYP, and the performance of feature combinations CJY, JY and J are also very effective in three methods. Finally, it is found that the accuracy of disambiguation on larger datasets is 10% inferior to the smaller ones, which indicated the limitation and deficiency of the performance achieved by bibliographic data in this “numerous and jumbled” real world. Consequently, it is a promising trend in the future to build an intellectual mechanism to map other information onto bibliographic information accurately in order to get sufficient information for author disambiguation.

Table of Contents


摘要 i
Abstract ii
Table of Contents iii
List of Tables v
List of Figures vi
Chapter 1 Introduction 1
1.1 Background and Motivation 1
1.2 Objectives of Research 2
1.3 Restriction of Research 3
1.4 Definition of Terms 3
1.4.1 Bibliographic data 3
1.4.2 Ambiguity Resolution 3
Chapter 2 Literature Review 5
2.1 Name Disambiguation 5
2.2 Ambiguity Resolution for Author 7
2.3 Machine Learning 9
2.3.1 Supervised Learning Methods 9
2.3.2 Unsupervised Learning Methods 11
Chapter 3 Research Design 13
3.1 Data Collection 13
3.2 Feature Combinations 15
3.3 Data Processing 15
3.4 Machine Learning 16
3.5 Performance Evaluation 16
3.6 Settings for Year and Number of Pages 17
Chapter 4 Experimental Results 19
4.1 Common Feature Combinations 19
4.2 Features Year (Y) and Number of Pages (P) 23
4.3 Complexity of Datasets 25
4.4 Top One Feature Combinations 29
Chapter 5 Conclusions and Suggestions 33
5.1 Conclusions 33
5.2 Suggestions for Future Studies 34
References 37
Appendix 41


Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1, 1-36.
Can, F., & Patton, J. M. (2004). Change of writing style with time. Computers and the Humanities, 38, 61-82.
Chang, C. C. & Lin, C. J. (2010). LIBSVM - A Library for support Vector Machines (Version 3.0). Retrieved Oct. 4, 2010, from http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Churches T., Christen, P., Lim, K., & Zhu, J. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making, 2, 9.
CiteSeer (n.d.). About CiteSeerX. Retrieved Jan. 31, 2011 from http://citeseer.ist.psu.edu/about/site
Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In: Proceedings of the AAAI 6 th International Workshop on Information Integration on the Web, 32-37.
Digital Author Identifier (DAI). (2009). DAI-Standard wiki. Retrieved Oct. 4, 2010, from http://www.surffoundation.nl/wiki/display/standards/DAI
DiLauro, T., Choudhury, G. S., Patton, M., Warner, J. W. & Brown, E. W. (2001). Automated name authority control and enhanced searching in the levy collection. D-Lib Magazine, 7(4).
Elmagarmid, A. K., Ipeirotis, P. G. & Verykios, V. S. (2007). Duplicate record detection: A survey. TKDE, 19(1), p1–16.
Ferris, M. & Munson, T. (2002). Interior-point methods for massive support vector machines. SIAM Journal on Optimization 13 (3): 783–804.
French, J. C., Powell, A., & Schulman, E. (2000). Using clustering strategies for creating authority files. Journal of the American Society for Information Science, 51, 774-786.
Gale, W. A., Church, K. W. & Yarowsky, W. (1992). A method for disambiguation word senses in a large corpus. Computers and the Humanities 26: 415-439.
Han, H., Giles, L., Zha, H., (2005a). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of ACM/IEEE Joint Conference on Digital Libraries. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.9354&rep=rep1&type=pdf
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS Joint Conference. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://clgiles.ist.psu.edu/papers/JCDL-2004-author-disambiguation.pdf
Han, H., Giles, L., Zha, H., Xu, W. (2005b). A hierarchical Naive Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM symposium. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://clgiles.ist.psu.edu/papers/SAC-2005-Naive-Bayes-Mixture.pdf
Hastie, T., Tibshirani, R., Friedman, J. (2011). Hierarchical clustering. The Elements of Statistical Learning (2nd ed.). New York: Springer, 520–528.
Hernandez, M. A., Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1), p9–37.
Hill, S., & Provost, F. (2003). The myth of the double-blind review? Author identification using only citations. ACM SIGKDD Explorations, 5, 179-184.
Huang, J., Ertekin., S., & Giles, C. L. (2006). Efficient name disambiguation for large scale databases. In J. Furnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, 536-544.
International Standard Name Identifier (ISNI). (2009). ISNI Draft ISO 27729. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://www.isni.org/
Jang, J. S. (2011). Data Clustering and Pattern Recognition. Retrieved Jan. 4, 2011, Retrieved Dec. 25, 2010, from http://mirlab.org/jang
Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Statistics in Medicine, 14, 491-498.
Kanani, P., McCallum, A., & Pal, C. (2007). Improving author coreference by resource bounded information gathering from the web. In M. M. Veloso (Ed.), Proceedings of the 20th International Joint Conference on Artificial Intelligence, 429-434.
Koppel, M., Argamon, S., & Shimoni, A. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17, 401-412.
Malin, B., Airoldi, E., & Carley, K. M. (2005). A network analysis model for disambiguation of names in lists. Computational and Mathematical Organization Theory, 11, 119-139.
Mitchell, T. M. (1997). Machine Learning. New York: McGraw Hill.
Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery.
Naveman. (2011). Naveman Glossary. Retrieved Jan. 4, 2011, Retrieved Dec. 25, 2010, from http://www.navmanmarine.net/
OCLC. (2009). WorldCat Identity Service. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://orlabs.oclc.org/identities
People Australia. (2010). People Australia Overview. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://www.nla.gov.au/initiatives/peopleaustralia/index.html
Pereira, D. A., Ribeiro-Neto, B. A., Ziviani, N., Laender, A. H. F., Goncalves, M. A., Ferreira, A. A. (2009). Using web information for author name disambiguation. In Proc. of JCDL, pp 49–58.
ProQuest. (2009). Scholar Universe. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://www.scholaruniverse.com
Research Name Resolver. (2010). NII Research Name Resolver. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://rns.nii.ac.jp/;jsessionid=372CE9C69AF0745A1597C34DD3ACC420
Safavian, S. R., Landgrebe, D. (1991). A survey of decision tree classifier methodology. IEEE Trans. Systems Man Cybernet. 21, 660-674.
Smalheiser, N. R., Torvik, V. I. (2009). Author Name Disambiguation. Chapter in Annual Review of Information Science and Technology, v.43.
Song, Y., Huang, J., Councill, I. G., Li, J. & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In E. M. Rasmussen, R. R. Larson, E. Toms, S. Sugimoto (Eds.), Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 342-351.
Tan, Y. F., Kan, M. Y. & Lee, D. (2006). Search engine driven author disambiguation. In G. Marchionini, M. L. Nelson, & C. C. Marshall (Eds.), Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 314-315.
Thomson Reuter. (2009). Distinct Author Identification System. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://scientific.thomsonreuters.com/support/faq/wok3new/dais/
Thomson Routers. (2011). Journal Citation Reports. Retrieved Oct. 4, 2010, Retrieved Jan. 3, 2011, from http://www.isiwebofknowledge.com/
Torvik V. I, Smalheiser N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11.
Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56, 140-158.
Wiley-Blackwell. (2010). Author Service. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://authorservices.wiley.com/bauthor/
Winkler, W. E. (1995). Matching and record linkage. In B. G. Cox et al. (Eds.), Business Survey Methods, New York: J. Wiley, 355-384.
Yang, D. L., Chang, J. H., Huang, M. C. & Liu, J. S. (1999). An efficient K-means-based clustering algorithm. In Proceedings of the 1st Asia-Pacific Conference on Intelligent Agent Technology, 269-273.
Yang, K. H., Jiang, J. Y., Lee, H. M., Ho, J. M. (2007). Extracting citation relationships from web documents for author disambiguation. Technical Report No. TR-IIS-06-017. Retrieved Oct. 4, 2010, from Retrieved Nov. 27, 2009, from http://www.iis.sinica.edu.tw/page/library/TechReport/tr2006/tr06017.pdf
Yang, K. H., Peng, H. T., Jiang, J. Y., Lee, H. M., Ho, J. M. (2008). Author Name Disambiguation for Citations Using Topic and Web Correlation. In Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries. Lecture Notes In Computer Science, (5173), p185 – 196. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://www.iis.sinica.edu.tw/papers/hoho/7642-F.pdf

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔