研究生(外文):Chang, Yu-Jen
論文名稱(外文):A Research of Performance Evaluation of Mandarin Chinese Full-Text Information Retrieval--Full-Text Scan Model vs. Cluster Indexing Model
指導教授(外文):Chung, Kuo-KueiHuang, Yun-Long
外文關鍵詞:Information RetrievalEffectiveness EvaluationFull-Text Scan ModelCluster Indexing ModelSingular value Decomposition
Full-Text Information Retrieval is becoming an interdisciplinary interest. Mandarin Chinese Full-Text Information Retrieval is facing more basic difficulties than English context because of research lag and language nature. Lack of an objective test collection and a standard effectiveness evaluation for information retrieval experiments is the fundamental issue for Mandarin Chinese Full-Text information retrieval. In this thesis, we will introduce two different systems, including the Chinese Text Processor (CTP) developed by Academia Sinica in 1996, and the Cluster Indexing Model (CIM) developed by Huang Yun-Long in 1997. Also we will use same corpus (documents set), to evaluate system performance.
Concerning the research status in Chinese, this research will have three contributions. First, analysis the fitness method of Full-Text Information Retrieval in same corpus or documents set. Second, developing a mature Cluster Indexing Model as the fundamental of advance application researches. Finally, this project will construct test collections and a standard effectiveness evaluation for Full-Text Information Retrieval researches in Chinese.
Involving with medicine of Children’s Daily News (502 documents) and 21 queries. Under a series of experiments, the following conclusions are discovered:
1.The average recall of CTP is 99.02%, and its average precision is 17.72%.
2.In automatic term segmentation methods, under index dimension 100 and similarity threshold 0.3:
(1)The recall of CIM-IDF is 80.73%, and the precision is 45.09%.
(2)The recall of CIM-TF is 65.97%, and the precision is 43.52%.
3.In manual term segmentation methods, under index dimension 100 and similarity threshold 0.3:
(1)The recall of CIM-IDF is 82.81%, and the precision is 47.11%.
(2)The recall of CIM—TF is 64.81%, and the precision is 42.72%.
4.According to the results of above experiments, the following conclusions are discovered:
(1)The performance of CIM-IDF is better than CTP in automatic and manual term segmentation.
(2)The performance of CIM-IDF is better than CIM—TF in automatic and manual term segmentation.
(3)In CIM-IDF, when index dimension greater than 80, the results show that the performance of automatic and manual term segmentation are similar. It showed clearly that automatic term segmentation methods could substitute for manual.
Many researchers have devoted to developing information retrieval systems for a long time. They are find new ways of doing things from different theories and improve system of performance, but not any one system can by satisfy. However, The IR system should support different retrieval models, and relevance feedback can use to differ model in the future.
Besides, research has involved many topics for discussion in Mandarin Chinese Full-Text information retrieval. However, it was lack of effectiveness evaluation in diverse information retrieval. If research could construct a standard of evaluation environment (ex. large corpus, query, relevance judgment, and a standard of evaluation), it will improve system of performance to contributive.
中文摘要 …………………………………………………………………………… i
英文摘要 …………………………………………………………………………… iii
誌謝 ………………………………………………………………………………… vi
目錄 ………………………………………………………………………………… vii
表次 ………………………………………………………………………………… ix
圖次 ………………………………………………………………………………… x
第一章 緒論 …………………………………………………………………… 1
第一節 研究背景 ………………………………………………………… 1
第二節 研究動機 ………………………………………………………… 5
壹、 知識時代,數位化資訊檢索 ……………………………………… 5
貳、 資訊檢索的應用與發展 …………………………………………… 6
參、 建構客觀性實驗平臺之環境 ……………………………………… 6
肆、 中文全文資訊檢索績效評量 ……………………………………… 7
第三節 研究目的 ………………………………………………………… 9
第四節 論文架構 ………………………………………………………… 11
第二章 文獻探討 ……………………………………………………………… 12
第一節 資訊檢索的概念 ………………………………………………… 12
壹、 資訊檢索系統 ……………………………………………………… 13
貳、 資訊需求 …………………………………………………………… 14
參、 文件的組織 ………………………………………………………… 16
第二節 相關的概念 ……………………………………………………… 18
壹、 相關的定義 ………………………………………………………… 18
貳、 相關判斷等級尺度對檢索績效影響 ……………………………… 20
參、 相關判斷過程與實際測試 ………………………………………… 21
第三節 檢索模型 ………………………………………………………… 23
壹、 全文檢視模型 ……………………………………………………… 23
貳、 群集索引模型 ……………………………………………………… 24
第四節 績效評量 ………………………………………………………… 35
壹、 評量標準 …………………………………………………………… 35
貳、 檢索結果呈現方式 ………………………………………………… 38
第三章 研究設計 ……………………………………………………………… 43
第一節 研究範疇 ………………………………………………………… 43
第二節 研究架構 ………………………………………………………… 45
第三節 研究流程 ………………………………………………………… 47
第四節 研究限制 ………………………………………………………… 49
壹、 測試語料選擇 ……………………………………………………… 49
貳、 使用者資訊需求描述與查詢問題 ………………………………… 50
參、 文件的相關判斷 …………………………………………………… 51
第四章 實驗結果分析 ………………………………………………………… 52
第一節 實驗概要簡介 …………………………………………………… 52
壹、 實驗環境概要 ……………………………………………………… 52
貳、 實驗評量變數 ……………………………………………………… 53
參、 實驗評量與呈現方式 ……………………………………………… 54
第二節 檢索研究基礎環境分析 ………………………………………… 55
壹、 語料庫分析 ………………………………………………………… 55
貳、 索引詞選詞分析 …………………………………………………… 56
參、 查詢句分析 ………………………………………………………… 58
第三節 實驗結果 ………………………………………………………… 59
壹、 人工選詞,群集索引TF與IDF加權模式的效能優劣 ………… 59
貳、 自動選詞,群集索引TF與IDF加權模式的效能優劣 ………… 63
參、 群集索引TF與IDF加權模式,自動與人工選詞的效能優劣 …… 66
肆、 群集索引IDF加權模式與CTP的效能優劣 ……………………… 69
第四節 錯誤分析與實驗討論 …………………………………………… 73
壹、 錯誤分析 …………………………………………………………… 73
貳、 實驗討論 …………………………………………………………… 74
第五章 結論與未來研究建議 ……………………………………………… 78
第一節 結論 ……………………………………………………………… 78
壹、 索引詞選取 ………………………………………………………… 78
貳、 奇異值與最適索引構面探討 ……………………………………… 79
參、 相似值與界限值探討 ……………………………………………… 81
肆、 實驗結論探討 ……………………………………………………… 84
第二節 未來研究建議 …………………………………………………… 86
壹、 未來實驗建議 ……………………………………………………… 86
貳、 資訊檢索效能評量的議題 …………………………………………… 87
參、 群集索引系統的藍圖 ……………………………………………… 89
肆、 結論 ………………………………………………………………… 88
參考文獻 ………………………………………………………………………… 91
附錄一 CTP系統與群集索引模式數值計算操作說明 ……………………… 96
附錄二 查詢句內容、檢索詞彙與相關文件數 ……………………………… 99
附錄三 查詢句內容、檢索詞彙與相關文件數 ……………………………… 103
