(44.192.112.123) 您好!臺灣時間:2021/03/08 13:38
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:梁婷
研究生(外文):Tyne Liang
論文名稱:中文文件擷取中以字為基礎的特徵法之研究
論文名稱(外文):The Study of Character-based Signature Methods in Chinese Text Retrieval
指導教授:李素瑛李素瑛引用關係楊維邦楊維邦引用關係
指導教授(外文):Suh-Yin LeeWei-Pang Yang
學位類別:博士
校院名稱:國立交通大學
系所名稱:資訊工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:1995
畢業學年度:83
語文別:英文
中文關鍵詞:中文檢索誤判文件特徵
外文關鍵詞:Chineseindexingfalse hitstextsignature
相關次數:
  • 被引用被引用:0
  • 點閱點閱:116
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
多數的中文全文檢索是以中文單字而非中文詞來做為基本的檢索單位。因
此在對多音詞搜尋時,如果沒有處理字組間的順序訊息,將會造成所擷取
出來的文件可能只含有此多音詞的字而非多音詞本身。在本篇論文裡我們
稱這些文件為順序誤判。因此在尋求有效的中文文件擷取法則時,我們一
方面先就中文的多音詞結構和順序誤判之間的關聯做一探討。另一方面,
我們也評估若為減少順序誤判而儲存字序將會在常用的擷取方法造成多少
額外的空間和處理時間。從搜尋時間和儲存空間的評估裡顯示特徵法較之
反轉法在處理中文字序問題和儲存空間上更具有良好的應用潛力。然而以
特徵檔做文件擷取法時會產生所謂的隨機誤判。因此我們針對中文文件擷
取時常用的雙音詞和三音詞,提出一個更接近實際數值的隨機誤判的理論
計算。在建構中文特徵檔時,我們另將文件中的連續雙字鍵訊息利用轉換
函數產生相對應的特徵碼。再以重疊方式儲存在文件特徵中。這種特徵擷
取我們稱之為結合法。在論文中,我們並提出應用此法在雙音詞查詢時所
造成的誤判機率的理論計算公式。同時對應單字鍵特徵和雙字續鍵特徵,
提出最佳加權設定使得在對雙音詞查詢時所產生的誤判率是最低的。在設
計最佳加權設定上,乃考量了不同鍵值在文件中出現的頻率,雙音詞與其
構成字的語意結合關係,特徵碼長度和儲存空間大小。
Many Chinese text access methods use characters instead of
words as the basic search units and treat polysyllabic queries
as conjunctive combinations of their constituent characters.
Therefore if no character sequence information is incorporated
in the search algorithm, one may retrieve an adjacency false
hit which is a document containing all the characters of a
polysyllabic query but not in the exact character sequence as
in the query itself. In search of a good character-based
Chinese text retrieval methods, the relation of adjacency false
hit to the construction of polysyllabic words in Chinese is
examined. On the other hand, the extra storage overhead and
processing time needed to eliminate adjacency false hits for
commonly-used character-based text access methods (inversion
and signature) are estimated. It turns out that signature
method is more promising than the inversion method for its less
space overhead and easy support for adjacency operation in
Chinese text retrieval. However, signature-based access may
retrieve those documents which do not contain all the keys of
search term. In this thesis, the origin of random false hits is
investigated and more realistic estimation of random false hit
probability is derived for Chinese disyllabic and trisyllabic
terms. To construct a Chinese signature file, a special scheme
(combined scheme) is proposed in which every character
(monogram ) and character pair (bigram) in the document is
hashed to the document signature. For disyllabic queries, an
analytical expression of the false hit rate is found. With this
expression, the optimal monogram and bigram weight assignments
are obtained in terms of the signature length, the storage
overhead , as well as the occurrence frequency and the
association value of the query.
Cover
Abstract (in Chinese)
Abstract (in English)
Acknowledgment
Contents
List of Figures
List of Tables
1. INTRODUCTION
I.1 General Features of Text Retrieval
1.2 Overview of Text Access Methods
1.3 Problem with Chinese Text Retrieval
1.4 Synopsis of this Dissertation
2. CHINESE TEXT RETRIEVAL AND ITS ANALYSIS
2.1 Properties of Chinese Texts and Query Terms
2.1.1 Word Construction
2.1.2 Word Identification
2.2 Adjacency False Hits
2.2.1 Adjacency False Hit Probability
2.2.2 Retrieval Accuracy Rate
2.2.3 Experiments and Analysis
2.3 Character-based Indexing Methods
2.3.1 Inversion Method
2.3.2 Cluster Method
2.3.3 Signature Method
2.3.4 Speed and Space Overhead
2.4 Concluding Remarks
3. RANDOM FALSE HITS AND CHINESE SIGNATURE SCHEMES
3.1 Random False Hits for Polysyllabic Queries
3.1.1 Disyllabic Queries
3.1.2 Trisyllabic Queries
3.2 Chinese Signature Schemes
3.2.1 Monogram and Combined Schemes
3.2.2 Fixed Size Block and Fixed Weight Block Schemes
3.3 Concluding Remarks
4. OPTIMAL WEIGHT ASSIGNMENTS
4.1 Survey of Previous Works
4.2 Optimal Weight Assignments for Combined Schemes
4.3 Experiments and Analysis
4.4 Concluding Remarks
5. CONCLUSION AND FUTURE WORK
APPENDIX: PROGRAM LISTINGS
A.I Programs for IBM/PC Compatible Computers
A.I.I K6.PAS
A.1.2 SIGUNIT.PAS
A.2 Programs for Workstations
A.2.1 kO.h
A.2.2 kl.c
A.2.3 k2.c
A.2.4 k3.c
REFERENCES
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔