跳到主要內容

臺灣博碩士論文加值系統

(34.204.172.188) 您好!臺灣時間:2023/09/22 23:49
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:黃俊誌
研究生(外文):Chun-Chih Huang
論文名稱:改善以序列為基礎之文件檢索系統之有效性與彈性
論文名稱(外文):Improving the Effectiveness and Scalability of a Sequence-Based Text Retrieval System
指導教授:蔡益坤
指導教授(外文):Yih-Kuen Tsay
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊管理學研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:中文
論文頁數:56
中文關鍵詞:累加式更新索引切割設計資訊檢索平行化反轉索引平行化處理文件檢索
外文關鍵詞:Incremental UpdateIndex Partitioning SchemesInformation RetrievalParallel Inverted IndexParallel ProcessingText Retrieval
相關次數:
  • 被引用被引用:0
  • 點閱點閱:128
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:3
The purpose of a text retrieval system is to locate documents from a large, textual
document collection that meet a user’s needs. The SIR system is such a system that is
based on the sequence model. As it was designed and implemented as a sequential, rather
than a parallel application, it becomes less efficient when the size of the data collection
gets larger. Another drawback of the SIR system is that the index must be rebuilt entirely
when the data collections are modified. Also, compared with other models, the query
evaluation process of the sequence model is time consuming. In this thesis, we seek to
make improvements that address these problems.
To facilitate parallel query processing, we implement three kinds of index partitioning
schemes in the system, and evalauete their load balancing characteristics. To improve the
scalability of index building, we design and implement a mechanism that allows the SIR
system to support incremental index updates. We also make other improvements such as
support of queries with homophones and support of more types of token, that make the
system more flexible.
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Chinese IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Effectiveness and Scalability of IR systems . . . . . . . . . . . . . 3
1.2 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 7
2.1 Inverted Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Parallel Processing in IR . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Incremental Index Updating . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 The Indexing and Retrieval Process of SIR . . . . . . . . . . . . . . . . . 14
2.5.1 Indexing Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 Retrieval Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Scalability and Effectiveness of SIR 19
3.1 Retrieval Process Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Index Partitioning Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Index Updating Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Other Modifications to the SIR system . . . . . . . . . . . . . . . . . . . 26
4 The New Components of SIR 29
4.1 The Broker Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Indexing Building . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 The Coordinated Collection Management Component . . . . . . . . . . . 35
5 Experimental Results and Analysis 38
5.1 The Effectiveness of the SIR system . . . . . . . . . . . . . . . . . . . . . 38
5.1.1 Topic Containing Continuous, Related Semantic Blocks . . . . . . 39
5.1.2 Topics Containing Distant Semantic Blocks . . . . . . . . . . . . . 40
5.1.3 Adjusting the Weight of the Three Scoring . . . . . . . . . . . . . 41
5.1.4 The Effect of Multi-phase Query . . . . . . . . . . . . . . . . . . 43
5.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 The Effect of Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 The Load Balancing Analysis . . . . . . . . . . . . . . . . . . . . 46
5.2.2 The Scalability Analysis of the SIR System . . . . . . . . . . . . . 49
6 Conclusion 51
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . 54
Bibliography
[1] Ricardo Baeza-Yates and Bertheir Ribeiro-Neto. Modern Information Retrieval.
Addison-Wesley, 1999.
[2] Eric W. Brown, James P. Callan, and W. Bruce Croft. Fast incremental indexing for
full-text information retrieval. In VLDB ’94: Proceedings of the 20th International
Conference on Very Large Data Bases, pages 192–202. Morgan Kaufmann Publishers
Inc., 1994.
[3] B. Ribeiro-Neto C. Badue, R. Baeza-Yates and N. Ziviani. Distributed query processing
using partitioned inverted files. In Proc. Symp. String Processing and Information
Retrieval, pages 10–20. IEEE CS Press, 2001.
[4] James P. Callan, W. Bruce Croft, and Stephen M. Harding. The INQUERY retrieval
system. In Proceedings of DEXA-92, 3rd International Conference on Database and
Expert Systems Applications, pages 78–83, 1992.
[5] Yu-Fang Chen. A Text Retrieval System Based on Sequence Similarity. Master’s
thesis, Department of Information Management, National Taiwan University, 2003.
[6] Doug Cutting and Jan Pedersen. Optimizations for dynamic inverted index maintenance.
In Proceedings of the 13th International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 405–411, 1990.
[7] Michael J. Flynn. Very high-speed computing systems. pages 519–527, 2000.
[8] O. Frieder, D. Grossman, A. Chowdhury, and G. Frieder. Efficiency considerations for
scalable information retrieval servers. Journal of Digital Information, 1(5), January
2000.
[9] Byeong-Soo Jeong and Edward Omiecinski. Inverted file partitioning schemes in
multiple disk systems. IEEE Trans. Parallel Distrib. Syst., 6(2):142–153, 1995.
[10] Lung-Chi Lin. A Preliminary Study of Text Retrieval Techniques Utilizing Character/
Word Positions. Master’s thesis, Department of Information Management,
National Taiwan University, 2000.
[11] Mauricio Marin. Parallel Text Query Processing using Composite Inverted Lists.
Master’s thesis, Computing Department, University of Magallanes, Chile, 2003.
[12] J. Eliot B. Moss. Design of the mneme persistent object store. ACM Trans. Inf.
Syst., 8(2):103–139, 1990.
[13] M. F. Porter. An algorithm for suffix stripping. pages 313–316, 1997.
[14] E. Rasmussen. Parallel information processing. In Annual Review of Information
Science and Technology, 27:99–130, 1992.
[15] Gerard Salton, Edward A. Fox, and Harry Wu. Extended boolean information retrieval.
Commun. ACM, 26(11):1022–1036, 1983.
[16] David B. Skillicorn, Jonathan M. D. Hill, and W. F. McColl. Questions and Answers
about BSP. Scientific Programming, 6(3):249–274, Fall 1997.
[17] Ohm Sornil. Hybrid Partition Inverted Files for Large-Scale Digital Libraries. Master’s
thesis, Virginia Polytechnic and State University(Virginia Tech), 2001.
[18] Ohm Sornil. Parallel Inverted Index for Large-Scale, Dynamic Digital Libraries.
Master’s thesis, Blacksburg, Virginia, 2001.
[19] C. Stanfill. Partitioned posting files: a parallel inverted file structure for information
retrieval. In SIGIR ’90: Proceedings of the 13th annual international ACM SIGIR
conference on Research and development in information retrieval, pages 413–428.
ACM Press, 1990.
[20] C. Stanfill, R. Thau, and D. Waltz. A parallel indexed algorithm for information
retrieval. In SIGIR ’89: Proceedings of the 12th annual international ACM SIGIR
conference on Research and development in information retrieval, pages 88–97. ACM
Press, 1989.
[21] Anthony Tomasic and Hector Garcia-Molina. Performance of inverted indices in
shared-nothing distributed text document informatioon retrieval systems. In PDIS
’93: Proceedings of the second international conference on Parallel and distributed
information systems, pages 8–17. IEEE Computer Society Press, 1993.
[22] Anthony Tomasic, Hector Garcia-Molina, and Kurt Shoens. Incremental updates of
inverted lists for text document retrieval. In SIGMOD ’94: Proceedings of the 1994
ACM SIGMOD international conference on Management of data, pages 289–300.
ACM Press, 1994.
[23] Anthony Slavko Tomasic. Distributed queries and incremental updates in information
retrieval systems. PhD thesis, 1994.
[24] Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM,
33(8):103–111, 1990.
[25] Ching-Lin Yu. Sequence-Based Text Retrieval: Design and Implementation. Master’s
thesis, Department of Information Management, National Taiwan University, 2002.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top