跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.87) 您好!臺灣時間:2025/03/17 13:43
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:鄭哲聖
研究生(外文):Cher-Sheng Cheng
論文名稱:大型資訊檢索系統之轉置檔案設計
論文名稱(外文):Inverted File Design for Large-Scale Information Retrieval System
指導教授:單智君
指導教授(外文):Jean Jyh-Jiun Shann
學位類別:博士
校院名稱:國立交通大學
系所名稱:資訊工程系所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:135
中文關鍵詞:資訊檢索轉置檔案轉置檔案壓縮搜尋引擎查詢處理平行資訊檢索
外文關鍵詞:Information RetrievalInverted FileInverted File CompressionSearch EngineQuery EvaluationParallel IR
相關次數:
  • 被引用被引用:0
  • 點閱點閱:237
  • 評分評分:
  • 下載下載:20
  • 收藏至我的研究室書目清單書目收藏:4
本論文主要在探討各種改善資訊檢索效率的技術。最近幾年來,資訊檢索系統已廣泛地使用於各種應用中,例如:搜尋引擎、數位圖書館、基因序列分析等。為了在大量的資料中有效率地搜尋,資訊檢索系統採用已壓縮的轉置檔案來迅速地找到使用者所需要的資料。在轉置檔案中,每一個字彙都有一個相對應的文件編號串列(稱為轉置串列)來指示那一個文件包含這個字彙。大型資訊檢索系統的查詢處理時間大多花在讀取與解壓縮各個出現在查詢中的字彙所對應到的轉置串列。由於每新增一個文件就會使得出現在文件中的字彙所對應的轉置串列長度增加,因此轉置串列的長度與文件數目呈現正比關係。這意味著查詢處理時間與文件數目亦呈現正比關係。所以,發展有效率的演算法來降低轉置串列的處理(讀取、解壓縮、與合併)時間就成了設計大型資訊檢索系統的成功關鍵。

本論文將探討下列的研究議題:
(1) 發展一個有效率的編碼方法來縮減轉置檔案所佔用的空間
在這個議題中我們透過轉置檔案的壓縮來減少磁碟的輸出與輸入所需的時間並藉以改善查詢處理時間,但這卻會帶來額外的解壓縮時間。本議題的目標即是設計一個可節省空間並可快速解碼的方法來壓縮轉置檔案。我們以壓縮率最高的內插編碼法為基礎,採用遞迴移除與迴圈展開的技巧來加速內插編碼法的編碼與解碼速度。實驗顯示與其他已知的編碼法比較,我們所提的方法提供了快速的解碼與良好的壓縮效能。
(2) 發展雙層可跳躍式轉置檔案來除去多餘的解碼
在這個議題中我們提出一個雙層可跳躍式轉置檔案來減少查詢處理時所需的轉置串列解壓縮時間。這個議題所面臨的困難是利用目前的可跳躍式機制來實作雙層可跳躍式轉置檔案所需耗用的空間太大。為了設計一個節省空間並可有效加速查詢處理的雙層可跳躍式轉置檔案,在以區塊空間預估為基礎,我們發展了一個新的跳躍機制。這個機制可以搭配目前已知的可跳躍式機制在相當小的空間耗用下實作雙層可跳躍式轉置檔案。實驗顯示我們所提的雙層可跳躍式轉置檔案可以同時加速連接式布林查詢與排名查詢。
(3) 利用文件編號來使得轉置檔案最佳化
在這個議題中我們提出一個文件編號演算法以加速查詢地處理速度。我們觀察到透過指派合適的編號給文件可以讓轉置串列在使用相同的編碼方法下被壓縮的更好,並提升查詢處理的速度。本議題我們提出一個新的演算法,稱為Partition-based document identifier assignment (PBDIA) 演算法,來為文件產生合適的編號。這個演算法可以有效率地指派連續的編號給那些包含有經常被查詢的字彙之文件,使得經查被查詢的字彙之轉置串列可以被壓縮得更好。實驗顯示我們所提的PBDIA演算法可以有效縮短查詢處理時間。
(4) 發展平行資訊檢索系統上的轉置檔案切割方法
在這個議題中我們針對平行資訊檢索系統提出一個轉置檔案切割方法。叢集系統利用分散在各工作站上的轉置檔案,以平行計算的方式處理查詢。此演算法的目的是降低處理查詢地平均時間。我們首先採用PBDIA演算法讓包含經查被查詢的字彙之文件可以被指派連續的編號。然後,再採用交錯式切割方案來分配轉置檔案到各工作站上。實驗顯示利用此步驟切割轉置檔案,可以達到負載與儲存量的平衡,以及幾近理想的速度提升。

本論文之研究成果包括:
• 在縮減轉置檔案所佔用的空間方面,我們所提出的編碼方法除了可提供優越的壓縮效果外,在查詢處理速度上也比目前最常使用的Golomb coding還快了大約30%。
• 在除去多餘的解碼方面,我們所提出的雙層可跳躍式轉置檔案比起目前的單層可跳躍式轉置檔案在連接式布林查詢的處理速度上最高可提升16%,而在排名查詢的處理速度上最高可提升44%。
• 在轉置檔案最佳化方面,我們所提出的PBDIA演算法可以在數秒的時間內為1GB大小的文件集產生合適的文件編號並使得查詢處理速度最高可提升25%。
• 在平行資訊檢索方面,我們所提出轉置檔案切割步驟可以改善只使用交錯式切割方案的平行查詢處理速度達14%到17%,無論一個叢集有多少台工作站。
This dissertation investigates a variety of techniques to improve efficiency in information retrieval (IR). Information retrieval systems (IRSs) are widely used in many applications, such as search engines, digital libraries, genomic sequence analyses, etc. To efficiently search vast amount of data, a compressed inverted file is used in an IRS to locate the desired data quickly. An inverted file contains, for each distinct term in the collection, a posting list. The query processing time of a large-scale IRS is dominated by the time needed to read and decompress the posting list for each query term. Moreover, adding a document into the collection is to add one document identifier into the posting list for each term appearing in the document, hence the length of a posting list increases with the size of document collection. This implies that the time needed to process posting lists increase as the size of document collection grows. Therefore, efficient approaches to reduce the time needed to read, decompress, and merge the posting lists are the key issues in designing a large-scale IRS. Research topics to be studied in this dissertation are
(1) Efficient coding method for inverted file size reduction
The first topic is to propose a novel size reduction method for compressing inverted files. Compressing an inverted file can greatly improve query performance by reducing disk I/Os, but this adds to the decompression time required. The objective of this topic is to develop a method that has both the advantages of compression ratio and fast decompression. Our approach is as follows. The foundation is interpolative coding, which compresses the document identifiers with a recursive process taking care of clustering property and yields superior compression. However, interpolative coding is computationally expensive due to a stack required in its implementation. The key idea of our proposed method is to facilitate coding and decoding processes for interpolative coding by using recursion elimination and loop unwinding. Experimental results show that our method provides fast decoding speed and excellent compression.
(2) Two-level skipped inverted file for redundant decoding elimination
The second topic is to propose a two-level skipped inverted file, in which a two-level skipped index is created on each compressed posting list, to reduce decompression time. A two-level skipped index can greatly reduce decompress time by skipping over unnecessary portions of the list. However, well-known skipping mechanisms are unable to efficiently implement the two-level skipped index due to their high storage overheads. The objective of this topic is to develop a space-economical two-level skipped inverted file to eliminate redundant decoding and allow fast query evaluation. For this purpose, we propose a novel skipping mechanism based on block size calculation, which can create a skipped index on each compressed posting list with very little or no storage overhead, particularly if the posting list is divided into very small blocks. Using a combination of our skipping mechanism and well-known skipping mechanisms can implement a two-level skipped index with very little storage overheads. Experimental results showed that using such a two-level skipped index can simultaneously allow extremely fast query processing of both conjunctive Boolean queries and ranked queries.
(3) Document identifier assignment algorithm design for inverted file optimization
The third topic is to propose a document identifier assignment (DIA) algorithm for fast query evaluation. We observe that a good DIA can make the document identifiers in the posting lists more clustered, and result in better compression as well as shorter query processing time. The objective of this topic is to develop a fast algorithm that finds an optimal DIA to minimize the average query processing time in an IRS. In a typical IRS, the distribution of query terms is skewed. Based on this fact, we propose a partition-based DIA (PBDIA) algorithm, which can efficiently assign consecutive document identifiers to those documents containing frequently used query terms. Therefore, the posting lists for frequently used query terms can be compressed better without increasing the complexity of decoding processes. This can result in reduced query processing time.
(4) Inverted file partitioning for parallel IR
The fourth topic is to propose an inverted file partitioning approach for parallel IR. The inverted file is generally partitioned into disjoint sub-files, each for one workstation, in an IRS that runs on a cluster of workstations. When processing a query, all workstations have to consult only their own sub-files in parallel. The objective of this topic is to develop an inverted file partitioning approach that minimizes the average query processing time of parallel query processing. Our approach is as follows. The foundation is interleaving partitioning scheme, which generates a partitioned inverted file with interleaved mapping rule and produces a near-ideal speedup. The key idea of our proposed approach is to use the PBDIA algorithm to enhance the clustering property of posting lists for frequently used query terms before performing the interleaving partitioning scheme. This can aid the interleaving partitioning scheme to produce superior query performance.

The results of this dissertation include:
• For inverted file size reduction, the proposed coding method allows query throughput rate of approximately 30% higher than well-known Golomb coding and still provides superior compression.
• For redundant decoding elimination, the proposed two-level skipped inverted file improves the query speed for conjunctive Boolean queries by up to 16%, and for ranked queries up to 44%, compared with the conventional one-level skipped inverted file.
• For inverted file optimization, the PBDIA algorithm only takes a few seconds to generate a DIA for a collection of 1GB, and improves query speed by up to 25%.
• For parallel IR, the proposed approach can further improve the parallel query speed for interleaving partitioning scheme by 14% to 17% no matter how many workstations are in the cluster.
Contents
Abstract in Chinese 9
Abstract 11
Contents 15
List of Figures 17
List of Tables 18
Chapter 1 Introduction 19
1.1 Background: IRS Runs on Cluster of Workstations 19
1.2 Objective: Inverted File Design for Large-Scale Information Retrieval System 22
1.3 Research Topics 25
1.4 Dissertation Organization 28
Chapter 2 Inverted File Size Reduction 29
2.1 Well-known Interpolative Coding 31
2.1.1 Algorithm description 31
2.1.2 Observation and improvement 33
2.1.3 Remarks 36
2.2 Proposed Method: Unique-Order Interpolative Coding 36
2.2.1 The coding method 37
2.2.2 Illustration 39
2.2.3 Implementation optimization 42
2.3 Quantitative Analysis 43
2.4 Performance Evaluation 48
2.4.1 Document collections and queries 49
2.4.2 Performance results 50
2.5 Other Application 57
2.6 Summary 58
Chapter 3 Redundant Decoding Elimination 59
3.1 Two Well-known Skipping Mechanisms and Their Posting List Structures 62
3.1.1 Skipped inverted file 63
3.1.2 Blocked inverted file 64
3.1.3 Remarks 65
3.2 Test Data 66
3.2.1 Conjunctive Boolean queries 66
3.2.2 Ranked queries 67
3.3 Proposed Two-level Skipped Inverted Files 67
3.3.1 Framework of proposed approach 67
3.3.2 Proposed skipping mechanism 69
3.4 Performance Evaluation 75
3.4.1 Sizes for various inverted file organizations 75
3.4.2 Elapsed time required to process queries 77
3.5 Summary 81
Chapter 4 Inverted File Optimization 82
4.1 General Framework 83
4.2 Document Identifier Assignment Problem and Its Algorithm 87
4.2.1 Problem mathematical formulation 87
4.2.2 Solving DIA problem via the well-known Greedy-NN algorithm 89
4.3 Partition-based Document Identifier Assignment Algorithm 94
4.3.1 Generating an optimal DIA for a single query term 94
4.3.2 Efficient PBDIA algorithm for DIA problem 96
4.4 Performance Evaluation 102
4.4.1 Document collections and queries 102
4.4.2 Performance results 104
4.5 Summary 111
Chapter 5 Parallel IR 112
5.1 Inverted File Partitioning Problem 113
5.2 Fundamental: Interleaving Partitioning Scheme 114
5.2.1 Algorithm description 114
5.2.2 How to improve parallel query processing through document identifier assignment 116
5.3 Framework of Proposed Approach 118
5.4 Performance Evaluation 119
5.4.1 Test collection and query set 119
5.4.2 Performance results 120
5.5 Summary 122
Chapter 6 Conclusions 124
6.1 Dissertation Summary 124
6.2 Contribution and Suggested Work 128
References 130
[1] Aalbersberg, I.J. & Sijstermans, F. (1990). InfoGuide: A full-text document retrieval system. In A.M. Tjoa & R. Wagner (Eds.), Proceedings of the international conference of database and expert systems applications (DEXA'90), (pp.12-21). Berlin: Springer-Verlag.
[2] Anh, V.N. & Moffat, A. (1998). Compressed inverted files with reduced decoding overheads. In R. Wilkinson, B. Croft, and C.V. Rijsbergen (Eds.), Proceedings of the 21st annual international ACM SIGIR conference on Research and Development in Information Retrieval, (pp. 290-297), Melbourne. New York: ACM Press.
[3] Anh, V.N. & Moffat, A. (2005). Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1), 151-166.
[4] Bell, T.C., Moffat, A., Nevill-Manning, C.G., Witten, I.H., and Zobel, J. (1993). Data compression in full-text retrieval systems. Journal of the American Society for Information Science, 44(9), 508-531.
[5] Breslau, L., Cao, P., Fan, L., Phillips, G., and Shenker, S. (1999). Web caching and zipf-like distributions: evidence and implications. In Proceedings of Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE INFOCOM '99), (pp. 126-134), New York, Mar. Los Alamitos, CA: IEEE Computer Society Press.
[6] Brown, E.W., Callan, J.P., and Croft, W.B. (1994). Fast incremental indexing for full-text information retrieval. In Proceedings of the 20th Very Large Data Base Conference (VLDB'94) , (pp. 192-202).
[7] Cheng, C.S., Shann, J.J.J., and Chung, C.P. (2005). Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems. To appear in Information Processing and Management.
[8] Cheng, C.S., Shann, J.J.J., and Chung, C.P. (2004). A Unique-Order Interpolative Code for Fast Querying and Space-Efficient Indexing in Information Retrieval Systems. In P.K. Srimani et al. (Eds.), Proceedings of ITCC 2004 International Conference on Information Technology: Coding and Communications Volume 2, (pp. 229-235), Las Vegas, Nevada, Apr. Los Alamitos, CA: IEEE Computer Society Press.
[9] Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2), 194-203.
[10]Faloutsos, C. (1985). Access methods for text. ACM Computing Surveys, 17(1), 49-74.
[11]Fraenkel, A.S. & Klein, S.T. (1985). Novel Compression of sparse bit-string-Preliminary report. In A. Apostolico & Z. Galil (Eds.) Combinatorial Algorithms on Words: Vol. 12, NATO ASI Serials F. (pp. 169-183). Berlin: Springer-Verlag.
[12]Frakes, W.B. & Baeza-Yates, R. (1992). Information Retrieval: Data Structures and Algorithms. Upper Saddle River, NJ: Prentice Hall.
[13]Gallager, R.G. & Van Voorhis, D.C. (1975). Optimal source codes for geometrically distributed alphabets. IEEE Transactions on Information Theory, IT-21(2), 228-230.
[14]Gelbukh, A., Han, S.Y., and Sidorov, G. (2003). Compression of boolean inverted files by document ordering. In Proceedings of 2003 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLPKE-2003), (pp. 244-249), Beijing, China, Oct. Los Alamitos, CA: IEEE Computer Society Press.
[15]Golomb, S.W. (1966). Run Length Encoding. IEEE Transactions on Information Theory, IT-12(3), 399-401.
[16]Hawking, D. (1996). Document retrieval performance on parallel systems. In H.R. Arabnial, ed, Proceedings of the 1996 International Conference on Parallel and Distributed Processing Techniques and Applications, Sunnyvale, (pp. 1354-1365), California, August. Athens: CSREA Press.
[17]Hollaar, L.A. (1991). Special-purpose hardware for text searching: past experience, future potential. Information Processing & Management, 27 (4): 371-378.
[18]Janson, B.J., Spink, A., Bateman, J., and Saracevic, T. (1998). Real life information retrieval: a study of user queries on the Web. SIGIR Forum, 32(1), 5-17.
[19]Kobayashi, M. & Takeda, K. (2000). Information retrieval on the web. ACM Computing Surveys, 32(2), 144-173.
[20]Lawrence, S. & Giles, C. (1999). Accessibility of information on the web. Nature, 400, 107-109.
[21]Lovins, J.B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22-31.
[22]Ma, Y.C., Chen, T.F., and Chung, C.P. (2002). Posting file partitioning and parallel information retrieval. Journal of Systems and Software, 63(2), 113-127.
[23]MacFarlane, A. (2000). Distributed inverted files and performance: a study of parallelism and data distribution methods in IR (Ph.D. thesis). London: City University.
[24]Mcllroy, M.D. (1982). Development of a spelling list. IEEE Transactions on Communications, COM-30(1), 91-99.
[25]Moffat, A. & Stuiver, L. (2000). Binary interpolative coding for effective index compression. Information Retrieval, 3(1), 25-47.
[26]Moffat, A. & Zobel, J. (1992). Parameterised compression for sparse bitmaps. In N. Belkin, P. Ingwersen, and A.M. Pejtersen (Eds.), Proceedings of 15th annual international ACM-SIGIR Conference on Research and Development in Information Retrieval, (pp. 274-285), Copenhagen, Jun. New York: ACM Press.
[27]Moffat, A. & Zobel J. (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), 349-379.
[28]Moffat, A., Zobel, J., and Klein, S.T. (1995). Improved inverted file processing for large text databases. In R. Sacks-Davis & J. Zobel (Eds.), Proceedings of 6th Australasian Database Conference, (pp. 162-171), Adelaide, Australia, Jan.
[29]Olken, F. & Rotem, D. (1986). Rearranging data to maximize the efficiency of compression. In Proceedings of the fifth ACM SIGACT-SIGMOD symposium on Principles of database systems, (pp. 78-90), Cambridge, Massachusetts, United States, Mar. New York: ACM Press.
[30]Reddaway, S.F. (1991). High speed text retrieval from large databases on a massively parallel processor. Information Processing & Management, 27 (4): 311-316.
[31]Ribeiro-Neto, B., Moura, E.S., Neubert, M.S., and Ziviani, N. (1999). Efficient distributed algorithms to build inverted files. In M. Hearst, F. Gey, and R. Tong (Eds.), Proceedings for the 22nd International Conference on the Research and Development in Information Retrieval (SIGIR'99), (pp. 105-112). New York: ACM Press.
[32]Salton, G. (1989). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, Mass: Addison-Wesley.
[33]Salton, G. & McGill, M.J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.
[34]Scholer, F., Williams, H.E., Yiannis, J., and Zobel, J. (2002). Compression of inverted indexes for fast query evaluation. In M. Beaulieu, R. Baeza-Yates, S.H. Myaeng, and K. Järvelin (Eds.), Proceedings of the 25th annual international ACM SIGIR conference on Research and Development in Information Retrieval, (pp. 222-229), Tampere, Finland. New York: ACM Press.
[35]Shieh, W.Y., Chen, T.F., Shann, J.J., and Chung, C.P. (2003). Inverted file compression through document identifier reassignment. Information Processing and Management, 39(1), 117-131.
[36]Stanfill, C., Thau, R., and Waltz, D. (1989). A parallel Indexed algorithm for Information Retrieval. In N.J. Belkin & C.J. Van Rijsbergen (Eds.), Proceedings of the 12th annual conference on research and development in Information Retrieval (SIGIR'89), (pp. 88-97). New York:ACM Press.
[37]Stanfill, C. & Thau, R. (1991). Information retrieval on the connection machine: 1 to 8192 Gigabytes. Information Processing & Management, 27 (4): 285-310.
[38]Tenenbaum, A.M., Langsam, Y., and Augenstein, M.J. (1990). Data structures using C. Englewood CLiffs, N.J. 07632: Prentice-Hall.
[39]Teuhola, J. (1978). A Compression method for clustered bit-vectors. Information Processing Letters, 7(6), 308-311.
[40]Trotman, A. (2003). Compressing inverted files. Information Retrieval, 6(1), 5-19.
[41]Turpin, A. (1998). Efficient prefix coding (Ph.D. thesis). Melbourne: University of Melbourne.
[42]Turtle, H. & Flood, J. (1995). Query evaluation: strategies and optimizations. Information Processing & Management, 31(6): 831-850.
[43]Voorhees, E. & Harman, D. (1997). Overview of the sixth text retrieval conference (TREC-6). In E.M. Voorhees & D.K. Harman (Eds.), Proceedings of the Sixth Text REtrieval Conference (TREC-6), (pp. 1-24). Gaithersburg, MD: NIST.
[44]Williams, H.E. & Zobel, J. (2002). Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering, 14(1), 63-78.
[45]Williams, H.E. & Zobel, J. (1999). Compressing integers for fast file access. The Computer Journal, 42(3), 193-201.
[46]Witten, I.H., Moffat, A., and Bell, T.C. (1999). Managing Gigabytes: Compressing and Indexing on Documents and Images, Second Edition. San Francisco, CA: Morgan Kaufmann Publishers.
[47]Wolfram, D. (1992). Applying informetric characteristics of databases to ir system file design, part i: informetric models. Information Processing and Management, 28(1), 121-133.
[48]Xie, Y. & O’Hallaron, D. (2002). Locality in search engine queries and its implications for caching. In P. Kermani, F. Bauer, and P. Morreale (Eds.), Proceedings of the 21th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM'02), (pp. 1238-1247), New York, Jun.
[49]Zipf G. (1949). Human Behavior and the Principle of Least Effort. New York: Addison-Wesley.
[50]Zobel, J. & Moffat, A. (1995). Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8), 891-903.
[51]Zobel, J., Moffat, A., and Ramamohanarao, K. (1998). Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4), 453-490.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top