(18.204.227.34) 您好!臺灣時間:2021/05/17 05:17
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:欒富安
研究生(外文):Fuan Luan
論文名稱:結合動態分群與字詞類型權重觀念的分散式新聞查詢模型
論文名稱(外文):A Distributed News Retrieval Model by Integrating Dynamical Clustering with Term Weighting Concepts
指導教授:洪智力洪智力引用關係
指導教授(外文):Chihli Hung
學位類別:碩士
校院名稱:中原大學
系所名稱:資訊管理研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:中文
論文頁數:80
中文關鍵詞:分散運算格網動態分群
外文關鍵詞:Dynamic ClusteringGridDistributed ComputingEVSM
相關次數:
  • 被引用被引用:0
  • 點閱點閱:291
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
傳統的單機作業受限於機器設備的運算處理能力不足,面對龐大的資料量時,僅能夠以關鍵字檢索或處理事先分類的資料檢索。為建立能夠處理大量資料的概念式檢索環境,本文發展格網架構 ( Grid Computing ) 下,儲存文件及概念式檢索的環境,提高傳回結果的效益。同時,對於傳回結果,本研究以字詞類型權重 ( Extended Significance Vector Model ; ESVM ) 使用X-means 進行動態分群 ( Dynamical Clustering ),提高系統傳回的可讀性。原有資料集中時,大量文件及發生字詞所形成矩陣過大,而使多數電腦記憶體難以負荷,藉由分散環境將資料來源切割處理,文字矩陣維度得以大幅降低,而使概念性檢索得以在分散環境下實現。本研究的實驗使用不同類型、不同分群數及集中與分散模式差異比較,實驗記錄由檢索開始到分群完成的時間,分析文字預處理及分群的時間,由實驗結果得知分群大小對於文字預處理時間影響少,對於分群時間有相當大的影響,分群數少時間顯著減少,但是,群內文件相似度也稍稍降低。類型大小差異對時間及分群內相似度影響不大。分散分群處理可以有效降低的時間,群內文件相似度確也更加降低。
The capacity of data storage needed is increasing rapidly due to the increasing availability of information in electronic forms in the information era. A traditional data retrieval system uses a word-form based comparison approach to get the search results. Although this approach is able to handle huge amount of information, it still suffers from the semantic problem. On the other hand, the conceptual retrieval system can get an improvement by using vector space model ( VSM ) but this system is restricted to the curse of dimensionality. In order to handle large amount of data, firstly we develop the conceptual retrieval environment by using a grid computing structure to improve the effectiveness of the search system based on the vector space model. Next, we cluster the results by the x-means dynamical clustering model. The document space vector model and the extended significance vector model ( ESVM ) are used to improve the readability of the system search results.

In this research, we evaluate our models for different retrieved types, number of grouping, centralization and decentralization based on time and clustering similarity. According to our experiments, we found that the size of cluster has less impact on the time of text-processing but a great impact on the time of clustering. In other words, the duration of time is significantly reduced when the number of clusters decreases. However, in this case, the clustering similarity between each document in the same group is slightly reduced. Different size of retrieved type has a small effect on time of clustering and similarity. Distributed grouping can greatly enhance processing efficiency.
摘要....................................................................................................................... I
ABSTRACT............................................................................................................ II
目錄....................................................................................................................... IV
圖目錄................................................................................................................. VII
表目錄................................................................................................................... IX
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究問題 3
1.3 研究目的 3
1.4 研究架構 4
第二章 文獻探討 5
2.1 資訊檢索的探討 5
2.1.1 資訊檢索的發展 6
2.1.2 TF * IDF 6
2.1.3 機率運算 7
2.1.4 Vector Space Model 8
2.1.5 Latent Semantic Index 9
2.1.6 TREC 10
2.1.7 Lucene IR Tools 10
2.2 分散式運算 ( Distributed Computing ) 11
2.2.1 GRID 12
2.2.2 GLOBUS Tools Kit 14
2.3 檢索資料的整合 16
2.3.1 利用索引檔產生排序 16
2.3.2 傳回所有資料再排序 18
2.3.3 分群 19
2.3.4 使用資料探勘技術 19
2.3.5 查詢擴張 20
2.3.6 其他 20
2.4 分群法與X-means 21
2.4.1 分群法 21
2.4.2 WEKA 23
2.4.3 動態分群法 (X-means) 24
2.5 字詞類型權重 ( Extended Significance Vector Model ) 25
第三章 研究方法 29
3.1 研究流程 29
3.1.1 資料整理 30
3.1.2 篩選過濾 30
3.1.3 建立雛型系統 30
3.1.4 實驗測試 31
3.1.5 參數調整 31
3.1.6 評估結果 31
3.1.7 結論與建議 32
3.2 實驗資料描述 32
3.3 分散式新聞查詢模型模型 33
3.3.1. 字詞類型權重 37
3.3.2. 分群重新排序 40
第四章 系統架構之實作 42
4.1 雛型系統開發環境 42
4.2 雛型系統架構 43
4.3 雛型系統作業流程 44
4.4 雛型系統程式流程 45
4.4.1 資料檢索提供者 45
4.4.2 格網服務提供者 46
4.4.3 分群結果驗證者 47
第五章 系統測試與評估 48
5.1 分群分析方式 50
5.2 系統效能分析 51
5.2.1 類型多寡對時間影響 53
5.2.2 分散運算的對時間影響 53
5.2.3 分群大小對時間的影響 54
5.3 分群相似度分析 54
第六章 結論 56
6.1 研究成果 56
6.2 研究限制 57
6.3 未來展望 58
6.3.1 修改流程提高系統執行效能 59
6.3.2 改良分群輸入變數 59
6.3.3 加入自動摘要機制 59
6.3.4 提高新聞群聚精確度 60
參考文獻 61


圖目錄

圖 1-1 研究流程圖.……………………………………………………………….…4
圖 2-1 向量表示法 …………………………………………………………………8
圖 2-2 Latent Semantic Index概念圖……...……………………………………….9
圖 2-3 格網形成概念圖.…...………………………..…………………………….13
圖 2-4 Globus ToolKit 4 架構.……...………………………..……….…………..15
圖 2-5 分散式索引 及 集中式索引圖....………………………..…….………....16
圖 2-6 pSearch利用LSI產生索引.……………………….…..…….………........17
圖2-7 叢集法流程圖……………..……...………………….…..…….………......19
圖2-8 分群法種類………………..…...………………….…..…….………..........22
圖2-9 字詞類型頻率…………….….....………………….…..…….……….........26
圖2-10 ESVM計算過程……….………………………….…..…….………..........28
圖3-1 實驗流程圖……………….…....………………….…..…….………..........29
圖3-2 在TOPICS中135個子類別的階層式架構圖…….…....…….….……....32
圖 3-3 系統工作流程圖………………..…...…………….…..…….………..........34
圖 3-4 字詞類型權重新聞資訊檢索模型…………..…...………..…….………...35
圖 3-5 字詞類型權重新聞資訊檢索流程圖………..…...………..…….………...36
圖 3-6 字詞類型頻率矩陣示意圖…………..…...………………..…….………...38
圖 3-7 字詞類型權重示意圖………………..…...………………..…….………...39
圖 3-8 動態分群法…………………….……….……………..………….……….. 41
圖4-1 實驗環境設計…….….…..…….…...…....……..….…..…….….……....….43
圖4-2 集中分群模式下資料檢索提供者流程圖…………..…….….……....…... 45
圖4-3 分散分群模式下資料檢索提供者流程圖…………..…….….……....…... 45
圖4-4 集中分群模式下格網服務提供者流程圖…………..…….….……....…... 46
圖4-5 分散分群模式下格網服務提供者流程圖…………..…….….……....…... 47
圖4-6 分群結果驗證者流程圖……………………………..…….….……....…... 47
圖5-1 集中分群與分散分群……………………….….…..…….………...…...... 48
圖5-2 單機實驗發生錯誤終止畫面..…………………………………………….52
圖5-3 各類模式運算時間…………………………….…….….……....…........... 52
圖5-4 各類模式累計運算時間……………………….………….….……....…... 53
圖5-5 分群內文件平均相似度……………………….…….……….……....…... 55

表目錄

表3-1. 路透社資料集內容說明 33
表4 1. 實驗環境需求 44
表5 1. 運算模式種類 50
卜小蝶,(2007)。 網路搜尋之分類架構評估初探。 2007電子資訊資源與學術聯盟國際研討會。財團法人國家實驗研究院科技政策研究與資訊中心,2-1~2-13。

卜小蝶,陳思穎,(2007)。 網路自動分群搜尋引擎之使用者評估研究。 圖書資訊學研究, 2 ( 1 ),55-80。

王志立,陳鴻文, (2004)。 旅遊語意網整體服務系統之建置。 大葉大學。

王志浩,姚修慎, (2003)。 知識發掘之技術於智慧型資訊檢索系統之研究。 元智大學。

古長江, 洪新原, 蔡志豐, (2007)。 A Concept-Based Information Retrieval Mechanism Using Ontology in the Grid Environment。 中正大學。

吳忻萍, (1998)。 以隱藏語意索引為基礎之中文資訊檢索。 國立台灣大學。

徐福聲,皮世明, (2006)。 個人化網路搜尋分類之研究—以中文旅遊網站為例。 中原大學。

黃卓倫, (1997)。 利用隱藏語意索引進行文件分段檢索之研究。 國立台灣大學。

顏義樺, 呂芳懌, (2003)。 以聯想法則概念網路為基礎之文章概念探索及相似比對。 網際空間:科技、犯罪與法律社會學術研究暨實務研討會。

Baeza-Yates, R. A., & Ribeiro-Neto, B. (1999). Modern information retrieval. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.

Callan, J. P. (2000). Distributed information retrieval. Proceeding of Advances in Information Retrieval, Kluwer Academic

Callan, J. P., Lu, Z., & Croft, W. B. (1995). Searching distributed collections with inference networks. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, United States. 21-28.

Cuenca-Acuna, F. M., & Nguyen, T. D. (2002). Text-based content search and retrieval in ad-hoc P2P communities. Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing, 2376 220-234.

Fallen, C. T. & Newby, G. B. (2007). Distributed web search efficiency by truncating results. Proceedings of JCDL’07, June 18–23, Vancouver, British Columbia, Canada.

Foster, I. (2006). What Is the Grid? A Three Point Checklist. Grid Today, ( Vol. 1, no. 6 ). Retrieved October 04, 2006, from http://www.gridtoday.com/02/0722/100136.html

Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the grid: Enabling scalable virtual organizations. 6-7.

Frazer, W. D., & Bennett, B. T. (1972). Bounds on optimal merge performance, and a strategy for optimality. Journal of the ACM, 19(4), 641-648.

Ghanem, M., Chortaras, A., Guo, Y., Rowe, A., & Ratcliffe, J. (2005). A grid infrastructure for mixed bioinformatics data and text mining. Computer Systems and Applications, 2005. the 3rd ACS/IEEE International Conference, London, UK. 41.

Gravano, L., García-Molina, H., & Tomasic, A. (1999). GlOSS: Text-source discovery over the internet. ACM Transactions on Database Systems, 24(2), 229 - 264.

Harman, D., (1995). Overview of the Fourth Text Retrieval Conference (TREC-4). Available at http://trec.nist.gov/pubs/trec4/overvies.ps.

Hawking, D., & Thomas, P. (2005). Server selection methods in hybrid portal search. Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil. 75-82.

Hung, C., & Wermter, S. (2004). Neural network based document clustering using WordNet ontologies. International Journal of Hybrid Intelligent Systems, 1(3,4), 127-142.

Iiritano, S., & Ruffolo, M. (2001). Managing the knowledge contained in electronic documents: A clustering method for text mining. Database and Expert Systems Applications, 2001. Proceedings. 12th International Workshop, Rende, Italy. 454 - 458.

Kartoo search engine, http://www.kartoo.net/e/eng/index.html.

Larson, R. R., & Sanderson, R. (2005). Grid-based digital libraries: Cheshire3 and distributed retrieval. 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, CO, USA. 112 - 113.

Li, M., Lee, W., & Sivasubramaniam, A., (2004). Semantic small world: An overlay network peer-to-peer search. 12th IEEE International Conference on Network Protocols (ICNP’04), 2004.

Lim, J. (1995). Using coollists to index HTML documents in the web. Computer Networks and ISDN Systems, 28(1), 147-154.

Lu, J., & Callan, J. P. (2002). Pruning long documents for distributed information retrieval. Proceedings of the Eleventh International Conference on Information and Knowledge Management, McLean, Virginia, USA. 332-339.

Lucene, http://lucene.apache.org/

Martínez-Santiago, F., García-Cumbreras, M. A., & Ureña-Lòpez, L. A. (2006). Does pseudo-relevance feedback improve distributed information retrieval systems? Information Processing & Management, 42(5), 1151-1162.

Moffat, A., & Zobel, J. (1995). Information retrieval system for large document collection. Proceedings of TREC-3.

Mooter search engine, http://www.mooter.com/moot.

Müller, W., Eisenhardt, M., & Henrich, A. (2005). Scalable summary based retrieval in P2P networks. Paper Session IR-7 (Information Retrieval): Distributed Retrieval, Bremen, Germany. 586-593.

NIST, http://trec.nist.gov/data/reuters/reuters.html.

Ogilvie, P., & Callan, J. P. (2001). The effectiveness of query expansion for distributed information retrieval. Proceedings of the Tenth International Conference on Information and Knowledge Management, Atlanta, Georgia, USA. 183-190.

Osinski, S., (2006). Improving Quality of Search Results Clustering with Approximate Matrix Factorisations. Proceedings of European Conference on IR Research 2006, Springer-Verlag Berlin Heidelberg. LNCS 3936, 167–178.

Paice, C. D. (1984). Soft evaluation of boolean search queries in information retrieval systems. Information Technology Research Development Applications, 3(issue 1), 33-41.

Pelleg, D., & Moore, A. W. (2000). X-means: Extending K-means with efficient estimation of the number of clusters. Seventeenth International Conference on Machine Learning, , 727-734.

Powell, A. L., French, J. C., Callan, J. P., Viles, C. L., Emmitt, T., Prey, K.J., & Mou, Y. (1999). Comparing the performance of database selection algorithm, DARPA contract N66001-97-C-8542 and NASA GSRP NGT5-50062.

Puppin, D., Silvestri, F., & Laforenza, D. (2006). Query-driven document partitioning and collection selection. Proceedings of the 1st International Conference on Scalable Information Systems, Hong Kong. 34.

Rasolofo, Y., Abbaci, F., & Savoy, J. (2001). Approaches to collection selection and results merging for distributed information retrieval. Proceedings of the Tenth International Conference on Information and Knowledge Management, Atlanta, Georgia, USA. 191-198.

Reuters, http://about.reuters.com/researchandstandards/corpus/available.asp.

Rosell, M., Kann, V., & Litton, J. (2004). Comparing comparisons: Document clustering evaluation using two manual classifications. Processing of ICON 2004, Hyderabad, India.

Salton, G. (1971). The SMART retrieval System—Experiments in automatic document processing. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.

Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York, NY, USA: McGraw-Hill, Inc.

Sanderson, R., & Watry, P. (2007). Integrating data and text mining processes for digital library applications. Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, BC, Canada. 73-79.

Shokouhi, M., & Zobel, J. (2007). Federated text retrieval from uncooperative overlapped collections. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands. 495-502.

Si, L., & Callan, J. P. (2002). Using sampled data and regression to merge search engine results. Tampere, Finland. 19-26.

Si, L., & Callan, J. P. (2004). Unified utility maximization framework for resource selection. Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, D.C., USA. 32-41.

Si, L., & Callan, J. P. (2003a). Relevant document distribution estimation method for resource selection. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, Canada. 298-305.

Si, L., & Callan, J. P. (2003b). A semisupervised learning method to merge search engine results. ACM Transactions on Information Systems, 21(4), 457-491.

Spink, A., Koshman, S., Park, M., Bernard, C. F. & Jansen, J., (2005). Multitasking Web Search on Vivisimo.com. Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05), Las Vegas, Nevada, USA. (2) 486 – 490.

Tang, C., Dwarkadas, S., & Xu, Z. (2004). On scaling latent semantic indexing for large peer-to-peer systems. 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom. 112-121.

Tang, C., Xu, Z., & Dwarkadas, S. (2003). Peer-to-peer information retrieval using self-organizing semantic overlay networks. 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Karlsruhe, Germany. 175-186.

Trnkoczy, J., Turk, Ž., & Stankovski, V. (2006). A grid-based architecture for personalized federation of digital libraries. Library Collections, Acquisitions, and Technical Services, 30(3-4), 139-153.

Van Rijsbergen, C. J., (1979). Information Retrieval, Butterworth-Heinemann, Newton, MA.

Vateekul, P., & Rungsawang, A. (2004). A distributed text retrieval prototype on GRID environment , Proceedings of Intemational Symposh on Communications and Informatian Technologies 2004 , Sappom, Japan.

Vivisimo search provider , http://vivisimo.com/

Wang, Y. & Kitsuregawa, M., (2002). Evaluating contents-link coupled web page clustering for web search results. Proceedings of the eleventh international conference on Information and knowledge management. McLean, Virginia, USA. 499 – 506.

Wong, S. K. M., Ziarko, W. P., Raghavan, V. V., & Wong, P. C. (1986). On extending the vector space model for boolean query processing. Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Palazzo dei Congressi, Pisa, Italy. 175-185.

Wu, S., & Crestani, F. (2004). Shadow document methods of resutls merging. Proceedings of the 2004 ACM Symposium on Applied Computing, Nicosia, Cyprus. 1067-1072.

WVTools, http://www.wvtools.com/

Xu, J., & Croft, W. B. (1999). Cluster-based language models for distributed retrieval. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, United States. 254-261.

Xu, Y., Wang, K., Zhang, B., & Chen, Z. (2007). Privacy-enhancing personalized web search. Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada. 591-600.

Yu, L., Wang, S., Lai, K. K., & Wu, Y. (2005). A framework of web-based text mining on the grid. Next Generation Web Services Practices, 2005. NWeSP 2005. International Conference, 6.

Yuwono, B., & Lee, D. L. (1997). Server ranking for distributed text retrieval systems on the internet. Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA), 41-50.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top