跳到主要內容

臺灣博碩士論文加值系統

(44.192.67.10) 您好!臺灣時間:2024/11/10 12:47
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:周政瑩
研究生(外文):Cheng-Ying Chou
論文名稱:經由資料分析及評分來改善搜尋引擎
論文名稱(外文):Search Engine Improvement through Data Analysis and Scoring
指導教授:吳昇吳昇引用關係
學位類別:碩士
校院名稱:國立中正大學
系所名稱:資訊工程所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:48
中文關鍵詞:搜尋引擎冗餘排除資料品質排名評分搜尋品質
外文關鍵詞:search engineredundancy eliminationdata qualityrankingscoringsearch quality
相關次數:
  • 被引用被引用:0
  • 點閱點閱:381
  • 評分評分:
  • 下載下載:67
  • 收藏至我的研究室書目清單書目收藏:1
由於網際網路的快速發展,Web 上的資料成長非常迅速。然而這些資料卻充斥著拷貝、廣告…等無用的資訊,導致搜尋引擎背後需要更大的空間來存放資料和索引。本研究中,除了根據句子重覆數目去計算每個網頁間重疊程度,還針對同一網站去濾除。研究結果顯示有32%無用資料被移除,對搜尋引擎的精確度不但明顯提升且前十筆搜尋結果的recall 幾乎沒什麼損失。

為了改善搜尋引擎ranking 的品質,吾人計算網頁分數及網站分數。網頁評分標準:將網頁定義出三種字數等級(低、中、高),針對好的網頁給定所屬等級,透過資料分析的方式去分析這三種等級,而後對每個網頁給定所屬等級的評分方法。網站評分標準:計算每個網站被網頁參考的值。實驗結果顯示在ranking 方面比原先的改善很多。
Due to the fast development in World Wide Web, web data has been growing very rapidly. However, there is much useless data such as copies, advertisements, etc. This leads to need more disk space to store info and indices. In this thesis, we not only compute textual overlap by counting the number of sentence that pages share but filter site pages with some redundant data, such as navigation bar. Experiment shows that the redundant data is about 32% of the text size. The precision of the search engine increases, and the recall of top ten search results is almost the same.

To improve the order of the search results, we calculate page score and site score for each page. By defining three-level of word counts (low, middle, high) and analyzing good page characteristic among three levels, a score of the page belonging to which level is calculated. Site score is computed according to how many pages reference the site. The result of the experiment is that scoring measure is superior to non-scoring in ranking.
ABSTRACT (CHINESE)...............................…………................I
ABSTRACT (ENGLISH).................................…………............. II
ACKNOWLEDGEMENTS ...............................…………...........III
TABLE OF CONTENTS .................................………….............IV
LIST OF TABLES ........................................…………........……VI
LIST OF FIGURES .........................................………….....……VII
1. INTRODUCTION....................................…………........…….1
2. RELATED WORKS.......................................…………......…. 5
3. REDUNDANCY ELIMINATION.....................…………........ 6
3.1 Record Format ......................................…………………....... 6
3.2 Sentence identification and sorting.................…………...…...... 7
3.3 Redundancy detection and removal ....................…………....... 8
3.3.1 Cross-Comparison Detection Module................…………..... 9
3.3.2 Intra-Site Detection Module .............………………….......... 10
3.4 Elimination Ratio ............................…………………............... 12
4. SCORING......................................…………………................ 14
4.1 Page score .....................................………………….............. 14
4.1.1 Word count definition ..........................……………….......... 14
4.1.2 Page metrics .................................…………………............ 15
4.1.3 Profiles of the standard page................…………….............. 16
4.1.4 good page criteria..............................………………............ 17
4.1.5 Other page metrics ..............................……………….......... 19
4.1.6 Page score calculation ........................……………….............21
4.2 Site score .........................................…………………….........22
4.2.1 Reference weight .................................………………...........22
4.2.2 Adjust site score.......…………………...................................23
5. EXPERIMENTS.............................……………….....................25
5.1 Data sets and evaluation measures ..............……………............25
5.2 Experiments Results .........................…………………...............26
5.2.1 Results of the redundancy elimination ............……………........26
5.2.2 Results of the scoring ......................…………………..............27
6. CONCLUSIONS AND FUTURE WORK..........….....................30
6.1 Conclusions ...................................……………………..............30
6.2 Future work ...................................……………………..............30
REFERENCES .................................………………….....................32
APPENDIX A.............................................…………………............35
A.1 Record field illustration ......................……………………............35
APPENDIX B...........................................……………………...........36
B.1 Good page selection system...........................……………….........36
B.2 Data analysis of good pages.........................………………...........37
APPENDIX C...............................................…………………….......38
C.1 SE-comp interface...............................…………………...............38
[1] H. Witten, A.Moffat, and T.C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, New York,1994.
[2] M. S. Manasse A. Z. Broder, S. C. Glassman and G. Zweig. Syntatic clustering of the Web. In Proc. Of the sixth International World Wide Web Conference [WWW6], pages 391-404.
[3] K. Bharat and A. Z. Broder. A study of host pairs with replicated content. In Proc. of 8th International Conference on World Wide Web [WWW99],May 1999.
[4] N. Shivakumar and H. Gracia-Molina. SCAM: a copy detection mechanism for digital documents. In Proc. of 2nd International Conference in Theory and Practice of Digital Libraries, June 1995.
[5] N. Shivakumar and H. Gcacia-Molina. Building a scalable and accurate copy detection mechanism. In Proc of 1 st ACM conference on Digital Libraries (DL’96), March 1996.
[6] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceeding of the 7th International World Wide Web Conference 1998, pages 107-117.
[7] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Stanford Digital Library Technologies, Working Paper 1999-0120,1998.
[8] B. Yuwono and D. L. Lee. Server ranking for distributed text retrieval systems on the Internet. In Proceeding of the 5th International conference on Data Engineering (ICDE), pages 164-171,New Orleans, USA, 1996.
[9] J. Kleinberg. Authoritative scores in a hyperlinked environment. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 668-667, January 1998.
[10] GAIS http://gais.cs.ccu.edu.tw
[11] M. O. Rabin. Fingerprinting by random polynomials. Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.
[12] Dina Bitton, David J. DeWitt, Duplicate record elimination in large data files, ACM Transactions on Database Systems (TODS), v.8 n.2, p.255-265, June 1983
[13] Melody Y. Ivory, Rashmi Sinha, and Matri A. Hearst. Preliminary findings on quantitative measures for distinguishing highly rated information-centric web pages. In Proceedings of the 6th Conference on Human Factors and the Web, 2000
[14] Shian-Hua Lin and Jan-Ming Ho. Discovering Informative Content Blocks from Web Documents, KDD-02, 2002
[15] Yi. L. and Liu, B., Eliminating Noisy Information in Web Pages for Data Mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 296-305, ACM Press, 2003
[16] Nahm, U.Y., Bilenko, M. and Mooney R.J. Two Approaches to Handling Noisy Variation in Text Mining. ICML-2002 Workshop on Text Learning, 2002
[17] Ed H. Chi. Peter Pirolli. and James Pitkow. The scent of a site: A system for analyzing and predicting information scent, usage, and usability of a web site. In Proceedings of ACM CHI00 Conference on Human Factors in computing Systems, 2000
[18] M. Cral Drott. Using web server logs to improve site design. In ACM 16th International Conference on Systems Documentation. Getting Feedback on your Web Site, pages 43-50, 1998
[19] Jean Scholtz and Sharon Laskowski. Developing usability tools and techniques for designing and testing web sites. In Proceedings of the 4th Conference on Human Factors and the Web, 1998
[20] Yin Leng Theng and Gil Marsden. Authoring tools: Towards continuous usability testing of web documents. In Proceedings of the 1st International Workshop on Hypermedia Development, 1998.
[21] Horold Thimbleby. Gentler: A tool for systematic web authoring. International Journal of Human-Computer Studies, 47(1): 139-168, 1997.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top