跳到主要內容

臺灣博碩士論文加值系統

(54.224.117.125) 您好!臺灣時間:2022/01/23 21:12
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:簡政傑
研究生(外文):Cheng-Chieh Chien
論文名稱:噢!別再出現404錯誤訊息了!
論文名稱(外文):404 Error: Oh, not again!
指導教授:李強李強引用關係
指導教授(外文):Chiang Lee
學位類別:碩士
校院名稱:國立成功大學
系所名稱:資訊工程學系碩博士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2002
畢業學年度:90
語文別:中文
論文頁數:67
中文關鍵詞:網頁比對資訊擷取搜尋引擎遺失連結404錯誤全球資訊網
外文關鍵詞:page comparisonInformation Retrievalsearch enginelost linkWWW404 error
相關次數:
  • 被引用被引用:1
  • 點閱點閱:900
  • 評分評分:
  • 下載下載:145
  • 收藏至我的研究室書目清單書目收藏:5
在過去十年中,World Wide Web(WWW)變成了Internet上用來獲取各種資訊的最重要媒介.然而,有些網頁可能被移動或是刪除,使得原先紀錄在我們的個人電腦或是搜尋引擎資料庫中的URL是過時的.

這個問題稱之為"lost link"並且產生HTTP 1.1 通訊協定中 "404"的錯誤代碼.一般來說,瀏覽網頁時發生"404 error"訊息的機率是非常頻繁的.在本篇論文中,我們提出一個新的ranking技術來解決這個問題,我們稱之為2-dimensional distance.

有別於已經提出的網頁比對技術只考慮了文字內容的distance,我們所提出的網頁間的2-dimensional distance則是同時考慮了style distance和text distance.我們的實驗也顯示了2-dimensional distance機制可以找到更正確的結果.我們也藉著2-dimensional distance設計了一個lost-link search engine的原型.
In the past decade, World Wide Web (WWW) become the most important medium for retrieving all kinds of information on internet.However, some web pages could be moved or deleted such that the URLs recorded in our personal computers or in the search engine databases are obsolete.

This problem is named "lost link" that is coded "404" by protocol of HTTP 1.1 . Currently, the average probability of the "404 error" messages in browsing web page is quite often.
In this thesis, we address this issue by proposing
a novel ranking technique, called 2-dimensional distance.

Our proposed 2-dimensional distance between two pages considers the style distance and text distance simultaneously, instead of considering only text distance in the proposed page comparison techniques.Our experiments also shows that the 2-dimensional distance mechanism
can find more accurate results. We also designed a prototype of a lost-link search engine by using
2-dimensional distance.
Abstract i
Acknowledgements iii
Table of Contents iv
Table of Figures vi
Table of Tables viii
Table of Algorithms ix
1 Introduction 1
2 Related Work 5
2.1 Overlap of shingles
2.2 Improve Performance of Shingle-method
2.3 Mirror Site and Web Collection
2.4 Drawback
2.5 Pages-clustering Base on Suffix Tree
2.6 Our work and the difference
3 MainWork
3.1 Definitions and Design Concepts
3.2 Phase 1: Text Comparison Phase
3.3 Phase 2: Style Comparison Phase
3.3.1 Adjust Topology of PST
3.3.2 Adjust Order of Paths
3.4 Short Summary
4 Performance Study
4.1 Environment
4.2 Experiment 1: original data
4.3 Experiment 2: modified data
5 Implementation Issue
5.1 Related Technique
5.2 ProgramManual
6 Conclusions
6.1 Conclusions
6.2 FutureWork
A Appendix
Biography
[AKM95]Keith Andrews, Frank Kappe and Hermann Maurer,
"Serving information to the Web with Hyper-G",
Computer Networks and ISDN Systems,
Volume 27, Issue 6, April, 1995, pp. 919-926.

[AM01]Javed A. Aslam and Mark Montague,
"Models for metasearch",
In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval,
New Orleans, Louisiana, United States, 2001, pp. 276-284.


[ACMPR01]Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke and Sriram Raghavan,
"Searching the Web",
ACM Transactions on Internet Technology,
Volume 1, August, 2001, pp. 2-43.


[BGMZ97]A. Z. Broder, S. C. Glassman, M. S. Manasse and G. Zweig,
"Syntactic clustering of the Web",
In Proceedings of the Sixth International World Wide Web Conference,
Santa Clara, California USA, April 7-11, 1997, pp.391-404.

[BRO97]Andrei Z. Broder,
"On the resemblance and containment of documents",
In Proceedings of Compression and complexity of Sequences(SEQUENCE'97),
1997, pp. 21-29.

[BB98]Krishna Bharat and Andrei Broder,
"A technique for measuring the relative size and overlap of public web search engines",
In Proceedings of the 7th International World Wide Web Conference,
Brisbane, Australia, April 1998, pp. 379-388.


[BH98]Krishna Bharat and Monika Rauch Henzinger,
"Improved Algorithms for Topic Distillation in a Hyperlinked Environment",
In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval,
Melbourne Australia, 1998, pp. 104-111.

[BP98]Sergey Brin and Lawrence Page,
"The anatomy of a Large-Scale hypertextual Web Search Engine",
Computer Networks and ISDN Systems,
Volume 30, Issue 1-7, April, 1998, pp. 107-117.


[BB99]Krishna Bharat and Andrei Broder,
"Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content",
In Proceedings of the 8th International World Wide Web Conference,
Toronto, Canada, May 11-14, 1999, pp. 501-512.

[BBDH00]Krishna Bharat and Andrei Z. Broder and Jeffrey Dean and Monika Rauch Henzinger,
"A comparison of techniques to find mirrored hosts on the WWW"
Journal of the American Society of Information Science,
Volume 51, Number 12, 2000, pp. 1114-1122.

[BKMRRSTW00]Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopolan, Raymie Stata, Andrew Tomkins, and Janet L. Wiener,
"Graph structure in the Web",
In Proceedings of the 9th International World Wide Web Conference,
Amsterdam, The Netherlands, May 2000, pp. 309-320.

[BO99]Tolga Bozkaya and Meral Ozsoyoglu,
"Indexing large metric spaces for similarity search queries",
ACM Transactions on Database Systems,
Volume 24, Issue 3, September 1999, pp. 361-404.

[CPZ97]P.Ciaccia, M.Patella and P.Zezula,
"M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces",
In Proceedings of the 23rd International Conference on Very Large Data Bases ,
Athens, Greece, August 1997, pp. 426-435.

[CSM97]Michal Cutler, Yungming Shih and Weiyi Meng,
"Using the Structure of HTML Documents to Improve Retrieval",
In USENIX Symposium on Internet Technologies and Systems(NSITS'97),
Decemember 1997, pp. 241-251.

[CDI98]Soumen Chakrabarti and Byron E. Dom and Piotr Indyk,
"Enhanced hypertext categorization using hyperlinks",
In Proceedings of {SIGMOD-98, {ACM International Conference on Management of Data,
Seattle, 1998, pp. 307-318.


[CSG00]Junghoo Cho, Narayanan Shivakumar,and Hector Garcia-Molina ,
"Finding replicated Web collections",
In Proceedings of the 2000 ACM SIGMOD on Management of data,
Dallas, Texas, United States, June 2000, pp. 355-366.

[CHU00]Yu-Chi Chung,
"Design and Implementation of a Client Side History Map and Web Page Handling System",
Master thesis, National Cheng-Kung University, R.O.C, 2000.

[CHINATIMES]
中時電子報, http://www.chinatimes.com.


[CHE99]Che-Min Chen,
"Design and Implementation of QBT in WWW",
Master thesis, National Cheng-Kung University, R.O.C, 1999.

[DH99]Jeffrey Dean and Monika Rauch Henzinger,
"Finding Related Pages in the World Wide Web",
Computer Networks,
Volume 31, Issue 11-16, May 17, 1999, pp. 1467-1479.

[DKMRST01]
Stephen Dill, S. Ravi Kumar, Kevin S. McCurley, Sridhar Rajagopalan, D. Sivakumar and Andrew Tomkins,
"Self-similarity in the Web",
In proceedings of International Conference on Very Large Databases,
September 11-14, 2001, Roma Italy, pp. 69-78.

[DOM]Document Object Model, http://www.w3.org/DOM/.

[FSGMU98]Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani and Jeffrey D.Ullman,
"Computing iceberg queries efficiently",
In proceedings of International Conference on Very Large Databases(VLDB'98),
August,1998, pp. 299-310.

[GUS97]D.Gusfield ,
"Algorithms on strings,trees and sequences",
CamBridge University Press, chap 6, 1997.

[GKR98]D. Gibson, J. Kleinberg, P. Raghavan.
"Inferring Web communities from link topology"
In Proceedings of the 9th ACM conference on Hypertext and hypermedia,
Pittsburgh, Pennsylvania, United States, 1998, pp. 225-234.

[GT99]Holmes Geoffrey and Leonard Trigg,
"A diagnostic tool for tree based supervised classification learning algorithms",
In Proceedings of the Sixth International Conference on Neural Information,
Western Australia, November 1999, pp. 514-519.

[HMCCA97]Joachim Hammer, Hector Garcia-Molina, Junghoo Cho, Arturo Crespo and Rohan Aranha,
"Extracting Semistructured Information from the Web",
In Proceedings of the Workshop on Management of Semistructured Data,
May 1997, pp. 18-25.

[HGI00]Taher H. Haveliwala and Aristides Gionis and Piotr Indyk,
"Scalable Techniques for Clustering the Web",
Third International Workshop on the Web and Databases,
Dallas, Texas, May 18-19, 2000, pp. 129-134.

[Hof00]Thomas Hofmann,
"Learning Probabilistic Models of the Web",
In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval,
Athens Greece, 2000, pp. 369-371.

[HRMP00]Jun Hirai, Sriram Raghavan, Hector Garcia-Molina and Andreas Paepcke,
"WebBase : A repository of web pages",
Computer Networks,
Volume 33, Issue 1-6, June 2000, pp. 277-293.

[HTML]W3C (World Wide Web Consortium),
"HTML 4.0 Specification",
http://www.w3.org/TR/1998/REC-html40-19980424 , April 1998.

[KRRSTU00]S.R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and Eli Upfal,
"The Web as a graph",
In Proceedings of the 19th ACM Symposium on Principles of Database Systems,
2000, pp. 1-10.

[KLST00]Ming-Yang Kao, Tak Wah Lam, Wing-Kin Sung, Hing-Fing Ting,
"Unbalanced and Hierarchical Bipartite Matchings with Applications to Labeled Tree Comparison",
Algorithms and Computation 11th International Conference(ISAAC 2000),
Taipei, Taiwan, December 2000, pp. 479-490.

[KIMO]Yahoo!奇摩, http://tw.yahoo.com.

[LCVA01]Wen-Syan Li, K. Selcuk Candan, Quoc Vu and Divyakant Agrawal,
"Retrieving and Organizing Web Pages by "Information Unit"",
In Proceedings of the 10th international World Wide Web conference,
Hong Kong, 2001, pp. 230-244.

[PBMW98]Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd,
"The PageRank Citation Ranking: Bringing Order to the Web",
Technical Report,
Computer Systems Laboratory, Stanford University, Stanford, CA, 1998.

[PMMC00]Andreas Paepcke, Hector Garcia-Molina, Gerard Rodriquez-Mula and Junghoo Cho,
"Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies",
SIGMOD Records, Volume 29, Issue 1, March 2000.


[SM99]Narayanan Shivakumar and Garcia-Molina,
"Finding near-replicas of documents on the web",
In Proceedings of Workshop on Web Databases (WebDB'98),
March 27-28, 1998.

[SY00]Neel Sundaresan, Jeonghee Yi,
"Mining the Web for Relations",
Computer Networks,
Volume 33, Issue 1-6, June 2000, pp. 699-711.

[Sal01]Salvador Roura,
"Digital Access to Comparison-Based Tree Data Structures and Algorithms",
Journal of Algorithms,
Volume 40, Number 1, July 2001, pp. 1-23.

[SCH01]Soumen Chakrabarti,
"Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction",
In Proceedings of the 10th International World Wide Web Conference,
Hong Kong , May 1-5, 2001, pp. 211-220.

[WZJS94]Jason Tsong-Li Wang, Kaizhong Zhang, Karpjoo Jeong and Dennis Shasha,
"A System for Approximate Tree Matching",
Knowledge and Data Engineering,
Volume 6, Number 4, 1994, pp. 559-571.

[WNZ01]Ji-Rong Wen, Jian-Yun Nie and Hong-Jiang Zhang,
"Clustering user queries of a search engine",
In Proceedings of the 10th international World Wide Web conference,
Hong Kong, 2001, pp. 162-168.

[YN99]Ricardo Baeza-Yatex and Berthier Ribeiro-Neto,
Modern Information Retrieval, Addison-wesley, 1999.

[YAM]蕃薯藤, http://www.yam.com.

[ZE98]Oren Zamir and Oren Etzioni,
"Web Document Clustering: A Feasibility Demonstration",
In Proceedings of the 21st Annual International ACM SIGIR conference on Research and Development in Information Retrieval,
Melbourne, Australia, 1998, pp. 46-54.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊