臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.110) 您好！臺灣時間：2025/09/25 07:53

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
參考文獻
紙本論文
QR Code

本論文永久網址:

研究生:

張育榮

研究生(外文):

Yu-Jung Chang

論文名稱:

運用模糊排序分析法作網頁自動分類之可調適特徵選取

論文名稱(外文):

Scalable Feature Selection for Web Page Classification by Fuzzy Ranking Analysis

指導教授:

李漢銘

指導教授(外文):

Hahn-Ming Lee

學位類別:

碩士

校院名稱:

國立臺灣科技大學

系所名稱:

電子工程系

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

1999

畢業學年度:

語文別:

英文

論文頁數:

120

中文關鍵詞:

特徵選取、網頁分類、排序分析

外文關鍵詞:

feature selection、Web page classification、ranking analysis

相關次數:

被引用:3
點閱:271
評分:
下載:0
書目收藏:5

目前的網頁搜尋工具無法同時滿足使用者在質與量上的需求，此時運用網頁自動分類器可有效解決以上問題，然而網頁資源的龐大卻使自動分類器面臨了極大的挑戰，面對網頁分類動輒數以百萬計的文件、數以萬計的輸入維度（特徵）、與數以百計的類別數，有效的維度降低技術成為問題的關鍵。在維度降低技術中，我們針對特徵選取方法提出新的「模糊排序分析模型（fuzzy ranking analysis paradigm）」。其中除了提出「排序分析（ranking analysis）」的評估方法，使得特徵在分類上的行為能被有效地分析外，並提出兩種改進特徵選取的新穎方式，包括奠基於模糊集合（fuzzy sets）理論的兩階段提升技巧（two-level promotion），以及專為可調適特徵選取（scalable feature selection）而設計的「特徵鑑別力（feature discriminating power）」度量。由中時電子報（1,000份新聞）的實驗結果可知，當分類器的輸入維度由10,427降至200時，運用我們的方法，仍然能使測試語料保持100%的辨識率與不錯的正確率（80.41%）。

To satisfy, efficiently and effectively, both the qualitative and quantitative demands of information request on the WWW, automatic Web page classifiers are urgently needed. However, the classifiers suffer from the huge-scale problem on WWW, i.e., they have to handle millions of Web pages, ten thousands of features, and hundreds of categories. When they come to practical implementation, the dimensionality-reduction process is imperatively necessary and also the major challenge. In the thesis, we propose a fuzzy ranking analysis paradigm accompanied with a novel relevance measure: feature discriminating power (FDP) to effectively reduce the input dimensionality from ten thousands into a few hundreds with zero rejection rate and with gracefully degrading in the accuracy. Also, this method can help us do scalable feature selection with a reasonable trade-off between accuracy and efficiency. Meanwhile, our proposed method is useful for evaluating relevance measures. Furthermore, a two-level promotion method is proposed to improve the behavior of each relevance measure and combine those measures to produce a better evaluation of features. According to the experimental results, the FDP measure successfully reduces the input-dimensionality from 10,427 to 200 with zero rejection rate and with less than 5% degrading (from 84.5% to 80.4%) in the test accuracy. The results also show that the FDP measure is not only able to reduce redundancy but also capable of reducing noise.
Acknowledgements iii
Contents iv
List of Figures vii
List of Tables ix
Nomenclature x
Notations xii
CHAPTER 1 Introduction 1
1.1 Motivation 1
1.1.1 The Demand of Retrieving High-Quality Information from the massive Web Documents 1
1.1.2 Dimensionality Reduction: The Major Challenge of Automatic Web Page Classifiers 2
1.2 Input-Dimensionality Reduction by Feature Selection 3
1.3 Our Goals and Design 4
1.3.1 Automatically Hierarchical Classification for Web Pages 4
1.3.2 Scalable Feature Selection for Both Accuracy and Efficiency 4
1.4 Organization 6
CHAPTER 2 Background 7
2.1 The Information-Overloading Problem on the Web 7
2.1.1 Explosive Growth of Web Pages 7
2.1.2 Problems of Current Search Tools on the Web 9
2.1.3 Automatic Web Page Classification: An Effective Remedy to Heal Information Overloading 10
2.2 Fuzzy Decision Making 11
2.2.1 Fuzzy Sets 11
2.2.2 Fuzzy Decision Making 12
2.2.3 Multicriteria Decision Making 13
2.3 Pattern Classification 16
2.3.1 Pattern Recognition 16
2.3.2 Interpretations and Types of Pattern Classification 16
2.3.3 Practical Steps of Pattern Classifiers 24
CHAPTER 3 Automatic Web Page Classifiers 27
3.1 Purpose and Considerations of Automatic Web Page Classifiers 27
3.2 Practical Design Issues 30
3.2.1 Preparation of Dataset 31
3.2.2 Representation of Web Pages 33
3.3 Feature Engineering 34
3.3.1 Feature Generation Stage 35
3.3.2 Feature Refinement Stage 36
3.3.3 Feature Utilization Stage 37
CHAPTER 4 Scalable Feature Selection 42
4.1 Feature Selection 42
4.1.1 The Quality Consideration of Feature Sets 42
4.1.2 Definitions 44
4.1.3 Methods and Approaches 48
4.2 Evaluation of Feature Relevance 51
4.2.1 The Behavior of Relevance Measures 51
4.2.2 Relevance Measures in Text Categorization 54
4.3 Feature Selection in Text Categorization 56
4.3.1 The Explicit-Filter Method 56
4.3.2 Problems in the Explicit-Filter Method 57
4.4 The Scalability of Input Features 59
4.4.1 The Meaning of Scalability 59
4.4.2 The Reason for Scalable Feature Selection 60
4.4.3 The Difficulty and Methods of Scalable Feature Selection 61
4.5 Fuzzy Ranking Analysis 62
4.5.1 Introduction 62
4.5.2 Ranking Analysis: The Scalability Test 64
4.5.3 Two-Level Promotion Based on Fuzzy Sets 65
4.6 A Novel Relevance Measure: Feature Discriminating Power 69
4.6.1 Characteristics of FDP 70
CHAPTER 5 Experiments 71
5.1 Experimental Design 71
5.1.1 Overview 71
5.1.2 Terminology and Evaluation Methods 72
5.2 The China Times Dataset 73
5.2.1 Overview 73
5.2.2 Classification Results without Feature Selection 74
5.2.3 Results of Scalability Test 75
5.2.4 Summary 86
CHAPTER 6 Discussion and Conclusion 87
6.1 Methods of Input-Dimensionality Reduction 87
6.2 Challenges of Automatic Content Analyzers on the Web 90
6.3 Conclusion 91
6.4 Further Work 92
REFERENCES 93
APPENDIX A Top 200 Selected Features Using FDP in China Times Dataset 102
APPENDIX B Results of Scalability Test Using FDP in China Times Dataset 104

[1] H. Berghel,“ Cyberspace 2000: Dealing with Information Overload,” Communications of the ACM, 40(2), pp.19-24, Feb. 1997.
[2] Al Borchers et al.,“ Ganging up on Information Overload,” Computer, pp. 106-108, Apr. 1998.
[3] C. Jenkins, M. Jackson, P. Burden, J. Wallis,“ Searching the World Wide Web: An Evaluation of Available Tools and Methodologies,” Information And Software Technology, 39(14-15), pp.985-994, Feb. 1998.
[4] R.E. Filman and S. Pant,“ Searching the Internet,” IEEE Internet Computing, 2(4), pp.21-23, Aug. 1998.
[5] V.N. Gudivada et al.,“ Information Retrieval on the World Wide Web,” IEEE Internet Computing, 1(5), pp. 58-68, Sep./Oct. 1997.
[6] T.G. Dietterich,“ Machine-Learning Research: Four Current Directions,” AI Magazine, 18(4), pp. 97-136, Winter 1997.
[7] G. Salton, “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer,” Addison Wesley, 1989.
[8] G. Salton,“ Introduction to Information Retrieval,” McGraw-Hill, 1983.
[9] S.H. Lin, M.C. Chen, J.M. Ho, M.T. Ko, and Y.M. Huang,“ ACIRD: Intelligent Internet Documents Organization and Retrieval,” in Special Issue on Web Technologies, IEEE Transaction on Knowledge and Data Engineering, forthcoming.
[10] D.D. Lewis,“ Representation and Learning in Information Retrieval,” Ph.D. Dissertation, Tech. Report UM-CS-1991-093, Department of Computer Science, University of Massachusetts, Amherst, MA, Feb. 1992, ftp://ftp.cs.umass.edu/pub/techrept/techreport/1991/UM-CS-1991-093.ps, http://www.research.att.com/~lewis/papers/lewis91d.ps.
[11] C. Chekuri and M.H. Goldwasser,“ Web Search Using Automatic Classification,” Poster Presentation Papers of the 6th International WWW Conference, California, USA, Apr. 1997.
[12] C. Lin and H. Chen.,“ An Automatic Indexing and Neural Network Approach to Concept Retrieval and Classification of Multilingual (Chinese-English) Documents,” IEEE Trans. SMC-part B: Cybernetics, 26(1), 75-88, Feb. 1996.
[13] T. Honkela et al.,“ WEBSOM--Self-Organizing Maps of Document Collections,” Proceedings of WSOM''97, Espoo, Finland, June 1997, http://websom.hut.fi/websom/, http://www.ics.hut.fi/wsom97/progabstracts./51.html.
[14] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen,“ WEBSOM--Self-Organizing Maps of Document Collections,” Neurocomputing, Vol. 21, pp. 101-117, 1998.
[15] Y. Yang,“ An Evaluation of Statistical Approaches to Text Categorization,” Information Retrieval Journal, 1999 (forthcoming), http://www.cs.cmu.edu/~yiming/papers.yy/irj98.ps.
[16] Y. Yang,“ Sampling Strategies and Learning Efficiency in Text Categorization,” AAAI Spring Symposium on Machine Learning in Information Access, pp. 88-95, 1996, http://www.cs.cmu.edu/~yiming/papers.yy/aaai_mlia96.ps.
[17] D. Mladenic,“ Turning Yahoo into an Automatic Web-Page Classifier,” Proceedings of the 13th European Conference on Aritficial Intelligence (ECAI''98), pp. 473-474, 1998, http://www-ai.ijs.si/DunjaMladenic/index.htm.
[18] D. Koller and M. Sahami,“ Hierarchically Classifying Documents Using Very Few Words, ” Proceedings of the 14th International Conference on Machine Learning, Nashville, Tennessee, pp. 170-178, July 1997, http://robotics.stanford.edu/~koller/paper/ml97.html.
[19] A.L. Blum and Pat Langley,“ Selection of relevant features and examples in machine learning,” Special Issue on Relevance, Artificial Intelligence, 97(1-2), 21-23, Dec. 1997.
[20] M. Dash and H. Liu.,“ Feature Selection for Classification,” Intelligent Data Analysis, 1(3), 1997, http://www.elsevier.com/locate/ida/.
[21] D.D. Lewis,“ Feature Selection and Feature Extraction for Text Categorization,” Proceedings of Speech and Natural Language Workshop, Defense Advanced Research Projects Agency, Morgan Kaufmann, pp. 212-217, Feb. 1992, http://www.research.att.com/~lewis/papers/lewis92e.ps.
[22] S. Chakrabarti, B. Dom, R. Aagrawal, P. Raghavan,“ Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies,” VLDB Journal, 1998(7), pp. 163-178, 1998.
[23] S.H. Lin, C.S. Shih, M.C. Chen, J.M. Ho, M.T. Ko, and Y.M. Huang,“ Extracting Classification Knowledge of Internet Documents: A Semantic Approach,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, Aug. 24-28, 1998.
[24] Y. Yang, and J.O. Pedersen“ A Comparative Study on Feature Selection in Text Categorization,” Proceedings of the 14th International Conference on Machine Learning (ICML''97), Nashville, Tennessee, July 1997, http://www.cs.cmu.edu/~yiming/papers.yy/ml97.ps.
[25] T.C. Liu,“ Feature Selection using Genetic Algorithms for Daily News Classification,” Announced Experimental Results of Fuzzy Neural Laboratory in NTUST, May 1999, http://neuron.et.ntust.edu.tw/~tcliu/research.html.
[26] M.R. Wulfekuhler and W.F. Punch,“ Finding Salient Features for Personal Web Page Categories,” Proceedings of the 6th International WWW Conference, California, USA, Apr. 1997, http://www6.nttlabs.com/HyperNews/get/PAPER118.html.
[27] D. Mladenic and M. Grobelnik,“ Feature selection for classification based on text hierarchy,” Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD-98), 1998, http://www-ai.ijs.si/DunjaMladenic/index.htm.
[28] D. Koller and M. Sahami,“ Toward Optimal Feature Selection,” Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, pp. 284-292, July 1996, http://robotics.stanford.edu/~koller/paper/ml96.html.
[29] N. Littlestone,“ Learning Quickly When Irrelevant Attributes Around: A New Linear Threshold Algorithm,” Machine Learning, 2(1988), pp. 285-318, 1988.
[30] K. Kira and L. Rendell,“ A Practical Approach to Feature Selection,” Proceedings 9th International Conference on Machine Learning, Aberdeen, Scotland, pp. 249-256, 1992.
[31] H. Almuallim and T.G. Dietterich,“ Learning With Many Irrelevant Feature,” Proceedings AAAI-91, Anaheim, CA, pp. 547-552, 1991.
[32] J.R. Quinlan,“ Learning Efficient Classification Procedures and Their Application to Chess and Games,” Machine Learning: An Artificial Intelligence Approach, pp. 463-482, edited by R.S. Michalski, J.G. Carbonell and T.M. Mitchell, Morgan Kaufmann, San Mateo, CA, 1983.
[33] J.R. Quinlan,“ C4.5: Programs for Machine Learning,” Morgan Kaufmann, San Mateo, CA, 1993.
[34] P. Langley and S. Sage,“ Oblivious Decision Trees and Abstract Cases,” Working Notes of the AAAI-94 Workshop on Case-Based Reasoning, Seattle, WA, pp. 113-117, 1994.
[35] K. Kira and L.A. Rendell,“ The Feature Selection Problem: Traditional Methods and a New Algorithm,” Proceedings of Ninth National Conference on Artificial Intelligence, pp. 129-134, 1992.
[36] P.M. Narendra and K. Fukunaga,“ A Branch and Bound Algorithm for Feature Selection,” IEEE Transactions on Computers, C-26(9), :pp. 917-922, Sep. 1977.
[37] L. A. Zadeh,“ Fuzzy Sets,”formation and Control, 8(3), pp.338-353, 1965.
[38] H.J. Zimmermann,“ Fuzzy Set Theory and its Applications,” 2nd edition, Boston, MA: Kluwer, 1991.
[39] G.J. Klir and Bo Yuan,“ Fuzzy Sets and Fuzzy Logic: Theory and Applications,” Prentice-Hall, 1995.
[40] L.A. Zadeh et al.,“ Special Issue on Fuzzy Set: Where do we stand? What do we go?,” Fuzzy Sets and Systems, 90(2), pp.109-224, Sep. 1997.
[41] R.E. Bellman and L.A. Zadeh,“ Decision-Making in a Fuzzy Environment,” Management Science, 17(4), pp.217-233, Dec. 1970.
[42] M. Roubens,“ Fuzzy Sets and Decision Analysis,” Special Issue on Fuzzy Set: Where do we stand? What do we go?, Fuzzy Sets and Systems, 90(2), pp.199-206, Sep. 1997.
[43] G. Pasi and R.A.M. Pereira,“ A Decision Making Approach to Relevance Feedback in Information Retrieval: A Model Based on Soft Consensus Dynamics,” Special Issue on Decision Theory in a Fuzzy Environment, International Journal of Intelligent Systems, 14(1), pp.105-122, Jan. 1999.
[44] H.-J. Zimmermann,“ Fuzzy Sets, Decision Making, and Expert System,” Boston: Kluwer Academic Publishers, 1987.
[45] J.C. Fodor, J.-L. Marichal, and M. Roubens,“ Characterization of Some Aggregation Functions Arising from Problems,”in Fuzzy Logic and Soft Computing, pp. 194-201, edited by B. Bouchon-Meunier, R.R. Yagar and L.A. Zadeh, World Scientific Publishing Co. Pte. Ltd., 1995.
[46] R.A. Ribeiro,“ Fuzzy Multiple Attribute Decision Making: A Review and New Preference Elicitation Techniques,”in Special Issue on Fuzzy Multiple Criteria Decision Making, Fuzzy Sets and Systems, 78(2), Mar. 1996.
[47] R.P. Lippmann.,“ Pattern Classification Using Neural Networks,” Special Issue on Neural Networks in Communications, IEEE Communications, 27(11), pp.47-64, Nov. 1989.
[48] R.P. Lippmann,“ An Introduction to Computing with Neural Networks,” IEEE Accoust. Speech Signal Process, 4(2), 4-22, 1987.
[49] J.M Zurada,“ Introduction to Artificial Neural Systems,” West Publishing, 1992.
[50] M. Nadler and E.P. Smith,“ Pattern Recognition Engineering,” A Wiley-Interscience Publication, John Wiley & Sons Inc., New York, NY, USA, 1993.
[51] L. Holmstom et al.,“ Neural and Statistical Classifiers-Taxonomy and Two Case Studies,” Special Issue on Artificial Neural Network and Statistical Pattern Recognition, IEEE Transactions on Neural Networks, 8(1), pp. 5-17, Jan. 1997.
[52] A. Joshi et al.,“ On Neurobiological, Neuro-Fuzzy, Machine Learning, and Statistical Pattern Recognition Techniques,” Special Issue on Artificial Neural Network and Statistical Pattern Recognition, IEEE Transactions on Neural Networks, 8(1), pp. 18-31, Jan. 1997.
[53] J. Sinclair (editor in chief),“ Collins COBUILD English Dictionary,” HarperCollins Publishers, London, 1987.
[54] K.A. Ross and C. Wright,“ Discrete Mathematics,” Prentice-Hall, 1988.
[55] R.W. Hamming,“ Coding and Information Theory,” Prentice-Hall, Englewood Cliffs, NJ, 1980.
[56] C.H. Tsai,“ MMSEG: A Word Identification System for Mandarin Chinese Text Based On Two Variations of the Maximum Matching Algorithm,” May 1996, http://casper.beckman.uiuc.edu/~c-tsai4/chinese/wordseg/mmseg.html.
[57] S. Deerwester et al.,“ Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, 41(6), pp. 391-407, 1990.
[58] List of Search Engines in Yahoo!, http://dir.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web /Searching_the_Web/Search_Engines/.
[59] Virtual Search Engines, List of over 1,000 Search Engines and Directories within 50 categories (categorized by purpose), http://www.dreamscape.com/frankvad/search.html.
[60] Yahoo!, Classified Directory, http://www.yahoo.com.
[61] Magellen in Excite Inc., Classified Directory, http://magellan.excite.com/.
[62] Yam, Classified Directory, Taiwan, http://www.yam.com.tw.
[63] Kimo, Classified Directory, Taiwan, http://www.kimo.com.tw.
[64] AltaVista, Search Robots, http://www.altavista.com.
[65] Excite, Search Robots, http://www.excite.com.
[66] HoBot, Search Robots, http://www.hotbot.com.
[67] InfoSeek, Search Robots, http://www.infoseek.com.
[68] Lycos, Search Robots, http://www.lycos.com.
[69] Openfind, Search Robots, Taiwan, http://www.openfind.com.tw.
[70] Dogpile, Meta Search Robots, http://www.dogpile.com.
[71] MetaCrawler, Meta Search Robots, http://www.metacrawler.com.
[72] Metafind, Meta Search Robots, http://www.metafind.com.
[73] “Meta-Search Engines,” Teaching Library Workshops, University of California, Berkeley, Jan. 25, 1999, http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/MetaSearch.html.

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	全球資訊網的樣式探勘及應用
2.	模糊多準則評量模式-以公務人員平時考核為例
3.	具公平性特徵選取之智慧型網頁自動分類器

1.	[4]. 陳澤修，吳文碩，姚修慎，營建圖文整合及互動查詢系統架構之研究，現代營建，第192期，1995.12。

1.	矩陣運算處理器之設計
2.	以解析法決定薄膜鍍膜材質之光學常數
3.	操作在不連續電流模式下的升壓型功率因數修正器之大信號分析
4.	具有完美重建之三維濾波器、數位全通濾波器及複數係數數位濾波器設計
5.	殘邊帶振幅/離散型小波多載波調變混合次載波多工光纖傳輸系統的誤碼率效能及頻寬效率之評估
6.	單相變流器空白時間之補償策略
7.	提昇RSVP延展性之資料傳送控制架構設計
8.	深度較小的最佳平行前置電路
9.	具語句特徵選取能力的類神經網路文件分類器
10.	主從式BP神經網路處理系統之設計
11.	不規則網路上蟲孔繞路方式之負載平衡方法
12.	軟硬整合式BP神經網路處理器之設計
13.	多階層式資料壓縮晶片架構設計
14.	即時性子網路頻寬管理系統在Linux核心之實作
15.	CDMA之Viterbi解碼器晶片設計與製作

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室