跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.110) 您好!臺灣時間:2025/09/25 07:53
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:張育榮
研究生(外文):Yu-Jung Chang
論文名稱:運用模糊排序分析法作網頁自動分類之可調適特徵選取
論文名稱(外文):Scalable Feature Selection for Web Page Classification by Fuzzy Ranking Analysis
指導教授:李漢銘李漢銘引用關係
指導教授(外文):Hahn-Ming Lee
學位類別:碩士
校院名稱:國立臺灣科技大學
系所名稱:電子工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:1999
畢業學年度:87
語文別:英文
論文頁數:120
中文關鍵詞:特徵選取網頁分類排序分析
外文關鍵詞:feature selectionWeb page classificationranking analysis
相關次數:
  • 被引用被引用:3
  • 點閱點閱:271
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:5
目前的網頁搜尋工具無法同時滿足使用者在質與量上的需求,此時運用網頁自動分類器可有效解決以上問題,然而網頁資源的龐大卻使自動分類器面臨了極大的挑戰,面對網頁分類動輒數以百萬計的文件、數以萬計的輸入維度(特徵)、與數以百計的類別數,有效的維度降低技術成為問題的關鍵。在維度降低技術中,我們針對特徵選取方法提出新的「模糊排序分析模型(fuzzy ranking analysis paradigm)」。其中除了提出「排序分析(ranking analysis)」的評估方法,使得特徵在分類上的行為能被有效地分析外,並提出兩種改進特徵選取的新穎方式,包括奠基於模糊集合(fuzzy sets)理論的兩階段提升技巧(two-level promotion),以及專為可調適特徵選取(scalable feature selection)而設計的「特徵鑑別力(feature discriminating power)」度量。由中時電子報(1,000份新聞)的實驗結果可知,當分類器的輸入維度由10,427降至200時,運用我們的方法,仍然能使測試語料保持100%的辨識率與不錯的正確率(80.41%)。
To satisfy, efficiently and effectively, both the qualitative and quantitative demands of information request on the WWW, automatic Web page classifiers are urgently needed. However, the classifiers suffer from the huge-scale problem on WWW, i.e., they have to handle millions of Web pages, ten thousands of features, and hundreds of categories. When they come to practical implementation, the dimensionality-reduction process is imperatively necessary and also the major challenge. In the thesis, we propose a fuzzy ranking analysis paradigm accompanied with a novel relevance measure: feature discriminating power (FDP) to effectively reduce the input dimensionality from ten thousands into a few hundreds with zero rejection rate and with gracefully degrading in the accuracy. Also, this method can help us do scalable feature selection with a reasonable trade-off between accuracy and efficiency. Meanwhile, our proposed method is useful for evaluating relevance measures. Furthermore, a two-level promotion method is proposed to improve the behavior of each relevance measure and combine those measures to produce a better evaluation of features. According to the experimental results, the FDP measure successfully reduces the input-dimensionality from 10,427 to 200 with zero rejection rate and with less than 5% degrading (from 84.5% to 80.4%) in the test accuracy. The results also show that the FDP measure is not only able to reduce redundancy but also capable of reducing noise.
Acknowledgements iii
Contents iv
List of Figures vii
List of Tables ix
Nomenclature x
Notations xii
CHAPTER 1 Introduction 1
1.1 Motivation 1
1.1.1 The Demand of Retrieving High-Quality Information from the massive Web Documents 1
1.1.2 Dimensionality Reduction: The Major Challenge of Automatic Web Page Classifiers 2
1.2 Input-Dimensionality Reduction by Feature Selection 3
1.3 Our Goals and Design 4
1.3.1 Automatically Hierarchical Classification for Web Pages 4
1.3.2 Scalable Feature Selection for Both Accuracy and Efficiency 4
1.4 Organization 6
CHAPTER 2 Background 7
2.1 The Information-Overloading Problem on the Web 7
2.1.1 Explosive Growth of Web Pages 7
2.1.2 Problems of Current Search Tools on the Web 9
2.1.3 Automatic Web Page Classification: An Effective Remedy to Heal Information Overloading 10
2.2 Fuzzy Decision Making 11
2.2.1 Fuzzy Sets 11
2.2.2 Fuzzy Decision Making 12
2.2.3 Multicriteria Decision Making 13
2.3 Pattern Classification 16
2.3.1 Pattern Recognition 16
2.3.2 Interpretations and Types of Pattern Classification 16
2.3.3 Practical Steps of Pattern Classifiers 24
CHAPTER 3 Automatic Web Page Classifiers 27
3.1 Purpose and Considerations of Automatic Web Page Classifiers 27
3.2 Practical Design Issues 30
3.2.1 Preparation of Dataset 31
3.2.2 Representation of Web Pages 33
3.3 Feature Engineering 34
3.3.1 Feature Generation Stage 35
3.3.2 Feature Refinement Stage 36
3.3.3 Feature Utilization Stage 37
CHAPTER 4 Scalable Feature Selection 42
4.1 Feature Selection 42
4.1.1 The Quality Consideration of Feature Sets 42
4.1.2 Definitions 44
4.1.3 Methods and Approaches 48
4.2 Evaluation of Feature Relevance 51
4.2.1 The Behavior of Relevance Measures 51
4.2.2 Relevance Measures in Text Categorization 54
4.3 Feature Selection in Text Categorization 56
4.3.1 The Explicit-Filter Method 56
4.3.2 Problems in the Explicit-Filter Method 57
4.4 The Scalability of Input Features 59
4.4.1 The Meaning of Scalability 59
4.4.2 The Reason for Scalable Feature Selection 60
4.4.3 The Difficulty and Methods of Scalable Feature Selection 61
4.5 Fuzzy Ranking Analysis 62
4.5.1 Introduction 62
4.5.2 Ranking Analysis: The Scalability Test 64
4.5.3 Two-Level Promotion Based on Fuzzy Sets 65
4.6 A Novel Relevance Measure: Feature Discriminating Power 69
4.6.1 Characteristics of FDP 70
CHAPTER 5 Experiments 71
5.1 Experimental Design 71
5.1.1 Overview 71
5.1.2 Terminology and Evaluation Methods 72
5.2 The China Times Dataset 73
5.2.1 Overview 73
5.2.2 Classification Results without Feature Selection 74
5.2.3 Results of Scalability Test 75
5.2.4 Summary 86
CHAPTER 6 Discussion and Conclusion 87
6.1 Methods of Input-Dimensionality Reduction 87
6.2 Challenges of Automatic Content Analyzers on the Web 90
6.3 Conclusion 91
6.4 Further Work 92
REFERENCES 93
APPENDIX A Top 200 Selected Features Using FDP in China Times Dataset 102
APPENDIX B Results of Scalability Test Using FDP in China Times Dataset 104
[1] H. Berghel,“ Cyberspace 2000: Dealing with Information Overload,” Communications of the ACM, 40(2), pp.19-24, Feb. 1997.
[2] Al Borchers et al.,“ Ganging up on Information Overload,” Computer, pp. 106-108, Apr. 1998.
[3] C. Jenkins, M. Jackson, P. Burden, J. Wallis,“ Searching the World Wide Web: An Evaluation of Available Tools and Methodologies,” Information And Software Technology, 39(14-15), pp.985-994, Feb. 1998.
[4] R.E. Filman and S. Pant,“ Searching the Internet,” IEEE Internet Computing, 2(4), pp.21-23, Aug. 1998.
[5] V.N. Gudivada et al.,“ Information Retrieval on the World Wide Web,” IEEE Internet Computing, 1(5), pp. 58-68, Sep./Oct. 1997.
[6] T.G. Dietterich,“ Machine-Learning Research: Four Current Directions,” AI Magazine, 18(4), pp. 97-136, Winter 1997.
[7] G. Salton, “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer,” Addison Wesley, 1989.
[8] G. Salton,“ Introduction to Information Retrieval,” McGraw-Hill, 1983.
[9] S.H. Lin, M.C. Chen, J.M. Ho, M.T. Ko, and Y.M. Huang,“ ACIRD: Intelligent Internet Documents Organization and Retrieval,” in Special Issue on Web Technologies, IEEE Transaction on Knowledge and Data Engineering, forthcoming.
[10] D.D. Lewis,“ Representation and Learning in Information Retrieval,” Ph.D. Dissertation, Tech. Report UM-CS-1991-093, Department of Computer Science, University of Massachusetts, Amherst, MA, Feb. 1992, ftp://ftp.cs.umass.edu/pub/techrept/techreport/1991/UM-CS-1991-093.ps, http://www.research.att.com/~lewis/papers/lewis91d.ps.
[11] C. Chekuri and M.H. Goldwasser,“ Web Search Using Automatic Classification,” Poster Presentation Papers of the 6th International WWW Conference, California, USA, Apr. 1997.
[12] C. Lin and H. Chen.,“ An Automatic Indexing and Neural Network Approach to Concept Retrieval and Classification of Multilingual (Chinese-English) Documents,” IEEE Trans. SMC-part B: Cybernetics, 26(1), 75-88, Feb. 1996.
[13] T. Honkela et al.,“ WEBSOM--Self-Organizing Maps of Document Collections,” Proceedings of WSOM''97, Espoo, Finland, June 1997, http://websom.hut.fi/websom/, http://www.ics.hut.fi/wsom97/progabstracts./51.html.
[14] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen,“ WEBSOM--Self-Organizing Maps of Document Collections,” Neurocomputing, Vol. 21, pp. 101-117, 1998.
[15] Y. Yang,“ An Evaluation of Statistical Approaches to Text Categorization,” Information Retrieval Journal, 1999 (forthcoming), http://www.cs.cmu.edu/~yiming/papers.yy/irj98.ps.
[16] Y. Yang,“ Sampling Strategies and Learning Efficiency in Text Categorization,” AAAI Spring Symposium on Machine Learning in Information Access, pp. 88-95, 1996, http://www.cs.cmu.edu/~yiming/papers.yy/aaai_mlia96.ps.
[17] D. Mladenic,“ Turning Yahoo into an Automatic Web-Page Classifier,” Proceedings of the 13th European Conference on Aritficial Intelligence (ECAI''98), pp. 473-474, 1998, http://www-ai.ijs.si/DunjaMladenic/index.htm.
[18] D. Koller and M. Sahami,“ Hierarchically Classifying Documents Using Very Few Words, ” Proceedings of the 14th International Conference on Machine Learning, Nashville, Tennessee, pp. 170-178, July 1997, http://robotics.stanford.edu/~koller/paper/ml97.html.
[19] A.L. Blum and Pat Langley,“ Selection of relevant features and examples in machine learning,” Special Issue on Relevance, Artificial Intelligence, 97(1-2), 21-23, Dec. 1997.
[20] M. Dash and H. Liu.,“ Feature Selection for Classification,” Intelligent Data Analysis, 1(3), 1997, http://www.elsevier.com/locate/ida/.
[21] D.D. Lewis,“ Feature Selection and Feature Extraction for Text Categorization,” Proceedings of Speech and Natural Language Workshop, Defense Advanced Research Projects Agency, Morgan Kaufmann, pp. 212-217, Feb. 1992, http://www.research.att.com/~lewis/papers/lewis92e.ps.
[22] S. Chakrabarti, B. Dom, R. Aagrawal, P. Raghavan,“ Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies,” VLDB Journal, 1998(7), pp. 163-178, 1998.
[23] S.H. Lin, C.S. Shih, M.C. Chen, J.M. Ho, M.T. Ko, and Y.M. Huang,“ Extracting Classification Knowledge of Internet Documents: A Semantic Approach,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, Aug. 24-28, 1998.
[24] Y. Yang, and J.O. Pedersen“ A Comparative Study on Feature Selection in Text Categorization,” Proceedings of the 14th International Conference on Machine Learning (ICML''97), Nashville, Tennessee, July 1997, http://www.cs.cmu.edu/~yiming/papers.yy/ml97.ps.
[25] T.C. Liu,“ Feature Selection using Genetic Algorithms for Daily News Classification,” Announced Experimental Results of Fuzzy Neural Laboratory in NTUST, May 1999, http://neuron.et.ntust.edu.tw/~tcliu/research.html.
[26] M.R. Wulfekuhler and W.F. Punch,“ Finding Salient Features for Personal Web Page Categories,” Proceedings of the 6th International WWW Conference, California, USA, Apr. 1997, http://www6.nttlabs.com/HyperNews/get/PAPER118.html.
[27] D. Mladenic and M. Grobelnik,“ Feature selection for classification based on text hierarchy,” Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD-98), 1998, http://www-ai.ijs.si/DunjaMladenic/index.htm.
[28] D. Koller and M. Sahami,“ Toward Optimal Feature Selection,” Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, pp. 284-292, July 1996, http://robotics.stanford.edu/~koller/paper/ml96.html.
[29] N. Littlestone,“ Learning Quickly When Irrelevant Attributes Around: A New Linear Threshold Algorithm,” Machine Learning, 2(1988), pp. 285-318, 1988.
[30] K. Kira and L. Rendell,“ A Practical Approach to Feature Selection,” Proceedings 9th International Conference on Machine Learning, Aberdeen, Scotland, pp. 249-256, 1992.
[31] H. Almuallim and T.G. Dietterich,“ Learning With Many Irrelevant Feature,” Proceedings AAAI-91, Anaheim, CA, pp. 547-552, 1991.
[32] J.R. Quinlan,“ Learning Efficient Classification Procedures and Their Application to Chess and Games,” Machine Learning: An Artificial Intelligence Approach, pp. 463-482, edited by R.S. Michalski, J.G. Carbonell and T.M. Mitchell, Morgan Kaufmann, San Mateo, CA, 1983.
[33] J.R. Quinlan,“ C4.5: Programs for Machine Learning,” Morgan Kaufmann, San Mateo, CA, 1993.
[34] P. Langley and S. Sage,“ Oblivious Decision Trees and Abstract Cases,” Working Notes of the AAAI-94 Workshop on Case-Based Reasoning, Seattle, WA, pp. 113-117, 1994.
[35] K. Kira and L.A. Rendell,“ The Feature Selection Problem: Traditional Methods and a New Algorithm,” Proceedings of Ninth National Conference on Artificial Intelligence, pp. 129-134, 1992.
[36] P.M. Narendra and K. Fukunaga,“ A Branch and Bound Algorithm for Feature Selection,” IEEE Transactions on Computers, C-26(9), :pp. 917-922, Sep. 1977.
[37] L. A. Zadeh,“ Fuzzy Sets,”formation and Control, 8(3), pp.338-353, 1965.
[38] H.J. Zimmermann,“ Fuzzy Set Theory and its Applications,” 2nd edition, Boston, MA: Kluwer, 1991.
[39] G.J. Klir and Bo Yuan,“ Fuzzy Sets and Fuzzy Logic: Theory and Applications,” Prentice-Hall, 1995.
[40] L.A. Zadeh et al.,“ Special Issue on Fuzzy Set: Where do we stand? What do we go?,” Fuzzy Sets and Systems, 90(2), pp.109-224, Sep. 1997.
[41] R.E. Bellman and L.A. Zadeh,“ Decision-Making in a Fuzzy Environment,” Management Science, 17(4), pp.217-233, Dec. 1970.
[42] M. Roubens,“ Fuzzy Sets and Decision Analysis,” Special Issue on Fuzzy Set: Where do we stand? What do we go?, Fuzzy Sets and Systems, 90(2), pp.199-206, Sep. 1997.
[43] G. Pasi and R.A.M. Pereira,“ A Decision Making Approach to Relevance Feedback in Information Retrieval: A Model Based on Soft Consensus Dynamics,” Special Issue on Decision Theory in a Fuzzy Environment, International Journal of Intelligent Systems, 14(1), pp.105-122, Jan. 1999.
[44] H.-J. Zimmermann,“ Fuzzy Sets, Decision Making, and Expert System,” Boston: Kluwer Academic Publishers, 1987.
[45] J.C. Fodor, J.-L. Marichal, and M. Roubens,“ Characterization of Some Aggregation Functions Arising from Problems,”in Fuzzy Logic and Soft Computing, pp. 194-201, edited by B. Bouchon-Meunier, R.R. Yagar and L.A. Zadeh, World Scientific Publishing Co. Pte. Ltd., 1995.
[46] R.A. Ribeiro,“ Fuzzy Multiple Attribute Decision Making: A Review and New Preference Elicitation Techniques,”in Special Issue on Fuzzy Multiple Criteria Decision Making, Fuzzy Sets and Systems, 78(2), Mar. 1996.
[47] R.P. Lippmann.,“ Pattern Classification Using Neural Networks,” Special Issue on Neural Networks in Communications, IEEE Communications, 27(11), pp.47-64, Nov. 1989.
[48] R.P. Lippmann,“ An Introduction to Computing with Neural Networks,” IEEE Accoust. Speech Signal Process, 4(2), 4-22, 1987.
[49] J.M Zurada,“ Introduction to Artificial Neural Systems,” West Publishing, 1992.
[50] M. Nadler and E.P. Smith,“ Pattern Recognition Engineering,” A Wiley-Interscience Publication, John Wiley & Sons Inc., New York, NY, USA, 1993.
[51] L. Holmstom et al.,“ Neural and Statistical Classifiers-Taxonomy and Two Case Studies,” Special Issue on Artificial Neural Network and Statistical Pattern Recognition, IEEE Transactions on Neural Networks, 8(1), pp. 5-17, Jan. 1997.
[52] A. Joshi et al.,“ On Neurobiological, Neuro-Fuzzy, Machine Learning, and Statistical Pattern Recognition Techniques,” Special Issue on Artificial Neural Network and Statistical Pattern Recognition, IEEE Transactions on Neural Networks, 8(1), pp. 18-31, Jan. 1997.
[53] J. Sinclair (editor in chief),“ Collins COBUILD English Dictionary,” HarperCollins Publishers, London, 1987.
[54] K.A. Ross and C. Wright,“ Discrete Mathematics,” Prentice-Hall, 1988.
[55] R.W. Hamming,“ Coding and Information Theory,” Prentice-Hall, Englewood Cliffs, NJ, 1980.
[56] C.H. Tsai,“ MMSEG: A Word Identification System for Mandarin Chinese Text Based On Two Variations of the Maximum Matching Algorithm,” May 1996, http://casper.beckman.uiuc.edu/~c-tsai4/chinese/wordseg/mmseg.html.
[57] S. Deerwester et al.,“ Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, 41(6), pp. 391-407, 1990.
[58] List of Search Engines in Yahoo!, http://dir.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web /Searching_the_Web/Search_Engines/.
[59] Virtual Search Engines, List of over 1,000 Search Engines and Directories within 50 categories (categorized by purpose), http://www.dreamscape.com/frankvad/search.html.
[60] Yahoo!, Classified Directory, http://www.yahoo.com.
[61] Magellen in Excite Inc., Classified Directory, http://magellan.excite.com/.
[62] Yam, Classified Directory, Taiwan, http://www.yam.com.tw.
[63] Kimo, Classified Directory, Taiwan, http://www.kimo.com.tw.
[64] AltaVista, Search Robots, http://www.altavista.com.
[65] Excite, Search Robots, http://www.excite.com.
[66] HoBot, Search Robots, http://www.hotbot.com.
[67] InfoSeek, Search Robots, http://www.infoseek.com.
[68] Lycos, Search Robots, http://www.lycos.com.
[69] Openfind, Search Robots, Taiwan, http://www.openfind.com.tw.
[70] Dogpile, Meta Search Robots, http://www.dogpile.com.
[71] MetaCrawler, Meta Search Robots, http://www.metacrawler.com.
[72] Metafind, Meta Search Robots, http://www.metafind.com.
[73] “Meta-Search Engines,” Teaching Library Workshops, University of California, Berkeley, Jan. 25, 1999, http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/MetaSearch.html.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top