跳到主要內容

臺灣博碩士論文加值系統

(3.236.84.188) 您好!臺灣時間:2021/08/06 11:43
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:蘇彥暐
研究生(外文):Yen-Wei Su
論文名稱:結合特徵擷取與分類技術於信用評分與過濾垃圾郵件之應用
論文名稱(外文):Using feature selection and classification approaches in credit score and spam filtering
指導教授:白炳豐白炳豐引用關係
指導教授(外文):Ping-Feng Pai
學位類別:碩士
校院名稱:國立暨南國際大學
系所名稱:資訊管理學系
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2007
畢業學年度:95
語文別:中文
論文頁數:61
中文關鍵詞:信用評分垃圾郵件支援向量機倒傳遞網路
外文關鍵詞:Credit ScoreSpamSupport Vector MachineBack-Propagation Network
相關次數:
  • 被引用被引用:0
  • 點閱點閱:281
  • 評分評分:
  • 下載下載:35
  • 收藏至我的研究室書目清單書目收藏:2
資料探勘技術目前廣受歡迎,例如應用在垃圾郵件過濾、信用風險評估、醫療診斷、財務分析及工業等領域上。透過這些技術的應用,讓各領域能夠有效且準確地預測各種可能的情況,以供決策、分析。其中在資料探勘領域中,分類技術的應用更是必備的功能。本研究主要以目前幾種常見的分類技術對信用卡及垃圾郵件資料進行過濾,並且結合特徵擷取的技術,將原始資料能先經過篩選後再進行過濾,以提高分類準確度。信用評分為許多銀行衡量申請者違約或逾期風險的指標,其目的為了解客戶是否有違約等風險存在,避免造成公司損失。然而,客戶資料中可能存在非重要特徵因子,而造成判斷之錯誤。因此本研究希望以混合式的方式對信用卡資料進行分析,利用因素分析法先篩選出影響客戶的重要屬性,再來進行過濾的動作以增加分類的正確率。
另外,本研究另ㄧ個方向著重在垃圾郵件的過濾。近年來垃圾郵件常困擾網路使用者,使用者信箱常會收到大量垃圾郵件,不勝其擾。所以各種反垃圾郵件的技術紛紛提出,如倒傳遞網路、決策樹、貝氏過濾法、支援向量機…等方法。而目前過濾機制也大不相同,例如以黑白名單、郵件標題關鍵字、郵件內容關鍵字等特徵作為過濾的標準。因此本研究採用結合特徵擷取及分類技術來過濾垃圾郵件,希望透過混合方式的方式能較單一過濾技術更精確的判斷郵件。而在判定郵件是否為垃圾郵件的標準上,本研究主要探討單對郵件內容特徵的取捨,是否會影響過濾準確度。其郵件內容特徵的取捨包含:針對1.郵件內的關鍵字,2.郵件內特定符號出現次數及郵件內容字串總長度,3.結合上述兩者,比較影響分類的結果。經過兩個不同領域的實驗發現,證實結合特徵擷取與分類技術方法能夠有效的提高分類準確度,同時可篩選出較具代表性的關鍵屬性。
Data mining is a popular technology recently. For example, it can apply to spam filtering, credit risks, medical, financial forecasting and industry, etc. The technology can forecast various situations precisely and effectively. The classification technique is a important domain in the data mining. In our research, we used these tools to filter spam and the credit card, and combined the feature selection technology to find the important attributes from original data to improve the accuracy. Credit score is an index to judge the risk of break with applicants, the purpose avoids the loss of company or banks. But from the large number of customer’s database, it may cause the error from the noise (irrelative attribute). Hence, the experiment brings up a hybrid method, using factor analysis to filter the noise before classifying in my experiment.
Besides, a major research is spam filtering. Many filtering techniques such as Back propagation network, Decision tree, Bayesian filtering or Support vector machine etc., resist the spam because of spam always besets users. And the filtering targets often different, likes as black white list, heading of mail or contents of keywords in the mail. So, we hope to combine the feature selection and filtering methods to improve the accuracy of classification. The content of the mail is the target in this experiment, it including: 1. The keywords in the mail, 2. The heuristic feature and 3. Both. The research will compare the three conditions and different filtering approaches, and get the best result of classification. The two experiment results indicate that using feature selection and classification approaches can improve the accuracy and reduce the noise.
中文摘要 III
Abstract IV
第壹章 緒論 - 1 -
1.1 研究背景與動機 - 1 -
1.2 研究目的 - 2 -
1.3 論文架構 - 3 -
第貳章 文獻探討 - 4 -
2.1 資料探勘技術應用 - 4 -
2.2 信用評分 - 6 -
2.2.1 信用評分定義及功能 - 6 -
2.2.2 分類技術應用於信用評分之文獻 - 7 -
2.3 垃圾郵件 - 9 -
2.3.1 垃圾郵件定義與由來 - 9 -
2.3.2 分類技術應用於垃圾郵件之文獻探討 - 10 -
2.3.3 郵件內容格式與關鍵字擷取 - 13 -
2.4特徵擷取之相關文獻 - 14 -
第参章 研究方法與研究架構 - 18 -
3.1 因素分析 - 18 -
3.2 倒傳遞類神經網路 - 20 -
3.3 支援向量機 - 21 -
3.4 判別分析 - 24 -
3.5 粗略集合論 - 25 -
3.6 最近鄰算法 - 26 -
3.7 本研究所用的方法及架構 - 26 -
第肆章 實驗結果與分析 - 28 -
4.1 信用卡之實驗設置流程圖 - 28 -
4.1.1 信用卡資料分析及前處理 - 28 -
4.1.2 信用資料之特徵擷取 - 31 -
4.1.4 信用評分之實驗結果與分析 - 36 -
4.2 垃圾郵件之實驗設置及流程圖 - 39 -
4.2.2 垃圾郵件資料分析及前處理 - 41 -
4.2.3 垃圾郵件之特徵擷取 - 42 -
4.2.4 垃圾郵件實驗結果分析 - 47 -
第伍章 結論與建議 - 55 -
5.1 結論 - 55 -
5.2 未來方向 - 56 -
中文文獻 - 57 -
參考文獻
中文文獻
[1]李禮仲(2004),「美國立法規範垃圾郵件之探討」,RUN PC,121期。
[2]吳文峰(2002),「中文郵件分類器之設計及實作」,逢甲大學資訊工程學系碩士論文。
[3]吳明隆(2006),「SPSS統計應用學習實務--問卷分析與應用統計」,知城書局。
[4]吳昭逸(2004),「具垃圾信過濾與安全機制之電子郵件收發系統」,台灣科技大學碩士論文。
[5]林傑斌、劉明德(2002),「資料採掘與OLAP理論與實務」,文魁資訊公司。
[6]徐得恩(2004),「合作式垃圾郵件偵測之研究」,雲林科技大學資訊管理學系碩士論文。
[7]孫敏瑗(2004),「加入信用評等下的銀行績效評估」,東吳大學經濟學系碩士論文。
[8]陳丁溫、莊建華、孫東淮、許志瑋、林建成、黃松淇(2004),「垃圾郵件防堵系統」,長榮大學。
[9]陳正昌(2002),「行為及社會科學統計學-統計軟體應用」,巨流圖書公司。
[10]陳建勳(2005),「垃圾郵件諸王記」,歐萊禮出版社。
[11]曾慧馨、劉昭麟、高照明、陳克健(2002),「以構詞與相似法為本的中文動詞自動分類研究」,中華民國計算語言學學會,Vol. 7, No. 1。
[12]黃承龍、陳穆臻、王界人(2004),「支援向量機於信用評等之應用」,計量管理期刊Vol. 1, No. 2, pp.155-172。
[13]黃維(2005),「以類免疫系統法建置垃圾郵件過濾系統之研究」,中原大學資訊管理碩士論文。
[14]葉怡成(1996),「類神經網路模式應用與實作」,儒林書局。
[15]劉鼎康(2005),「使用類神經網路進行垃圾郵件過濾之研究」,中原大學資訊管理碩士論文。
[16]蔡明冀(2004),「以離散式粒子群尋優演算法結合支撐向量機於垃圾郵件分類之應用」,樹德科技大學資訊管理碩士論文。
[17]蔡瑤昇、廖森貴、石柏洲(2004),「品牌對消費者的態度與行為量表之建構」,行銷評論,第1卷第1期,37-52頁。
英文文獻
[18] Altman, E.I., "Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy", Journal of Finance, pp.589-609, 1968.
[19] Androutsopoulos, I., J. Koutsias, K.V. Chandrinos, G. Paliouras and C.D. Spyropoulos, " An Evaluation of Naive Bayesian Anti-Spam Filtering", European Conference on Machine Learning, pp. 9-17, 2000.
[20] Cerny, B.A. and H. F. Kaiser, "A Study of a Measure of Sampling Adequacy for Factor Analytic Correlation Matrices," Multivariate Behavioral Research, pp. 43-47, 1977.
[21] Chang, C.C. and C.J., Lin, "LIBSVM: a library for support vector machines", Department of Computer Science and Information Engineering, 2001.
[22] Chen, C.Y., S.F. Tseng, C.R. Huang and K.J. Chen, "Some Distributional Properties of Mandarin Chinese-a Study Based on the Academia Sinica Corpus", In Proceedings of the First Pacific Asia Conference on Formal & Computational Linguistics, pp. 81-95, 1993.
[23] Chouchoulas, A. and Q. Shen, "A Rough Set-based Approach to Text Classification", Lecture Notes in Computer Science, 1711, pp. 118-127, 1999.
[24] Clark, J., I. Koprinska and J. Poon, "A Neural Network Based Approach to Automated E-mail Classification", IEEE/WIC International Conference on Web Intelligence, pp. 702-705, 2003.
[25] Cortes, C. and V. Vapnik, "Support-Vector Networks," Machine Learning, 20(3), pp.273-297, 1995.
[26] Cover, T.M. and P.E. Hart, "Nearest Neighbor Pattern Classification," IEEE Transactions on Information Theory, 13(1), pp. 21-27, 1967.
[27] Gee, K.R., "Using Latent Semantic Indexing to Filter Spam", Symposium on Applied Computing, pp. 460-464, 2003.
[28] Gold, C., A. Holub and P. Sollich, "Bayesian approach to feature selection and parameter tuning for support vector machine classifiers", Neural Networks archive, 18(5-6), pp. 693-701, 2005.
[29] Gorsuch, R.L., "Factor analysis (2nd ed.)", Hillsdale, NJ: Lawrence Erlbaum, 1983.
[30] Hair, J. F., R.E. Anderson, R.L. Tatham and W.C. Black, "Multivariate Data Analysis with Readings", 1992.
[31] Harris, D. and D. Wu, "Support Vector Machines for Spam Categorization", IEEE Transactions on Neural Networks, 10(5), pp. 1048-1054, 1999.
[32] Huang, C.L. and C.J. Wnag, "A GA-based feature selection and parameters optimization for support vector machines" Expert Systems with Applications, 31(2), pp. 231-240, 2006.
[33] Huang, C.L., M.C. Chen and C.J. Wang, "Credit Card Scoring with a Data Mining Approach Based on Support Vector Machine," Expert Systems with Applications, 34(3), 2007.
[34] Ivancevic, V., A.K. Kaine, B.A. McLindin and J. Sunde, "Factor analysis of essential facial feature," Information Technology Interfaces IT1, pp. 16-19, 2003.
[35] Johnson, R.A. and D.W. Wichern, "Applied multivariate statistical analysis", Upper Saddle River, 1998.
[36]Jose M., M. M. Lopez, and E. P. Sanz, "Combining Text and Heuristics for Cost-Sensitive Spam Filtering", In Proceedings of the Fourth Conference on Computational Natural Language Learning, pp. 99-102, 2000.
[37] Lee, T.S. and I.F. Chen, "A two-stage hybrid credit scoring model using artificial neural network and multivariate adaptive regression splines", Expert Systems with Applications, 28(4), pp.743-752, 2005.
[38] Li, K. and H. Huang, "An Architecture of Active Learning SVMs for Spam", International Conference on Signal Processing, 2(2), pp. 1247-1250, 2002.
[39] Li, Q., L. Jiao and Y. Hao, "Adaptive simplification of solution for support vector machine", Pattern Recognition, 40(3), pp. 972-980, 2007.
[40] Li, S.T., W. Shiue and M.H. Huang, "The evaluation of consumer loans using support vector machines," Expert Systems with Applications, 30(4), pp. 772-782, 2006.
[41] Lin, J.X., W.Q. Luo and S.L. Pang, "The application of listed companies credit scoring model based on bayes discriminant rule" Digital Object Identifier, 2, pp.1517-1521, 2004.
[42] Liu, Y. and Y.F. Zheng, "FS_SFS: Anovel feature selection method for support vector machines", Department of Electrical Engineering, pp. 797-780, 2004.
[43] Makuch, W.M., "Scoring Application", Handbook of Credit Scoring, 1, pp. 3-21, 2001.
[44]Marusic, A., " Factor analysis of risk for coronary heart disease: an independent replication," International Journal of Cardiology, 7(5), pp. 233-238, 2000.
[45] Massart, D.L., F. Questier, I.A. Rollier and B. Walczak, "Application of rough set theory to feature selection for unsupervised clustering", Chemometrics and Intelligent Laboratory Systems, 63, pp. 155-167, 2002.
[46] Mues, C., B. Baesens, C.M. Files and J. Vanthienen, "Decision diagrams in machine learning: an empirical study on real-life credit-risk data," Expert Systems with Applications, 27(2), pp. 257-264, 2004.
[47] Ocal, M.E., E.L. Oral, E. Erdis and G. Vural, "Industry financial ratios-application of factor analysis in Turkish construction industry" Building and Environment, 42(1), pp.385-392, 2005.
[48] Ong, C.S., J.J. Huang and G.H. Tzeng, "Building credit scoring models using genetic programming," Expert Systems with Applications ,29(1) ,pp. 41-47, 2005.
[49] Rumelhart, D.E., B.V. Hinton and R.J. Williams, "Learning internal representation by error propagation", Explorations in the Microstructures of Cognition, 1, 1986.
[50] Sahami, M., S. Dumais, D. Heckerman and E. Horvitz, "A Bayesian Approach to Filtering Junk E-Mail ", AAAI Workshop on Learning for Text Categorization, 1998.
[51] Saqib, A., A. Arshad and H.F. Ahmad, "Using a Probable Weight Based Bayesian Approach for Spam Filtering", National University of Sciences and Technology, pp. 340-345, 2004.
[52] Schneider, K., "A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering", Conference of the European Chapter of the Association for Computational Linguistics, pp. 307-314, 2003.
[53] Shima, K., M. Todoriki and A. Suzuki, "SVM-based feature selection of latent semantic features", Pattern Recognition Letters archive, 25(9), pp. 1051-1057, 2004.
[54] Swiniarski, R.W. and A. Skowron, "Rough set methods in feature selection and recognition", Pattern Recognition Letters, 24, pp. 833-849, 2003.
[55] Tang, W. and K.Z. Mao "Feature selection algorithm for mixed data with both nominal and continuous features", Electronic Edition, 28(5), pp. 563-571, 2007.
[56] Vellido, A., P.J.G. Lisboa, and J. Vaughan, "Neural networks in business: a survey of applications" Expert Systems with Applications, 17, pp. 51-70, 1999.
[57] Wang, Y., S. Wang and K.K. Lai, "A New Fuzzy Support Vector Machine to Evaluate Credit Risk," IEEE Transactions on Fuzzy Systems, 13(6), pp. 820-831, 2005.
[58] Werbos, P.J., "Beyond Regression: New tools for Prediction and Analysis in the Behavioral Sciences," PhD thesis: Harvard University, 1974.
[59] Worth, A.P. and M.T.D. Cronin, "The use of discriminant analysis, logistic regression and classification tree analysis in the development of classification models for human health effects", Journal of Molecular Structure, 622, pp. 97-111, 2003.
[60] Yang, C.H., L.Y. Chuang, C.J. Tu and H.W. Chang, "A Novel Feature Selection for Gene Expression Data", Advances in Intelligent Systems Research, pp. 978-990, 2006.
[61] Zhao, W.Q. and Z.L. Zhang, "An Email Classification Model Based on Rough Set Theory", International Conference on Active Media Technology, pp. 403-408, 2005.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top