跳到主要內容

臺灣博碩士論文加值系統

(54.224.133.198) 您好!臺灣時間:2022/01/29 22:14
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:王國河
研究生(外文):Kuo-Ho Wang
論文名稱:整合叢集與迴歸技術以處理大型資料庫遺失值問題之新方法
論文名稱(外文):A New Method for Handling Missing Values in Large Databases by Integrating Clustering and Regression Techniques
指導教授:曾新穆曾新穆引用關係
指導教授(外文):Shin-Mu Tseng
學位類別:碩士
校院名稱:國立成功大學
系所名稱:資訊工程學系碩博士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2002
畢業學年度:90
語文別:中文
論文頁數:72
中文關鍵詞:資料探勘遺失值叢集分析迴歸分析資料清理
外文關鍵詞:data cleaningregression analysisclustering analysismissing valuedata mining
相關次數:
  • 被引用被引用:11
  • 點閱點閱:675
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
資料探勘為(Data Mining)目前非常熱門的研究領域,主要在研究如何從龐大料庫中萃取出有用的知識。但如果資料庫中含有遺失值(Missing Values)存在時將嚴重影響到資料探的勘分析品質,所以如何妥善處理遺失值問題是相當重要的議題。雖然已經有很多處理遺失值的方法被提出,但沒有一種方法可以完美處理各種不同類型的遺失值,因為不同型態的資料集可能需要有不同的資源。本研究針對有叢集特性的資料集時,提出一種新的遺失值處理方法,在這種特性之下,透過叢集分析(clustering analysis)與迴歸分析(regression analysis)的整合可以適當回復遺失值。根據實驗結果顯示,本研究方法確實在不同類型的資料集下回復遺失值都較之前的研究方法來得優良。
Data mining has become a very popular research area recently. It is the process of extracting desirable knowledge from existing databases for specific purposes. However, the quality of the data mining results will be affected substantially if there exist missing values in the database. Therefore, how to handle missing values effectively is an important topic. Although a number of methods for analyzing missing values have been proposed, none of them can handle different types of missing values well at the same time since different types of datasets might need different resolutions. In this thesis, we propose a new approach to handle missing values for datasets that have clustering characteristic. The proposed approach integrates the techniques of clustering and regression analysis such that the missing values can be recovered suitably if there exist some kinds of cluster properties in the dataset. Through empirical evaluation, the proposed approach was shown to perform better than other methods in recovering the missing values under various types of datasets.
英文摘要…………………………………………………………………I
中文摘要…………………………………………………………………II
誌謝…………………………………………………………………………III
目錄………………………………………………………IV
表目錄………………………………………………………………VII
圖目錄……………………………………………………………Ⅸ

第一章 緒論……………………………………………………………1
1.1 研究動機………………………………………………………1
1.2 研究方法…………………………………………………………3
1.3 研究架構………………………………………………………3
第二章 相關文獻…………………………………………………4
2.1 資料遺失值的定義及分類……………………………………………………4
2.2 遺失值的處理方式……………………………………………………………5
2.2.1 把有遺失值的整筆記錄刪除………………………………………………6
2.2.2 把有遺失值的欄位刪除……………………………………………………7
2.2.3 平均數插補法………………………………………………………………7
2.2.4 組內插補法…………………………………………………………………7
2.2.5 組外插補法…………………………………………………………………7
2.2.6 熱卡插補法…………………………………………………………………8
2.2.7 冷卡插補法…………………………………………………………………8
2.2.8 替代插補法…………………………………………………………………8
2.2.9 迴歸插補法…………………………………………………………………8
2.2.10 最大期望概似插補法……………………………………………………9
2.2.11 組合插補法………………………………………………………………10
2.2.12 多重插補…………………………………………………………………10
2.2.13 主成份分析法……………………………………………………………11
2.2.14 決策樹……………………………………………………………………13
2.2.15 關連法則…………………………………………………………………15
2.2.16 機械式學習………………………………………………………………17
2.2.17 叢集分析…………………………………………………………………17
2.2.18 結論………………………………………………………………………19
2.3 相似度量測法………………………………………………………………20
2.3.1 距離量測…………………………………………………………………20
2.3.2 相關係數…………………………………………………………………20
第三章 研究方法與設計…………………………………………………………22
3.1 方法概念……………………………………………………………………22
3.1.1 利用叢集分析來處理遺失值問題………………………………………22
3.1.2 利用多重插補法來處理遺失值問題……………………………………24
3.2 CAST 演算法………………………………………………………………24
3.3 多元迴歸分析………………………………………………………………27
3.4 新方法RC……………………………………………………………………29
第四章 實驗設計…………………………………………………………………34
4.1 叢集資料產生器……………………………………………………………34
4.2 各種評價指標………………………………………………………………42
4.2.1 賽爾預測法………………………………………………………………42
4.2.2 均差平方和開方法………………………………………………………42
4.2.3 均差絕對值法……………………………………………………………42
4.2.4 相對絕對值法……………………………………………………………42
4.2.5 接近率法…………………………………………………………………43
第五章 實驗結果及討論…………………………………………………………44
5.1 叢集資料集實驗設定………………………………………………………44
5.1.1 實驗一 基本組(Base Model)……………………………………………46
5.1.2 實驗二 改變資料量………………………………………………………49
5.1.3 實驗三 改變欄位數………………………………………………………52
5.1.4 實驗四 改變分散性………………………………………………………54
5.1.5 實驗五 改變叢集數目……………………………………………………57
5.1.6 實驗六 執行效率比較……………………………………………………59
5.1.7 叢集資料集實驗結論……………………………………………………61
5.2 隨機產生資料集實驗………………………………………………………61
5.3 真實資料集實驗……………………………………………………………64
5.4 實驗結論……………………………………………………………………66
第六章 結論與未來研究方向
6.1 貢獻與結論…………………………………………………………………67
6.2 未來研究方向………………………………………………………………67
參考文獻…………………………………………………………………………68
1.Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.,"Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications", Proc. Of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998.
2.Aldenderfer, M.S.,and Blashfield, R.K.,"Cluster Analysis", Sage Publications, Inc., 1984.
3.Allison, P.D.,"Missing data", Thousand Oaks, Cali,Sage Publications, 2002.
4.Ben-Dor, A. and Yakhini, Z.,"Clustering gene expression patterns", Proceedings of the 3rd Annual International Conference on Computational Molecular BiologyRECOMB , 1999.
5.Bramer, M.A., Liu, W.Z.,White, A.P., Thompson, S.G.,"Techniques for Dealing with Missing Values in Classification", IDA 527-536,1997.
6.Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J.,"Classification and Regression Trees", Wadsworth and Brooks, Pacific Grove CA, 1984.
7.Ching-Pin, K., Shin-Mu, T.,"Efficient Clustering Methods for Gene Expression Mining:A performance Evaluation", Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2002.
8.Hyafil, L. and Rivest, R.,"Constructing optimal binary decision trees is NP-complete," Information Processing Letters, 15-17, 1976.
9.Jain, A.K. and Dubes, R.C.,"Algorithms for Clustering Data", Prentice Hall, 1988.
10.Jiawei, H. and Micheline, K.,"Data Mining: Concepts and Techniques", Morgan Kaufmann Publishers, 2000.
11.Kalton, G., and Kasprzyk, D.,"Imputing for missing survey response", Proc. Sect. Survey Res. Meth., Amer. Statist. Assoc., 22-23, 1982.
12.Kaufman, L. and Rousseeuw, P.J.,"Finding groups in data: an Introduction to cluster analysis", John Wiley and Sons, 1990.
13.Kononenko, I., Bratko, I. and Roskar, E.,"Experiments in automatic learning of medical diagnostic rules", Technical Report. Jozef Stefan Institute, Ljubjana,Yugoslavia,1984.
14.Lien-Chin, C., "A Correlation-Based Approach for Validating Gene Expression Clustering", Department of Computer Science and Information Engineering National Cheng Kung University, 2002.
15.Little, R.J.A. and Rubin ,D.B."Statistical Analysis with Missing Data", New York, John Wiley and Sons, 1987.
16.MartinEster, H.P.K., Sander, J. and Xiaowei, X.,"A density-based algorithm for discovering clusters in large spatial databases with noise", Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 226-231, Portland, Orgon, 1996.
17.McQueen, J.B.,"Some Methods of Classification and Analysis of Multivariate Observations", Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281-297, 1967.
18.Ng, R.T. and Jiawei, H.,"Efficient and effective clustering methods for spatial data mining", Proceedings of the 20th VLDB Conference, pages 144-155, Santiago, Chile, 1994.
19.Ng, V., Lee, J.,"Quantitative association rules over incomplete data", Systems, Man, and Cybernetics, IEEE International Conference on , Volume: 3 ,1998.
20.Numao, M., Lobo, O.O.,"Ordered Estimation of Missing Values", PAKDD 499-503,1999.
21.Pedreira, C.E., Parente, E.,"Neural Networks with Missing Values Attributes", Proceedings., IEEE International Conference on , Volume: 6 ,1995.
22.Plye, D.,"Data PreParation for Data Mining", Morgan Kaufmann Publishers, 1999.
23.Quinlan, J.R.,"C4.5: Programs for machine learning", Morgan Kaufmann, San Mateo, CA,1993.
24.Quinlan, J.R.,"Induction of decision trees", Machine Learning 1, 1986.
25.Ragel, A., and Cremilleux, B,."Treatment of Missing Values for Association Rules", PAKDD 258-270, 1998.
26.Ragel, A.,"Preprocessing of Missing Values Using Robust Association Rules", PKDD 414-422, 1998.
27.Ragle, A., and Cremilleux, B.,"MVC a preprocessing method to deal with missing values", Knowledge-Base System, vol. 12, Issue:5-6. pp. 205-332, October 1999.
28.Richard, C.T.L., James, R.S., and Mong, C.T.,"Application of Clustering to Estimate Missing Data and Improve Data Integrity", ICSE, 1976.
29.Rubin, D.B.,"Multiple imputation for nonresponse in surveys", New York, Wiley, 1987.
30.Schafer, J.L.,"Analysis of Incomplete Multivariate Data", NewYork, Chap and Hall, 1997.
31.Sudipto, G., Rajeev R., and Kyuseok S.,"CURE: An efficient clustering algorithm for large databases", Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 73-84, New York, 1998.
32.Sudipto, G., Rajeev R., and Kyuseok S.,"ROCK: a robust clustering algorithm for categorical attributes", Proceedings of the 15th International Conference on Data Eng., 1999.
33.Wei, W., Jiong, Y., and Richard, M.,"STING: a statistical information grid approach to spatial data mining", Proc. 23rd Int. Conf. On Very Large Data Bases (VLDB), 186-195, 1997.
34.Zhang, T., Ramakrishnan, R., and Livny, M.,"BIRCH: A new data clustering algorithm and its applications", Data Mining and Knowledge Discovery, 1(2):141¡X182, 1997.
35.Zhang, T., Ramakrishnan, R., and Livny, M.,"BIRCH: An Efficient Data Clustering Method for Very Large Databases", Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, Montreal, Canada, 1996.
36.黃珮菁,"含遺失值之列聯表最大概似估計量及模式的探討", 國立政治大學 統計學系,1999.

37.趙士儀,"以主成份分析法處理定量資料缺失值問題", 元智大學 資訊管理研究所, 1999.

38.曹志弘,"遺漏值插補方法的比較", 國立中央大學 統計研究所, 1998.

39.林清山,"多變項分析統計法 社會及行為科學研究適用", 東華社會科學叢書, 1981.
40.林真真 ,鄒幼涵,"迴歸分析", 華泰書局, 1990.
41.陳信木,林僅塋,"調查資料之遺漏值的處理-以熱卡插補法為例", 社會調查研究 第三期.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top