跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.134) 您好!臺灣時間:2025/11/19 23:19
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:馮文榆
研究生(外文):Wen-yu Feng
論文名稱:特定偽缺漏值好發生之資料群集之偵測
論文名稱(外文):Detecting the Data Group most Prone to a Specific Disguise Value
指導教授:林文揚林文揚引用關係
指導教授(外文):Wen-yang Lin
學位類別:碩士
校院名稱:國立高雄大學
系所名稱:資訊工程學系碩士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2013
畢業學年度:101
語文別:英文
論文頁數:50
中文關鍵詞:資料清理資料探勘資料品質偽缺漏值遺傳演算法隨機缺漏無偏差樣本
外文關鍵詞:data cleansingdata miningdata qualitydata cleansingdisguised missing datagenetic algorithmsmissing at randomunbiased sampling
相關次數:
  • 被引用被引用:0
  • 點閱點閱:168
  • 評分評分:
  • 下載下載:16
  • 收藏至我的研究室書目清單書目收藏:0
偽缺漏值是缺漏值的一種特殊的缺漏值;在資料欄位中,偽缺漏值並不為空,但所擁有的資料卻無法反應事實。偽缺漏值的存在可能會造成分析結果的嚴重偏差,因此,偽缺漏值偵測遂成為資料清理中一個重要的議題。根據Little與Rubin所提出的分類,偽缺漏值可分為完全隨機型、隨機型、以及非隨機型三種,而過去的方法往往著重於第一種偽缺漏值的處理,並沒有對另外兩種類型進行探討。
在本論文中,我們提出一個關於偵測隨機型偽缺漏值上的問題的變形,即尋找特定偽缺漏值容易發生的資料群集。我們成功地將此問題轉換為最佳化的問題,並提出基於遺傳演算法的方法以處理這個問題。我們並利用兩個真實的數據庫來進行實驗,根據實驗結果顯示,我們所提出的遺傳演算法的偵測方法能夠有效地找出最可能產生特定偽缺漏值的資料群集。
Disguised missing data is a special kind of missing data, which is not exactly missing in the data entry, but cannot reflect the fact. The presence of disguised missing data may lead to severe bias on analysis results, so the problem of detecting existing disguise values becomes an important issue in data cleansing. Following the taxonomy proposed by Little and Rubin, the types of disguise missing data can also be classified into three categories: Missing completely at random, missing at random, and missing not at random. Previous work on the detection of disguise missing data focused on the first type; no work has been conducted to the other two types.
In this thesis, we present a variant of the problem of detecting the second type of disguise missing data, i.e., finding out the data group most prone to a specific disguise value. We formalize this problem as an optimization problem and propose a genetic algorithms based method to handle this problem. According to the experimental results we conducted on two real datasets, our genetic algorithms based method can discover the data group most prone to the occurrence of a given specific disguise value.
致謝  i
摘要  ii
Abstract iii
Contents v
Contents of Figures vii
Contents of Tables viii
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Contributions 2
1.3 Thesis Organizations 3
Chapter 2 Background and Related Work 5
2.1 Missing Data 5
2.2 Disguised Missing Data 6
2.3 Genetic Algorithms 7
2.4 Related Work 8
Chapter 3 Embedded Unbiased Sample Based Detection of Disguise Value 10
3.1 Embedded Unbiased Sample Heuristic 10
3.2 CBSQS: Measurement of Unbiased Sample 12
3.3 The EUS Algorithm 14
Chapter 4 Problem Description 17
4.1 Preliminary 17
4.2 Formal Definition 18
Chapter 5 The Proposed GA-based Detection Method 23
5.1 General Framework 23
5.2 Chromosome Representation 24
5.3 Evolutionary Operations 25
5.4 Fitness Function 26
5.5 Candidate Pruning 28
Chapter 6 Experiments and Analysis 31
6.1 Experimental Results on Execution Time 31
6.2 Experimental Results on Solution Correctness 34
Chapter 7 Conclusions and Future Work 38
7.1 Conclusions 38
7.2 Future Work 38
References 40
[1] R. Belen, ”Detecting disguised missing data,” Master Thesis, The Middle East Technical University, February, 2009.
[2] R. Belen, T. T. Temizel, ”A framework to detect disguised missing data,” in Knowledge Discovery Practices and Emerging Applications of Data Mining: Trends and New Domains, A.V. Senthil Kumar, Eds. USA: IGI Global, 2010, pp. 1-22.
[3] FDA Adverse Event Reporting System, Available: http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm083765.htm, [Jun. 19, 2013].
[4] Hsin Chu General Hospital, Department of Health, Executive Yuan, R.O.C., Available: https://dss.hch.gov.tw/other8.asp, [Jun. 19, 2013].
[5] J. Holland, Adaptation in Natural and Artificial Systems, Cambridge, MA: MIT Press, 1992.
[6] M. Hua and J. Pei, “Cleaning disguised missing data: a heuristic approach,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 950-958.
[7] M. Hua and J. Pei, “DiMaC: A system for cleaning disguised missing data,” in International Conference on Management of Data, Vancouver, BC, Canada, June 2008, pp. 9-12.
[8] R. Little and D. Rubin, Statistical Analysis with Missing Data, Wiley Publishers, New York, 1987.
[9] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs. Berlin: Springer, 1994.
[10] M. Mitchell, An Introduction to Genetic Algorithms. Cambridge, MA: MIT Press, 1996.
[11] H. Muhlenbein, “How genetic algorithms really work: I. Mutation and hillclimbing,” in Parallel Problem Solving from Nature 2, Reinhard Manner, Bernard Manderick, Eds. Brussels, Belgium, September 1992, pp. 15–25.
[12] K. Natarajan, J. Li, and A. Koronios, ”Detecting mis-entered values in large data sets,” in Proceedings of the 4th World Congress on Engineering Asset Management, Athens, Greece, 2009, pp. 805-812.
[13] R. K. Pearson,”The problem of disguised missing data,” in ACM SIGKDD Explorations Newsletter, Vol. 8, No. 1, pp. 83-92, June 2006.
[14] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Boston, MA: Pearson Education, Inc., 2006.
[15] UCI Machine Learning Repository: Pima Indians Diabetes Data Set, Available: http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes, [Jun. 19, 2013].
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top