臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.134) 您好！臺灣時間：2025/11/19 23:19

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
電子全文
紙本論文
QR Code

本論文永久網址:

研究生:

馮文榆

研究生(外文):

Wen-yu Feng

論文名稱:

特定偽缺漏值好發生之資料群集之偵測

論文名稱(外文):

Detecting the Data Group most Prone to a Specific Disguise Value

指導教授:

林文揚

指導教授(外文):

Wen-yang Lin

學位類別:

碩士

校院名稱:

國立高雄大學

系所名稱:

資訊工程學系碩士班

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2013

畢業學年度:

101

語文別:

英文

論文頁數:

中文關鍵詞:

資料清理、資料探勘、資料品質、偽缺漏值、遺傳演算法、隨機缺漏、無偏差樣本

外文關鍵詞:

data cleansing、data mining、data quality、data cleansing、disguised missing data、genetic algorithms、missing at random、unbiased sampling

相關次數:

被引用:0
點閱:168
評分:
下載:16
書目收藏:0

偽缺漏值是缺漏值的一種特殊的缺漏值；在資料欄位中，偽缺漏值並不為空，但所擁有的資料卻無法反應事實。偽缺漏值的存在可能會造成分析結果的嚴重偏差，因此，偽缺漏值偵測遂成為資料清理中一個重要的議題。根據Little與Rubin所提出的分類，偽缺漏值可分為完全隨機型、隨機型、以及非隨機型三種，而過去的方法往往著重於第一種偽缺漏值的處理，並沒有對另外兩種類型進行探討。
在本論文中，我們提出一個關於偵測隨機型偽缺漏值上的問題的變形，即尋找特定偽缺漏值容易發生的資料群集。我們成功地將此問題轉換為最佳化的問題，並提出基於遺傳演算法的方法以處理這個問題。我們並利用兩個真實的數據庫來進行實驗，根據實驗結果顯示，我們所提出的遺傳演算法的偵測方法能夠有效地找出最可能產生特定偽缺漏值的資料群集。

Disguised missing data is a special kind of missing data, which is not exactly missing in the data entry, but cannot reflect the fact. The presence of disguised missing data may lead to severe bias on analysis results, so the problem of detecting existing disguise values becomes an important issue in data cleansing. Following the taxonomy proposed by Little and Rubin, the types of disguise missing data can also be classified into three categories: Missing completely at random, missing at random, and missing not at random. Previous work on the detection of disguise missing data focused on the first type; no work has been conducted to the other two types.
In this thesis, we present a variant of the problem of detecting the second type of disguise missing data, i.e., finding out the data group most prone to a specific disguise value. We formalize this problem as an optimization problem and propose a genetic algorithms based method to handle this problem. According to the experimental results we conducted on two real datasets, our genetic algorithms based method can discover the data group most prone to the occurrence of a given specific disguise value.

致謝　 i
摘要　 ii
Abstract iii
Contents v
Contents of Figures vii
Contents of Tables viii
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Contributions 2
1.3 Thesis Organizations 3
Chapter 2 Background and Related Work 5
2.1 Missing Data 5
2.2 Disguised Missing Data 6
2.3 Genetic Algorithms 7
2.4 Related Work 8
Chapter 3 Embedded Unbiased Sample Based Detection of Disguise Value 10
3.1 Embedded Unbiased Sample Heuristic 10
3.2 CBSQS: Measurement of Unbiased Sample 12
3.3 The EUS Algorithm 14
Chapter 4 Problem Description 17
4.1 Preliminary 17
4.2 Formal Definition 18
Chapter 5 The Proposed GA-based Detection Method 23
5.1 General Framework 23
5.2 Chromosome Representation 24
5.3 Evolutionary Operations 25
5.4 Fitness Function 26
5.5 Candidate Pruning 28
Chapter 6 Experiments and Analysis 31
6.1 Experimental Results on Execution Time 31
6.2 Experimental Results on Solution Correctness 34
Chapter 7 Conclusions and Future Work 38
7.1 Conclusions 38
7.2 Future Work 38
References 40

[1] R. Belen, ”Detecting disguised missing data,” Master Thesis, The Middle East Technical University, February, 2009.
[2] R. Belen, T. T. Temizel, ”A framework to detect disguised missing data,” in Knowledge Discovery Practices and Emerging Applications of Data Mining: Trends and New Domains, A.V. Senthil Kumar, Eds. USA: IGI Global, 2010, pp. 1-22.
[3] FDA Adverse Event Reporting System, Available: http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm083765.htm, [Jun. 19, 2013].
[4] Hsin Chu General Hospital, Department of Health, Executive Yuan, R.O.C., Available: https://dss.hch.gov.tw/other8.asp, [Jun. 19, 2013].
[5] J. Holland, Adaptation in Natural and Artificial Systems, Cambridge, MA: MIT Press, 1992.
[6] M. Hua and J. Pei, “Cleaning disguised missing data: a heuristic approach,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 950-958.
[7] M. Hua and J. Pei, “DiMaC: A system for cleaning disguised missing data,” in International Conference on Management of Data, Vancouver, BC, Canada, June 2008, pp. 9-12.
[8] R. Little and D. Rubin, Statistical Analysis with Missing Data, Wiley Publishers, New York, 1987.
[9] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs. Berlin: Springer, 1994.
[10] M. Mitchell, An Introduction to Genetic Algorithms. Cambridge, MA: MIT Press, 1996.
[11] H. Muhlenbein, “How genetic algorithms really work: I. Mutation and hillclimbing,” in Parallel Problem Solving from Nature 2, Reinhard Manner, Bernard Manderick, Eds. Brussels, Belgium, September 1992, pp. 15–25.
[12] K. Natarajan, J. Li, and A. Koronios, ”Detecting mis-entered values in large data sets,” in Proceedings of the 4th World Congress on Engineering Asset Management, Athens, Greece, 2009, pp. 805-812.
[13] R. K. Pearson,”The problem of disguised missing data,” in ACM SIGKDD Explorations Newsletter, Vol. 8, No. 1, pp. 83-92, June 2006.
[14] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Boston, MA: Pearson Education, Inc., 2006.
[15] UCI Machine Learning Repository: Pima Indians Diabetes Data Set, Available: http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes, [Jun. 19, 2013].

電子全文

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	金融業應用資訊技術在資訊品質與顧客關係管理上之實證研究
2.	運用基因群集技術於大型資料庫內遺失值之處理
3.	遺傳演算法應用於支援向量機之參數調整與屬性篩選
4.	資料探勘在銀行業之應用－以類神經網路整合基因演算法處理信用貸款為例
5.	證券交易策略發掘
6.	整合叢集與迴歸技術以處理大型資料庫遺失值問題之新方法
7.	資料庫關聯式規則探勘方法之研究及應用
8.	專案成功度動態預測-應用演化式模糊神經推論模式
9.	應用類神經網路與支援向量機於目標客戶選取
10.	模糊知識探勘之多屬性決策
11.	結合粗略集合論及基因演算法於提昇頭部斷層掃描效益之研究
12.	利用遺傳演算法探勘隸屬函數與模糊關聯規則
13.	分類法則於基因演算法改良之應用
14.	基因演算法於高中學生輔導工作上之應用
15.	建構個人化投資組合與基金績效預測之研究

1.	許森彥、郭浩然、蘇世斌，TFT-LCD光電產業輪班作業員工睡眠品質研究．中華職業醫學雜誌 2005；13(3)：157-167。
2.	蔡崇煌、黃素雲、羅永杰、林高德．新陳代謝症候群•基層醫學 2004；19(11)：270-273。
3.	蔡兆勳、陳慶餘，代謝症候群的相關基因及其表現。基層醫學 2006；21(11)： 338-340。
4.	戴在松，運動與新陳代謝症候群。嘉大體育健康休閒期刊 2007；6（2）：110-118。

1.	利用合作訓練與集成學習法檢測藥物不良反應事件通報系統中之重複記錄
2.	應用多重支持之廣義關聯分類法建構大學休退學預測系統
3.	斜張橋動力特性研究
4.	利用電子束蒸鍍製作之氧化鎂薄膜的結構與電性研究
5.	應用品質機能展開與價值分析探討售後市場車燈產品開發
6.	探討APOBEC3B及APOBEC3G對人類肝癌細胞株Hep3B腫瘤抑制的影響
7.	現金增資期間的實質盈餘管理活動:以台灣上市電子公司為實證
8.	大規模地震災害時救災圈域圖劃設之研究-以高雄市為例
9.	扣件知識庫之建立
10.	使用基因演算法變數篩選與SVM分類器於PET/CT上孤立肺結節之診斷
11.	外部環境品質觀點下都市更新制度之檢討：ANP與GIS方法之整合與運用
12.	條件常態分佈模型之相容性探討
13.	到院民眾防癌認知與防癌篩檢意願之研究－以高雄地區醫院為例
14.	知覺價值及關係品質與就醫忠誠度之關聯性研究─以高雄地區中醫診所為例
15.	政風機構行政調查權之研究

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室