研究生(外文):Chang, Wei Yuan
論文名稱(外文):A Data-driven Framework on Correlating Air Pollution Indices and Cancer Statistics
指導教授(外文):Arbee L.P. Chen
外文關鍵詞:air pollutioncancer statisticshealth caredata analysisdata drivendata as a service
  • 被引用被引用:2
  • 點閱點閱:552
  • 評分評分:
  • 下載下載:51
  • 收藏至我的研究室書目清單書目收藏:1
根據世界衛生組織公佈的全球健康風險報告,環境議題是世界上急需被解決的問題之一。特別是,空氣污染將傷害人體健康。在這個研究中,我們建立一個從資料收集到產生知識的整體分析框架用來探討空氣污染指數和癌症統計資料之間的關係。該框架由兩個部分組成,包括資料存取流程與資料分析流程。資料存取流程用來提升原始(開放)資料的可存取性,並透過 API釋出。在資料處理中,我們利用時間和位置資訊將癌症統計資料對應到最近的空氣品質監測站,並侷限在一個有效的影響範圍內。資料分析流程基於資料驅動的概念,使用資料探索及資料探勘的方法發現資料間的關係。資料探索方法使用統計、叢集和序列化分析的技術,初步發現資料間存在的關聯。然後,引入分類器進一步分析空氣污染指數和癌症統計之間的關係。實驗結果表明,不同的空氣污染指標在特定的癌症上有顯著的影響。使用醫學領域上的論文作為評估,我們所找出的結果與傳統的統計方法一致,並可同時包含多個研究的結果。總體來說,此框架除了空氣汙染與癌症資料外,亦可應用在其他同時具有空間與時間之資料集。

According to the Global Health Risks Report, published by WHO, environmental issues are urged to be solved in the world. Especially, air pollution causes great damage to human health. In this work, we build an analysis framework for finding the implications between air pollution indices and cancer statistics. This framework consists of two parts for data access and data analytics, including data access flow and analytics flow, respectively. The data access flow is designed to process raw (open) data to be accessed by APIs. We map the cancer statistics to the air pollution data in the nearest monitoring stations through time and location information. The analytics flow is used to find the insights based on data exploration methods and data mining methods. The exploration methods use statistics, clustering, and series mining techniques to interpret data at hand. Then, classifiers are applied to find the relationships between air quality and cancer diseases by viewing air pollution indices and cancer statistics as features and labels, respectively. The experiments show which air pollutant has significant influence on the specific cancer. In addition, the results identified are consistent with those by traditional statistical methods. Moreover, the results achieved can also cover those by several studies. In summary, this framework is flexible and can be applied globally to other spatiotemporal data.
Acknowledgement i
Abstract ii
摘要 iii
Content iiii
List of Figures vi
List of Tables vii
1. Introduction 1
2. Related works 4
3. System Infrastructure and Analytic Flowchart 8
4. Data Access 10
4.1 Data-as-a-Service (DaaS) 10
4.2 Data Description 12
4.2.1 Air Quality Monitoring Data 13
4.2.2 Cancer Occurrence statistical Data 14
4.3 Data Preprocessing 16
5. Data Exploration 18
5.1 Statistical Analysis 18
5.2 Row-wise Analysis 19
5.2.1 Clustering Method 19
5.2.2 Clustering Representation 20
5.3 Column-wise Analysis 20
5.3.1 Time Serialization 20
5.3.2 Correlation 21
5.3.3 Correlation Matrix 21
5.4 Results and Observations 21
5.4.1 Statistical Analysis 21
5.4.2 Row-wise Analysis 22
5.4.3 Column-wise Analysis 25
5.4.4 Observations 28
6. Data Mining 29
6.1 Classification 29
6.1.1 Classification 29
6.1.2 Interpretation 30
6.1.3 Application 30
6.2 Results and Evaluations 30
6.2.1 Results 30
6.2.2 Evaluations 33
7. Conclusion 36
8. References 38
