跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.80) 您好!臺灣時間:2024/12/08 23:32
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:陳鈺婷
研究生(外文):Yu-Ting Chen
論文名稱:評估基因表現資料的正規化方法:聚焦於RUV系列方法
論文名稱(外文):Evaluations of gene expression normalization methods: a focus on Remove Unwanted Variation (RUV) series methods.
指導教授:洪弘洪弘引用關係郭柏秀郭柏秀引用關係
指導教授(外文):Hung HungPo-Hsiu Kuo
口試委員:林菀俞蔡孟勳
口試委員(外文):Wan-Yu LINMong-Hsun Tsai
口試日期:2016-07-19
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:統計碩士學位學程
學門:數學及統計學門
學類:統計學類
論文種類:學術論文
論文出版年:2016
畢業學年度:104
語文別:英文
論文頁數:69
中文關鍵詞:基因表現正規化核醣核酸定序分析差異表現基因檢測
外文關鍵詞:gene expressionnormalizationRNA-Seqdifferential expression test
相關次數:
  • 被引用被引用:0
  • 點閱點閱:330
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
在基因表現的資料中時常包含著許多變異,例如實驗中的批次效果(batch effects)或者在基因上不同的定序深度(sequencing depth),這些變異都會影響到分析中所辨別出的差異表現(differential expression, DE)基因。因此在進行相關的分析前,先做資料前處理-正規化(normalization)是必要的步驟,以用來校正實驗技術上的變異(technical variations),及確保對於差異表現基因的推論。
  近年來,許多的正規化方法應用於分析核醣核酸定序(RNA-Seq)的資料,然而這些已存在的方法在校正未知的潛在變異的效用時,尚缺乏系統性的評估。在我們的研究中,我們實作了一個比較性的研究來評估兩種型態的正規化方法,一種是global-scaling的方法,另一種是remove unwanted variation (RUV)。在過去文獻中提及global-scaling的方法能夠校正在基因上不同的定序深度的變異,並且此種方法已廣泛的使用在各種應用RNA-Seq資料來檢測差異基因的研究,此類型方法包括Median、Upper Quartile (UQ)以及Trimmed mean of M-values (TMM)。此外,RUV是新開發的方法,使用對照基因(control genes)或樣本來調整實驗技術上造成變異的影響。我們比較了global-scaling的方法跟RUV系列的方法,包括RUV2、RUV4、RUVr以及結合兩種型態的正規化方法RUV2+UQ、RUV4+UQ和RUVr+UQ。我們考慮了七種不同的變異設定,包括批次效果、潛在的變異和在基因上不同的定序深度。在每一種變異設定中,另外也調整了樣本大小、變異因子的數量、控制基因的多寡等參數來評估這些參數對於正確找出差異表現基因的影響。在我們的模擬中,發現不論基因上是否有不同的定序深度,RUV系列方法(除了RUVr)加上UQ最能有效的校正變異。
  本文中,我們討論不同資料情境下的結果,並使用公開取得的基因表現資料,根據不同特性的RNA-Seq資料型態,推薦適合的正規化方法。依據我們的研究結果,提供研究者在進行相關分析前的正規化時,能選取適當的正規化方法來增進研究結果的有效性。


Gene expression data are often embedded with many unwanted variations, such as batch effects in experiments or varying sequencing depth among subjects, which hinder the identification of differentially expressed (DE) genes for the trait of interest. The pre-process of normalization before performing association analysis is proven to be essential in correcting technical variations and increasing the power of identifying DE genes.
Recently, several normalization methods have been applied to analyze RNA sequencing (RNA-Seq) data. However, the utility of existing methods in correcting potential bias from unwanted variations is lacking a systematic evaluation. In this work, we conduct a comparison study to evaluate the performances of two types of normalization methods: global-scaling and removing unwanted variation (RUV). It is claimed that the global-scaling methods are able to correct sequencing depth, and are widely used in the literature for testing DE genes in RNA-seq studies, including Median, Upper Quartile (UQ), Trimmed Mean of M-values (TMM) methods. On the other hand, RUV methods are newly developed methods to adjust for nuisance technical effects, by utilizing control genes or samples. We compare a series of RUV methods with the global methods (Median, UQ, TMM), including RUV2, RUV4, RUVr, as well as the combinations of RUV2+UQ, RUV4+UQ and RUVr+UQ. We considered 7 simulation settings of unwanted variations, including various combinations of batch effects, latent variations and sequencing depth, each under different settings of sample size, numbers of factors of unwanted variation and control genes. Our simulations indicate that the RUV series methods (except RUVr) plus UQ are the most effective in correcting unwanted variations, even in the situation of no sequencing depth.
We discussed results in different scenarios and provided recommendations for the use of different normalization methods according to the characteristics of RNA-seq data. Our study results could inform researchers for the selection of a suitable normalization method when the data are required for normalization before association testing.


CONTENTS
口試委員會審定書 #
誌謝 i
中文摘要 ii
ABSTRACT iii
CONTENTS v
LIST OF FIGURES vii
LIST OF TABLES ix
Chapter 1 Introduction 1
1.1 RNA sequencing (RNA-Seq) Data 1
1.2 Normalization 1
1.3 Test for Differentially Expressed Genes 3
1.4 Aims 3
Chapter 2 Material and Methods 5
2.1 Normalization methods 5
2.1.1 Global-scaling normalization 5
2.1.2 Removing Unwanted Variation (RUV) 8
2.2 Real data 12
2.2.1 GSE26024 (n=21) 12
2.2.2 GSE57148 (n=189) 13
2.3 Simulation 14
2.4 Evaluations of Method Performance 16

Chapter 3 Results 19
3.1 Real data simulation 19
3.1.1 Batch effects and Latent variations (W1-W2) 19
3.1.2 Batch effects, Latent variations and Sequencing depth (W1-W2-Seq) 20
3.1.3 Batch effects and Sequencing depth (W1-Seq) 20
3.1.4 Latent variations and Sequencing depth (W2-Seq) 21
3.1.5 Only batch effects (W1) 21
3.1.6 Only sequencing depth (Seq) 21
3.1.7 Summary 22
3.2 Real data 24
3.2.1 GSE26024 24
3.2.2 GSE57138 25
3.3 Influence of noise of real data 25
Chapter 4 Discussion 27
REFERENCE 29

LIST OF FIGURES
Figure 1. The analysis process of RNA-Seq data. 32
Figure 2. The performance of each method under scenarios of unwanted variations including batch effects and latent variations (W1-W2). 33
Figure 3. The performance of each method under scenarios of unwanted variations including batch effects, latent variations and sequencing depth. (W1-W2-Seq). 34
Figure 4. The performance of each method under scenarios of unwanted variations including batch effects and sequencing depth (W1-Seq). 35
Figure 5. The performance of each method under scenarios of unwanted variations including latent variations and sequencing depth (W2-Seq). 36
Figure 6. The performance of each method under scenarios of unwanted variations including only batch effects (W1). 37
Figure 7.The performance of each method under scenarios of unwanted variations including only sequencing depth (Seq). 38
Figure 8. The performance of each method under scenarios without any unwanted variations (True). 39
Figure 9. The mean of gene rank finding by each method in ascending order of SNR based on control genes found by RUVr. (a)W1-W2 (b)W1-W2-Seq (c)W2-Seq. 40
Figure 10. The mean of gene rank finding by each method in ascending order of SNR based on true control genes. (a)W1-W2 (b)W1-W2-Seq (c)W2-Seq. 41
Figure 11. The mean of AUC of each method under the control genes found by RUVr. (a)W1-W2 (b)W1-W2-Seq (c)W2-Seq. 42
Figure 12. The mean of AUC of each method under the true control genes. (a)W1-W2 (b)W1-W2-Seq (c)W2-Seq. 43
Figure S1. Comparisons of mean of AUC between UQ and RUV2+UQ with different settings and different sample size in real data simulations.. 44

LIST OF TABLES
Table 1. Results for DE test of GSE26024 data 46
Table 2. Models of 7 combinations of variations 46
Table 3. Results for each method under the data simulations with 4 types of sample sizes. 47
Table 4. Results for each method under the normal simulations with 4 types of sample sizes. 55
Table 5. The performances of each method when the number of factors of unwanted variation is overestimated/underestimated. 63
Table 6. The performance of each method in data GSE26024 with the number of factors of unwanted variation k equal to 3. 68
Table 7. The performance of each method in data GSE57148 with the number of factors of unwanted variation k equal to 5. 68
Table S1. The best method for the comparisons of overestimating/underestimating k in situations of W1-W2 and W1-W2-Seq. 69



[1]Kim, W. J., Lim, J. H., Lee, J. S., Lee, S. D., Kim, J. H., & Oh, Y. M. (2015). Comprehensive analysis of transcriptome sequencing data in the lung tissues of COPD subjects. International journal of genomics, 2015.
[2]Bottomly, D., Walter, N. A., Hunter, J. E., Darakjian, P., Kawane, S., Buck, K. Jessica Ezzell Hunter, et al. (2011). Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PloS one, 6(3), e17820.
[3]Slonim, D. K., & Yanai, I. (2009). Getting started in gene expression microarray analysis. PLoS Comput Biol, 5(10), e1000543.
[4]Anders, S., McCarthy, D. J., Chen, Y., Okoniewski, M., Smyth, G. K., Huber, W., & Robinson, M. D. (2013). Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature protocols, 8(9), 1765-1786.
[5]Dündar, F., Skrabanek, L., & Zumbo, P. (2015). Introduction to differential gene expression analysis using RNA-seq.
[6]Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), 1.
[7]Bullard, J. H., Purdom, E., Hansen, K. D., & Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC bioinformatics, 11(1), 1.
[8]Jacob, L., Gagnon-Bartsch, J. A., & Speed, T. P. (2016). Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics, 17(1), 16-28.
[9]Risso, D., Ngai, J., Speed, T. P., & Dudoit, S. (2014). Normalization of RNA-seq data using factor analysis of control genes or samples. Nature biotechnology, 32(9), 896-902.
[10]Zyprych-Walczak, J., Szabelska, A., Handschuh, L., Górczak, K., Klamecka, K., Figlerowicz, M., & Siatkowski, I. (2015). The impact of normalization methods on RNA-Seq data analysis. BioMed research international, 2015.
[11]Gagnon-Bartsch, J., Jacob, L., & Speed, T. P. (2013). Removing unwanted variation from high dimensional data with negative controls. Berkeley: Department of Statistics. University of California.
[12]Lorenz, D. J., Gill, R. S., Mitra, R., & Datta, S. (2014). Using RNA-seq data to detect differentially expressed genes. In Statistical Analysis of Next Generation Sequencing Data (pp. 25-49). Springer International Publishing.
[13]Dillies, M. A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in bioinformatics, 14(6), 671-683.
[14]Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.
[15]Risso, D., Schwartz, K., Sherlock, G., & Dudoit, S. (2011). GC-content normalization for RNA-Seq data. BMC bioinformatics, 12(1), 480.
[16]McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (Vol. 37). CRC press.
[17]Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
[18]Storey, J. D. (2003). The positive false discovery rate: a Bayesian interpretation and the q-value. Annals of statistics, 2013-2035..
[19]Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), 289-300.
[20]Pencina, M. J., D’Agostino, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine, 27(2), 157-172.
[21]McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, gks042. http://doi.org/10.1093/nar/gks042
[22]Reddy, R. (2015). A Comparison of Methods: Normalizing High-Throughput RNA Sequencing Data. bioRxiv, 026062. http://doi.org/10.1101/026062
[23]Konishi, T. (2016). Parametric analysis of RNA-seq expression data. Genes to Cells: Devoted to Molecular & Cellular Mechanisms, 21(6), 639–647. http://doi.org/10.1111/gtc.12372
[24]De Boer, J. F., Cense, B., Park, B. H., Pierce, M. C., Tearney, G. J., & Bouma, B. E. (2003). Improved signal-to-noise ratio in spectral-domain compared with time-domain optical coherence tomography. Optics letters, 28(21), 2067-2069.


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top