( 您好!臺灣時間:2021/07/25 12:23
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::


論文名稱(外文):Generalized Dirichlet Priors for Naïve Bayesian Classifiers with Multinomial Models in Classifying Gene Sequence Data
指導教授(外文):Tzu-Tsung Wong
外文關鍵詞:Dirichlet distributiongene sequence data classificationgeneralized Dirichlet distributionnaïve Bayesian classifier
  • 被引用被引用:3
  • 點閱點閱:191
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0

With the passing of time, biologists are no longer limited to make observations on Petri dishes in labs. Nowadays, they can easily obtain samples from the natural world by using the new technology developed for metagenomics. Although the new technology is helpful in studying the relationships among species and the places where they live, samples obtained in this way cannot be analyzed by traditional methods. This research attempts to propose a new operational mechanism for naïve Bayesian classifiers to classify gene sequence data for biologists. Since the number of class values or species is generally over one hundred, and the number of features extracted from gene sequence data can be more than ten thousand, the information carried by a feature for classification will be relatively little. In this case, priors can play an important role in the operation of the naïve Bayesian classifier. This research adopts Dirichlet and generalized Dirichlet distributions that have been shown to be appropriate priors for improving the performance of the naïve Bayesian classifier to enhance its prediction accuracy on gene sequence data. The experimental results on two gene sequence data sets demonstrate that priors do helpful in classifying gene sequence instances, and that a significant improvement can be achieved in a gene sequence data set in which the original prediction accuracy is poor.
摘 要 I
Abstract II
致 謝 III
目 錄 IV
表目錄 VI
圖目錄 VII
符號表 VIII
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究流程 3
第二章 文獻探討 4
2.1 簡易貝氏分類器 4
2.1.1 基本運作原理 4
2.1.2 簡易貝氏分類器的應用 6 簡易貝氏分類器應用於文件分類 6 簡易貝氏分類器應用於基因序列分類 9
2.2平滑常數 12
2.3 狄氏分配與廣義狄氏分配 14
2.3.1 狄氏分配的計算公式 14
2.3.2 廣義狄氏分配的計算公式 16
2.3.3 狄氏與廣義狄氏分配的關係 16
第三章 研究方法 18
3.1 基因序列分類流程與敘述 18
3.2 基因序列資料的前置處理 21
3.3多項式模型 22
3.4 先驗分配參數的調整以及修正方法 22
3.5 尋找最佳先驗分配參數的方法 24
3.5.1 狄氏分配參數的尋找方法 24
3.5.2 廣義狄氏分配參數的尋找方法 25
3.6 驗證方式 30
第四章 實證研究 32
4.1 資料檔介紹 32
4.2 狄氏分配之實證結果 32
4.3 廣義狄氏分配之實證結果 34
4.4 小結 39
第五章 結論與建議 40
參考文獻 42
附錄一 狄氏分配正確率變化表-Bacteria資料檔 46
附錄二 狄氏分配正確率變化表-Fungi資料檔 47
附錄三 廣義狄氏分配正確率變化表- Bacteria資料檔 49
附錄四 廣義狄氏分配正確率變化表- Fungi資料檔 51

Beck, D., Settles, M., and Foster, J. A. (2011). OTUbase: an R infrastructure package for operational taxonomic unit data. Bioinformatics, 27(12), 1700-1701.
Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning. Proceedings of the Ninth European Conference on Artificial Intelligence, 2157, Publisher: Pitman Publishing, 147-149.
Chan, C. L. and Ting, H. W. (2011). Constructing a novel mortality prediction model with Bayes theorem and genetic algorithm. Expert Systems with Applications, 38(7), 7924-7928.
Chandra, B. and Gupta, M. (2011). Robust approach for estimating probabilities in Naïve–Bayes Classifier for gene expression data. Expert Systems with Applications, 38(3), 1293-1298.
Chen, J., Huang, H., Tian, S., and Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435.
Cole, J. R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R. J., Kulam-Syed-Mohideen, A. S., McGarrell, D. M., Marsh, T., Garrity, G. M., and Tiedje, J. M. (2009). The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Research, 37, D141-145.
Connor, R. J. and Mosimann, J. E. (1969). Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution. Journal of the American Statistical Association, 64, No. 325, 194-206.
Dong, Y., Butler, E. C., Philp, R. P., and Krumholz, L. R. (2011). Impacts of microbial community composition on isotope fractionation during reductive dechlorination of tetrachloroethylene. Biodegradation, 22(2), 431-444.
Eichorst, S. A., Kuske, C. R., and Schmidt, T. M. (2011). Influence of plant polymers on the distribution and cultivation of bacteria in the phylum Acidobacteria. Applied and Environmental Microbiology, 77(2), 586-596.
Fienberg, S. E. and Holland, P. W. (1972). On the choice of flattening constants for estimating multinomial probabilities. Journal of Multivariate Analysis, 2(1), 127-134.
Forney, L. J., Gajer, P., Williams, C. J., Schneider, G. M., Koenig, S. S., McCulle, S. L., Karlebach, S., Brotman, R. M., Davis, C. C., Ault, K., and Ravel, J. (2010). Comparison of self-collected and physician-collected vaginal swabs for microbiome analysis. Journal of Clinical Microbiology, 48(5), 1741-1748.
Frank, J. A. and Sorensen, S. J. (2011). Quantitative metagenomic analyses based on average genome size normalization. Applied and Environmental Microbiology, 77(7), 2513-2521.
Good, I. J. (1965). The Estimation of Probabilities, MIT Press, Cambridge, MA.
Handelsman, J., Rondon, M.R., Brady, S., Clardy, J., and Goodman, R.M. (1998). Molecular biology provides access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology, 5, R 245-R 249.
Hao, X., Jiang, R., and Chen, T. (2011). Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering. Bioinformatics, 27(5), 611-618.
Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society A, 186, 453-461.
Lidstone, G. J. (1920). Note on the general case of the bayes-laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries, 8, 182-192.
Lu, S.-H., Chiang, D.-A., Keh, H.-C., and Huang, H.-H. (2010). Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values. Knowledge-Based Systems, 23(6), 598-604.
Macdonald, C. A., Clark, I. M., Hirsch, P. R., Zhao, F. J., and McGrath, S. P. (2011). Development of a real-time PCR assay for detection and quantification of Rhizobium leguminosarum bacteria and discrimination between different biovars in zinc-contaminated soil. Applied and Environmental Microbiology, 77(13), 4626-4633.
McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. Working Notes of the 1998 AAAI/ICML Workshop on Learning for Text Categorization, 41-48.
Mitchell, T. M. (1997). Machine learning: McGraw-Hill.
Perks, W. (1947). Some observations on inverse probability including a new indifference rule. Journal of the Institute of Actuaries, 73, 285-334.
Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., and Sokhansanj, B. (2008). Metagenome fragment classification using N-mer frequency profiles. Advances in Bioinformatics, 2008, Article ID 205969, 12 pages.
Sharpton, T. J., Riesenfeld, S. J., Kembel, S. W., Ladau, J., O'Dwyer, J. P., Green, J. L., Eisen, J. A., and Pollard, K. S. (2011). PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Computational Biology, 7(1), e1001061.
Simonoff, J. S. (1995). Smoothing categorical data. Journal of Statistical Planning and Inference, 47(1-2), 41-69.
Stein, C. M. (1962). Confidence Sets for the Mean of a Multivariate Normal Distribution. Journal of the Royal Statistical Society. Series B, 24(2), 265-296.
Trybula, S. (1958). Some Problems of Simultaneous Minimax Estimation. Annals of Mathematical Statistics, 29, 245-253.
Walters, W. A., Caporaso, J. G., Lauber, C. L., Berg-Lyons, D., Fierer, N., and Knight, R. (2011). PrimerProspector: de novo design and taxonomic analysis of barcoded polymerase chain reaction primers. Bioinformatics, 27(8), 1159-1161.
Wang, Q., Garrity, G. M., Tiedje, J. M., and Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261-5267.
Wong, T. T. (1998). Generalized Dirichlet distribution in Bayesian analysis. Applied mathematics and Computation, 97, 165-181.
Wong, T. T. (2007). Perfect aggregation of Bayesian analysis on compositional data. Statistical Papers, 48, 265-282.
Wong, T. T. (2009). Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Mining and Knowledge Discovery, 18(2), 183-213.
Youn, E. and Jeong, M. K. (2009). Class dependent feature scaling method using naive Bayes classifier for text datamining. Pattern Recognition Letters, 30(5), 477-485.

註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
第一頁 上一頁 下一頁 最後一頁 top