論文名稱(外文):Dirichlet Priors for Markov Naïve Bayesian Classifiers with Multinomial Model for Gene Sequence Data
指導教授(外文):Tzu-Tsung Wong
外文關鍵詞:Dirichlet distributiongene sequence dataMarkov modelnaïve Bayesian classifier
With the development of metagenomics and sequencing, biologists do not have to culture the microbes in a laboratory that is less than one percent of the microbes living in an ecological environment. In order to explore the diversity of species, biologists extract samples from an ecological environment directly by using the technologies for metagenomics. In the process of classifying gene sequence reads, the N-mer sliding window is generally used to extract features, and two adjacent features will have N-1 letters in common. This greatly violates the conditional independence assumption of the naïve Bayesian classifier. The Markov naïve Bayesian classifier releases the conditional independence assumption and should be a more appropriate classifier for gene sequence data. In this study, we attempt to embed multinomial models and Dirichlet priors for enumerator and denominator in the Markov naïve Bayesian classifier to enhance its accuracy in classifying gene sequence reads. Two methods enumerator-first and denominator-first are tested on four gene sequence sets, and the experimental results show that the enumerator-first method can generally achieve a higher prediction accuracy. Both methods can have a better performance than the well-known RDP classifier. Since the number of priors for a class value in the Markov naïve Bayesian classifier is two instead of one in the naïve Bayesian classifier, the best noninformative Dirichlet priors do enhance the performance of the Markov naïve Bayesian classifier.
