研究生(外文):Bo-Yu Chen
論文名稱(外文):Model Adaptation Learning for Large Scale Gene Tagging Task
指導教授(外文):Yuh-Jye Lee
外文關鍵詞:named entity recognitionhuman gene taggingconditional random fieldsperiodic step-size adaptationmodel adaptationvoting scheme
我們利用條件隨機場(Conditional Random Fields: CRFs)做為訓練模型的演算法,並採用模型適應的方法(model adaptation method)來解決少量的人類基因資料的問題。簡單的說,我們利用其它與人類基因有相關的資料,像是包含所有物種的資料,再利用模型適應的方試來擷取出人類基因的實體名稱。藉由模型適應過程,我們擷取到更多人類基因與人類基因產品的實體名稱,並且盡可能的降低將非人類基因誤判成人類基因的比例。為了進一步的增進我們的效能,我們使用投票的機制來結合由不同的比例的所有物種資料來做模型適應方法所訓練出來的模型,經由我們的實驗證實,投票的機制將有助於我們提升效能。另一方面這個結果也顯示,利用投票來結合不同的模型比只用單一模型來標示測試資料效果來的好。
Gene tagging task is a kind of named entity recognition (NER) of gene and gene product mentions in scientific text in biomedical text mining. However, it is very difficult to train a good model by few human genes data for tagging a large number of human gene names and human gene product mentions. We propose a model adaptation method based on conditional random fields (CRFs) models to solve the problem of the lackness of human genes data. We choose other data relative to human genes such as all kinds of species gene data, and extract information of human gene names from all kinds of species gene data in the model adaptation method. By the model adaptation method, we tag more human gene names and human gene product mentions without increasing the false positive rate as far as possible. In order to enhance the performance further, we use the voting scheme to combine the labeled results from different models by model adaptation method. Finally, the experimental results verify that the model adaptation method enhances our performance indeed. On the other hand, the results also show that the voting scheme has a better performance than single model.
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Learning Algorithms for Gene Tagging 5
2.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Periodic Step-Size Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Periodic Step-Size Adaptation . . . . . . . . . . . . . . . . . . . . . 9
2.3 Model Adaptation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Data Source and Data Preprocessing 13
3.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Dataset Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Extracting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Data Labeling and Performance Measures . . . . . . . . . . . . . . . . . . 20
3.4.1 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Experimental Results 23
4.1 Baseline Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Model Adaptation Measurement . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Voting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Conclusion 36
