(44.192.112.123) 您好!臺灣時間:2021/03/06 06:50
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:陳玟卉
研究生(外文):CHEN, WEN-HUEI
論文名稱:針對類別不平衡以門檻移動法對奈米孔定 序序列進行錯誤校正
論文名稱(外文):Reducing Class Imbalance with ThresholdMoving for Nanopore Genome Polishing
指導教授:黃耀廷
指導教授(外文):Huang, Yao-Ting
口試委員:莊樹諄蔡懷寬江振國
口試委員(外文):Chuang, Trees-JuenTsai, Huai-KuangCHIANG,CHEN-KUO
口試日期:2020-07-30
學位類別:碩士
校院名稱:國立中正大學
系所名稱:資訊工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2020
畢業學年度:108
語文別:英文
論文頁數:41
中文關鍵詞:均聚物系統性錯誤牛津奈米孔定序平台類別不平衡支持向量機
外文關鍵詞:homopolymersystematic errorsnanopore sequencingclass imbalancesupport vector machine
相關次數:
  • 被引用被引用:0
  • 點閱點閱:32
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
奈米孔定序(Nanopore sequencing)目前所遇到的困難主要是其所固有的系統性錯誤,系統性錯誤發生於一長段相同的核苷酸(稱為均聚物)內。由於通常無法區分不同長度的均聚物信號,因此鹼基檢出算法(basecalling algorithms)會產生插入(Insertion)或缺失(Deletion)錯誤。 現有的糾錯演算法(error-correction algorithms)無法消除這些系統性錯誤。本篇論文開發了一種機器學習模型,用於消除奈米孔定序(Nanopore sequencing)中固有的系統性錯誤。 由於在訓練階段時有極度類別不平衡的現象,我們提出了三種減少類別不平衡的方法,包括重新設計標籤(Labels),對訓練材料採用不同的採樣法以及提高校正準確率的門檻移動。 從實驗結果來看,這些方法都可以減少類別不平衡並校正奈米孔定序的系統性錯誤,進而提高最終基因組的品質。
The major challenge of Nanopore sequencing comes from the inherent systemic errors. The systematic error usually occurred within a long stretch of identical nucleotides passing through the pore (called homopolymers). As the signals of different lengths of homopolymers are often indistinguishable, the basecalling algorithms result in insertion or deletion errors. Existing error-correction algorithms failed to erase these systematic errors. This thesis developed a machine learning model for erasing systematic errors inherent in Nanopore sequencing. Owing to extreme class imbalance during training stage, we proposed three methods to reduce the imbalance, including redesign of class labeling, different sampling algorithms on training material, and threshold moving to improve the polishing accuracy. From the experimental results, these methods can all reduce class imbalance and correct Nanopore systematic errors, leading to a better quality of final genome.
Abstract iii
List of Figures v
List of Tables vii
1 Introduction 1
2 Literature Review 4
2.1 LIBSVM 4
2.2 Phred quality score 4
2.3 Mash 5
2.4 Alignment tools 5
2.5 Assembly and polishing tools 5
2.6 Homopolish 5
3 Method 6
3.1 Class partition 9
3.2 Different sampling of training data 14
3.3 Threshold moving 18
4 Results 25
4.1 Materials 25
4.2 Compare with four sampling method 26
4.3 Compare with other polishing software 28
4.4 The results of R10.3 version model 30
5 Conclusion 31
Bibliography 32
Appendix 34
[1] R10.3: the newest nanopore for high accuracy nanopore sequencing – now
available in store. https://nanoporetech.com/about-us/news/
r103-newest-nanopore-high-accuracy-nanopore-sequencing-now-available-store,
. Accessed: 2020-8-7.
[2] Wikipedia contributors. Read (biology). https://en.wikipedia.org/w/index.
php?title=Read_(biology)&oldid=951065110, April 2020. Accessed: 2020-
8-7.
[3] Nicholas J Loman, Joshua Quick, and Jared T Simpson. A complete bacterial genome
assembled de novo using only nanopore sequencing data. Nat. Methods, 12(8):733–735,
August 2015.
[4] Christopher R O’Donnell, Hongyun Wang, and William B Dunbar. Error analysis of idealized nanopore sequencing. Electrophoresis, 34(15):2137–2144, August 2013.
[5] Tara Boyle. Dealing with imbalanced data - towards
data science. https://towardsdatascience.com/
methods-for-dealing-with-imbalanced-data-5b761be45a18, February 2019. Accessed: 2020-8-7.
[6] Yao-Ting Huang. homopolish. https://github.com/ythuang0522/
homopolish. Accessed: 2020-8-7.
[7] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software
available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
[8] Wikipedia contributors. Phred quality score. https://en.wikipedia.org/w/
index.php?title=Phred_quality_score&oldid=961145361, June 2020.
Accessed: 2020-8-21.
[9] Brian D Ondov, Todd J Treangen, Pall Melsted, Adam B Mallonee, Nicholas H Bergman, ´
Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance
estimation using MinHash. Genome Biol., 17(1):132, June 2016.
32
Bibliography 33
[10] Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34
(18):3094–3100, September 2018.
[11] Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, and Pavel A Pevzner. Assembly of long,
error-prone reads using repeat graphs. Nat. Biotechnol., 37(5):540–546, May 2019.
[12] Robert Vaser, Ivan Sovic, Niranjan Nagarajan, and Mile ´ Siki ˇ c. Fast and accurate de novo ´
genome assembly from long uncorrected reads. Genome Res., 27(5):737–746, May 2017.
[13] medaka. https://github.com/nanoporetech/medaka, . Accessed: 2020-8-
21.
[14] Glossary:Draft genome sequence. http://www.informatics.jax.org/
glossary/draft_genome, . Accessed: 2020-8-11.
[15] Jason Brownlee. A gentle introduction to Threshold-Moving for imbalanced classification. https://machinelearningmastery.com/
threshold-moving-for-imbalanced-classification/, February 2020.
Accessed: 2020-8-16.
[16] Ting-Fan Wu, Chih-Jen Lin, and Ruby C Weng. Probability estimates for multi-class
classification by pairwise coupling. J. Mach. Learn. Res., 5(Aug):975–1005, 2004.
電子全文 電子全文(網際網路公開日期:20250827)
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔