跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.213) 您好!臺灣時間:2025/11/12 11:39
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:林宜歆
研究生(外文):LIN, YI-HSIN
論文名稱:利用文字探勘建立醫學主題詞與基因名稱之關聯性
論文名稱(外文):Association between Medical Subject Heading and Gene Names based on Text-Mining from PubMed
指導教授:蘇遂龍蘇遂龍引用關係
指導教授(外文):SU, SUI-LUNG
口試委員:張啟仁葉釋仁蘇遂龍方文輝林嶔
口試委員(外文):CHANG, CHEE-JENYEH, SHIH-JENSU, SUI-LUNGFANG, WEN-HUILIN, CHIN
口試日期:2017-05-12
學位類別:碩士
校院名稱:國防醫學院
系所名稱:公共衛生學研究所
學門:醫藥衛生學門
學類:公共衛生學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:中文
論文頁數:93
中文關鍵詞:文字探勘醫學主題詞基因名稱常一起出現在摘要中的相關性
外文關鍵詞:text-miningmedical subject headingsgene namesoften appearing in abstractsMeSH
相關次數:
  • 被引用被引用:0
  • 點閱點閱:826
  • 評分評分:
  • 下載下載:7
  • 收藏至我的研究室書目清單書目收藏:1
研究背景:
近年來生物醫學文獻發表量日益增加,從2014年起每年發表文章已突破100萬篇,研究者想要透過傳統閱讀方式整理這些文獻越來越困難,因此有必要借助電腦自動化整理大量文獻並提供有用的資訊。目前自動化整理生物資訊之著名相關網站如Coremine、 STRING、 DisGeNet等,都只能看到查詢的字詞直接相關字詞,然而卻看不到間接相關性。
研究目的:
本研究欲探討PubMed收錄之非結構化摘要中,醫學主題詞 (MeSH)與基因名稱間不同年代使用次數的高低情形與各詞彙間的相關強度。
材料與方法:
本篇研究所採取的研究設計為text-mining design,研究樣本為PubMed文獻所刊登之摘要,於2016年7月8日檢索並下載Abstract(text)文字檔案,共26,295,751篇。將檔案分割為各篇文章,收集各篇摘要及年代。利用美國國立醫學圖書館編製的醫學主題詞至MeSHBrowser進行27,883個字的同義字的檢索後建立醫學主題詞字庫,利用國際人類基因組組織命名委員會對基因命名的正式名稱至NCBI Gene進行別稱檢索後建立39,903個人類基因字庫;利用文字比對擷取各摘要中所包含的67,786個醫學主題詞與基因名稱,以年為單位計算各醫學主題詞及基因名稱於摘要中出現次數,並使用word2vec分析醫學主題詞及基因名稱間的相關強度。
結果:
本研究建立互動式網站,提供查詢醫學主題詞與基因名稱在各年代摘要中出現次數與頻率,以及最常一起出現在摘要中的相關字詞 ( https://yihsin.shinyapps.io/meshgeneterm_relation/ )。本研究發現在2012年開始有很多文章在摘要中提到China,次數排名第10名,更於2016年向前擠進前八名,象徵中國在學術界的崛起。另外Health從排行7到排行第3名,也許是表示近幾年更多文章重視健康的議題。最常一起出現在摘要中的相關字詞舉退化性關節炎為例,與退化性關節炎最常一起出現在摘要中有截肢手術、指甲、髕骨、膝關節、單倍體、體檢與癱瘓,其中看的到截肢手術也與癱瘓、膝關節以及髕骨有相關(相關性大於0.6)。
結論:
利用本研究建立的網站,可以知道各醫學主題詞與基因名稱在摘要中不同年代使用次數與頻率,以及最常與哪些字一起出現的相關強度,表示這些字常常在做研究時一起被提及與討論,同時也可以看到間接相關,讓研究者在探索新領域時能快速有概括性的了解,取得建議研究的方向,以利往後跨領域之科學研究。

Background:
In recent years, biomedical literature published by leaps and bounds, especially papers had exceeded 1 million in 2014. Based upon studies available on the databases, it is difficult for the users to sort out from the massive literature and organize sets of qualitative. At present, there have the well-known websites such as Coremine, STRING, DisGeNet, ect. However, it can only inquire the words directly related to the words without any indirect relevance suggestions. Thus, there is huge justification for addressing the issue.

Purpose:
In this study, we investigated that the number of use in each years and the relationship between medical subject headings (MeSH) and gene names in the non-structured abstracts of PubMed.

Materials and Methods:
The study used text-mining design. The study samples are published articles in the PubMed from 1809 to 2016. We divide the files into articles, collect the abstracts and the publication of years. Using the MeSH from the American National Library of Medicine to the MeSHBrowser, 27,883 word synonyms were used to establish the MeSH dictionary. The official name of 39,903 human genes was named by the Human Genome Organization Nomenclature Committee. After searched from NCBI Gene build the gene dictionary, the 67,786 medical subject headings and gene names were included in the abstracts were extracted from the abstracts. The number of MeSH and gene names in the abstract were calculated by each years. We used word2vec to analyze the relevance between the MeSH and gene name.

Result:
We build an interactive website which provides the information on the number of use in different years and the relevant words that most often appear together between MeSH and gene names in the abstract ( https://yihsin.shinyapps.io/meshgeneterm_relation/ ). For example, most often appearance together with Osteoarthritis in the summary are Osteotomy, Nails, Patella, Knee Joint, Monosomy, Physical Examination and Paralysis. Also, Osteotomy is related with paralysis, knee Joints and patella (correlation greater than 0.6).

Conclusion:
Base on the website developed of this study, it provides the number of use in different years between the MeSH and gene names, and what words were most associated with them, which indicating that these words were often mentioned and discussed together in medical publications. Also we can know the indirect correlation between them. So that researchers in the exploration of new areas can quickly have a general understanding.

目錄 i
表目錄 iii
圖目錄 iv
摘要 v
Abstract viii
第一章 前言 1
第一節 研究背景 1
第二節 研究動機與重要性 3
第三節 研究目的 5
第二章 文獻探討 6
第一節 大數據與非結構化資料 6
第二節 文字探勘技術 11
第三節 探討疾病與基因間的關係 16
第四節 PubMed 生物醫學資料庫 18
第五節 MeSH 醫學主題詞 20
第六節 Gene Term基因名稱 28
第三章 研究方法 30
第一節 研究設計與流程 30
第二節 資料收集 32
第三節 統計方法 35
第四節 資料視覺化 36
第四章 研究結果 37
第一節 各醫學主題詞與基因名稱不同年代使用次數與頻率 37
第二節 醫學主題詞與基因名稱次數排名 39
第三節 各醫學主題詞與基因名稱間的相關強度 48
第四節 各醫學主題詞與基因名稱間的間接相關強度-資料視覺化 53
第五章 討論 56
第一節 NCBI封鎖使用PubMed的權限 56
第二節 各醫學主題詞與基因名稱次數校正 60
第三節 結果摘要 64
第四節 研究結果與其他網站比較 67
第五節 本研究之優勢 79
第六節 本研究之限制 80
第六章 結論 81
第七章 未來展望 82
第八章 參考文獻 83
第九章 附錄 88


Bauer-Mehren, A., Bundschus, M., Rautschka, M., Mayer, M. A., Sanz, F., & Furlong, L. I. (2011). Gene-disease network analysis reveals functional modules in mendelian, complex and environmental diseases. PloS one, 6(6), e20284.
Bollier, D., & Firestone, C. M. (2010). The promise and peril of big data: Aspen Institute, Communications and Society Program Washington, DC.
Bongers, E. M., Gubler, M. C., & Knoers, N. V. (2002). Nail-patella syndrome. Overview on clinical and molecular findings. Pediatr Nephrol, 17(9), 703-712. doi:10.1007/s00467-002-0911-5
Chen, H., Lun, Y., Ovchinnikov, D., Kokubo, H., Oberg, K. C., Pepicelli, C. V., . . . Johnson, R. L. (1998). Limb and kidney defects in Lmx1b mutant mice suggest an involvement of LMX1B in human nail patella syndrome. Nat Genet, 19(1), 51-55. doi:10.1038/ng0598-51
Cook, K. ( 2008, Accessed December 3, 2012). Unstructured data and the 80 percent rule. Clarabridge BridgePoints Retrieved from http://clarabridge.com/default.aspx?tabid=137&ModuleID=635&ArticleID=551.
Cox, M., & Ellsworth, D. (1997). Application-controlled demand paging for out-of-core visualization. Paper presented at the Proceedings of the 8th conference on Visualization'97.
Fallik, D. (2014). For big data, big questions remain. Health Aff (Millwood), 33(7), 1111-1114. doi:10.1377/hlthaff.2014.0522
Flicek, P., Amode, M. R., Barrell, D., Beal, K., Billis, K., Brent, S., . . . Searle, S. M. (2014). Ensembl 2014. Nucleic Acids Res, 42(Database issue), D749-755. doi:10.1093/nar/gkt1196
Gentleman, G. R. I. a. R. C. (1993). R language.
Google. (2013). word2vec.
Gray, K. A., Daugherty, L. C., Gordon, S. M., Seal, R. L., Wright, M. W., & Bruford, E. A. (2013). Genenames.org: the HGNC resources in 2013. Nucleic Acids Res, 41(Database issue), D545-552. doi:10.1093/nar/gks1066
Gray, K. A., Yates, B., Seal, R. L., Wright, M. W., & Bruford, E. A. (2015). Genenames.org: the HGNC resources in 2015. Nucleic Acids Res, 43(Database issue), D1079-1085. doi:10.1093/nar/gku1071
Guidera, K. J., Satterwhite, Y., Ogden, J. A., Pugh, L., & Ganey, T. (1991). Nail patella syndrome: a review of 44 orthopaedic patients. J Pediatr Orthop, 11(6), 737-742.
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., & McKusick, V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res, 33(Database issue), D514-517. doi:10.1093/nar/gki033
Hartz, A., & Marsh, J. L. (2003). Methodologic issues in observational studies. Clin Orthop Relat Res(413), 33-42. doi:10.1097/01.blo.0000079325.41006.95
Hiatt, R. A., Sulsky, S., Aldrich, M. C., Kreiger, N., & Rothenberg, R. (2013). Promoting innovation and creativity in epidemiology for the 21st century. Ann Epidemiol, 23(7), 452-454. doi:http://dx.doi.org/10.1016/j.annepidem.2013.05.007
Hilbert, M., López, P ( 2012). How to Measure the World’s Technological Capacity to Communicate, Store and Compute Information? Part I: results and scope. . International Journal of Communication.
Hilbert, M., & Lopez, P. (2011). The world's technological capacity to store, communicate, and compute information. Science, 332(6025), 60-65. doi:10.1126/science.1200970
HUGO Gene Nomenclature Committee. (2016).
Jensen, P. B., Jensen, L. J., & Brunak, S. (2012). Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13(6), 395-405.
Kay, J., de Sa, D., Shallow, S., Simunovic, N., Safran, M. R., Philippon, M. J., & Ayeni, O. R. (2015). Level of clinical evidence presented at the International Society for Hip Arthroscopy Annual Scientific Meeting over 5 years (2010-2014). J Hip Preserv Surg, 2(4), 332-338. doi:10.1093/jhps/hnv059
Kooistra, B., Dijkman, B., Einhorn, T. A., & Bhandari, M. (2009). How to design a good case series. J Bone Joint Surg Am, 91 Suppl 3, 21-26. doi:10.2106/jbjs.h.01573
Laakso, M., Welling, P., Bukvova, H., Nyman, L., Björk, B.-C., & Hedlund, T. (2011). The development of open access journal publishing from 1993 to 2009. PloS one, 6(6), e20961.
Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., & Maglott, D. R. (2014). ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res, 42(Database issue), D980-985. doi:10.1093/nar/gkt1113
Laney, D. (2012). The importance of'Big Data': A definition. Gartner. Retrieved, 21.
Lin, Y. C., Wu, Y. H., & Scher, R. K. (2008). Nail changes and association of osteoarthritis in digital myxoid cyst. Dermatol Surg, 34(3), 364-369. doi:10.1111/j.1524-4725.2007.34070.x
Lipscomb, C. E. (2000). Medical Subject Headings (MeSH). Bull Med Libr Assoc, 88(3), 265-266.
Losiewicz, P., Oard, D. W., & Kostoff, R. N. (2000). Textual data mining to support science and technology management. Journal of Intelligent Information Systems, 15(2), 99-119.
Maglott, D., Ostell, J., Pruitt, K. D., & Tatusova, T. (2011). Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res, 39(Database issue), D52-57. doi:10.1093/nar/gkq1237
Mayer Schönberger V, C. K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton: Mifflin Harcourt.
Meaney, C., Moineddin, R., Voruganti, T., O'Brien, M. A., Krueger, P., & Sullivan, F. (2016). Text mining describes the use of statistical and epidemiological methods in published medical research. J Clin Epidemiol, 74, 124-132. doi:10.1016/j.jclinepi.2015.10.020
Mooney, S. J., Westreich, D. J., & El-Sayed, A. M. (2015). Epidemiology in the Era of Big Data. Epidemiology (Cambridge, Mass.), 26(3), 390-394. doi:10.1097/EDE.0000000000000274
Mork, J. G., Jimeno-Yepes, A., & Aronson, A. R. (2013). The NLM Medical Text Indexer System for Indexing Biomedical Literature. Paper presented at the BioASQ@ CLEF.
Murdoch, T. B., & Detsky, A. S. (2013). The inevitable application of big data to health care. JAMA, 309(13), 1351-1352.
Nadadur, S. S., Miller, C. A., Hopke, P. K., Gordon, T., Vedal, S., Vandenberg, J. J., & Costa, D. L. (2007). The complexities of air pollution regulation: the need for an integrated research and regulatory perspective. Toxicol Sci, 100(2), 318-327. doi:10.1093/toxsci/kfm170
Niu, Y., Otasek, D., & Jurisica, I. (2010). Evaluation of linguistic features useful in extraction of interactions from PubMed; application to annotating known, high-throughput and predicted interactions in I2D. Bioinformatics, 26(1), 111-119. doi:10.1093/bioinformatics/btp602
Piñero, J., Bravo, À., Queralt-Rosinach, N., Gutiérrez-Sacristán, A., Deu-Pons, J., Centeno, E., . . . Furlong, L. I. (2017). DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res, 45(D1), D833-D839. doi:10.1093/nar/gkw943
PubGene. (2012). Coremine Medical
Roski, J., Bo-Linn, G. W., & Andrews, T. A. (2014). Creating value in health care through big data: opportunities and policy implications. Health Affairs, 33(7), 1115-1122.
Sacchi, L., & Holmes, J. H. (2016). Progress in Biomedical Knowledge Discovery: A 25-year Retrospective. Yearb Med Inform, Suppl 1, S117-129. doi:10.15265/IYS-2016-s033
Sackett, D. L., Rosenberg, W. M., Gray, J. A., Haynes, R. B., & Richardson, W. S. (1996). Evidence based medicine: what it is and what it isn't. BMJ : British Medical Journal, 312(7023), 71-72.
Salerno, J., Knoppers, B. M., Lee, L. M., Hlaing, W. M., & Goodman, K. W. (2017). Ethics, big data and computing in epidemiology and public health. Ann Epidemiol, 27(5), 297-301. doi:10.1016/j.annepidem.2017.05.002
Schmitt, T., Ogris, C., & Sonnhammer, E. L. (2014). FunCoup 3.0: database of genome-wide functional coupling networks. Nucleic Acids Res, 42(Database issue), D380-388. doi:10.1093/nar/gkt984
Singhal, A., Simmons, M., & Lu, Z. (2016). Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature. Journal of the American Medical Informatics Association, ocw041.
Stuart J. Nelson, W. D. J., and Betsy L. Humphreys, & National Library of Medicine, B., MD, USA. Relationships in Medical Subject Headings (MeSH).
Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., . . . Tsafou, K. P. (2014). STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res, gku1003.
The New York Times. (2014). Big Data Compendium. The New York Times.
Tigchelaar, S., Lenting, A., Bongers, E. M. H. F., & van Kampen, A. (2015). Nail patella syndrome: Knee symptoms and surgical outcomes. A questionnaire-based survey. Orthopaedics & Traumatology: Surgery & Research, 101(8), 959-962. doi:http://dx.doi.org/10.1016/j.otsr.2015.09.033
U.S. National Center for Biotechnology Information. (Updated 2017 Mar 14). PubMed Help [Internet].
U.S. National Library of Medicine. Medical Subject Headings.
U.S. National Library of Medicine. (2016). MeSH tree.
Vance, A. (2010). Start-up goes after big data with hadoop helper. New York Times Blog, 22.
Warner, J. L., Jain, S. K., & Levy, M. A. (2016). Integrating cancer genomic data into electronic health records. Genome Medicine, 8, 113. doi:10.1186/s13073-016-0371-3
Weber, G. M., Mandl, K. D., & Kohane, I. S. (2014). FInding the missing link for big biomedical data. JAMA, 311(24), 2479-2480. doi:10.1001/jama.2014.4228
Zuberi, K., Franz, M., Rodriguez, H., Montojo, J., Lopes, C. T., Bader, G. D., & Morris, Q. (2013). GeneMANIA prediction server 2013 update. Nucleic Acids Res, 41(Web Server issue), W115-122. doi:10.1093/nar/gkt533
丁怡婷, & 劉志光. (2010). 文字探勘技術應用於中醫診斷腦中風之研究. Journal of Data Analysis, 5(4), 41-64.
游忠諺. (2016). 醫學電子資源大數據分析模式之建構與應用. (博士), 國立臺灣師範大學, 台北市.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top