跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.223) 您好!臺灣時間:2025/10/08 05:18
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:楊皓評
論文名稱:基於微調預訓練大語言模型的資料去識別化效能評估:以澳洲電子健康紀錄為例
論文名稱(外文):Data De-identification Performance Evaluation based on Fine-tuned Pre-trained Large Language Models: A Case Study of Electronic Health Records in Australia
指導教授:戴鴻傑
指導教授(外文):HONG-JIE, DAI
口試委員:李俊宏朱學亭
口試委員(外文):CHUNG-HONG, LEEHSUEH-TING, CHU
口試日期:2024-01-31
學位類別:碩士
校院名稱:國立高雄科技大學
系所名稱:電機工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2024
畢業學年度:112
語文別:中文
論文頁數:58
中文關鍵詞:電子醫療記錄受保護的健康資訊去識別化自然語言處理大型語言模型生成式預訓練轉換器Pythia
外文關鍵詞:Electronic Medical RecordProtected Health InformationDe-identificationNatural Language ProcessingLarge Language ModelGenerative Pre-trained TransformerPythia
相關次數:
  • 被引用被引用:1
  • 點閱點閱:413
  • 評分評分:
  • 下載下載:110
  • 收藏至我的研究室書目清單書目收藏:1
電子病歷(Electronic Medical Record,簡稱 EMR)的出現為醫學領域帶來許多好處,例如:增加數據分析的方便性、提高病患用藥的安全性和降低醫院診所儲存病理報告的成本等,提高醫院診所為病患就診時的醫療品質與效率。EMR 是醫事人員所撰寫的病患臨床記錄,並結合診後醫師分析的內容,提供豐富的醫病資訊,然而其中亦包含大量的病患個人隱私資訊。因此若是在二次運用 EMR 時,直接使用未經去識別化處裡的 EMR 會導致醫事人員及病患的個人資料外洩,所以如何保護病患及醫事人員的隱私資訊是一項很重要的議題。另一方面,EMR 所記錄的時間資訊在不同醫事人員與不同的機構或組織間所撰寫的方式也有所差異,因此影響了對時間資訊分析的正確性,所以將時間資訊予以正規化也是一項非常重要的議題。本論文以澳大利亞醫院的 EMR 進行研究,將 EMR 內所記載的受保護健康訊息(Protected health information,簡稱 PHI)透過替換的方式,減少個資資訊洩漏的風險以符合醫療保險可攜與責任法案(Health Insurance Portability and Accountability Act)予以釋出。並於釋出的語料上使用以 Pythia 為基礎的各種規模之語言模型(Language Model)以及生成式預訓練轉換器 2(Generative Pre-trained Transformer 2,簡稱 GPT-2)進行微調(Fine Tune)訓練,來對 EMR 內的 PHI 進行辨識及並將對應的時間資訊進行正規化。微調後加上基本的後處理後,Pythia 模型最佳在 PHI 辨識與時間正規化可以達到 94.7% 和 84.9%,以 Pythia 70M 為例在預測 PHI 任務與時間資訊正規化上,相較於微調前的成效,Macro F1 Measure 分別提高了 52.6% 以及 69.2%。根據本研究結果顯示,以 Pythia 模型為例,模型參數量大約在 2.8B 時達到最佳的微調效果,在使用更大的模型,對於兩個任務的效能的增長並不明顯。另外觀察結果也可發現,透過 LoRA 訓練的模型效能跟整體微調的效能相近。
The emergence of Electronic Medical Records (EMRs) has brought many benefits to the medical field. For example, it has increased the convenience of data analysis, improved patient medication safety, and reduced the cost of storing pathology reports in hospitals and clinics, thereby enhancing the quality and efficiency of medical care for patients. EMR is a clinical record written by medical personnel, combined with the analysis of post-diagnosis physicians. Direct use of unprocessed EMRs can lead to the leakage of personal information of medical personnel and patients, making the protection of patient and medical personnel privacy information an important issue. Additionally, the representation of time information recorded in EMRs varies among different institutions or organizations, greatly affecting the analysis of temporal information. Therefore, normalizing temporal information is also an important issue. This paper complies a corpus consisted of EMRs collected from Australian hospitals, annotated with protected health information which is replaced by surrogates to reduce the risk of identity theft and comply with Health Insurance Portability and Accountability Act. This paper fine-tuned Pythia and GPT-2-based models on the compiled corpus to recognize PHI and normalize temporal information mentioned in EMRs. After fine-tuning and basic post-processing adjustments, the Pythia model achieves optimal performance in PHI identification and time information normalization at 94.7% and 84.9%, respectively. For instance, with the Pythia 70M model, compared to its performance before fine-tuning, there is a 52.6% and 69.2% increase in Macro F1 Measure for predicting PHI tasks and time information normalization, respectively. According to the results of this study, using the Pythia model as an example, optimal fine-tuning effects are achieved when the model parameters are approximately 2.8 billion. Furthermore, increasing the model size does not significantly improve performance in both tasks. Additionally, observations indicate that models trained via LoRA achieve performance similar to overall fine-tuning effectiveness.
目錄
摘要 ii
Abstract iii
誌謝 v
圖目錄 viii
表目錄 x
第 1 章 緒論 1
1.1 研究動機 1
1.2研究目的 2
1.3論文架構 2
第 2 章 相關研究探討 3
2.1、醫療臨床資料命名實體化相關技術發展 3
2.3、時間資訊相關研究 6
第 3 章 研究方法 7
3.1 資料集 7
3.2 註釋指南 7
3.2.1 標註流程 8
3.2.2 註釋標準 10
3.3 替換標註內容方法 16
3.4 資料前處理方法 17
3.5 微調預訓練模型方法以及介紹相關工具 18
3.5.1 GPT-2 預訓練模型 21
3.5.2 Pythia 預訓練模型 22
3.5.3 GPT-2 與 Pythia 模型比較 22
3.5.4 LoRA架構 23
3.5.5 AdamW 最佳化器 23
3.6 資料後處理 23
3.7 實驗參數設計 24
第 4 章 實驗結果與討論 27
4.1 標註語料統計資訊統計 27
4.2 微調預訓練模型在不同資料集的表現 28
4.2.1 問題一:微調前的預訓練模型在驗證集的表現 28
4.2.2 問題二:微調後的預訓練模型在驗證集的表現 29
4.2.3 問題三:微調後的預訓練模型在測試集的表現 30
4.3 微調預訓練模型時間 34
4.4 各模型訓練結果 36
第 5 章 結論 38
參考文獻 39
附錄 43
[1]"ISO 8601", [Online] Available at: https://zh.wikipedia.org/zh-tw/ISO_8601, [Accessed: January. 2024]
[2]Khaled Shaalan, "Rule-based Approach in Arabic Natural Language Processing", International Journal on Information and Communication Technologies, Vol. 3, No. 3, June 2010, [Online] Available at: https://m.marefa.org/w/images/1/17/Rule_based_Arabic_NLP.pdf, [Accessed: January. 2024]
[3]Sudha Morwal, Nusrat Jahan, Deepti Chopra, "Named Entity Recognition using Hidden Markov Model (HMM)", Internation“l Journal on Natural Language Computing (IJNLC) Vol. 1, ”o.4, December 2012, [Online] Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3758852, [Accessed: January. 2024]
[4]J. Lafferty, A. McCallum, and F. C. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", 2001, [Online] Available at: https://repository.upenn.edu/entities/publication/c9aea099-b5c8-4fdd-901c-15b6f889e4a7, [Accessed: January. 2024]
[5]Silvestri, S., Esposito, A., Gargiulo, F., Sicuranza, M., Ciampi, M., & De Pietro, G., "A Big Data Architecture for the Extraction and Analysis of EHR Data", 2019 IEEE World Congress on Services (SERVICES), [Online] Available at: https://ieeexplore.ieee.org/document/8817262, [Accessed: January. 2024]
[6]L. Li, L. Jin, Z. Jiang, D. Song, and D. Huang, "Biomedical named entity recognition based on extended Recurrent Neural Networks", 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), [Online] Available at: https://ieeexplore.ieee.org/abstract/document/7359761, [Accessed: January. 2024]
[7]A. Graves, "Long short-term memory. Supervised sequence labelling with recurrent neural networks", Studies in Computational Intelligence (SCI, volume 385) 2012, [Online] Available at: https://link.springer.com/book/10.1007/978-3-642-24797-2, [Accessed: January. 2024]
[8]Zhiheng Huang, Wei Xu, Kai Yu, "Bidirectional LSTM-CRF Models for Sequence Tagging", arXiv:1508.019“1v1 [cs.CL] 9 Aug 2015, [Online] Available at: https://arxiv.org/pdf/1508.01991.pdf , [Accessed: January. 2024]
[9]Łukasz Kaiser, Samy Bengio, "Can Active Memory Replace Attention?", 30th Conference Neural Information Processing Systems (NIPS2016” ,Barcelona, Spain. [Online] Available at: https://proceedings.neurips.cc/paper_files/paper/2016/file/fb8feff253bb6c834deb61ec76baa893-Paper.pdf, [Accessed: January. 2024]
[10]Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaronvanden Oord, Alex Graves, Koray Kavukcuoglu, " Neural Machine Translation in Linear Time ", arXiv preprint arXiv:1610.10“99v2, 2017, [Online] Available at: https://arxiv.org/pdf/1610.10099.pdf , [Accessed: January. 2024]
[11]Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin, " Convolutional sequence to sequence learning", arXiv preprint arXiv:1705.03122v2, 2017, [Online] Available at: https://arxiv.org/pdf/1705.03122.pdf , [Accessed: January. 2024]
[12]John F. Kolen, Stefan C. Kremer, "Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies", Wiley-IEEE Press, 2001, [Online] Available at: https://ieeexplore.ieee.org/document/5264952, [Accessed: January. 2024]
[13]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.Gomez, Łukasz Kaiser, "Attention Is All You Need", arXiv preprint arXiv:1706.03762, [Online] Available at: https://arxiv.org/pdf/1706.03762.pdf, [Accessed: January. 2024]
[14]Jianpeng Cheng, Li Dong, Mirella Lapata, "Long short-term memory-networks for machine reading", arXiv preprint arXiv:601.06733,2016, [Online] Available at: https://arxiv.org/pdf/1601.06733.pdf, [Accessed: January. 2024]
[15]Ankur P. Parikh, Oscar Täckström, Dipanjan Das, Jakob Uszkoreit, "A Decomposable Attention Model for Natural Language Inference", Empirical Methods in Natural Language Processing, 2016, [Online] Available at: https://arxiv.org/pdf/1606.01933.pdf, [Accessed: January. 2024]
[16]Romain Paulus, Caiming Xiong, Richard Socher, "A Deep Reinforced Model for Abstractive Summarization", arXiv preprint arXiv:1705.04304, 2017, [Online] Available at: https://arxiv.org/pdf/1705.04304.pdf, [Accessed: January. 2024]
[17]Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, Yoshua Bengio, "A Structured Self-attentive Sentence Embedding", arXiv preprint arXiv:1703.03130, 2017, [Online] Available at: https://arxiv.org/pdf/1703.03130.pdf, [Accessed: January. 2024]
[18]Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv preprint arXiv:1810.04805, 2018, [Online] Available at: https://arxiv.org/pdf/1810.04805.pdf, [Accessed: January. 2024]
[18] Andrea Setzer, and Robert Gaizauskas, "Annotating Events and Temporal Information in Newswire Texts", In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00) , Athens, Greece. European Language Resources Association (ELRA), May 2000, [Online] Available at: http://www.lrec-conf.org/proceedings/lrec2000/pdf/321.pdf, [Accessed: January. 2024]
[20]Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky, "SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations", In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9, Atlanta, Georgia, USA. Association for Computational Linguistics, June 2013, [Online] Available at: https://aclanthology.org/S13-2001, [Accessed: January. 2024]
[21]Wentao Ding, Jianhao Chen, Jinmao Li, and Yuzhong Qu, "Automatic rule generation for time expression normalization", In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3135–3144, Punta Cana, Dominican Republic. Association for Computational Linguistics, November 2021, [Online] Available at: https://aclanthology.org/2021.findings-emnlp.269, [Accessed: January. 2024]
[22]Joseph L. Fleis“, "Fleiss' kappa", [Online] Available at: https://en.wikipedia.org/wiki/Fleiss%27_kappa, [Accessed: January. 2024]
[23]Anthony J Viera, Joanne M Garret“, " Understanding interobserver agreement: the kappa statistic", I Fam Med, May 2005, [Online] Available at: https://www1.cs.columbia.edu/~julia/courses/CS6998/Interrater_agreement.Kappa_statistic.pdf, [Accessed: January. 2024]
[24]Lutz Prechelt, "Early Stopping — But When?", Part of the Lecture Notes in Computer Science book series (LNTCS,volume 7700), 2012, [Online] Available at: https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5, [Accessed: January. 2024]
[25]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, "Language Models are Unsupervised Multitask Learns", Computer Science, Linguistics, 2019, [Online] Available at: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf, [Accessed: January. 2024]
[26]Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyl’ O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal, "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling", arXiv preprint arXiv:2304.01373, 2023, [Online] Available at: https://arxiv.org/pdf/2304.01373.pdf, [Accessed: January. 2024]
[27]Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, "Language Models are Few-Shot Learners”, arXiv preprint arXiv:2005.14165, 2020, [Online] Available at: https://arxiv.org/pdf/2005.14165.pdf, [Accessed: January. 2024]
[28]Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Phang, Jason and Purohit, Shivanshu and Schoelkopf, Hailey and Stander, Dashiell and Songz, Tri and Tigges, Curt and Thérien, Benjamin and Wang, Phil and Weinbach, Samuel, "{GPT-NeoX-20B}: An Open-Source Autoregressive Language Model", arXiv preprint arXiv:2204.06745, 2022, [Online] Available at: https://arxiv.org/pdf/2204.06745.pdf, [Accessed: January. 2024]
[29]Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, "LoRA: Low-Rank Adaptation of Large Language Models", arXiv preprint arXiv:2106.09685, [Online] Available at: https://arxiv.org/pdf/2106.09685.pdf, [Accessed: January. 2024]
[30]Ilya Loshchilov, Frank Hutter, "Decoupled Weight Decay Regularization", arXiv preprint arXiv:1711.05101, [Online] Available at: https://arxiv.org/pdf/1711.05101.pdf, [Accessed: January. 2024]
[31]Liu, J., Shen, D., Zhang, Y., Dolan, W. B., Carin, L., & Chen, W. (2022, May), "What Makes Good In-Context Examples for GPT-3?", In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures (pp. 100-114), [Online] Available at: https://arxiv.org/pdf/2101.06804.pdf, [Accessed: January. 2024]

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top