臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.223) 您好！臺灣時間：2025/10/08 05:18

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
電子全文
QR Code

本論文永久網址:

研究生:

楊皓評

論文名稱:

基於微調預訓練大語言模型的資料去識別化效能評估：以澳洲電子健康紀錄為例

論文名稱(外文):

Data De-identification Performance Evaluation based on Fine-tuned Pre-trained Large Language Models: A Case Study of Electronic Health Records in Australia

指導教授:

戴鴻傑

指導教授(外文):

HONG-JIE, DAI

口試委員:

李俊宏、朱學亭

口試委員(外文):

CHUNG-HONG, LEE、HSUEH-TING, CHU

口試日期:

2024-01-31

學位類別:

碩士

校院名稱:

國立高雄科技大學

系所名稱:

電機工程系

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2024

畢業學年度:

112

語文別:

中文

論文頁數:

中文關鍵詞:

電子醫療記錄、受保護的健康資訊、去識別化、自然語言處理、大型語言模型、生成式預訓練轉換器、Pythia

外文關鍵詞:

Electronic Medical Record、Protected Health Information、De-identification、Natural Language Processing、Large Language Model、Generative Pre-trained Transformer、Pythia

相關次數:

被引用:1
點閱:413
評分:
下載:110
書目收藏:1

電子病歷（Electronic Medical Record，簡稱 EMR）的出現為醫學領域帶來許多好處，例如：增加數據分析的方便性、提高病患用藥的安全性和降低醫院診所儲存病理報告的成本等，提高醫院診所為病患就診時的醫療品質與效率。EMR 是醫事人員所撰寫的病患臨床記錄，並結合診後醫師分析的內容，提供豐富的醫病資訊，然而其中亦包含大量的病患個人隱私資訊。因此若是在二次運用 EMR 時，直接使用未經去識別化處裡的 EMR 會導致醫事人員及病患的個人資料外洩，所以如何保護病患及醫事人員的隱私資訊是一項很重要的議題。另一方面，EMR 所記錄的時間資訊在不同醫事人員與不同的機構或組織間所撰寫的方式也有所差異，因此影響了對時間資訊分析的正確性，所以將時間資訊予以正規化也是一項非常重要的議題。本論文以澳大利亞醫院的 EMR 進行研究，將 EMR 內所記載的受保護健康訊息（Protected health information，簡稱 PHI）透過替換的方式，減少個資資訊洩漏的風險以符合醫療保險可攜與責任法案（Health Insurance Portability and Accountability Act）予以釋出。並於釋出的語料上使用以 Pythia 為基礎的各種規模之語言模型（Language Model）以及生成式預訓練轉換器 2（Generative Pre-trained Transformer 2，簡稱 GPT-2）進行微調（Fine Tune）訓練，來對 EMR 內的 PHI 進行辨識及並將對應的時間資訊進行正規化。微調後加上基本的後處理後，Pythia 模型最佳在 PHI 辨識與時間正規化可以達到 94.7% 和 84.9%，以 Pythia 70M 為例在預測 PHI 任務與時間資訊正規化上，相較於微調前的成效，Macro F1 Measure 分別提高了 52.6% 以及 69.2%。根據本研究結果顯示，以 Pythia 模型為例，模型參數量大約在 2.8B 時達到最佳的微調效果，在使用更大的模型，對於兩個任務的效能的增長並不明顯。另外觀察結果也可發現，透過 LoRA 訓練的模型效能跟整體微調的效能相近。

The emergence of Electronic Medical Records (EMRs) has brought many benefits to the medical field. For example, it has increased the convenience of data analysis, improved patient medication safety, and reduced the cost of storing pathology reports in hospitals and clinics, thereby enhancing the quality and efficiency of medical care for patients. EMR is a clinical record written by medical personnel, combined with the analysis of post-diagnosis physicians. Direct use of unprocessed EMRs can lead to the leakage of personal information of medical personnel and patients, making the protection of patient and medical personnel privacy information an important issue. Additionally, the representation of time information recorded in EMRs varies among different institutions or organizations, greatly affecting the analysis of temporal information. Therefore, normalizing temporal information is also an important issue. This paper complies a corpus consisted of EMRs collected from Australian hospitals, annotated with protected health information which is replaced by surrogates to reduce the risk of identity theft and comply with Health Insurance Portability and Accountability Act. This paper fine-tuned Pythia and GPT-2-based models on the compiled corpus to recognize PHI and normalize temporal information mentioned in EMRs. After fine-tuning and basic post-processing adjustments, the Pythia model achieves optimal performance in PHI identification and time information normalization at 94.7% and 84.9%, respectively. For instance, with the Pythia 70M model, compared to its performance before fine-tuning, there is a 52.6% and 69.2% increase in Macro F1 Measure for predicting PHI tasks and time information normalization, respectively. According to the results of this study, using the Pythia model as an example, optimal fine-tuning effects are achieved when the model parameters are approximately 2.8 billion. Furthermore, increasing the model size does not significantly improve performance in both tasks. Additionally, observations indicate that models trained via LoRA achieve performance similar to overall fine-tuning effectiveness.

目錄
摘要 ii
Abstract iii
誌謝 v
圖目錄 viii
表目錄 x
第 1 章緒論 1
1.1 研究動機 1
1.2研究目的 2
1.3論文架構 2
第 2 章相關研究探討 3
2.1、醫療臨床資料命名實體化相關技術發展 3
2.3、時間資訊相關研究 6
第 3 章研究方法 7
3.1 資料集 7
3.2 註釋指南 7
3.2.1 標註流程 8
3.2.2 註釋標準 10
3.3 替換標註內容方法 16
3.4 資料前處理方法 17
3.5 微調預訓練模型方法以及介紹相關工具 18
3.5.1 GPT-2 預訓練模型 21
3.5.2 Pythia 預訓練模型 22
3.5.3 GPT-2 與 Pythia 模型比較 22
3.5.4 LoRA架構 23
3.5.5 AdamW 最佳化器 23
3.6 資料後處理 23
3.7 實驗參數設計 24
第 4 章實驗結果與討論 27
4.1 標註語料統計資訊統計 27
4.2 微調預訓練模型在不同資料集的表現 28
4.2.1 問題一：微調前的預訓練模型在驗證集的表現 28
4.2.2 問題二：微調後的預訓練模型在驗證集的表現 29
4.2.3 問題三：微調後的預訓練模型在測試集的表現 30
4.3 微調預訓練模型時間 34
4.4 各模型訓練結果 36
第 5 章結論 38
參考文獻 39
附錄 43

[1]"ISO 8601", [Online] Available at: https://zh.wikipedia.org/zh-tw/ISO_8601, [Accessed: January. 2024]
[2]Khaled Shaalan, "Rule-based Approach in Arabic Natural Language Processing", International Journal on Information and Communication Technologies, Vol. 3, No. 3, June 2010, [Online] Available at: https://m.marefa.org/w/images/1/17/Rule_based_Arabic_NLP.pdf, [Accessed: January. 2024]
[3]Sudha Morwal, Nusrat Jahan, Deepti Chopra, "Named Entity Recognition using Hidden Markov Model (HMM)", Internation“l Journal on Natural Language Computing (IJNLC) Vol. 1, ”o.4, December 2012, [Online] Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3758852, [Accessed: January. 2024]
[4]J. Lafferty, A. McCallum, and F. C. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", 2001, [Online] Available at: https://repository.upenn.edu/entities/publication/c9aea099-b5c8-4fdd-901c-15b6f889e4a7, [Accessed: January. 2024]
[5]Silvestri, S., Esposito, A., Gargiulo, F., Sicuranza, M., Ciampi, M., & De Pietro, G., "A Big Data Architecture for the Extraction and Analysis of EHR Data", 2019 IEEE World Congress on Services (SERVICES), [Online] Available at: https://ieeexplore.ieee.org/document/8817262, [Accessed: January. 2024]
[6]L. Li, L. Jin, Z. Jiang, D. Song, and D. Huang, "Biomedical named entity recognition based on extended Recurrent Neural Networks", 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), [Online] Available at: https://ieeexplore.ieee.org/abstract/document/7359761, [Accessed: January. 2024]
[7]A. Graves, "Long short-term memory. Supervised sequence labelling with recurrent neural networks", Studies in Computational Intelligence (SCI, volume 385) 2012, [Online] Available at: https://link.springer.com/book/10.1007/978-3-642-24797-2, [Accessed: January. 2024]
[8]Zhiheng Huang, Wei Xu, Kai Yu, "Bidirectional LSTM-CRF Models for Sequence Tagging", arXiv:1508.019“1v1 [cs.CL] 9 Aug 2015, [Online] Available at: https://arxiv.org/pdf/1508.01991.pdf , [Accessed: January. 2024]
[9]Łukasz Kaiser, Samy Bengio, "Can Active Memory Replace Attention?", 30th Conference Neural Information Processing Systems (NIPS2016” ,Barcelona, Spain. [Online] Available at: https://proceedings.neurips.cc/paper_files/paper/2016/file/fb8feff253bb6c834deb61ec76baa893-Paper.pdf, [Accessed: January. 2024]
[10]Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaronvanden Oord, Alex Graves, Koray Kavukcuoglu, " Neural Machine Translation in Linear Time ", arXiv preprint arXiv:1610.10“99v2, 2017, [Online] Available at: https://arxiv.org/pdf/1610.10099.pdf , [Accessed: January. 2024]
[11]Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin, " Convolutional sequence to sequence learning", arXiv preprint arXiv:1705.03122v2, 2017, [Online] Available at: https://arxiv.org/pdf/1705.03122.pdf , [Accessed: January. 2024]
[12]John F. Kolen, Stefan C. Kremer, "Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies", Wiley-IEEE Press, 2001, [Online] Available at: https://ieeexplore.ieee.org/document/5264952, [Accessed: January. 2024]
[13]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.Gomez, Łukasz Kaiser, "Attention Is All You Need", arXiv preprint arXiv:1706.03762, [Online] Available at: https://arxiv.org/pdf/1706.03762.pdf, [Accessed: January. 2024]
[14]Jianpeng Cheng, Li Dong, Mirella Lapata, "Long short-term memory-networks for machine reading", arXiv preprint arXiv:601.06733,2016, [Online] Available at: https://arxiv.org/pdf/1601.06733.pdf, [Accessed: January. 2024]
[15]Ankur P. Parikh, Oscar Täckström, Dipanjan Das, Jakob Uszkoreit, "A Decomposable Attention Model for Natural Language Inference", Empirical Methods in Natural Language Processing, 2016, [Online] Available at: https://arxiv.org/pdf/1606.01933.pdf, [Accessed: January. 2024]
[16]Romain Paulus, Caiming Xiong, Richard Socher, "A Deep Reinforced Model for Abstractive Summarization", arXiv preprint arXiv:1705.04304, 2017, [Online] Available at: https://arxiv.org/pdf/1705.04304.pdf, [Accessed: January. 2024]
[17]Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, Yoshua Bengio, "A Structured Self-attentive Sentence Embedding", arXiv preprint arXiv:1703.03130, 2017, [Online] Available at: https://arxiv.org/pdf/1703.03130.pdf, [Accessed: January. 2024]
[18]Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv preprint arXiv:1810.04805, 2018, [Online] Available at: https://arxiv.org/pdf/1810.04805.pdf, [Accessed: January. 2024]
[18] Andrea Setzer, and Robert Gaizauskas, "Annotating Events and Temporal Information in Newswire Texts", In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00) , Athens, Greece. European Language Resources Association (ELRA), May 2000, [Online] Available at: http://www.lrec-conf.org/proceedings/lrec2000/pdf/321.pdf, [Accessed: January. 2024]
[20]Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky, "SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations", In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9, Atlanta, Georgia, USA. Association for Computational Linguistics, June 2013, [Online] Available at: https://aclanthology.org/S13-2001, [Accessed: January. 2024]
[21]Wentao Ding, Jianhao Chen, Jinmao Li, and Yuzhong Qu, "Automatic rule generation for time expression normalization", In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3135–3144, Punta Cana, Dominican Republic. Association for Computational Linguistics, November 2021, [Online] Available at: https://aclanthology.org/2021.findings-emnlp.269, [Accessed: January. 2024]
[22]Joseph L. Fleis“, "Fleiss' kappa", [Online] Available at: https://en.wikipedia.org/wiki/Fleiss%27_kappa, [Accessed: January. 2024]
[23]Anthony J Viera, Joanne M Garret“, " Understanding interobserver agreement: the kappa statistic", I Fam Med, May 2005, [Online] Available at: https://www1.cs.columbia.edu/~julia/courses/CS6998/Interrater_agreement.Kappa_statistic.pdf, [Accessed: January. 2024]
[24]Lutz Prechelt, "Early Stopping — But When?", Part of the Lecture Notes in Computer Science book series (LNTCS,volume 7700), 2012, [Online] Available at: https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5, [Accessed: January. 2024]
[25]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, "Language Models are Unsupervised Multitask Learns", Computer Science, Linguistics, 2019, [Online] Available at: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf, [Accessed: January. 2024]
[26]Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyl’ O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal, "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling", arXiv preprint arXiv:2304.01373, 2023, [Online] Available at: https://arxiv.org/pdf/2304.01373.pdf, [Accessed: January. 2024]
[27]Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, "Language Models are Few-Shot Learners”, arXiv preprint arXiv:2005.14165, 2020, [Online] Available at: https://arxiv.org/pdf/2005.14165.pdf, [Accessed: January. 2024]
[28]Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Phang, Jason and Purohit, Shivanshu and Schoelkopf, Hailey and Stander, Dashiell and Songz, Tri and Tigges, Curt and Thérien, Benjamin and Wang, Phil and Weinbach, Samuel, "{GPT-NeoX-20B}: An Open-Source Autoregressive Language Model", arXiv preprint arXiv:2204.06745, 2022, [Online] Available at: https://arxiv.org/pdf/2204.06745.pdf, [Accessed: January. 2024]
[29]Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, "LoRA: Low-Rank Adaptation of Large Language Models", arXiv preprint arXiv:2106.09685, [Online] Available at: https://arxiv.org/pdf/2106.09685.pdf, [Accessed: January. 2024]
[30]Ilya Loshchilov, Frank Hutter, "Decoupled Weight Decay Regularization", arXiv preprint arXiv:1711.05101, [Online] Available at: https://arxiv.org/pdf/1711.05101.pdf, [Accessed: January. 2024]
[31]Liu, J., Shen, D., Zhang, Y., Dolan, W. B., Carin, L., & Chen, W. (2022, May), "What Makes Good In-Context Examples for GPT-3?", In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures (pp. 100-114), [Online] Available at: https://arxiv.org/pdf/2101.06804.pdf, [Accessed: January. 2024]

電子全文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	探討大型語言模型與深度學習用於辨識生醫文獻中蛋白質互動關係之效能
2.	自然語言轉資料庫查詢語言合成器
3.	步步精心：用於少樣本情境學習的階段性學習範例選擇策略
4.	用於科學預測的長序變壓器編碼器再現性

無相關期刊

1.	基於混合深度學習模型與注意力機制由大型語言模型輔助的惡意代碼注入偵測系統
2.	預訓練模型微調於期刊推薦之應用
3.	深度學習應用於廣告圖片之分類與物件偵測研究
4.	探討大型語言模型與深度學習用於辨識生醫文獻中蛋白質互動關係之效能
5.	大型語言模型於表格資料分析之應用
6.	運用LLM自動生成食譜方法與系統
7.	基於語言模型及提示工程的新聞文章自動摘要及科學文本簡化
8.	ICD-10-CM自動輔助編碼系統之研發
9.	運用大型語言模型分析國小資優生獨立研究報告的參考文獻特徵
10.	一個基於大型語言模型的智能合約弱點偵測方法
11.	基於模態轉換和大型語言模型的視覺問答
12.	探討自然語言模型 Chat GPT 3.5 版之醫療衛教諮詢應用與信度分析
13.	基於大型語言模型應用指示詞打造無程式碼對話系統平台 - 以聊故事機器人為例
14.	大學教學實務問答機器人之設計與實作
15.	以BERT技術輔助GPT-2語言模型的對話技術研究-以旅宿訂房機器人為例

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室