跳到主要內容

臺灣博碩士論文加值系統

(44.220.249.141) 您好!臺灣時間:2023/12/11 20:36
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:李佑謙
研究生(外文):Lee, You-Qian
論文名稱:於語碼混合的去識別資料上之分析與改進:以臺灣電子健康紀錄為例
論文名稱(外文):Analysis and Improvement of Code-mixed De-identified Data: A Case Study of Electronic Health Records in Taiwan
指導教授:戴鴻傑
指導教授(外文):Dai, Hong-Jie
口試委員:李俊宏蘇家玉楊弘章
口試委員(外文):Lee, Chung-HongSu, Chia-YuYang, Horng-Chang
口試日期:2022-07-19
學位類別:碩士
校院名稱:國立高雄科技大學
系所名稱:電機工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2022
畢業學年度:110
語文別:中文
論文頁數:47
中文關鍵詞:去識別化語碼混合電子健康病例
外文關鍵詞:De-IdentificationCode-MixingElectronic Health Record
相關次數:
  • 被引用被引用:0
  • 點閱點閱:132
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
臺灣電子病歷技術的成熟,促成全民健保資料庫(National Health Insurance Research Database,NHIRD)成為重要的臨床大數據。然而 NHIRD 的資訊缺乏醫療衛生科學研究中的重要細節乃是一大限制。相較之下,醫事人員撰寫的非結構化臨床報告,詳實記錄病患在醫療行為中之處置細節,是了解病患診療過程最詳盡的資料來源,但是,在將此類記錄用於研究目的之前,必須移除非結構化文本中提到的受保護健康信息(Protected Health Information, PHI)。在台灣的非結構化的臨床報告通常以中英文混合表示,給去標識化技術帶來了挑戰。為縮短前述流程進而加速自然語言處理(Natural Language Processing, NLP)技術的導入與相關應用,本研究建立且細化出入院病摘的去識別化標注。我們著重在分析多語言模型在此類中英文語碼混合(Coding Switch, CS) 的文本上的效度,並基於語碼混合指標(Code Mixing Index, CMI)量化公式觀察 CS 對於模型識別 PHI 的影響。於評估結果中我們證實了多語言預訓練的模型可以更有效地進行PHI 識別,並大幅降低 CS 問題帶來的影響,同時我們也繼續精進在去識別化任務上的表現,使用詞掩碼語言模型結合多任務學習轉移學習到原本的命名實體辨識(Name Entity Recognition, NER)任務上,得到了最高的 F-Score,我們也將其他的錯誤分析(歧異與字典以外字詞)等問題進行分析,量化轉移學習後的改善程度,有無自適應在模型的影響,我們也充分的量化數據來說明,證明了我們的方法可以有效的對字典之外的字詞進行改善,特別是在模型預測偽陰性且跨度為 0 的範例上,對於已經改善 CS 問題的多語言預訓練模型又提高了 ~2% 左右的 Macro-F。
The maturity of electronic medical record technology in Taiwan has contributed to the National Health Insurance Research Database (NHIRD) becoming an important clinical big data. A major limitation, however, is the lack of important detail in health science research in NHIRD's information. In contrast, unstructured clinical reports written by medical staff, which record the details of the patient's treatment in medical behavior, are the most detailed source of information for understanding the patient's diagnosis and treatment process. However, when such records are used for research purposes Previously, references to Protected Health Information (PHI) in unstructured text had to be removed. Unstructured clinical reports in Taiwan are usually expressed in a mixture of Chinese and English, which brings challenges to the de-identification technology. In order to shorten the aforementioned process and accelerate the introduction of Natural Language Processing (NLP) technology and related applications, establish and refine the de-identified annotations of hospitalized and discharged patients. We focus on analyzing the validity of the multilingual model on the text of such Chinese-English Coding Switch (CS), and observe the impact of CS on the model's recognition of PHI based on the Code Mixing Index (CMI) quantitative formula. In the evaluation results, we confirmed that the multilingual pre-trained model can be more effective in PHI recognition and greatly reduce the impact of CS problems. At the same time, we also continue to improve the performance on the de-identification task, using word mask language. The model was transferred to the original Name Entity Recognition (NER) task by combining multi-task learning and obtained the highest F-Score. We also analyzed other problems such as ambiguity and words outside the dictionary and quantified transfer learning. We also fully quantify the data to illustrate the degree of improvement, and whether there is an impact of “self-adaptation” on the model, which proves that our method can effectively improve words outside the dictionary. Especially when the model predicts false negatives and the span is on the example of 0. Another ~2% increase in Macro-F for the multilingual pretrained model that has improved the CS problem.
中文摘要 i
英文摘要 ii
誌謝 iii
目錄 iv
表目錄 vi
圖目錄 vii
1、 緒論 1
1.1、 研究背景與動機 1
1.2、 研究目的 2
1.3、 研究貢獻 2
2、 文獻探勘 3
2.1、 臨床非結構化資料二次應用的病患隱私欄位 3
2.1.1、 去識別化公開資料集 3
2.1.2、 語碼混合的公開資料集 4
2.2、 醫療臨床資料命名實體化相關技術發展 4
2.2.1、 近年來的命名實體辨識技術發展 4
2.2.2、 基於語碼混合的命名實體辨識方法 6
2.3、 當今語碼混合的現況與挑戰 7
3、 研究方法 8
3.1、 語碼混合去識別化資料集建立 8
3.1.1、 資料來源與倫理 8
3.1.2、 標註指南與流程 8
3.2、 語碼混合資料前處理流程 10
3.2.1、 非結構化電子健康紀錄前處理之方法 10
3.2.2、 BILOU格式 11
3.3、 語碼混合指數分析 11
3.3.1、 語碼混合指數公式定義 11
3.3.2、 計算混合度分析詳細實作細節 12
3.4、 詞掩碼模型延伸改良識別方法 12
3.4.1、 基於 EHR 詞掩碼預訓練 BERT 模型(OwnBERT) 12
3.4.2、 BML 於 PHI的詞掩碼模型(PHI_MLM) 14
3.4.3、 BML 於 PHI 的自適應詞掩碼模型(Adaptive+PHI_MLM) 15
3.5、 基於語碼混合去識別方法 16
3.5.1、 基於字典的方法 16
3.5.2、 條件隨機場 16
3.5.3、 長短期記憶+隨機條件場 17
3.5.4、 BERT Based Models 18
3.5.5、 資料集翻譯後在辨識PHI(TransBMl) 18
3.6、 實驗參數設定 19
3.6.1、 機器學習參數設定 19
3.6.2、 深度學習參數設定 19
3.6.3、 MLM 預訓練參數設定 20
3.7、 實驗評估方法 20
4、 實驗結果與討論 22
4.1、 語碼混合去識別語料分布統計 22
4.2、 基於語碼混合去識別表現 23
4.3、 單語與多語模型對於 CS 之影響 25
4.3.1、 BERT-based Models 各類 CS 分數 25
4.3.2、 CMI 範圍錯誤句子之比例 26
4.4、 其他原因與錯誤分析 26
4.4.1、 PHI 歧異問題 27
4.4.2、 基於訓練集特徵之 PHI 識別模型 28
4.5、 BML 於 PHI 的自適應詞掩碼模型之分析 30
4.5.1、 陷入局部最佳解 30
4.5.2、 自適應參數之選擇 31
4.5.3、 自適應模型改進方向 32
5、 結論 33
6、 專有名詞對照表 34
參考文獻 36

Aguilar, G., AlGhamdi, F., Soto, V., Diab, M., Hirschberg, J., & Solorio, T. (2018). Named entity recognition on code-switched data: Overview of the calcs 2018 shared task. Paper presented at the Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching.
Aguilar, G., AlGhamdi, F., Soto, V., Diab, M., Hirschberg, J., & Solorio, T. (2019). Named entity recognition on code-switched data: Overview of the CALCS 2018 shared task. arXiv preprint arXiv:1906.04138.
Banerjee, S., Naskar, S. K., Rosso, P., & Bandyopadhyay, S. (2016). The First Cross-Script Code-Mixed Question Answering Corpus. Paper presented at the MultiLingMine@ ECIR.
Barman, U., Das, A., Wagner, J., & Foster, J. (2014). Code mixing: A challenge for language identification in the language of social media. Paper presented at the Proceedings of the first workshop on computational approaches to code switching.
Calvillo, J., Fang, L., Cole, J., & Reitter, D. (2020). Surprisal predicts code-switching in Chinese-English bilingual text. Paper presented at the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Cannon, J., & Lucci, S. (2010). Transcription and EHRs. Benefits of a blended approach. J ahima, 81(2), 36-40.
Caruana, R. (1997). Multitask learning. Machine learning, 28(1), 41-75.
Chang, N.-W., Dai, H.-J., Jonnagaddala, J., Chen, C.-W., Tsai, R. T.-H., & Hsu, W.-L. (2015). A context-aware approach for progression tracking of medical concepts in electronic medical records. Journal of biomedical informatics, 58, S150-S157.
Dai, H.-J., Su, C.-H., Lee, Y.-Q., Zhang, Y.-C., Wang, C.-K., Kuo, C.-J., & Wu, C.-S. (2021). Deep Learning-Based Natural Language Processing for Screening Psychiatric Patients. Frontiers in Psychiatry, 11(1557). doi:10.3389/fpsyt.2020.533949
Dai, H.-J., Syed-Abdul, S., Chen, C.-W., & Wu, C.-C. (2015). Recognition and evaluation of clinical section headings in clinical documents using token-based formulation with conditional random fields. BioMed research international, 2015.
De Boer, P.-T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A tutorial on the cross-entropy method. Annals of operations research, 134(1), 19-67.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. Paper presented at the 2009 IEEE conference on computer vision and pattern recognition.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Etter, D., Ferraro, F., Cotterell, R., Buzek, O., & Van Durme, B. (2013). Nerit: Named entity recognition for informal text. Human Language Technology Center of Excellence, Johns Hopkins, vol. Technical Report, 11.
Gambäck, B., & Das, A. (2014). On measuring the complexity of code-mixing. Paper presented at the Proceedings of the 11th International Conference on Natural Language Processing, Goa, India.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., . . . He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
Graves, A. (2012). Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37-45.
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
Jia, C., Shi, Y., Yang, Q., & Zhang, Y. (2020). Entity enhanced BERT pre-training for Chinese NER. Paper presented at the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., . . . Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1-9.
Kim, H., & Kang, J. (2021). How Do Your Biomedical Named Entity Models Generalize to Novel Entities? arXiv preprint arXiv:2101.00160.
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
Lee, J.-S., & Hsiang, J. (2020). Patent classification by fine-tuning BERT language model. World Patent Information, 61, 101965.
Lee, Y.-Q., Wang, B.-H., Su, C.-H., Chen, P.-T., Wu-Qing, L., & Dai, H.-J. (2021). Protected Health Information Recognition of Unstructured Code-Mixed Electronic Health Records in Taiwan. Paper presented at the MedInfo, Sydney, Australia.
Liu, K., Hu, Q., Liu, J., & Xing, C. (2017). Named entity recognition in Chinese electronic medical records based on CRF. Paper presented at the 2017 14th Web Information Systems and Applications Conference (WISA).
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Lyu, D.-C., Tan, T.-P., Chng, E. S., & Li, H. (2010). Seame: a mandarin-english code-switching speech corpus in south-east asia. Paper presented at the Eleventh Annual Conference of the International Speech Communication Association.
Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. Paper presented at the Interspeech.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
Nakayama, H. (2018). seqeval: A python framework for sequence labeling evaluation. Software available from https://github. com/chakki-works/seqeval.
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12, 1532-1543.
Saeed, M., Villarroel, M., Reisner, A. T., Clifford, G., Lehman, L.-W., Moody, G., . . . Mark, R. G. (2011). Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): a public-access intensive care unit database. Critical care medicine, 39(5), 952.
Safran, C. (2014). Reuse of Clinical Data. Yearbook of Medical Informatics, 9(1), 52-54. doi:10.15265/IY-2014-0013
Shen, H.-P., Wu, C.-H., Yang, Y.-T., & Hsu, C.-S. (2011). CECOS: A chinese-english code-switching speech database. Paper presented at the 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).
Sheng, E., Miller, S., Ambite, J. L., & Natarajan, P. (2017). A Neural Named Entity Recognition Approach to Biological Entity Identification. Paper presented at the Proceedings of the BioCreative VI workshop, Bethesda, MD USA.
Silvestri, S., Esposito, A., Gargiulo, F., Sicuranza, M., Ciampi, M., & De Pietro, G. (2019). A big data architecture for the extraction and analysis of EHR data. Paper presented at the 2019 IEEE World Congress on Services (SERVICES).
Singh, V., Vijay, D., Akhtar, S. S., & Shrivastava, M. (2018). Named entity recognition for hindi-english code-mixed social media text. Paper presented at the Proceedings of the Seventh Named Entities Workshop.
Stubbs, A., Kotfila, C., & Uzuner, Ö. (2015). Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of biomedical informatics, 58, S11-S19.
Sutton, C., McCallum, A., & Rohanimanesh, K. (2007). Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. Journal of Machine Learning Research, 8(3).
Tang, T., Tang, X., & Yuan, T. (2020). Fine-tuning BERT for multi-label sentiment analysis in unbalanced code-switching text. IEEE Access, 8, 193248-193256.
Taylor, W. L. (1953). “Cloze procedure”: A new tool for measuring readability. Journalism quarterly, 30(4), 415-433.
Thara, S., & Poornachandran, P. (2018). Code-mixing: A brief survey. Paper presented at the 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI).
Thimm, G., & Fiesler, E. (1995). Neural network initialization. Paper presented at the International Workshop on Artificial Neural Networks.
Trivedi, S., Rangwani, H., & Singh, A. K. (2018). Iit (bhu) submission for the acl shared task on named entity recognition on code-switched data. Paper presented at the Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Viera, A. J., & Garrett, J. M. (2005). Understanding interobserver agreement: the kappa statistic. Fam med, 37(5), 360-363.
Wang, C., Cho, K., & Kiela, D. (2018). Code-switched named entity recognition with embedding attention. Paper presented at the Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching.
Winata, G. I., Wu, C.-S., Madotto, A., & Fung, P. (2018). Bilingual character representation for efficiently addressing out-of-vocabulary words in code-switching named entity recognition. arXiv preprint arXiv:1805.12061.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., . . . Funtowicz, M. (2019). Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
Wu, C.-S., Kuo, C.-J., Su, C.-H., Wei, L.-X., Lu, W.-H., Wang, S. H., & Dai, H.-J. (2020). Text mining approach to extract depressive symptoms and to validate the diagnosis of major depressive disorder from electronic health records. Journal of Affective Disorders, 260, 617-623.
Wu, S., & Manber, U. (1992). Fast text searching: allowing errors. Communications of the ACM, 35(10), 83-91.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., . . . Macherey, K. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
Yang, X., Zhao, X., Tjio, G., Chen, C., Wang, L., Wen, B., & Su, Y. (2020). Opencc–an open Benchmark data set for Corpus Callosum Segmentation and Evaluation. Paper presented at the 2020 IEEE International Conference on Image Processing (ICIP).
Zhong, X., & Cambria, E. (2018). Time expression recognition using a constituent-based tagging scheme. Paper presented at the Proceedings of the 2018 world wide web conference.
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. Paper presented at the Proceedings of the IEEE international conference on computer vision.

電子全文 電子全文(網際網路公開日期:20270830)
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top