臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.11) 您好！臺灣時間：2025/09/23 13:55

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
電子全文
紙本論文
QR Code

本論文永久網址:

研究生:

吳冠誼

研究生(外文):

Wu, Guan-Yi

論文名稱:

利用機器學習辨識專有名詞 – 以提升中文斷字斷詞的績效

論文名稱(外文):

Proper Noun Identification Using Machine Learning – Elevate the Performance of Chinese Word Segmentation

指導教授:

陳宗天

指導教授(外文):

Chen, Tsung-Teng

口試委員:

陳宗天、李瑞元、王永心、蔡瑞煌

口試委員(外文):

Chen, Tsung-Teng、Lee, Maria R.、Wang, Yung-Hsin、Tsaih, Rua-Huan

口試日期:

2018-07-30

學位類別:

碩士

校院名稱:

國立臺北大學

系所名稱:

資訊管理研究所

學門:

電算機學門

學類:

電算機一般學類

論文種類:

學術論文

論文出版年:

2018

畢業學年度:

106

語文別:

中文

論文頁數:

中文關鍵詞:

專有名詞、中文斷字斷詞、深度學習、LSTM

外文關鍵詞:

Proper Nouns、Chinese Word Segmentation、Deep Learning、LSTM

相關次數:

被引用:1
點閱:642
評分:
下載:8
書目收藏:1

在英文的文本中，詞彙以空格或特殊符號分隔，專有名詞也以大寫字母開頭。但中文文本內的詞彙沒有像英文有明確的邊界，因此常需使用自訂辭庫來解決斷字斷詞與專有名詞辨識的問題。
由於專有名詞與新創名詞不斷出現，目前以人工自訂詞庫方式較無效率，因此擬將機器學習方法也應用於中文詞彙的識別。但常見的監督式學習方式需要使用人工事先標記的語料庫，或是有標準答案的資料集，在資料處理上花費了過多時間，因此本研究使用深度學習中的長短期記憶（Long-Short Term Memory, LSTM）做為訓練模型，並將標點符號以空格代替的文本做為訓練資料，利用深度學習的方式找出文章中的專有名詞(如：人名)。然而預測結果並沒有標準答案，因此本研究加入了門檻值，對每次預測的機率轉換矩陣進行挑選；應用貝式定理與多模型，過濾出多字詞彙；開頭字篩選，選擇較有可能作為開頭的字進行預測；將正向與反向文本訓練後之預測結果取交集，過濾無意義詞彙。
本研究運用了上述多種方式找出正確且有意義之詞彙，也針對專有名詞計算Precision、Recall值來做為驗證績效指標，並改善現有斷詞系統Jieba對於專有名詞及未知詞的斷詞效能。

In English text, words are separated by spaces or special symbols, and proper nouns begin with uppercase letters. However, the vocabulary in Chinese text does not have a clear boundary like English. Therefore, it is often necessary to use a custom lexicon to solve the problem of word segmentation and proper noun identification.
Because proper nouns and unknown words are constantly growing, the cost of custom lexicon is relatively high, so machine learning methods are also used for Chinese recognition. In this study, we use LSTM as a training model, replace the punctuation of the article with a space and use it as a training data and find proper nouns in the article through deep learning.
However, there are no standard answers to the predictions, so this study adds the threshold value, selects the probability matrix for each prediction, applies Bayes' theorem and multi-model to find long words, choose the one that is more likely to be the start words, forward and backward prediction result intersect, filtering the meaningless words.
This study also calculates the Precision and Recall values for the proper nouns as performance indicator, and improves the word segmentation performance of Jieba on proper nouns and unknown words.

中文論文提要I
英文論文提要II
目　錄III
圖　次VI
表　次VII

第一章緒論1
第一節研究背景與動機1
第二節研究目的2
第三節論文架構3
第二章文獻探討4
第一節文字探勘（Text Mining） 4
第二節機器學習（Machine Leaning） 4
第三節 Keras與TensorFlow 7
第四節深度學習（Deep Learning） 7
第五節中文斷詞方法 11
第六節隱含馬可夫模型（Hidden Markov Model）於斷詞之應用 15
第七節命名實體識別（Named Entity Recognition,NER） 16
第三章研究方法 18
第一節研究架構 18
第二節類神經網路模型建立 20
第三節預測結果篩選 23
第四節系統評估與調整 25
第五節現有系統比較 26
第六節使用工具 28
第四章研究實作與結果 29
第一節機器學習平台建置 29
第二節資料搜集與整理 30
第三節建模與實作結果 32
第四節模型限制說明及比較 40
第五章結論與未來建議 41
第一節研究貢獻 41
第二節系統限制 42
第三節結論與建議 42
參考文獻 44
附錄一詞性表 46
附錄二程式碼流程 48
簡歷 54
著作權聲明 55

Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166.
Chen, H.-H., & Lee, J.-C. (1996). Identification and classification of proper nouns in Chinese texts. Paper presented at the Proceedings of the 16th conference on Computational linguistics-Volume 1.
Chen, X., Qiu, X., Zhu, C., Liu, P., & Huang, X. (2015). Long Short-Term Memory Neural Networks for Chinese Word Segmentation. Paper presented at the EMNLP.
Chieu, H. L., & Ng, H. T. (2002). Named entity recognition: a maximum entropy approach using global information. Paper presented at the Proceedings of the 19th international conference on Computational linguistics-Volume 1.
Chiu, J. P., & Nichols, E. (2015). Named entity recognition with bidirectional LSTM-CNNs. arXiv preprint arXiv:1511.08308.
Goh, C.-L., Asahara, M., & Matsumoto, Y. (2005). Chinese Word Segmentation by Classification of Characters. International Journal of Computational Linguistics & Chinese Language Processing, Volume 10, Number 3, September 2005: Special Issue on Selected Papers from ROCLING XVI, 10(3), 381-396.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Junyi, S. (2013, 2016). jieba. Retrieved from https://github.com/fxsjy/jieba
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
Lin, Q.-X., Chang, C.-H., & Chen, C.-L. (2010). 結合長詞優先與序列標記之中文斷詞研究 (A Simple and Effective Closed Test for Chinese Word Segmentation Based on Sequence Labeling)[In Chinese]. International Journal of Computational Linguistics & Chinese Language Processing, Volume 15, Number 3-4, September/December 2010, 15(3-4).
Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354.
Peng, F., Feng, F., & McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. Paper presented at the Proceedings of the 20th international conference on Computational Linguistics.
Peng, N., & Dredze, M. (2016). Improving named entity recognition for chinese social media with word segmentation representation learning. arXiv preprint arXiv:1603.00786.
Raschka, S. (2015). Python machine learning: Packt Publishing Ltd.
Rehurek, R. (2009, 2018/02/03). Topic Modelling For Humans. Retrieved from https://radimrehurek.com/gensim
Sullivan, D. (2001). Document warehousing and text mining: techniques for improving business operations, marketing, and sales: John Wiley & Sons, Inc.
Sun, J. (2012). ‘Jieba’Chinese word segmentation tool.
Teahan, W. J., Wen, Y., McNab, R., & Witten, I. H. (2000). A compression-based algorithm for Chinese word segmentation. Computational Linguistics, 26(3), 375-393.
Xu, J., & Sun, X. (2016). Dependency-based gated recursive neural network for chinese word segmentation. Paper presented at the Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
林千翔. (2004). 基於特製隱藏式馬可夫模型之中文斷詞研究; Chinese Word Segmentation using Specialized HMM.
林大貴. (2017). TensorFlow+Keras深度學習人工智慧實務應用: 博碩文化.
陳稼興, 謝佳倫, & 許芳誠. (2000). 以遺傳演算法為基礎的中文斷詞研究. 資訊管理研究, 2(2), 27-44.
陳譽晏. (2015). 運用 R Shiny 建立文字探勘平台之語意分析及輿情分析. Journal of Data Analysis, 10(6), 51-78.

電子全文

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	應用深度學習于文本多標籤分類之研究
2.	Elderly fall-risk prediction study from Inertial Sensor Time Series Data using Deep Learning Algorithms
3.	結合類神經網路LSTM與大數據分析技術用於半導體製程良率的改善
4.	基於深度學習及穿戴式慣性測量單元之步態分析
5.	基於深度學習預測外匯市場趨勢
6.	機器從數據中學到甚麼:應用深度學習預測股票價格
7.	應用深度學習於觀看預測之研究-以Kaggle KKTV data game為例
8.	以LSTM預測細懸浮微粒值規畫最佳行徑路線
9.	基於LSTM深度學習遞迴類神經網路之穿戴式重量訓練即時成效診斷及評估系統設計
10.	基於字元卷積神經網路之中文文件分類
11.	應用TensorFlow之深度學習於懸浮微粒濃度預測之研究
12.	深度學習應用於中英文文法偵錯
13.	以LSTM深度神經網路語言模型建構英文課程重點摘要
14.	應用卷積神經網路與長短期記憶神經網路之Twitter輿情分析

無相關期刊

1.	改進深度學習自動作曲系統- 應用更豐富的MIDI資料
2.	結合使用者與項目為基礎之協同過濾文獻推薦系統-學術大數據之應用
3.	應用人工智慧方法偵測網軍口碑
4.	臺灣電影口碑語料庫之建置架構
5.	5G網路緩存與傳輸效能改善-以Facebook內容遞送為例
6.	緊急醫療物資儲備中心選址之研究
7.	字串型態的使用者行為分布建模
8.	根基於IEEE 802.11ac傳輸模式選擇以增進服務品質研究
9.	社群媒體廣告對顧客參與影響之探究
10.	社群媒體使用行為之構形探究
11.	共享經濟驅動力對消費者使用意圖之影響性探究
12.	在大數據平台使用機器學習方法預測空氣汙染
13.	窩是你/妳的牽手-年輕型失智症伴侶照顧者照顧經驗之探討
14.	以文字探勘與機器學習法於部落格遊記之研究
15.	綠色認知混淆對於綠色購買意願影響之研究：環保情感承諾與綠色品牌知名度之中介效果

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室