跳到主要內容

臺灣博碩士論文加值系統

(44.210.149.218) 您好!臺灣時間:2024/11/07 20:56
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:周君蕙
研究生(外文):Chun-Hui Chou
論文名稱:混合後綴陣列與條件隨機域模型之中文部落格網頁菜餚名稱擷取
論文名稱(外文):Extracting Dish Names from Chinese Blog Reviews Using Suffix Arrays and Conditional Random Fields Model
指導教授:蔡宗翰蔡宗翰引用關係
學位類別:碩士
校院名稱:元智大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2010
畢業學年度:98
語文別:中文
論文頁數:35
中文關鍵詞:條件隨機域 後綴陣列 專有名詞辨識
外文關鍵詞:Conditional Random Fields Suffix Arrays Name Entity Recognition
相關次數:
  • 被引用被引用:0
  • 點閱點閱:210
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
線上部落格食記(評論)對於創建餐廳目錄是非常有用的資訊。而抽取出菜餚名稱在挖掘部落格當中是件重要的事情。在本文中,我們提出了一個新的方法來提取中文的菜餚名稱。
在第一階段,系統運用基於後綴陣列的方法,來鑑別出候選詞(candidates)。第二階段,透過條件隨機域來驗證候選詞,並且可以同時作為廣義候選詞和狹義候選詞互相關聯的模型。
除了純文字的詞綴特徵(affix features)以外,我們的模型還考慮其他網頁資訊特有的特徵值:引號標誌,顯示風格(字體/顏色)和圖像鄰近度。預期希望添加額外的特徵,能夠大幅提升了只有詞綴為特徵的效能。
另外由於使用單一的部落格來當作審查,是不夠找出一些菜餚名稱。因此,系統將聚合多個餐廳食評做為輸入條件。如此一來,不僅可以增加網頁資訊的額外特徵,更可以透過頻率,將較出現頻率較高的菜餚名稱,推薦給使用者。


Online blog reviews are a useful source of information for creating restaurant directories. An important step in mining restaurant review blogs is extracting dish names. In this paper, we propose a novel two-phase method for extracting dish names from Chinese language blog reviews.
In the first phase, we identify candidates using an unsupervised suffix array-based approach. In the second, we validate the extracted candidates using a CRF-based approach that can simultaneously model the correlation between sets of super/substring candidates.
In addition to affix features, our model also considers quotation marks, style (font/color/hyperlink) and image proximity. Our experimental results show that adding these extra features significantly improves affix-only baseline performance by as much as 8.05%.
Unlike traditional approaches that extract words from single documents, we use all review documents corresponding to a restaurant because we believe that using multiple documents will increase the frequency and style information for each dish name.


書名頁.................................................i
論文口試委員審定書....................................ii
授權書 ..............................................iii
中文摘要 .............................................iv
英文摘要 ..............................................v
誌謝 .................................................vi
目錄.................................................vii
表目錄.................................................x
圖目錄................................................xi
第1章 緒論 ............................................1
1.1 研究動機...........................................1
1.2 研究背景...........................................2
1.5 論文架構...........................................4
第2章 相關研究 ........................................5
2.1 規則導向(Rule-based) ..............................5
2.2 統計導向(Statistics-based) ........................5
2.2.1 非監督式 ........................................6
2.2.2 監督式 ..........................................6
2.3 中文專有名詞擷取與整合相關研究.....................7
第3章 系統處理流程與架構 ..............................9
3.1 系統介紹 ..........................................9
3.2 系統架構 ..........................................9
3.3 系統介面 .........................................10
第4章 候選詞辨識 .....................................13
4.1 使用後綴陣列識別菜餚名稱 .........................14
4.2 候選詞過濾(Candidate Filtering) ..................15
第5章 候選詞驗證 .....................................16
5.1 公式化(Formulation) ..............................16
5.2 條件隨機域(Conditional Random Fields) ............18
5.3 特徵選擇(Features) ...............................19
5.3.1 詞綴特徵(Affix Features) .......................19
5.3.2 顯示風格特徵(Style Features) ...................19
5.3.3 引號特徵(Quotation Features) ...................20
5.3.4 圖像鄰近度特徵(Image Proximity Features) .......21
第6章 實驗與評估 .....................................23
6.1 資料集 ...........................................23
6.2 實驗設計與評估方式 ...............................23
6.3 實驗結果 .........................................25
6.3.1 實驗1 ..........................................25
6.3.2 實驗2 ..........................................26
6.3.3 實驗3 ..........................................27
第7章 實驗結果分析與探討 .............................28
7.1 額外特徵的貢獻 ...................................28
7.2 條件隨機域的貢獻 .................................28
7.3 條件隨機域錯誤情況 ...............................29
第8章 結論與未來展望 .................................31
8.1 結論 .............................................31
8.2 未來展望 .........................................32
參考文獻 .............................................33


[1] Technorati, “http://technorati.com/.”
[2] 部落格觀察, “http://look.urs.tw/.”
[3] J. C. Lee, Y. H. Lee, and H. H. Chen, “中文文本人名辨識問題之研究,”第七屆計算器語言會會議論文集, pp. 203–222, 1994.
[4] L.-F. Chien, “PAT-tree-based keyword extraction for Chinese information retrieval,” Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–58, 1997.
[5] W. Black, F. Rinaldi, and D. Mowatt, “Facile: Description of the ne system used for muc-7,” In Proceedings of the 7th Message Understanding Conference, 1998.
[6] S. Abney, “Partial parsing via finite-state cascades,” Nat. Lang. Eng., vol. 2, no. 4, pp. 337–344, 1996.
[7] F. Olsson, G. Eriksson, K. Franz #233, L. Asker, and P. Lid “Notions of correctness when evaluating protein name taggers,” Proceedings of the 19th international conference on Computational linguistics, vol. 1, pp. 1–7, 2002.
[8] B. Eckhard, “A named entity recognizer for danish,” Proc. of 4th International Conf. on Language Resources and Evaluation(LREC2004), pp. 305–308, 2004.
[9] 孫茂松, 黃昌寧, and 高海燕, “中文姓名的自動辨別,” 中文信息學報, vol. 2, pp. 16–27, 1995.
[10] R. Sproat and C. Shih., “A statistical method for finding word boundaries in Chinese text,” Computer processing of Chinese and Oriental Languages, vol. 4, pp. 336–351, 1990.
[11] R. C.Wang andW.W. Cohen, “Language-independent set expansion of named entities using the web,” Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, pp. 342–350, 2007.
[12] D. M. Bikel, S. Miller, R. Schwartz, and R. Weischedel, “Nymble: a highperformance learning name-finder,” Proceedings of the fifth conference on Applied natural language processing, pp. 194–201, 1997.
[13] M. Asahara and Y. Matsumoto, “Japanese named entity extraction with redundant morphological analysis,” Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 8–15, 2003.
[14] A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman, “Exploiting diverse knowledge sources via maximum entropy in named entity recognition,” In Proceedings of the Sixth Workshop on Very Large Corpora, pp. 152–160, 1998.
[15] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289, 2001.
[16] M. Collins, “Ranking algorithms for named-entity extraction: boosting and the voted perceptron,” Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 489–496, 2002.
[17] K.-J. Chen and M.-H. Bai, “Unknown word detection for Chinese by a corpus-based learning method,” International Journal of Computational Linguistics and Chinese Language Processing, vol. 3, pp. 27–44, 1998.
[18] J. Sun, J. Gao, L. Zhang, M. Zhou, and C. Huang, “Chinese named entity identification using class-based language model,” Proceedings of the 19th international conference on Computational linguistics, vol. 1, pp. 1–7, 2002.
[19] K.-J. Chen andW.-Y. Ma, “Unknown word extraction for Chinese documents,” Proceedings of the 19th international conference on Computational linguistics, pp. 1–7, 2002.
[20] K. Zhang, Q. Liu, H. Zhang, and X.-Q. Cheng, “Automatic recognition of Chinese unknown words based on roles tagging,” Proceedings of the first SIGHAN workshop on Chinese language processing, vol. 18, pp. 1–7, 2002.
[21] C.-W.Wu, S.-Y. Jan, R. T.-H. Tsai, andW.-L. Hsu, “On using ensemble methods for Chinese named entity recognition,” In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing, 2006.
[22] J. Zhao and F. Liu, “Product named entity recognition in Chinese text,” Language Resources and Evaluation, vol. 42, pp. 197–217, 2008.
[23] J. D. M. Rennie and T. Jaakkola, “Using term informativeness for named entity detection,” Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 353–360, 2005.
[24] H. Nanba, H. Taguma, T. Ozaki, D. Kobayashi, A. Ishino, and T. Takezawa, “Automatic compilation of travel information from automatically identified travel blogs,” International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), 2009.
[25] C.-C. Shih, T.-C. Peng, and W.-S. Lai, “Mining the blogosphere to generate local cuisine hotspots for mobile map service,” International Journal of Computational Linguistics Research, pp. 18–28, 2009.
[26] C.-L. Sung, H.-C. Yen, and W.-L. Hsu, “Compute the term contributed frequency,” Proceedings of the 2008 Eighth International Conference on Intelligent Systems Design and Applications, vol. 2, pp. 325–328, 2008.
[27] 中文斷詞詞類標記系統, “http://ckipsvr.iis.sinica.edu.tw/.”
[28] F. Sha and F. Pereira, “Shallow parsing with conditional random fields,” Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 134–141, 2003.
[29] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Readings in speech recognition, pp. 267–296, 1990.
[30] ipeen愛評網, “http://www.ipeen.com.tw/.”


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top