跳到主要內容

臺灣博碩士論文加值系統

(44.222.64.76) 您好!臺灣時間:2024/06/14 04:42
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:李奕儒
研究生(外文):Yi-Ju Li
論文名稱:多國語言文件探勘技術應用於評估專利文件的相關性之研究
論文名稱(外文):Research on Applying Multilingual Text-Mining Approaches to Computing Relatedness Evaluation of Patent Documents
指導教授:李俊宏李俊宏引用關係
指導教授(外文):Chung-Hong Lee
學位類別:碩士
校院名稱:國立高雄應用科技大學
系所名稱:電機工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:中文
論文頁數:69
中文關鍵詞:多國語言文件探勘多國語言專利探勘未知詞擷取文件群聚
外文關鍵詞:Multilingual text miningMultilingual Patent miningUnknown word extractionText clustering
相關次數:
  • 被引用被引用:0
  • 點閱點閱:748
  • 評分評分:
  • 下載下載:15
  • 收藏至我的研究室書目清單書目收藏:3
專利數量已成為衡量一個國家競爭力重要的指標,專利保護也成為重要的商業手段。隨著全球化所帶來的影響,跨國的專利查詢工具成為企業評估產品與市場開發方向的利器,多國語言專利探勘遂成為重要的研究課題。多國語言專利探勘研究的主要目的是從不同語言的專利文件中發掘出其潛在的關聯,以協助縮短產品開發與技術研究的時程及避免侵權的問題。
然而,,專利文件在進行探勘時常會遭遇到專有名詞無法辨識的問題;因此,本研究結合文件探勘技術與未知詞擷取的方法來解決專利文件探勘的問題,並開發出一個未知詞擷取方法來自動取出可能的詞彙,以解決專有辭典詞彙不足所造成的特徵遺漏及提升專利文件探勘的準確率。實驗中首先比較了多種分群方法,藉由分析其最佳分群與最佳配對的結果來選取出最適合的分群演算法與最佳表現的維度,接著配合未知詞擷取方法來改善專利特徵遺失的問題。本研究針對專利文件的要件,包括「專利摘要」(abstract)及「專利申請範圍」(claim)進行實驗分析,結果顯示結合多國語言文件探勘技術與未知詞擷取的方法可有效提供多國語言專利文件探勘的效能達4個百分點以上,改善辭典涵蓋率不足造成詞彙無法辨識的問題,並可有效的將相似度高的專利文件進行優先的排序。
The amount of patents has become an important index for evaluating the competitiveness of a country. Also, protection of patents and copyrights has been a critical issue for commercial operation. Alone with the influence of globalization, the transnational patent query tool becomes a useful tool for enterprises to cope with the market trend and benefit product development. As a result, multilingual patent mining is regarded as an important research topic in recent years. The goal of multilingual patent mining research is to discover the implicit relationships among patents in different languages, and therefore reducing the product development time for R&D engineers and preventing patent infringement.
However, it is inevitable to suffer from the difficulties of recognizing proper nouns and terminologies, which often affects the experimental results during the process of mining patent documents. Therefore, this research work attempts to combine text mining techniques with a novel approach for unknown word extraction to tackle such issues. The unknown word extraction approach was developed to deal with the incomplete features of documents due to the shortage of specialized vocabulary dictionaries and improve the accuracy of patent text mining. First, the work was started with a comparison of several clustering methods. The most suitable clustering method and the optimal dimensions were selected to perform the patent mining process according to the best clustering and matching results among these techniques. Subsequently, the extraction of unknown words was performed to solve the problem of feature missing in the patent documents. The text mining task in this research was mainly focused on dealing with the essential parts of patent documents, including the abstracts and claims of patents. The results show that the combination of unknown word extraction method and multilingual patent text mining is able to increase the system performance over four percents in the experiment, and it is significantly better than traditional approaches. It also improves the difficulties of terminology recognition due to the insufficient coverage of a lexicon, and the resulting multilingual patent documents will be sorted according to the ranking of their content similarity.
目錄
第一章 緒論 1
1.1 研究背景 1
1.2 問題領域 1
1.3 研究動機 3
1.4 研究目的 4
1.5 研究範圍與限制 6
1.6 論文架構 6
第二章 文獻探討 7
2.1 專利定義 7
2.1.1 專利制度 7
2.1.2 專利說明書 7
2.1.3 專利分類 8
2.1.4 專利家族 8
2.2 專利研究文獻探討 9
2.2.1 多國語言專利檢索 9
2.2.2 自動摘要 10
2.2.3 趨勢分析 10
2.2.4 自動分群 10
2.2.5 自動分類 11
2.2.6 專利可視化 11
第三章 多國語言文件探勘 12
3.1 文件的前置處理 13
3.1.1 斷詞 13
3.1.1.1 斷詞歧異性的問題 15
3.1.1.2 未知詞的問題 15
3.1.2 特徵選取 16
3.1.3 文件表示 16
3.2 多國語言文件的探勘觀點 16
3.2.1 以翻譯為基礎 17
3.2.1.1 以字典為基礎 17
3.2.1.2 以語料庫為基礎 17
3.2.1.3 混合式方法 18
3.2.1.4 網路翻譯擷取 18
3.2.2 以多語空間為基礎 19
3.2.2.1 語意模型 19
3.2.2.2 外在類別模型 19
3.3 文件探勘的方法 19
3.3.1 監督式文件分類 19
3.3.2 非監督式文件分群 20
3.3.3 多國語言文件分群應用於專利文件 20
第四章 機器學習技術於多國語言專利文件探勘之應用 21
4.1 潛在語意索引 21
4.1.1 奇異質分解 21
4.1.2 Query Transformation 23
4.1.3 Folding-in 23
4.2 機器學習技術 24
4.2.1 階層式分群演算法 24
4.2.1.1 聚合式階層分群法 25
4.2.1.2 分列式階層分群法 26
4.2.2 分割式分群演算法 26
4.2.2.1 k-means分群技術 26
4.2.2.2 Fuzzy c-means分群技術 27
4.2.2.3 自我組織映射分群技術 28
第五章 研究方法與實驗架構 30
5.1 多國語言專利文件相關性評估 30
5.1.1 文件收集 30
5.1.2 建構多國語言語意空間 32
5.1.2.1 文件擷取 33
5.1.2.2 文件前處理 33
5.1.2.3 LSI模組 34
5.1.3 專利文件映射與相關性量測 34
5.2 未知詞擷取與判別 35
5.2.1 詞彙擷取 35
5.2.2 詞彙判別 36
第六章 實驗結果與分析 39
6.1 多國語言專利文件相關性量測 39
6.1.1 語料庫的選擇 39
6.1.2 多國語言專利文件相關性量測與結果分析 39
6.2 未知詞擷取 49
第七章 結論 54
7.1 實驗結果與討論 54
7.2 結果與本研究之貢獻 55
7.3 未來研究方向與重點 56
參考文獻 58
附錄 一 65
附錄 二 67

圖目錄
圖 1. 1 相關專利擴充示意圖 3
圖 1. 2 系統概念示意圖與結果範例 5
圖 3. 1 詞彙-文件矩陣示意圖 16
圖 3. 2 分群特性示意圖 20
圖 4. 1 矩陣Ak的數學表示式法 21
圖 4. 2 詞彙匯入的數學表示圖 24
圖 4. 3 文件匯入的數學表示圖 24
圖 4. 4 聚合式階層分群樹狀圖 25
圖 4. 5 模糊分群 27
圖 4. 6 自我組織映射類神經網路模型 29
圖 5. 1 系統流程圖 31
圖 5. 2 建構多國語言語意空間系統架構 32
圖 5. 3 未知詞擷取流程圖 35
圖 5. 4 線上未知詞檢核作業流程圖 37
圖 5. 5 詞彙對譯 38
圖 6. 1 以Abstract為樣本條件,維度對準確率的影響 43
圖 6. 2 以Claim為樣本條件,維度對準確率的影響 43
圖 6. 3 以Abstract + Claim為樣本條件,維度對準確率的影響 44
圖 6. 4 不同樣本條件,hierarchical clustering配對準確率 45
圖 6. 5 不同樣本條件,k-means的配對準確率 45
圖 6. 6 以Abstract為樣本條件,SOM的配對準確率 46
圖 6. 7 以Claim為樣本條件,SOM的配對準確率 46
圖 6. 8 以Abstract+Claim為樣本條件,SOM的配對準確率 47
圖 6. 9 以Abstract為樣本條件,fuzzy c-means的配對準確率 47
圖 6. 10 以Claim為樣本條件,fuzzy c-means的配對準確率 48
圖 6. 11 以Abstract+Claim為樣本條件,fuzzy c-means的配對準確率 48
圖 6. 12 以Abstract為樣本條件,SOM的分群準確率 50
圖 6. 13 以Claim為樣本條件,SOM的分群準確率 50
圖 6. 14 以Absract+Claim為樣本條件,SOM的分群準確率 51
圖 6. 15 以Absract為樣本條件,SOM的配對準確率 51
圖 6. 16 以Claim為樣本條件,SOM的配對準確率 52
圖 6. 17 以Absract+Claim為樣本條件,SOM的配對準確率 52
圖 1 中文專利範例 65
圖 2 英文專利範例 66
圖 3 系統雛形 67
圖 4 評估結果範例一 68
圖 5 評估結果範例二 69

表目錄
表 5. 1 訓練資料類別與文件個數 31
表 5. 2 專利文件類別與文件個數 32
表 5. 3 辭典類別 33
表 5. 4 部分中文停用詞列表 36
表 6. 1 不同權重搭配分群技術的分群準確率 40
表 6. 2 不同權重搭配分群技術的分群準確率 40
表 6. 3 向量權重搭配對SOM分群準確率的影響 41
表 6. 4 向量權重搭配對k-means分群準確率的影響 41
表 6. 5 向量權重搭配對fuzzy c-means分群準確率的影響 42
表 6. 6 向量權重搭配對hierarchical分群準確率的影響 42
表 6. 7 擷取詞彙數量 49
[1] Asahara, M., Goh, C.L., Wang, X., and Matsumoto, Y., 2003 “Combining segmenter and chunker for Chinese word segmentation,” In Proceedings of Second SIGHAN Workshop on Chinese Language Processing, pp. 144-147.
[2] Attar R. and Fraenkel, A.S., 1977, “Local feedback in full-text retrieval systems”, Journal of the Association for Computing Machinery, 24(3), pp. 397-417.
[3] Ballesteros, L. and Croft, W.B., 1996, “Dictionary methods for cross-lingual information retrieval”, In Proceedings of the 7th International DEXA Conference on Database and Expert Systems Applications, pp. 791–801.
[4] Ballesteros, L. and Croft, W.B., 1997, “Phrasal translation and query expansion techniques for cross-language information retrieval”, Working Notes of AAAI-97 Spring Symposiums on Cross-Language Text and Speech Retrieval, pp. 1-8.
[5] Ballesteros, L. and Croft, W.B., 1998, “Resolving ambiguity for cross-language retrieval”, In Proceedings of the 21st International ACM SIGIR Conference, pp. 64-71.
[6] Berry, M. W., Dumais, S. T. and O'Brien, G.W., 1995 “Using linear algebra for intelligent information retrieval”, SIAM: Review, 37, pp.573-595.
[7] Bezdek, J.C., 1981, “Pattern recognition with fuzzy objective function algorithms”, Plenum Press, New York.
[8] Camus C. and Brancaleon, R., 2003, “Intellectual assets management: From patents to knowledge”, World Patent Information, 25(2), pp.155-159.
[9] Chakrabarti, S., Dom, B., Agrawal, R., and Raghavan. P., 1998 “Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies”, VLDB Journal, vol. 7, pp. 163-178.
[10] Chen, K.J. and Ma, W.Y., 2002, “Unknown word extraction for Chinese documents”, In Proceedings of COLING 2002, pp. 169–175.
[11] Cheng, C.C.; Shue, R.J.; Lee, H.L.; Hsieh, S.Y.; Yeh, G.C., and Bian, G.W., 2007 “AINLP at NTCIR-6: Evaluations for multilingual and cross-lingual information retrieval”, Proceedings of NTCIR-6 Workshop, Japan.
[12] Chuang, W.T. and Yang, J., 2000, “Extracting sentence segments for text summarization: a machine learning approach”, Proceedings of the 23rd International Conference on Research in Information Retrieval, pp. 152-159.
[13] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R.A., 1990, “Indexing by latent semantic analysis”, Journal of the American Society for Information Science, 41(6), pp. 391-407.
[14] Dou, H.J.-M., 2004, “Benchmarking R&D and companies through patent analysis using free databases and special software: A tool to improve innovative thinking”, World Patent Information, vol. 26, pp. 297-309.
[15] Egozi, O., Gabrilovich, E., and Markovitch, S., 2008, “Concept-based feature generation and selection for information retrieval”, In Proceedings of the Twenty-Third Conference on Artificial Intelligence (AAAI).
[16] European Patent Office, 2003, “Insufficient use of innovation support mechanisms in Europe”, Information for Journalists.
URL: http://www.epo.org/about-us/press/releases/archive/2003/05112003.html
[17] Fattori, M., Pedrazzi, G., and Turra, R. 2003, “Text mining applied to patent mapping: A practical business case”, World Patent Information, 25(4), pp. 335-342.
[18] Fu, G.H. and Luke, K.K., 2003, “An integrated approach for Chinese word segmentation”, In Proceedings of PACLIC 17, pp. 80-87.
[19] Gao, J., Nie, J. Y., Zhang, J., Xun, E., Zhou, M., and Huang, C. 2001, “Improving query translation for CLIR using statistical Models”, ACM SIGIR’01, New Orleans, Louisiana, pp. 96-104.
[20] Gao, J., Li, M., Wu, A., and Huang, C.H., 2005, “Chinese word segmentation and named entity recognition: A pragmatic approach”, Computational Linguistics, 31(4), pp. 531-574.
[21] Goh, C.L., Asahara, M., and Matsumoto, Y., 2005, “Chinese word segmentation by classification of characters,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 10, No. 3, pp. 381-396.
[22] Hahn, U. and Mani, I. 2000, “The challenges of automatic summarization”, IEEE Computer, 33(11), pp. 29-36.
[23] Higuchi, S., Fukui, M., Fujii, A., and Ishikawa, T., 2001 “PRIME: A system for multi-lingual patent retrieval”, In Proceedings of MT Summit VIII, pp. 163-167.
[24] Kang, I.S., Na, S.H., Kim, J., and Lee, J.H., 2007, “Cluster-based patent retrieval”, Information Processing & Management, Vol. 43, Issue 5, pp. 1173-1182.
[25] Kim, Y. Suh, J., and Park. S., 2007, “Visualization of patent analysis for emerging technology”, Expert Systems with Applications, Vol. 34 (3), pp 1804-1812.
[26] Kohonen, T., 1995, “Self-organizing maps”, Berlin: Springer-Verlag.
[27] Kwok, K-L Deng, P., and Dinstl, N., 2007, “NTCIR-6 monolingual Chinese and English-Chinese cross language retrieval experiments using PIRCS”, Proceedings of NTCIR-6 Workshop, pp. 190-197.
[28] Lamirel, J.C., Shehabi, S.A., Hoffmann, M., and François, C., 2003, “Intelligent patent analysis through the use of a neural network: Experiment of multi-viewpoint analysis with the MultiSOM model”, Proceedings of ACL Workshop on Patent Corpus Processing, pp. 7-23.
[29] Lee, E., Yoon, B., Lee, C., and Park, J., 2009, “Business planning based on technological capabilities: Patent analysis for technology-driven roadmapping”, Technological Forecasting and Social Change.
[30] Lu, C., Xu, Y. and Geva, S., 2007, “Translation disambiguation in web-based translation extraction for English-Chinese CLIR”, Proceedings of the 2007 ACM symposium on Applied computing, pp. 819-823.
[31] Ma, W.Y. and Chen, K.J., 2003, “A bottom-up merging algorithm for Chinese unknown word extraction”, In Proceedings of Second SIGHAN Workshop on Chinese Language Processing, pp. 31-38.
[32] Morris, S., DeYong, C., Wu, Z., Salman, S., and Yemenu, D., 2002, “DIVA: A visualization system for exploring documents databases for technology Forecasting”, Computers and Industrial Engineering, 43(4), 841-862.
[33] Nie, J.Y., Gao, J.F., Zhang, J., and Zhou, M., 2000, “On the use of words and n-grams for Chinese informationretrieval”, Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, pp. 141-148.
[34] Oard, D.W., 1997, “Alternative approaches for cross language text retrieval”, In AAAI Symposium on Cross Language Text and Speeck Retrieval, USA.
[35] Osborn, M., Strzalkowski, T., and Marinescu, M., 1997, “Evaluating document retrieval in patent database: a preliminary report”, In Proceedings of the conference on information and knowledge management, pp. 216–221.
[36] Peng, F., Feng, F., and McCallum, A., 2004, “Chinese segmentation and new word detection using conditional random fields”, In COLING 2004, pp. 562-568.
[37] Potthast, M., Stein, B., and Anderka, M., 2008, “A wikipedia-based multilingual retrieval model”, In Craig Macdonald, Proceedings of the 30th European Conference on IR Research (ECIR), pp. 522–530.
[38] Tikk, D., Biró, G., and Törcsvári, A., 2007, “A Hierarchical Online Classifier for Patent Categorization”, In H. A. do Prado and E. Ferneda, editors, Emerging Technologies of Text Mining: Techniques and Applications, Idea Group Inc.
[39] Trappey, A.J.C., Hsu, F.C., Trappey, C.V., and Lin, C.I., 2006, “Development of a patent document classification and search platform using a back-propagation network”, Expert Systems with Applications, 31(4), pp. 755-765.
[40] Tseng, Y.H., Wang, Y.M., Juang, D.W., and Lin, C.J., 2005, “Text mining for patent map analysis”, In IACIS Pacific 2005 conference proceedings, pp. 1109-1116.
[41] Tseng, Y.H., Lin, C.J., and Lin, Y.I., 2007, “Text Mining techniques for patent analysis”, Information Processing and Management, Vol. 43, No.5, pp. 1216-1247.
[42] Tseng, Y.H., Wang, Y.M., Lin, Y.I., Lin, C.J., and Juang, D.W., 2007, “Patent surrogate extraction and evaluation in the context of patent mapping”, In Journal of Information Science.
[43] Uchida, H., Mano, A., and Yukawa, T., 2004, “Patent map generation using concept-based vector space model”, In Proceedings of the fourth NTCIR workshop.
[44] Xue, N., 2003, “Chinese word segmentation as character tagging”, International Journal of Computational Linguistics and Chinese Language Processing, 8(1). pp. 29-48.
[45] Xue, N. and Shen, L., 2003, “Chinese word segmentation as LMR tagging”, In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 176–179.
[46] Yang, C.C. and LI, K.W., 2002 “Mining English/Chinese parallel documents from the world wide web”, Proceedings of the 11th international World Wide Web Conference, Honolulu, Hawaii, May, pp. 188-192.
[47] Yang, C.C. and Li, K.W., 2005, “A heuristic method based on a statistical approach for Chinese text segmentation”, Journal of the American Society for Information Science and Technology, 56(13), pp. 1438–1447.
[48] Yeap, T., Loo, G., and Pang, S., 2003, “Computational patent mapping: intelligent agents for nanotechnology”, Proceedings of the International Conference on MEMS, NANO and Smart Systems, pp. 274-278.
[49] Yoon, B. and Park, Y., 2004, “A text-mining-based patent network: analytic tool for high-technology trend”, Journal of High Technology Management Research, vol. 15, pp. 37-50.
[50] World Intellectual Property Organization, 2006, “Guide to the international patent classification (Eighth Edition)”, WIPO Publication No 560E.5/8, ISBN 92-805-1442-3.
[51] 鄭凱安、馬仁宏、林殿琪、黃郁棻、劉瑄儀 2003, “量子點光學應用專利地圖及分析”, 行政院國家科學委員會科學技術資料中心, ISBN 957-619-090-8.
[52] 陳達仁, 2007 “專利檢索與分析”, 經濟部智慧局, ISBN 978-986-00-7692-9.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top