跳到主要內容

臺灣博碩士論文加值系統

(44.210.83.132) 您好!臺灣時間:2024/05/22 22:04
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:曾冠樺
研究生(外文):Kuan-Hua Tseng
論文名稱:基於詞語權重的機器學習應用於自動新聞分類
論文名稱(外文):Machine Learning in Automated News Categorization Based on Term Weighting
指導教授:林俊宏林俊宏引用關係
指導教授(外文):Chun-Hung Richard Lin
學位類別:博士
校院名稱:國立中山大學
系所名稱:資訊工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2021
畢業學年度:109
語文別:中文
論文頁數:115
中文關鍵詞:權重監督式權重文章表現式文本分類特徵選取
外文關鍵詞:Term weightingSupervised term weightingDocument representation methodText classificationFeature selection
相關次數:
  • 被引用被引用:0
  • 點閱點閱:119
  • 評分評分:
  • 下載下載:4
  • 收藏至我的研究室書目清單書目收藏:1
非結構性資料,常見如文字檔,的分析技術,對於許多應用非常重要。例如醫院的病理報告可以分析病情,各類新聞可以做為主事者的決策判斷依據。本論文就是研究新聞文本資料的自動分類,讓新聞資訊可以分門別類,以利後續之運用與細部內容分析。特別是像新聞消息面的股市預測和趨勢分析,首先就是需要對新聞消息能自動分類。為了實現文本分類自動化,除了機器學習演算法本身外,許多額外相關的研究,包含資料之前處理、特徵提取、特徵選擇、詞語權重(term weighting)及分類方法等等的五種相關技術,正如火如荼的進行中。在本論文中,我們將特別研究詞語權重對文本自動分類的影響。
詞語權重是文本分類中非常重要的一個部分。精算後的權重值,應該要能直接反映該詞語在整個文本分類上的重要性,每一篇文件對應到一群詞語,每一個詞語都是該文件之「特徵」,詞語權重值即為特徵值,機器學習分類器根據詞語權重值進行分類。由於目前常用的詞語權重計算方法,無法一體適用於所有文本資料集,而得到不錯的文本分類準確度,所以我們提出了一種基於監督式的特徵值提取方法來改善這個問題。同時我們也將一些常用的詞語權重計算方法,應用於幾個著名的文本資料集,進行分類檢測實驗,實驗結果顯示出,高權重的詞語並不一定代表了其在分類上的重要性,這是因為監督式權重演算法在轉換過程中,可能造成某些資訊的流失而導致失準。
The analysis techniques for unstructured data, e.g., text files, are very important to many applications. For example, the hospital pathology report can help to analysis the condition, and various news can be the basis for the decision-making of the principal. The purpose of the dissertation is to study the process of news categorization so that news can be dispatched into categories automatically to achieve the subsequent use or the deeper content analysis. Especially for stock market forecasting and trend analysis, the first thing to do for analyzing is to categorize the news automatically. To achieve automatic text classification, except machine learning, there are five related technologies, namely preprocessing, feature extraction, feature selection, term weighting, and classification algorithm, to be discussed by many researchers. In this dissertation, we explore the impact of term weighting on news categorization.
Term weighting is an essential part of text classification. The calculated weight should directly reflect the importance of a term in the entire text. Each document maps to a group of terms and each term can be represented as a "feature" of the document. Therefore, we can treat the term weight value as the feature value of the document. Supposedly, for these learning features based on the term-weighting algorithm, the classifier can obtain a good classification result. However, as the term-weighting algorithms often used for text classification cannot satisfactorily classify a randomly chosen document, we propose a new term-weighting method to solve this problem. We also applied standard term-weighting methods to several predefined datasets for comparison. After analyzing the results, we found that, instead of the intuitive consideration that the weight represents how important a term is, the terms were not as important as their high-scored weight would seem to represent. This is due to the information lose during the conversion of the weighting process.
論文審定書 i
誌謝 ii
中文摘要 iii
Abstract iv
Contents v
List of Figures vii
List of Tables viii
Chapter 1 Introduction 1
1.1 Background and Motivation 1
1.2 Related Work 2
1.3 Thesis Organization 6
Chapter 2 Research Method 7
2.1 Framework 7
2.2 Pre-Processing 8
2.3 Vectorize 8
2.4 Traditional Term Weighting Method 9
2.4.1 One-Hot Encoding 9
2.4.2 Term Frequency - Inverse Document Frequency (TFIDF) 9
2.4.3 Term Frequency - Inverse Category Frequency (TFICF) 10
2.5 Supervised Term Weighting Method 11
2.6 Our Proposed Term-Weighting Method 14
2.7 Test Document Representation Method 15
2.8 Classification Methods 19
2.8.1 Binary Approach 19
2.8.2 Support Vector Machines 19
2.8.3 Logistic Regression 20
2.9 Feature Reduction 21
2.9.1 Feature Selection 21
2.9.2 Feature Extraction 22
2.10 Evaluation 23
2.11 Recursive Feature Elimination 26
Chapter 3 Experiment 27
3.1 Datasets 27
3.2 Tools 32
Chapter 4 Results and Discussion 33
4.1 Results 33
4.1.1 Dataset Re0 34
4.1.2 Dataset Re1 43
4.1.3 Dataset Re52 52
4.1.4 Dataset K1a 61
4.1.5 Dataset K1b 70
4.1.6 Dataset Reuters-21578 79
4.2 Discussion 85
4.2.1 The Overlap Rating 85
4.2.2 Information Loss 95
Chapter 5 Conclusion 96
5.1 Conclusion 96
5.2 Future Work 96
References 98
附件一、學位考試口試委員意見與論文修改 102
口試委員意見: 102
回覆: 102
附件二、無違反學術倫理聲明書 104
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊