研究生(外文):YANG, YUN-RONG
論文名稱(外文):Feature Extraction and Hot Topic Mining in US Securities Fraud:An Application of LDA Topic Modeling
指導教授(外文):CHU, SHAN-YING
外文關鍵詞:securities fraudtext miningLatent Dirichlet Allocationtopic modelingfeature extraction
本研究利用文字探勘技術中的隱含狄利克雷分配(Latent Dirichlet allocation, LDA)方法,根據美國證券交易委員會(U.S. Securities and Exchange Commission,SEC) 1995 年至 2021 年公開發布之訴訟發布文件(Litigation Release) 進行主題建模。透過自動化文本分類方法,從證券欺詐案件中分類出主題以提取美國證券詐欺行為之關鍵特徵以及挖掘歷年證券犯罪訴訟案件之熱點議題。據研究結果指出,投資型計畫是當前最主要的起訴問題,其次是內幕交易,第三是未註冊的證券發行。
The disclosure of market information is conducive to the stable development of the capital market. Although there are existing regulations requiring listed companies to disclose information, various violations of the information disclosure system are emerging one after another, indicating that the relevant information disclosure is still insufficient. In order to protect general investors from losses due to securities frauds, this paper aims to enhance the existing market information and provide more comprehensive information for general investors who are on the weak side of information.

This study applies the Latent Dirichlet Allocation (LDA) approach in text mining technology to conduct topic modeling based on 26 years of Litigation Releases publicly released by the U.S. Securities and Exchange Commission (SEC) from 1995 to 2021. Through an automated text classification method, topics are classified from securities fraud cases to extract key features of securities fraud in the United States and to mine hot topics in securities fraud litigations over the years. According to the findings, investment-based schemes are currently the leading prosecution issue, followed by insider trading, and third, unregistered securities offerings.

摘要 I
Abstract II
Acknowledgement III
Contents IV
List of Figures VI
List of Tables VII
Chapter I Introduction 1
Chapter II Literature Review 4
2.1 Overview of Market Manipulation 4
2.1.1 Potential Problems and Causes of Market Manipulation 4
2.1.2 Difficulties in Identifying Market Manipulation 5
2.1.3 The Cost of Market Manipulation 6
2.1.4 Market Manipulation under Asymmetric Information 7
2.2 The Overview of Text Mining 9
2.2.1 Value of Text Mining 9
2.2.2 Framework for Text Mining 10
2.2.3 Text Mining Applications 11
2.3 Topic Modeling 12
2.3.1 Evolution of Topic Modeling 12
2.3.2 Topic Modeling Applications 13
Chapter III LDA Topic Modeling 15
Chapter IV Data Collection and Data Processing 18
4.1 Data Collection 18
4.1.1 Data Resource 18
4.1.2 Data Selection 19
4.1.3 Worming 20
4.1.4 Construction of Text Database 22
4.2 Data Processing 24
4.2.1 Text Preprocessing 24
4.2.2 Document-Term Matrix (DTM) Creation 28
4.2.3 Setting up LDA Parameters 30
Chapter V Results and Analysis 36
5.1 Topic Labeling 36
5.2 Topic Labels Validation 42
5.3 Feature Extraction of Securities Fraud 44
5.4 Hot Topic Exploration 47
Chapter VI Concluding Remarks and Suggestions 49
References 51

List of Figures
Figure 1. The Framework of Text Mining 11
Figure 2. Mechanism of LDA Topic Modeling 16
Figure 3. Process of Web Crawling 21
Figure 4. Annual Number of SEC Litigation Releases 23
Figure 5. Document-Term Matrix Composition 29
Figure 6. Document-Term Matrix Composition after Removing 1% of Sparse Terms 30
Figure 7. Perplexity Results in A 5-Fold Cross-validation Test 32
Figure 8. Test Result of Ldatuning 34
Figure 9. Topic Labels 41
Figure 10. Topic Similarity Dendrogram 43
Figure 11. The Distribution of Papers for The Top 10 Topics 47

List of Tables
Table 1. Description of LDA Parameters 17
Table 2. SEC Enforcement Actions 19
Table 3. NTLK English Stopword List 26
Table 4. Custom Stopword List 27
Table 5. Document-Term Matrix 29
Table 6. Perplexity Results in A 5-Fold Cross-validation Test 33
Table 7. LDA Parameters in Our Research 35
Table 8. Suspected Securities Fraud or Wrongdoing Reported by The SEC 37
Table 9. Features of Each Securities Fraud 44
