跳到主要內容

臺灣博碩士論文加值系統

(44.192.48.196) 您好!臺灣時間:2024/06/23 19:28
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:吳庭誼
研究生(外文):Ting-Yi Wu
論文名稱:網頁時間表格領域分類之研究
論文名稱(外文):A Study On The Domain Classification Of Web Time Tables
指導教授:周清江周清江引用關係
口試委員:陸承志戴敏育周清江
口試日期:2019-01-11
學位類別:碩士
校院名稱:淡江大學
系所名稱:資訊管理學系碩士班
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:中文
論文頁數:40
中文關鍵詞:表格結構領域分類領域關鍵字資料探勘
外文關鍵詞:Table StructureDomain ClassificationDomain KeywordData Mining
相關次數:
  • 被引用被引用:0
  • 點閱點閱:92
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
近年來越來越多網頁利用表格呈現小量但有意義的資料,因為採用表格呈現的資料,可以讓使用者很清楚的了解表格所包含資料內容間的關係,例如“交通時刻表”、“民宿價位表”、“門診時刻表”等。在網頁資料表格中,有許多跟時間相關的表格,強烈影響使用者的生活作息安排,本研究稱他們為網頁時間表格。目前各個應用領域相關的網頁時間表格分散在不同的網頁,使用者想要搜尋或彙整該領域的資料都非常不方便,本研究即在探討如何正確的進行網頁時間表格的領域分類,以大幅提升各應用領域的網頁時間表格內容整合及運用。我們提出比對各表格之表格標頭集合與領域關鍵字庫的方法,分別利用標頭字詞的出現次數和標頭字詞的TFIDF值兩種判斷方式,以分辨表格屬於何種領域。本研究依據上述概念以C#程式語言建置系統,並比較兩種判斷方式的分類效果。利用F-Measure評估後發現,本研究所提出之兩種方法,均有助網頁時間表格領域分類。
Nowadays more and more web tables are utilized to demonstrate clear and concise presentation for small amount of data and their relationships, mainly due to the fact that web tables help facilitate better understanding of the contents. Examples are “traffic timetable”, “hotel and hostel price table”, “clinic schedule table”, and so on. Many web tables are related to time, and they have great influence on internet users’ daily lives. We call them “Web Time Tables". Currently, for each application domain, web time tables have been widely distributed in miscellaneous websites. It is time-consuming and inconvenient to search, collect and integrate these useful data. If web time tables could be classified into their domains precisely, then these data could be greatly utilized to enhance their integration and application. We address the following research issue: how to design and develop a domain classification system for the web time tables? We propose to collect a web time table''s set of header strings first. Then its domain is determined through matching them with domain-specific keywords, which are collected by training. In the classification step, we propose two methods: one is based on number of matching keywords, and the other is based on the TFIDF value of matching keywords. We implement the above concepts and compare performances of these two methods. Through F-Measure, our proposed methods are proved that they could effectively perform classification of the web time tables.
目錄
第一章 緒論1
1.1研究背景與動機1
1.2研究目的3
1.3論文架構4
第二章 文獻探討5
2.1表格應用5
2.2表格結構辨識6
2.2.1表格結構辨識應用6
2.3領域分類7
第三章 網頁時間表格領域分類系統架構 9
3.1運作流程9
3.2表格蒐集10
3.3網頁資料表格結構辨識11
3.4候選領域詞庫建立13
3.5比對表格標頭14
3.5.1方法一:以不重複表格標頭為基礎15
3.5.2方法二:以表格標頭TFIDF之值做分類16
第四章 實驗與比較19
4.1表格抓取19
4.2資料集 19
4.3系統評估方式20
4.4實驗結果與分析20
4.4.1訓練階段20
4.4.2實驗結果27
4.4.3分類錯誤分析30
4.4.4與其他研究之比較32
4.4.5討論 33
第五章 結論與未來發展36
5.1結論36
5.2未來發展36
參考文獻38

表目錄
表 3 1:門診時刻表範例-1正規化後表格標頭13
表 3 2:門診時刻表範例-1最後表格標頭及其出現次數14
表 4 1:Group 3 1%領域詞庫訓練之結果25
表 4 2:Group 3 1%領域詞庫TFIDF之結果 26
表 4 3:領域分類結果之混淆矩陣27
表 4 4:領域分類結果效能表28
表 4 5:領域分類結果之混淆矩陣28
表 4 6:領域分類結果效能表29

圖目錄
圖 1-1:表格功能結構圖Zanibbi[16]3
圖 3 1:網頁時間表格分類系統運作流程9
圖 3 2:門診時刻表範例-110
圖 3 3:交通時刻表範例-111
圖 3 4:民宿價位表範例-111
圖 4 1:Group 1訓練結果22
圖 4 2:Group 2訓練結果22
圖 4 3:Group 3訓練結果23
圖 4 4:Group 4訓練結果23
圖 4 5:Group 5訓練結果24
圖 4 6:領域分類錯誤門診時刻表範例-130
圖 4 7:領域分類錯誤門診時刻表範例-231
圖 4 8:領域分類錯誤交通時刻表範例-132
圖 4 9:其他類型表格範例34
圖 4 10:其他類型表格範例35
參考文獻
[1]陳雅伶. (2011). 一個自動化網頁資料表格結構辨識系統. 淡江大學資訊管理學系碩士班學位論文.
[2]Balakrishnan, S., Halevy, A. Y., Harb, B., Lee, H., Madhavan, J., Rostamizadeh, A., Shen, W., Wilder, K., Wu, F., & Yu, C. (2015). Applying WebTables in Practice. In Proceedings of 7th Biennial Conference on Innovative Data Systems Research, paper 3.
[3]Bhagavatula, C.S., Noraset, T., & Downey, D. (2013), “Methods for exploring and mining tables on Wikipedia”, Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics, pp. 18-26.
[4]Buttinger, C., Feilmayr, C., Guttenbrunner, M., Parzer, S., & Pröll, B. (2010). Extracting Room Prices from Web Tables—an Ontology-Aware Approach. Information and Communication Technologies in Tourism 2010, 223-234.
[5]Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., & Zhang, Y. (2008). Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1), 538-549.
[6]Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., & Zhang, Y. (2018). Ten Years of WebTables. Proceedings of the VLDB Endowment, 11(12), 2140-2149.
[7]Crestan, E., & Pantel, P. (2011, February). Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 545-554). ACM.
[8]Gonzalez, H., Halevy, A., Jensen, C. S., Langen, A., Madhavan, J., Shapley, R., & Shen, W. (2010, June). Google fusion tables: data management, integration and collaboration in the cloud. Proceedings of the 1st ACM symposium on Cloud computing (pp. 175-180). ACM.
[9]Hassanzadeh, O., Ward, M. J., Rodriguez-Muro, M., & Srinivas, K. (2015). Understanding a Large Corpus of Web Tables Through Matching with Knowledge Bases–An Empirical Study. Proceedings of the 10th International Workshop on Ontology Matching, pp. 25-34.
[10]Nishida, K., Sadamitsu, K., Higashinaka, R., & Matsuo, Y. (2017). Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture. Proceedings of the 31th Conference on Artificial Intelligence (AAAI 2017). 168–174.
[11]Peng, X., & Choi, B. (2005, February). Document Classifications based on Word Semantic Hierarchies. Artificial Intelligence and Applications (Vol. 5, pp. 362-367).
[12]Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., & Wu, C. (2011). Recovering semantics of tables on the web. Proceedings of the VLDB Endowment, 4(9), 528-538.
[13]Wang, B. B., Mckay, R. I., Abbass, H. A., & Barlow, M. (2003, February). A comparative study for domain ontology guided feature extraction. Proceedings of the 26th Australasian computer science conference-Volume 16(pp. 69-78). Australian Computer Society, Inc..
[14]Wu, S. H., Tsai, T. H., & Hsu, W. L. (2003, July). Text categorization using automatically acquired domain ontology. Proceedings of the sixth international workshop on Information retrieval with Asian languages-Volume 11(pp. 138-145). Association for Computational Linguistics.
[15]Yin, X., Tan, W., & Liu, C. (2011, March). Facto: a fact lookup engine based on web tables. Proceedings of the 20th international conference on World Wide Web (pp. 507-516). ACM.
[16]Zanibbi, R., Blostein, D., & Cordy, J. R. (2004). A survey of table recognition. Document Analysis and Recognition, 7(1), 1-16
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top