跳到主要內容

臺灣博碩士論文加值系統

(3.236.84.188) 您好!臺灣時間:2021/08/03 09:46
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:周世恩
研究生(外文):Shih-En Chou
論文名稱:基於預測熱門度之大規模即時社群爬蟲演算法分析與設計
論文名稱(外文):An efficient crawling algorithm for large-scale real-time social stream data collection based on popularity prediction
指導教授:黃乾綱黃乾綱引用關係
指導教授(外文):Chien-Kang Huang
口試委員:張瑞益鄭卜壬陳信希
口試委員(外文):Ray-I ChangPu-Jen ChengHsin-Hsi Chen
口試日期:2015-06-29
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:工程科學及海洋工程學研究所
學門:工程學門
學類:綜合工程學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:英文
論文頁數:52
中文關鍵詞:社群網路網路爬蟲設計資訊檢索行為分析
外文關鍵詞:Social NetworkCrawler DesignInformation RetrievalBehavior Analysis
相關次數:
  • 被引用被引用:0
  • 點閱點閱:395
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
社群網路近年來改變了我們的溝通方式,累積巨量人類行為活動資料,吸引許多新興研究主題與社群網路行為分析結合。進行問題分析的過程中往往需要一個龐大的數據量,最近更朝向時域上分析,每隔一段時間必須對特定的研究標的做一次快照,熱門的訊息尤需要更密集的快照以洞察使用者行為隨著時間上變化。受限於這些社群網路有複雜的網絡,以及爬蟲對於數據存取量和頻率限制,對於多數機構的數據採集部門而言並不容易,且於資料取得之效能上無法進行有效優化。為了取得即時且足夠的資料,必須高頻率對社群網路存取,不僅浪費網路資源,亦增加社群網路的負荷。此外,目前社群網路隱私政策不允許不同單位共享數據,Facebook甚至透過加密的ID來保護使用者使用者資料。這些限制增加單一研究機構與其他機構共享數據,無法利用現有的爬行調度算法與其他機構分配資料收集方式。在本文中,我們提出了一種新爬行排序演算法,考慮用戶過去的行為,隨著收集的資料越多,越能預測該收集標的是否熱門以及有更多文章發布。所設計的演算法可以解決大型立式爬行資源分配與動態網頁無法通過一般的履帶採用的問題。在本研究中,我們運用單位資源內收集的訊息熱度來評估爬行性能。實驗結果呈現我們的演算法在收集社群網路99.5%熱門的訊息能最高節省40%爬蟲網路呼叫次數。

Social media has greatly changed the way we communicate and huge amount of social behavior data is thus recorded and accumulated simultaneously. The data is now widely applied to many emerging research issues in combination with social behavior analysis. More recently, time domain analysis is especially popular on conducting behavior change investigation, in which people take snapshots on a particular subject of network on regular intervals, and hot messages (posts) are in urgent need of snapshot so as to precisely learn about user’s behavior as time moves. Scraping social networking sites such as Twitter, Facebook, etc. is not an easy task for data acquisition departments of most institutions since these sites often have complex structures and also restrict the amount and frequency of the data that they let out to common crawlers. To get more snapshots, groups often consume more computation power and network resources; even increase the load of OSN (Online Social Network) sites. In addition, the current privacy control policies do not allow different groups to share data with one another. These become challenges for an individual research group to collect sufficient data by using existing crawling scheduling algorithms or collaborating with other partners. In this paper, we propose “Novel Crawling Ordering Algorithm”, which allows our crawlers to focus on popular content by collecting and analyzing user behaviors. The designed crawler can also solve the problems of large-scale vertical crawling and dynamic web page problems. The performance of our crawling ordering algorithm” is evaluated by some designed metrics. And the experimental results tell us that this algorithm can save up to 40% of requests by crawling top 99.5 % popular social stream.

口試委員會審定書 #
誌謝 i
中文摘要 ii
ABSTRACT iii
CONTENTS iv
LIST OF FIGURES vi
LIST OF TABLES viii
Chapter 1 Introduction 1
1.1 Background 1
1.2 Problem 2
1.3 Solution 3
1.4 Scope and organization of this thesis 4
Chapter 2 Related Works 5
2.1 Web crawling 5
2.2 Demand of user behavior analysis 6
2.3 Data source selection and privacy issue 7
2.4 Challenge of crawling and the corresponding solution. 10
2.5 Temporal Analysis 13
2.6 Popularity Prediction 13
Chapter 3 System Overview 15
3.1 Crawling Architecture 15
3.2 Scheduler Implementation 18
3.3 Fetch Posts 19
3.4 Problem Definition. 22
3.5 Selected Features 22
3.6 Ranking strategy 25
3.6.1 Random crawling 25
3.6.2 Ranked by overall engagement 25
3.6.3 Ranked by average engagement 26
3.6.4 Ranked by predicted-engagement 26
3.7 Learning Algorithm 27
3.8 Recrawl Strategies and the Freshness Metric 28
3.8.1 Model publication frequency 28
3.8.2 Freshness and Delay 29
Chapter 4 Experiment 31
4.1 Data Analysis 31
4.2 Feature Importance 33
4.3 Measurements 37
4.4 Crawling performance 39
4.5 Comparison of static crawling and dynamic crawling 44
Chapter 5 Conclusion 48
REFERENCE 50


[1]F. Inc. (2014, 2015/6/15). Facebook Reports Fourth Quarter and Full Year 2014 Results. Available: http://investor.fb.com/releasedetail.cfm?ReleaseID=893395
[2]C. Wilson, B. Boe, A. Sala, K. P. Puttaswamy, and B. Y. Zhao, "User interactions in social networks and their implications," in Proceedings of the 4th ACM European conference on Computer systems, 2009, pp. 205-218.
[3]D. Horowitz and S. D. Kamvar, "The anatomy of a large-scale social search engine," in Proceedings of the 19th international conference on World wide web, 2010, pp. 431-440.
[4]J. Teevan, D. Ramage, and M. R. Morris, "# TwitterSearch: a comparison of microblog search and web search," in Proceedings of the fourth ACM international conference on Web search and data mining, 2011, pp. 35-44.
[5]J. Cho, H. Garcia-Molina, and L. Page, "Efficient crawling through URL ordering," 1998.
[6]G. Pant, P. Srinivasan, and F. Menczer, "Crawling the web," in Web Dynamics, ed: Springer, 2004, pp. 153-177.
[7](2015/06/04). Web crawler. Available: https://en.wikipedia.org/wiki/Web_crawler
[8]R. Zafarani and H. Liu, "Behavior Analysis in Social Media," ed: IEEE COMPUTER SOC 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1314 USA, 2014.
[9]C.-I. Wong, K.-Y. Wong, K.-W. Ng, W. Fan, and K.-H. Yeung, "Design of a Crawler for Online Social Networks Analysis."
[10]A. Yakushev and S. Mityagin, "Social networks mining for analysis and modeling drugs usage," Procedia Computer Science, vol. 29, pp. 2462-2471, 2014.
[11]Y. Zhang and M. Pennacchiotti, "Predicting purchase behaviors from social media," in Proceedings of the 22nd international conference on World Wide Web, 2013, pp. 1521-1532.
[12]H. Kwak, C. Lee, H. Park, and S. Moon, "What is Twitter, a social network or a news media?," in Proceedings of the 19th international conference on World wide web, 2010, pp. 591-600.
[13]F. Erlandsson, R. Nia, H. Johnson, and S. F. Wu, "Making social interactions accessible in online social networks," Inf. Services and Use, vol. 33, pp. 113-117, 2013.
[14]D. Shen, H. Wang, Z. Jiang, and J. Cao, "A high efficient incremental microblog crawler: design and implementation," J Inf Comput Sci, vol. 10, pp. 1731-1747, 2013.
[15]D. Shestakov, "Intelligent Web Crawling," IEEE Intelligent Informatics Bulletin, vol. 14, pp. 5-7, 2013.
[16]E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, "Web data extraction, applications and techniques: A survey," Knowledge-Based Systems, vol. 70, pp. 301-323, 2014.
[17]J. Cho and H. Garcia-Molina, "Parallel crawlers," in Proceedings of the 11th international conference on World Wide Web, 2002, pp. 124-135.
[18]D. H. Chau, S. Pandit, S. Wang, and C. Faloutsos, "Parallel crawling for online social networks," in Proceedings of the 16th international conference on World Wide Web, 2007, pp. 1283-1284.
[19]K. Kim, K. Kim, K. Lee, T. Kim, and W. Cho, "Design and implementation of web crawler based on dynamic web collection cycle," in Information Networking (ICOIN), 2012 International Conference on, 2012, pp. 562-566.
[20]S. Mali and B. Meshram, "Focused web crawler with revisit policy," in Proceedings of the International Conference & Workshop on Emerging Trends in Technology, 2011, pp. 474-479.
[21]D. Yadav, A. Sharma, J. Gupta, N. Garg, and A. Mahajan, "Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages," in Information Technology,(ICIT 2007). 10th International Conference on, 2007, pp. 258-264.
[22]J. Cho and H. Garcia-Molina, "Synchronizing a database to improve freshness," in Acm Sigmod Record, 2000, pp. 117-128.
[23]E. Zhong, W. Fan, Y. Zhu, and Q. Yang, "Modeling the dynamics of composite social networks," in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 937-945.
[24]R. Horincar, B. Amann, and T. Artières, "Online Change Estimation Models for Dynamic Web Resources," in Web Engineering, ed: Springer, 2012, pp. 395-410.
[25]J. Lehmann, B. Gonçalves, J. J. Ramasco, and C. Cattuto, "Dynamical classes of collective attention in twitter," in Proceedings of the 21st international conference on World Wide Web, 2012, pp. 251-260.
[26]B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi, "On the evolution of user interaction in facebook," in Proceedings of the 2nd ACM workshop on Online social networks, 2009, pp. 37-42.
[27]L. Ostroumova, I. Bogatyy, A. Chelnokov, A. Tikhonov, and G. Gusev, "Crawling Policies Based on Web Page Popularity Prediction," in Advances in Information Retrieval, ed: Springer, 2014, pp. 100-111.
[28]D. Lefortier, L. Ostroumova, E. Samosvat, and P. Serdyukov, "Timely crawling of high-quality ephemeral new content," presented at the Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, San Francisco, California, USA, 2013.
[29]D. Lefortier, L. Ostroumova, E. Samosvat, and P. Serdyukov, "Timely crawling of high-quality ephemeral new content," in Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, 2013, pp. 745-750.
[30]S. Gao, J. Ma, and Z. Chen, "Effective and effortless features for popularity prediction in microblogging network," in Proceedings of the companion publication of the 23rd international conference on World wide web companion, 2014, pp. 269-270.
[31](2015/06/04). Graph and Ads API Rate Limiting. Available: https://developers.facebook.com/docs/marketing-api/api-rate-limiting - troubleshooting
[32]R. Sutton. Do Facebook Graph API calls using field expansion count differently against the rate limits than batch calls. Available: http://stackoverflow.com/questions/14626689/do-facebook-graph-api-calls-using-field-expansion-count-differently-against-the/18472015 - 18472015
[33]M. Liu, R. Cai, M. Zhang, and L. Zhang, "User browsing behavior-driven web crawling," in Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, pp. 87-92.
[34]Z. Yang, J. Guo, K. Cai, J. Tang, J. Li, L. Zhang, et al., "Understanding retweeting behaviors in social networks," presented at the Proceedings of the 19th ACM international conference on Information and knowledge management, Toronto, ON, Canada, 2010.
[35]C.-C. Chang and C.-J. Lin, "LIBSVM: A library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, p. 27, 2011.
[36]P. Kolari, T. Finin, and A. Joshi, "SVMs for the Blogosphere: Blog Identification and Splog Detection," in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, 2006, pp. 92-99.
[37](2015/6/10). Random forest. Available: https://en.wikipedia.org/wiki/Random_forest


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top