跳到主要內容

臺灣博碩士論文加值系統

(44.222.218.145) 您好!臺灣時間:2024/03/03 23:40
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:陳弘軒
研究生(外文):Hung-Hsuan Chen
論文名稱:尋找資料串流之代表元素
論文名稱(外文):Finding frequent items in data stream environments
指導教授:王家祥
指導教授(外文):Jia-Shung Wang
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2006
畢業學年度:94
語文別:英文
論文頁數:45
中文關鍵詞:資料串流熱門元素即時收視率調查阻斷服務攻擊偵測
外文關鍵詞:data streamfrequent itemon-line TV ratingDoS attack detection
相關次數:
  • 被引用被引用:0
  • 點閱點閱:193
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
  隨著電子化時代的來臨,資訊量呈現爆炸性的增加,資料串流 (data stream) 所扮演的角色愈來愈重要,許多應用程式的資訊都以資料串流的型式傳輸與處理,諸如:無線感知網路 (wireless sensor networks) 的監控、即時熱門電視節目調查、阻斷服務攻擊 (deny of service attack) 的偵測等。資料串流的特性是:(1) 總
資料量龐大,遠超過傳統資料庫或記憶裝置 (如:硬碟) 所能紀錄的量;(2) 資料更新頻率高;(3) 使用者對於資料的要求不再是單次的快照檢索 (one-time query),而是長期的連續檢索 (continuous query)。這些與傳統資料處理的相異點,使資料串流中,資訊統計量的計算變得很困難。
  這篇論文中,我們將提出一種在資料串流中尋找熱門元素的方式,其中,熱門元素指的是在資料串流中,出現頻率超過特定比例的元素。為了能符合資料串流的特性並提高實用性,我們的演算法將達到以下的目標:(1) 給定有限範圍內的任何一段時間,我們的系統將能有效率的回傳這段時間內的熱門元素;(2) 可採用分散式處理。我們的演算法必須能在局部刪整資料,且這個刪整資料的動作不可影響最後熱門元素統計的正確性;(3) 有效率的儲存方式。由於資料串流的資料量龐大,遠超過儲存裝置所能記載的數量,我們必須設計出一個有效率的資料結構來儲存熱門元素;(4) 低計算時間複雜度。資料串流的更新頻率頻繁,因此,此演算法必須能快速的計算出熱門元素的統計資訊。
  我們已經在數學上證明:當熱門元素存在時,此演算法保證正確。實驗也顯示:即使在熱門元素較稀少的環境中,我們的演算法回傳的結果也有很高的比例是出現次數較高的元素。
  In many real world applications, such as monitoring in sensor networks, online TV ratings, and DoS (deny of service) attack detection, information delivering is by the
form of data streams. Differs from the traditional database systems, data streams have the following characteristics: (1) total amount of data is enormous, much more than traditional database systems or memory device (e.g. hard disk) can afford; (2) highly updating rate; (3) rather than one-time query, users tend to issue sort of continuous query instead. These differences make it difficult to calculate the statistics of data streams.
  In this thesis, a method to disclose frequent items in masses of data streams is proposed. Frequent items refer to those whose occurrence rates exceed a given frequency threshold, which can be specified by inquirers. In order to meet the characteristics of data stream processing and without losing the practicability, the proposed approach has the following objectives: (1) any query interval: Given a time period within the system-affordable range, the frequent items of this period will be responded efficiently; (2) distributed processing: the approach can separately prune away many occurrences from bottom to top without losing any fidelity; (3) storage saving: due to the enormous amount of occurrences, we have to design a mechanism to conserve the potential frequent items using only limited local counters; (4) computation saving: because of the highly updating rate, the proposed approach should be able to respond the statistical results of frequent items without delay.
  We proposed a distributed algorithm which fulfills the above requirements in this thesis as well as proved its correctness when frequent items do exist. To demonstrate
the practicability in locating the proper frequent items we conduct some simulations, the experimental results show that most of the items generated are of eminent occurrence rates even though the frequent items are eccentric.
Chapter 1 Introduction ...................................1
Chapter 2 Related Works ..................................3
2.1 The Concept of Data Stream Processing.................3
2.2 Reveal Frequent Items in Continuous Query Processing .5
Chapter 3 The Proposed Approach and Corresponding Algorithms................................................7
3.1 The Basic Majority Algorithm.................................................7
3.2 Distributed Majority Algorithm................................................11
3.3 Majority Algorithm for Data Stream Input ............17
3.3.1 Majority Counting with Data Stream Input ..........17
3.3.2 Majority Counting with Limited Storage Space ......20
3.4 Finding Frequent Items ..............................26
3.5 Some Improvements ...................................35
3.5.1 Raise Success Ratio ...............................35
3.5.2 Reduce the Impact of the Order of Input Elements ..36
Chapter 4 Experimental Results...........................37
4.1 Normal Distribution..................................37
4.2 Data Generated by Zipf Distribution ............................................39
Chapter 5 Conclusions and Future Works...................42
References...............................................44
[1] The STREAM Project homepage: http://www-db.stanford.edu/stream/
[2] The Aurora Project homepage: http://www.cs.brown.edu/research/aurora/
[3] The Niagara Project homepage: http://www.cs.wisc.edu/niagara/
[4] The Telegraph Project homepage: http://telegraph.cs.berkeley.edu/
[5] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” in Proceedings of the ACM Symposium on Principles of
Database Systems (PODS), 2002.
[6] M. Datar, A. Gionis, P. Indyk, and R. Motwani, Maintaining Stream Statistics over Sliding Windows,” in Proceedings of the 13th ACM SIAM Symposium on
Discrete Algorithms, 2002.
[7] Robert S. Boyer and J Strother Moore, “MJRTY - A Fast Majority Vote Algorithm,” in Technical Report ICSCA-CMP-32, 1982.
[8] G. S. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams,” in Proceedings of Very Large Data Bases (VLDB), 2002.
[9] M. Charikar, K. Chen, and M. Farach-Colton, “Finding Frequent Items in Data Streams,” in Proceedings of International Colloquium on Automata, Language, and Programming (ICALP), 2002
[10] J. Misra and D. Gries, “Finding Repeated Elements,”in Science of Computer Programming, 1982.
[11] E. Demaine, A. Lopez-Ortiz, and J. I. Munro: Frequency Estimation of Internet Packet Streams with Limited Space,” in Proceedings of the 10th Annual European Symposium on Algorithms, 2002.
[12] G. Cormode and S. Muthukrishman, “What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically,” in ACM Principles of Database Systems (PODS), 2003.
[13] A. Arasu and G. Manku, “Approximate Counts and Quantiles over Sliding Windows,” in ACM Principles of Database Systems (PODS), 2004.
[14] Nan Jiang and Le Gruenwald, “Research Issues in Data Stream Association Rule Mining,” in ACM Special Interest Group on Management of Data (SIGMOD), 2006.
[15] S. K. Park and K. W. Miller, “Random Number Generators: Good Ones Are Hard To Find,” in Communications of ACM, 1988.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top