跳到主要內容

臺灣博碩士論文加值系統

(34.204.169.230) 您好!臺灣時間:2024/03/03 01:36
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:陳鴻元
研究生(外文):Hung-Yuan Chen
論文名稱:使用線段估計改進長條圖近似
論文名稱(外文):Improving Histogram Approximation with Line Representation
指導教授:陳良弼陳良弼引用關係
指導教授(外文):Arbee L. P. Chen
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2006
畢業學年度:94
語文別:英文
論文頁數:42
中文關鍵詞:長條圖線段近似
外文關鍵詞:HistogramLineApproximate
相關次數:
  • 被引用被引用:0
  • 點閱點閱:247
  • 評分評分:
  • 下載下載:18
  • 收藏至我的研究室書目清單書目收藏:0
長條圖是靠著的一些小量的函式來儲存,而且通常使用來估計資料的分佈情形。在資料庫中,記錄每一個單一屬性的長條圖可以幫助我們來評佔資料庫運算所需的代價。而為了執行查詢的最佳化,通常需要估量這種運算的代價,來決定一個有效率的查詢計畫。另外,長條圖也廣泛的使用在近似查詢系統以及資料探詢上面。這種在資料庫中事先儲存長條條圖的技術,需要用到大量的記憶體空間。所以,如何能夠把長條圖壓縮在一個固定的空間中,而使的壓縮造成的誤差為最小,已經被科學家們視為重要的問題,並且研究了多年。

一般最常用在壓縮長條圖的演算法,是將長條圖切成好幾個格子,然後把每個格子裡的頻率視為相同的數目。於是,這個問題就轉變成,我們要如何去決定最佳格子的邊界而使得壓縮的誤差是最小。對於這個問題,之前的研究已經提供了些不錯的解答。然而,在現實生活中,我們所知道的許多資料分布是非常偏斜的。所以,之前的方法對於這種資料分佈並不是有太好的估計效果,因為它們沒有考慮到資料分布的傾向。在這篇論文中,我們提出一個用線段來估計每一個格子的演算法。在資料分布是非常偏斜或是在現實生活的資料中,它可以更為準確的來壓縮長條圖。我們做了一連串的實驗,並且從實驗的結果中顯示,我們的方法在偏斜的資料分布下是更為精準的。
Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimation of cost to decide an efficient access query plan. Histograms are also widely used in approximate query answering systems and data mining. The techniques that store precomputed histograms in the database require some overhead of memory consumption. Therefore, the problem of compressing the histogram in a fixed amount of space with the least error is considered as an important issue and has been investigated by researchers for many years.

The most common algorithm to compress the histogram is to divide the histogram into buckets and estimate every bucket by uniform representation. The problem becomes how to choose the bucket boundaries to minimize the estimation error for a given number of buckets. The pervious approach has provided a desirable solution to this problem of bucket boundaries selection. However, many data distributions in real-life are well known to be extremely skewed. The pervious algorithms do not perform well for the real data because they do not consider the tendency of data distribution. In this paper, we propose an algorithm that utilizes a line segment to estimate each bucket in replace of uniform representation. The algorithm can construct the histogram more precisely when the data distribution is skewed and is more suitable for the real-world data. We performed a series of experiments, and the results show that our method has better accuracy when the data distribution is skewed.
Abstract i
Acknowledgements ii
Contents iii
Chapter 1: Introduction 1
Chapter 2: Problem Statements 7
Chapter 3: Optimal Histograms 11
3.1 Preliminary 11
3.2 Optimal Histogram Algorithm 12
3.3 Optimal Line Histogram 16
3.3.1 Optimal Line 16
3.3.2 Optimal Line Histogram Algorithm 21
Chapter 4: Approximate Histograms 23
4.1 Approximate Histogram Construction Algorithm 23
Chapter 5: Experimental Results 27
5.1 Experimental Setting 27
5.2 Synthetic Data Sets 28
5.3 Real Data Sets 37
Chapter 6: Conclusion 40
Reference: 41
[1] Acton, F. S. Analysis of Straight-Line Data. New York: Dover, 1966.
[2] S. Acharya, P. Hibbons, V. Poosala, and S. Ramaswamy. Join Synopses for Approximate Query Answering. Proceedings of ACM SIGMOD, pages 275-286, June 1999.
[3] S. Acharya, P. Hibbons, V. Poosala, and S. Ramaswamy. The Aqua Approximate Query Answering System. Proceedings of ACM SIGMOD, pages 574-578, June 1999
[4] M. N. Garofalakis and P. B. Gibbons. Wavelet Synopses with Error Guarantees. Proceedings of ACM SIGMOD, 2002.
[5] M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. Proceedings of ACM SIGMOD, Santa Barbara, May 2001
[6] S. Guha and N. Koudas. Approximating a Data Stream for Querying and Estimation: Algorithms and performance evaluation. Proceeding of ICDE, 2002
[7] S. Guha, N. Koudas, and K. Shim. Data-Streams and Histograms. In Proc. of STOC, 2001
[8] Y. Ioannidis. and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. Proceedings of ACM SIGMOD, pages 233-244, May 1995
[9] H. V. Jagadish, Nick Koudas, and S. Muthukrishnan. Mining Deviants in a Time Series Database. Proceedings of VLDB, 1999
[10] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal Histograms with Quality Guarantees. Proceedings of VLDB, pages 275-286, August 1998.
[11] H. V. Jagadish, Nick Koudas, and K. C. Sevcik. Choosing Bucket Boundaries for a Histogram. University of Toronto Technical Report, TR-375, May 1998
[12] J. Lee, D. Kim, and C. Chung. Multi-dimensional Selectivity Estimation Using Compressed Histogram Information. Proceedings of ACM SIGMOD, pages 205-214, June 1999.
[13] S. Muthukrishnan, R. Shah, and J. S. Vitter. Mining Deviants in Time Series Data Streams. Proceedings of SSDBM, Santorini Island, Greece, June 2004
[14] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. Proceedings of ACM SIGMOD, 1998.
[15] Otto B. Linear Algebra with Applications, 2nd ed. Prentice Hall, 2001
[16] V. Poosala and Y. Ioannidis. Selectivity Estimation Without the Attribute Value Independence Assumption. Proceedings of VLDB, Athens Greece, pages 486-495, August 1997
[17] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improve Histograms for Selectivity Estimation of Range Predicates. Proceedings of ACM SIGMOD, pages 294-305, June 1996
[18] G. P. Shapiro and C. Connell. Accurate Estimation of the Number of Tuples Satisfying a Condition. Proceedings of ACM SIGMOD, pages 256-276, 1984.
[19] http://www.ics.uci.edu/~mlearn/MLRepository.html
[20] http://kdd.ics.uci.edu/
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top