跳到主要內容

臺灣博碩士論文加值系統

(44.220.251.236) 您好!臺灣時間:2024/10/08 11:31
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:李維陞
研究生(外文):Lee, Wei-Shen
論文名稱:在多核心圖形處理器架構下實現一有效率的排序演算法
論文名稱(外文):An Efficient Sorting Algorithm based on Many-core Graphics Processing Units
指導教授:唐傳義
指導教授(外文):Tang, Chuan-Yi
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2010
畢業學年度:98
語文別:英文
論文頁數:46
中文關鍵詞:平行排序圖形處理器雙調排序謝爾排序平行處理統一計算架構
外文關鍵詞:Parallel shellsortGPUBitonic merge sortSortingCUDA
相關次數:
  • 被引用被引用:0
  • 點閱點閱:435
  • 評分評分:
  • 下載下載:45
  • 收藏至我的研究室書目清單書目收藏:0
Sorting is one of most classic algorithmic problem in computer science and its fundamental importance has led to the design and implementation of sorting algorithms on recently evolved many-core Graphic Processing Units (GPUs). CUDPP Radix sort has been the most ecient non-comparison sort for a period of time, and recently published GPU Sample sort is considerably the best comparison-based sort. Though the sorting implementations are quite ecient, they either need extra space for data rearrangement or atomic operation for acceleration. Applications runningon GPU usually deal with huge amount of data thus memory utilization is an important consideration. Also, sorting algorithms on GPUs that do not support atomic operations can result in performance degradation or even fail to work. In this paper, we propose an implementation of a shellsort algorithm for NVIDIA's GPUs. It is an in-place sort and requires no atomic operation support. Our experimental result shows that, on average, its performance is nearly twice faster than GPU quicksort and nearly three times faster than Thurst mergesort. Its performance is almost
the same as GPU sample for sorting up to 32M data elements.
Our implementation is robust to various input distributions and should also be well suited for other many-core architectures.
在計算機科學的研究領域,排序法是一直不斷被拿出來討論的問題之一;由於排序法的基礎性和重要性,使得它現在也廣泛被移植到新興的多核心圖形處理器(GPUs)上。在CUDPP函式庫裡的 GPU radix sort 是一種非比較式的(non-comparison-based)排序法,它是目前在GPU上最快的排序法;而 GPU sample sort 是GPU上最快的比較式(comparison-based)排序。這兩個排序法,分別是兩種不同排序分類裡的領先技術(state of the art);不過它們都需要使用到額外的空間來重組資料,或是使用不可分割運算(atomic operation)來加速。GPU通常被用來處理非常大量的資料,因此,記憶空間的運用顯得特別重要;另外,排序法在沒有不可分割運算(atomic operation)支援的GPU上執行,可能導致效能下降,或甚至無法正常執行。在這個論文裡,我們提出一個在 NVIDIA GPU 上實現shell排序的方法。這個方法不用額外的記憶空間,也不需要不可分割運算的支援。我們的實驗結果也顯示,GPU shellsort 平均上有 GPU quicksort 的兩倍快,與將近快過三倍 Thrust mergesort 。在小於三千兩百萬筆資料時,效能大概和目前比較式排序的中最快的 GPU sample sort 相當。我們所提出在GPU上的 shell 排序法 ,在各種資料分佈上的表現也都非常穩健,沒有一種標準測試用的資料分佈會使這個方法的效能變的非常差。基本上這個方法也適用其它的多核心平台。
1 Introduction 2
1.1 Current Trend of Processor Design . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Algorithm Preview . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Results and Contributions . . . . . . . . . . . . . . . . . . . . 4
1.5 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Work 6
2.1 Classication of Parallel Sort . . . . . . . . . . . . . . . . . . . 6
2.2 Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Preliminary 10
3.1 Shellsort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 The NVIDIA GPU Architecture . . . . . . . . . . . . . . . . . 13
3.3 CUDA Programming Model . . . . . . . . . . . . . . . . . . . 15
3.4 Performance Guidelines . . . . . . . . . . . . . . . . . . . . . . 17
4 The Algorithm 19
4.1 Overview of Hybrid Sorting Architecture . . . . . . . . . . . . 19
4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Phase 1 { Shellsort . . . . . . . . . . . . . . . . . . . . 22
4.2.2 Phase 2 { Bitonic Merge Sort . . . . . . . . . . . . . . 25
4.2.3 Phase 3 { Odd-Even Bitonic Merge . . . . . . . . . . . 25
4.3 Other Variant of Implementation . . . . . . . . . . . . . . . . 27
5 Algorithmic Complexity 29
5.1 Space Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Experimental Results 32
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 32
6.1.1 Algorithms Used . . . . . . . . . . . . . . . . . . . . . 34
6.1.2 Input Distributions . . . . . . . . . . . . . . . . . . . . 34
6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 35
6.2.1 Space Usage . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2.2 Execution Time . . . . . . . . . . . . . . . . . . . . . . 36
7 Conclusion and Future Work 42
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. MA: The MIT Press, 2001.
[2] S. G. Akl, Parallel Sorting Algorithms. Orlando: Academic Press, 1985.
[3] M. A. Weiss, Data structures and algorithm analysis in C, 2nd ed. CA:Addison-Wesley, 1997.
[4] C. Breshears, The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Applications, Cambridge: O'Reilly Media, 2009.
[5] D. B. Kirk and W.-M. W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, MA: Morgan Kaufmann Publishers, 2010.
[6] D. E. Knuth, The Art of Computer Programming, Vol 3, Addison-Wesley, 1973.
[7] T. R. Halfhill, \Parallel Processing with CUDA," Microprocessor Report, Jan. 2008.
[8] W. A. Martin, \Sorting," ACM Comput. Surv., vol. 3, no. 4, pp. 147-174, Dec. 1971.
[9] D. L. Shell, \A high-speed sorting procedure," Commun. ACM, vol. 2, no. 7, pp. 30-32, July 1959.
[10] H. W. Lang. (2010). Shellsort [Online]. Available: http://www.inf.fh-
ensburg.de/lang/algorithmen/sortieren/shell/shellen.htm.
[11] V. R. Pratt, \Shellsort and Sorting Networks," Ph.D. dissertation. Stanford University, Standford, CA, USA, 1972.
[12] M. Ciura, \Best Increments for the Average Case of Shellsort," 13th International Symposium on Fundamentals of Computation Theory, vol. 2138, pp. 106-117, August 2001.
[13] R. Sedgewick, \Analysis of Shellsort and Related Algorithms," In Proceedings of the Fourth Annual European Symposium on Algorithms, vol. 1136, pp. 1-11, 1996.
[14] Wikipedia. (2010). WIKI: Shell sort [online]. Available: http://en.wikipedia.org/wiki/Shellsort
[15] NVIDIA CUDA Programming Guide, NVIDIA Corporation, 2009, version 2.3.
[16] NVIDIA CUDA C Programming Best Practices Guide, NVIDIA corporation, 2009, version 2.3.
[17] M. Harris, S. Sengupta, J. D. Owens, Parallel Prex Sum (Scan) with CUDA. In GPU Gems 3, Nguyen H., (Ed.), Addison Wesley, Aug. 2007, chapter 31.
[18] R. Baraglia, G. Capannini, F. M. Nardini, and F. Silvestri, Sorting using Bitonic Network with CUDA. In th 7th Workshop on Large Scale Distributed Systems for Information Retrieval, Boston, USA, July, 2009.
[19] N. Satish, M. Harris, and M. Garland, "Designing Ecient Sorting Algorithms for Manycore GPUs," IPDPS, 2009, pp. 1-10.
[20] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. "Efficient Implementationof Sorting on Multi-core SIMD CPU Architecture," PVLDB, vol. 1, no. 2, pp. 1313-1324, August 2008.
[21] N. Leischner, V. Osipov, and P. Sanders, "GPU Sample Sort," IPDPS, 2010, pp. 1-10.
[22] D. R. Helman, D. A. Bader, and J. JaJa, "A randomized parallel sorting algorithm with an experimental study," J. of Parallel and Distributed Computing, vol. 52, no. 1, pp. 1-23, 1998.
[23] S. Sengupta, M. Harris, and M. Garland, \Ecient Parallel Scan Algorithms for GPUs," NVIDIA Technical Report, 2008.
[24] S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens, "Scan primitives for GPU computing." in Graphics Hardware 2007, August 2007, pp. 97-106.
[25] D. Cederman and P. Tsigas, "A practical quicksort algorithm for graphics processors," in Proc. 16th Annual European Symposium on Algorithms, Sep. 2008, pp. 246-258.
[26] S. Chen, J. Qin, Y. Xie, J. Zhao, and P.-A. Heng, \A fast and exible sorting algorithm with CUDA," Proceedings of the 9th International Conference on Algorithms and Architectures for Parallel Processing Springer-Verlag, 2009, pp. 281-290.
[27] E. Sintorn and U. Assarsson, \Fast parallel GPU-sorting using a hybrid algorithm," J. Parallel Distrib. Comput., vol. 68, no. 10, pp. 1381-1388, 2008.
[28] F. Dehne and H. Zaboli. \Deterministic Sample Sort For GPUs," arXiv:1002.4464, 2010.
[29] J. Vitter, "External memory algorithms and data structures: Dealing with massive data," ACM Computing Surveys, pages 209-271, 2001.
[30] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, "An experimental analysis of parallel sorting algorithms," Theory of Computing Systems, Vol. 31, No. 2, March/April
1998, pp. 135-167.
[31] J. Nickolls, I. Buck, M. Garland, and K. Skadron, "Scalable Parallel Programming with CUDA," Queue, Vol. 6, No. 2, pp. 40-53, Mar/Apr 2008.
[32] V. Volkov, and Demmel, J. W, "Benchmarking GPUs to tune dense linear algebra, "ACM/IEEE Conference on Supercomputing, Austin, 2008. pp. 1-11.
[33] M. Matsumoto and T. Nishimura, "Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number genera-tor," ACM Transactions on Modeling and Computer Simulation, vol. 8, No. 1, pp. 330, 1998.
[34] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu, "Optimization principles and application performance evaluation of a multithreaded GPU using CUDA," In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 73-82.
[35] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha, "GPUTeraSort: high performance graphics coprocessor sorting for large database management," In Proceedings of the 2006 ACM SIGMOD international Conference on Management of Data, June 2006.

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
1. 呂錘寬(1982)。南管戲與南管音樂的關係。民俗曲藝,22,33-43。
2. 呂錘寬(1981)。臺灣的南管音樂與南管活動。民俗曲藝,5,38-51。
3. 呂錘寬(1981)。近年來臺灣南管戲活動。民俗曲藝,145,67-73。
4. 吳榮桂(1986)。直笛教材教法研究-談本省現有直笛教材分析。音樂教育,3,17-28。
5. 林小玉(2001)。由音樂藝術之本質探討多元評量於音樂教學之意涵與實踐。音樂藝術學刊,1,61-88。
6. 邱坤良(1982)。談南管藝術的維護。民俗曲藝,16,30-33。
7. 邱家麟(1985)。國民小學直笛教學之探討。國教輔導,25(1),8-10。
8. 陳友新(1988)。兒童(節奏)樂隊的組訓活動之檢討與改進。國教園地,26,15-20。
9. 陳貞銣(1994)。臺南市永福國小兒童合唱團之組訓。傳習,12,169-180。
10. 陳學謙(1990)。兒童合唱團之組訓。國民教育,30(9),18-38。
11. 曾漢榮(1991)。彰化地區國民中學社團活動實施現況調查研究。輔導學報,14,1-33。
12. 蕭穗珍(1980)。泉州之音-南管漫談。高雄文獻,5(6),303-309
13. 謝苑玫、蔡明芳(2007)。由音樂看世界-淺談世界觀音樂教學。美育,156,38-47。
14. 蘇志祥(1982)。南管的沿革與危機。民俗曲藝,13(1),1-6。
15. 衣若蘭,〈近十年兩岸明代婦女史研究(1986-1996)〉,《國立台灣師範大學歷史學報》,第25期,1997年6月。