(34.237.124.210) 您好!臺灣時間:2021/03/02 07:37
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:黃湧鈞
研究生(外文):Huang, Yong-Jyun
論文名稱:應用於GPU之蝶形演算法資料排列
論文名稱(外文):Index Conversion on GPU for Butterfly Based Algorithm
指導教授:許雅三
指導教授(外文):Hsu, Yar-Sun
口試委員:鐘太郎邱瀞德
口試委員(外文):Jong, Tai-LangChiu, Ching-Te
口試日期:2017-09-07
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:106
語文別:英文
論文頁數:53
中文關鍵詞:資料排列
外文關鍵詞:GPUdata layoutbutterflyindex permutation
相關次數:
  • 被引用被引用:0
  • 點閱點閱:85
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:23
  • 收藏至我的研究室書目清單書目收藏:0
現今訊號處理演算法在許多領域扮演重要的角色,例如: 無線通訊、影像處理、頻譜分析…等。隨著運算需求增大,許多研究針對各種訊號處理演算法提出快速演算法 (fast algorithm),而GPU的應用更促使許多平行化的快速演算法的研究蓬勃發展。許多快速演算法都有資料排列的潛在問題,例如:傅立葉轉換、沃爾什-阿達瑪轉換、哈特利轉換、哈爾轉換…等,而許多研究都是在轉換前後額外處理。值得一提的是[2]實作出令人驚艷的函式庫,[2]利用定義的運算子以及排程將資料排列的工作並入轉換的演算法部分。
[1]利用硬體的方式解決異質系統中資料排列的問題,而本篇論文也應用[1]的概念解決上述快速演算法的潛在問題。在位址轉換過程中我們使用[1]的ping-pong傳輸模式以盡可能縮短位址轉換造成的延遲。資料排列方面我們實做出三種模式: bit-reversal order、natural Hadamard order、快速哈爾轉換的資料次序,並模擬使用硬體方式解決資料排列問題後對於六種演算法在GPU上的加速。
在我們的實驗中,訊號處理演算法得到的加速和資料排列的複雜度成正比,但是如果演算法轉換部分運算量較大,得到的加速則會下降。在挑選的演算法裡面最差的加速是快速傅立葉轉換,落在7.12%~11.2%;最佳的加速則是快速沃爾什-阿達瑪轉換,落在15.28%~40.78%。
Signal processing algorithm plays an important role in many fields nowadays, such as wireless communication, image processing, spectrum analysis and so on. As computational demands increase, many researches presented fast algorithm about signal processing algorithms. Besides, evolving GPU technology further boosts researchers parallelize existing fast algorithm. However, fast algorithms such as Fast Fourier Transform (FFT), Fast Walsh-Hadamard Transform (FWHT), Fast Hartley Transform, Fast Haar Transform, suffered from index permutation issue, which are often solved by programmer before or after transformation. However, researchers proposed impressive library solution in [2]. Work in [2] utilizes defined operators and scheduling to achieve index permutation and merges it into transformation algorithm.
[1] adopts hardware method to tackle with data layout issue between heterogeneous processors, and we solve index permutation issue with concept in [1]. We adopt ping-pong transpose mechanism to hide the latency due to index conversion. We implement three types of index permutation, such as bit-reversal order, natural Hadamard order, and index order in Fast Haar Transform. Finally, we show the speedup of six chosen algorithms on GPU after eliminating index permutation problem with our index converter.
In our experiment, the speedup of signal processing algorithm is proportional to the complexity of index distribution, and in inverse proportion with the execution time of fast transform. In selected benchmarks, we obtain worst case in FFT, its speedup ranges from 7.12% to 11.2%, and best case in FWHT, ranging from 15.28% to 40.78%.
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Goals and Contributions 2
1.3 Thesis Organization 3
Chapter 2 Related Work and Background 4
2.1 GPGPU 4
2.2 Smart Controller 6
2.2.1 Background 6
2.2.2 System Overview 7
Chapter 3 Signal Processing Algorithm 9
3.1 FFT[7] 9
3.2 FHT(Fast Hartley Transform)[13][14] 10
3.3 SCHT (Sequency-Ordered Complex Hadamard Transform)[15] 12
3.4 CS-SCHT (Conjugate Symmetric Sequency-Ordered Complex Hadamard Transform) [16][20] 13
3.5 WHT (Walsh-Hadamard Transform) [17][18] 14
3.6 Fast Haar Transform[5][6] 15
Chapter 4 Design and Implementation 17
4.1 Index converter 17
4.1.1 Ping-Pong Transpose Unit 17
4.1.2 Index Redistribution Generator 18
4.2 Transformation Implementation and Index Redistribution Kernel 19
Bit-reverse kernel 19
FFT kernel 20
Hartley transform kernel 22
SCHT and CSSCHT kernel 25
Fast Walst-Hadamard Transform and index redistribution kernel 29
Fast Haar transform and index redistribution kernel 33
Chapter 5 Evaluation 38
5.1 Experimental Environment 38
5.2 Experimental Result Analysis 40
5.2.1 Synthesis Result 40
5.2.2 Overall Performance Analysis 40
Chapter 6 Conclusions and Future Work 48
6.1 Conclusions 48
6.2 Future Work 49
Bibliography 50
[1] Chia-Chen Hsu, Cheng-Yen Lin, Shin Kai Chen, Chih-Wei Liu, and Jenq-Kuen
Lee, “Optimized Memory Access Support for Data Layout Conversion on Heterogeneous Multi-core Systems,” in Embedded Systems for Real-time Multimedia (ESTIMedia), 2014 IEEE 12th Symposium.
[2] Jacobo Lobeiras, Margarita Amor, and Ramón Doallo, “Designing Efficient Index-Digit Algorithms for CUDA GPU Architectures,” in IEEE Transactions on Parallel and Distributed Systems, 2016.
[3] I-Jui Sung, Geng Daniel Liu, and Wen-Mei W. Hwu, “DL: A data layout transformation system for heterogeneous computing,” in Innovative Parallel Computing (InPar), 2012.
[4] Shuai Che, Jeremy W. Sheaffer, and Kevin Skadron, “Dymaxion: Optimizing memory access patterns for heterogeneous systems,” in SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Peter R. Roeser, and M. E. Jernigan, “Fast Haar Transform Algorithms,” in IEEE Transactions on Computers, 1982
[6] Yanwei Pang, Xuelong Li, Yuan Yuan, Dacheng Tao, and Jing Pan, “Fast Haar Transform Based Feature Extraction for Face Representation and Recognition,” in IEEE Transactions on Information Forensics and Security, 2009.
[7] Sreehari Ambuluri, “Implementations of the FFT algorithm on GPU,” 2012
[8] Eladio Gutierrez, Sergio Romero, Maria A. Trenas, and Emilio L. Zapata, “Memory Locality Exploitation Strategies for FFT on the CUDA Architecture,” International Conference on High Performance Computing for Computational Science, 2008.
[9] Wen-Mei W. Hwu and David B. Kirk, “Programming Massively Parallel Processors: A Hands-on Approach,” Burlington, Morgan Kaufmann, 2010.
[10] Muthu Manikandan Baskaran et al., “A compiler framework for optimization of affine loop nests for gpgpus,” Proceedings of the 22nd annual international conference on Supercomputing, 2008
[11] Jee W. Choi et al., “Model-driven autotuning of sparse matrix-vector multiply on GPUs,” Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010.
[12] Alexander Monakov et al., “Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures,” International Conference on High-Performance Embedded Architectures and Compilers, 2010.
[13] H.-J. Meckelburg, and D. Lipka, “Fast Hartley transform algorithm,” Electronics Letters 11th, 1985.
[14] Hsieh S. Hou, “The Fast Hartley Transform Algorithm,” IEEE Transactions on Computers, 1987.
[15] N. Swetha, Panyam Narahari Sastry, Y. Rajasree Rao, and Samrat L. Sabat, “Fast Sequency-Ordered Complex Hadamard Transform-Based Parzen Window Entropy Detection for Spectrum Sensing in Cognitive Radio Networks,” IEEE Communications Letters, 2016.
[16] Aye Aung,Boon Poh Ng, and Susanto Rahardja, “Conjugate Symmetric Sequency-Ordered Complex Hadamard Transform,” IEEE Transactions on Signal Processing, 2009.
[17] M. Kunt, “On Computation of the Hadamard Transform and the R Transform in Ordered Form,” IEEE Transactions on Computers, 1975.
[18] Geadah, and Corinthios, “Natural, Dyadic, and Sequency Order Algorithms and Processors for the Walsh-Hadamard Transform,” IEEE Transactions on Computers, 1977.
[19] Soo-Chang Pei, Chia-Chang Wen, and Jian-Jiun Ding, “Conjugate Symmetric Discrete Orthogonal Transform,” IEEE Transactions on Circuits and Systems, 2014.
[20] Saad Bouguezel, M. Omair Ahmad, and M.N.S. Swamy, “An efficient algorithm for the conjugate symmetric sequency-ordered complex Hadamard transform,” Circuits and Systems (ISCAS), 2011.
[21] Jiasong Wu, Lu Wang, Guanyu Yang, Lotfi Senhadji, Limin Luo, and Huazhong Shu, “Sliding Conjugate Symmetric Sequency-Ordered Complex Hadamard Transform: Fast Algorithm and Applications,” IEEE Transactions on Circuits and Systems I: Regular Papers, 2012.
[22] Eduardo Colmenares, and P. Andersen, “Parallel Computing and its Potential Benefits: A Complex Radix-2 FFT Case Study,” Colmenares Eduardo,and Andersen Per. Parallel computing and its potential benefits: A complex radix-2 FFT case study. Proceedings of the 8th International Multi-Conference on Society, Cybernetics and Informatics, IMSCI 2014.
[23] Yuh-Ming Huang, and Ja-Ling Wu, “A refined fast 2-D discrete cosine transform algorithm,” IEEE Transactions on Signal Processing, 1999.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文
 
無相關期刊
 
無相關點閱論文
 
系統版面圖檔 系統版面圖檔