跳到主要內容

臺灣博碩士論文加值系統

(3.236.68.118) 您好!臺灣時間:2021/07/31 19:49
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:許博凱
研究生(外文):HSU, BO-KAI
論文名稱:增強式學習系統加速器之晶片設計
論文名稱(外文):Chip Design of Accelerator for Reinforcement Learning System
指導教授:朱元三陳昱仁陳昱仁引用關係
指導教授(外文):Chu, Yuan-SunChen, Yu-Je
口試委員:朱元三陳昱仁劉宗憲黃永廣
口試委員(外文):Chu, Yuan-SunChen, Yu-JeLIU, TSUNG-HSIENWONG, WING-KWONG
口試日期:2020-07-28
學位類別:碩士
校院名稱:國立中正大學
系所名稱:電機工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2020
畢業學年度:108
語文別:中文
論文頁數:51
中文關鍵詞:機器學習增強式學習Q-LearningDyna-Q加速器ASIC
外文關鍵詞:Machine LearningReinforcement learningQ-LearningDyna-QAcceleratorASIC
相關次數:
  • 被引用被引用:0
  • 點閱點閱:124
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
增強式學習是機器學習的一個子領域,強調如何基於環境而做行動,來取得最大化的利益,透過訓練代理人在未知的環境可以通過每一次的決策來學習,並且為代理人所做的行為給予獎勵值,以及透過獎勵值的累加,能夠在環境當找到最佳的策略行為。而正因為增強式學習的概念,造成學習時間會有過長的問題,造成在軟體上運算時會須花費大量的時間。
在本論文中除了提出了以增強式學習中的Q學習為基礎,透過硬體架構實現加速器晶片,來減少訓練時所花費的時間以外,並以Single-port SRAM存取Q學習中的Q值,以及加入了Dyna-Q的概念,讓整體學習效率提升,以平行化的概念去做Q學習的更新,不管是在環境停滯還是在更新過程中,都透過此概念來減少訓練時間,達到加速的效果。另外此加速器晶片未將環境寫入硬體架構 ,能應付未知的環境去做學習,可支援512個狀態以及4種動作。加速器晶片的實作以製程TN40G(45nm)下完成,以及頻率可達到550MHz,並經過不同環境的證明,能達到Q學習的效果。

Reinforcement learning is a sub-field of machine learning. It emphasizes how to act based on the environment to maximize the benefits. By training the agent to learn through every decision in an unknown environment, and what the agent does behaviors give rewards, and through the accumulation of rewards, the best strategic behavior can be found in the environment. And because of the concept of enhanced learning, the learning time will be too long, and it will take a lot of time to calculate on the software.
In this paper, in addition to the Q learning in the reinforcement learning
as the basis, the accelerator chip is implemented through the hardware architecture to reduce the time spent in training, and the Q value in the Q learning is accessed by Single-port SRAM, and added With the concept of Dyna-Q, the overall learning efficiency is improved, and Q learning is updated with the concept of parallelization. Whether the environment is stagnant or during the update process, this concept is used to reduce training time and achieve an accelerated effect. In addition, the accelerator chip does not write the environment into the hardware architecture, which can cope with the unknown environment for
learning, and can support 512 states and 4 actions. The implementation of the accelerator chip is completed under the process TN40G (45nm), and the frequency can reach 550MHz. It has been proven in different environments to achieve the effect of Q learning.

誌謝 iii
摘要 iv
Abstract v
目錄 vi
圖目錄 viii
表目錄 xi
第一章 簡介 1
1.1 背景介紹 1
1.2 研究動機 1
1.3 論文架構 2
第二章 背景知識 3
2.1 馬可夫決策過程 (Markov Decision Process, MDPs) 3
2.1.1 馬可夫過程(Markov process) 4
2.1.2 馬可夫獎勵過程(Markov Reward Processes) 5
2.1.3 狀態價值函數(State Value function) 6
2.1.4 馬可夫決策過程(Markov Decision Processes) 6
2.2 貝爾曼方程式(Bellman Equation) 7
2.2.1 馬可夫獎勵過程的貝爾曼方程 7
2.2.2 策略(Policy) 8
2.2.3 動作價值函數(Action Value Function) 9
2.2.4 馬可夫決策過程的貝爾曼方程 9
2.3 增強式學習 10
2.3.1 ε-貪婪策略(ε-greedy) 11
2.3.2 Q學習 11
2.3.3 SARSA(State – Action – Reward – State - Action) 12
2.3.4 Q學習與SARSA 13
2.3.5 Dyna-Q 14
第三章 環境架構 17
3.1 整體環境系統架構 17
3.1.1 系統架構 17
3.1.2 模擬環境 18
3.1.3 軟體模擬情況 20
3.2 運算設計與流程 21
3.2.1 運算流程 21
第四章 增強式學習之加速器架構 23
4.1 硬體架構 23
4.2 硬體實做 25
4.2.1 存取Q-value架構 25
4.2.2 選取最大Q值架構 26
4.2.3 Q更新架構 28
4.2.4 環境回饋訊號 31
4.2.5 Dyna-Q架構 33
4.2.6 晶片布局圖 36
第五章 實驗結果與比較 37
5.1 環境模擬結果 37
5.1.1 Q值訓練情形 37
5.1.2 環境收歛情形 40
5.2 系統與架構比較 43
5.2.1 Q學習與Dyna-Q 43
5.2.2 硬體優化運算 44
5.2.3 加速效果 46
5.3 相關研究比較 47
第六章 結論與未來規劃 50
6.1 結論 50
6.2 未來規劃 51
參考資料 52


[1] 劉康皓,以增強式學習為基礎之兩輪機器人控制設計,碩士論文,國立雲林科技大學,2017
[2] P. R. Gankidi and J. Thangavelautham, "FPGA architecture for deep learning and its application to planetary robotics," 2017 IEEE Aerospace Conference, Big Sky, MT, 2017, pp. 1-9
[3] 王鈞平,六貫棋遊戲實作與強化學習應用,碩士論文,國立臺灣師範大學資訊工程學系,2019。
[4] J. Su, J. Liu, D. B. Thomas, and P. Y. K. Cheung, "Neural network based reinforcement learning acceleration on FPGA platforms," ACM SIGARCH Comput. Archit. News, vol. 44, no. 4, pp. 68_73, 2017.
[5] Christopher J. C. H. Watkins,Peter Dayan, "Q -learning", Machine Learning(3),1992
[6] L. P. Kaebling, M. L. Littman and A. W. Moore, "Reinforcement Learning:A Survey," Journal of Artificial Intelligencr Research 4, pp.237-285, 1996
[7] K. S. Hwang and C. Y. Lo, "Policy Improvement by a Model-Free Dyna Arch-itecture," IEEE Transactions on Neural Networks and Learning Systems, vol.24, no, 5,pp. 776-788, 2013.
[8] H. H. Viet, S.H. An and T. C Chung, "Dyna-Q-Based Vector Direction for Path Planning Problem of Autonomous Mobile Robots in Unknown Envitonments," Advances Robotics, vol. 27, no.3, pp. 159-173, 2013
[9] K. S. Hwang, W. C. Jiang and Y. J. Chen, "Tree-Based Dyna-Q Agent," IEEE-/ASME International Conference on Advances Intelligent Mechatronics(AIM), pp.1077-1080, 2012
[10] 蔣惟承,模型學習及多代理人知識分享於Dyna-Q之應用,博士論文,國立中正大學,2013
[11] 洪贊順,具混合規劃架構之並行Dyna-Q學習演算法,碩士論文,國立中正大學,2012
[12] Takeshi Tateyama, Seiichi Kawata and Toshiki Shimomura, "Parallel Reinforcement Learning Systems Using Exploration Agents and Dyna-Q Algorithm," SICE Annual Conference 2007, Takamatsu, 2007, pp. 2774-2778,
[13] Y. Debizet, G. Lallement, F. Abouzeid, P. Roche and J. Autran, "Q-Learning-based Adaptive Power Management for IoT System-on-Chips with Embedded Power States," 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, 2018, pp. 1-5
[14] Hsu, Yuan-Pao and Wei-Cheng Jiang. "A Fast Learning Agent Based on the Dyna Architecture," J. Inf. Sci. Eng. 30 (2014): 1807-1823.
[15] V. L. Prabha and E. C. Monie, "Hardware architecture of reinforcement learning scheme for dynamic power management in embedded systems," EURASIP J. Embedded Syst., vol. 2007, no. 1, p. 065478, 2007
[16] S. Shao et al., "Towards Hardware Accelerated Reinforcement Learning for Application-Specific Robotic Control," 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Milan, 2018, pp. 1-8
[17] L. M. D. Da Silva, M. F. Torquato and M. A. C. Fernandes, "Parallel Implementation of Reinforcement Learning Q-Learning Technique for FPGA," in IEEE Access, vol. 7, pp. 2782-2798, 2019
[18] S. Spanò et al., "An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm," in IEEE Access, vol. 7, pp. 186340-186351, 2019
[19] P. R. Gankidi, "FPGA accelerator architecture for Q-learning and its applications in space exploration rovers," Ph.D. dissertation, SpaceTREx Lab., Univ. Arizona, Tucson, AZ, USA, 2016

電子全文 電子全文(網際網路公開日期:20250827)
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top