跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.44) 您好!臺灣時間:2025/12/31 20:47
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:紀律呈
研究生(外文):Lu-cheng Chi
論文名稱:基於稀疏報酬改良深度加強式學習
論文名稱(外文):An Improved Deep Reinforcement Learning with Sparse Rewards
指導教授:黃國勝黃國勝引用關係
指導教授(外文):Kao-Shing Hwang
學位類別:碩士
校院名稱:國立中山大學
系所名稱:電機工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2018
畢業學年度:107
語文別:中文
論文頁數:43
中文關鍵詞:Asynchronous Advantage Actor-Critic演算法稀疏報酬監督式學習加強式學習Actor-Critic演算法
外文關鍵詞:Sparse RewardsSupervised LearningAsynchronous Advantage Actor-Critic algorithmActor-Critic algorithmReinforcement Learning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:170
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
在加強式學習,代理者如何在報酬稀疏的環境探索,是長期存在的問題。本論文所闡述的改良深度加強式學習,鼓勵代理者在報酬稀疏的環境,探索尚未拜訪過的環境狀態。
在深度加強式學習,代理者藉由圖像的環境觀測,直接地作為神經網路的輸入。然而,一些被忽略的環境觀測,如深度,可能提供有價值的信息。
本論文所闡述的改良深度加強式學習,基於Actor-Critic演算法,並且藉由卷積神經網路,作為圖像輸入以及其他環境觀測之間的異質編碼器。本論文藉由監督式學習,透過這些被忽略的環境觀測,作為監督式學習的目標輸出,在報酬稀疏的環境,提供代理者密集的訓練信號,以引導加強式學習。此外,本論文透過監督式學習的損失,作為代理者在環境探索行為的回饋,稱為標籤報酬,鼓勵代理者探索尚未拜訪過的環境狀態。最後,本論文藉由Asynchronous Advantage Actor-Critic演算法,建構多個神經網路,以多位代理者,共同地學習一個策略。
本論文所闡述的改良深度加強式學習與其他深度加強式學習,在報酬稀疏的環境進行比較,並且獲得較好的表現。
In reinforcement learning, how an agent explores in an environment with sparse rewards is a long-standing problem. An improved deep reinforcement learning described in this thesis encourages an agent to explore unvisited environmental states in an environment with sparse rewards.
In deep reinforcement learning, an agent directly uses an image observation from environment as an input to the neural network. However, some neglected observations from environment, such as depth, might provide valuable information.
An improved deep reinforcement learning described in this thesis is based on the Actor-Critic algorithm and uses the convolutional neural network as a hetero-encoder between an image input and other observations from environment. In the environment with sparse rewards, we use these neglected observations from environment as a target output of supervised learning and provide an agent denser training signals through supervised learning to bootstrap reinforcement learning. In addition, we use the loss from supervised learning as the feedback for an agent’s exploration behavior in an environment, called the label reward, to encourage an agent to explore unvisited environmental states. Finally, we construct multiple neural networks by Asynchronous Advantage Actor-Critic algorithm and learn the policy with multiple agents.
An improved deep reinforcement learning described in this thesis is compared with other deep reinforcement learning in an environment with sparse rewards and achieves better performance.
摘要 i
Abstract ii
目錄 iv
圖目錄 vi
表目錄 viii
第一章 導論 1
1.1 研究動機 1
1.2 論文框架 1
第二章 研究背景 2
2.1 加強式學習 2
2.2 神經網路 3
2.2.1 神經元 4
2.3 Actor-Critic演算法 5
2.4 ASYNCHRONOUS ADVANTAGE ACTOR CRITIC(A3C)演算法 6
第三章 研究方法 9
3.1 Nav A3C演算法 9
3.1.1 長短期記憶 9
3.1.2 代理者的觀測 10
3.1.3 神經網路 10
3.2 監督式學習於深度加強式學習 12
3.2.1 神經網路以及損失函數 13
3.3 標籤報酬 14
3.4 訓練 15
3.5 軟硬體配置 17
第四章 模擬實驗 18
4.1 第一人稱視角的三度空間迷宮 18
4.1.1 實驗環境 18
4.1.2 深度預測 19
4.1.3 模擬實驗結果 22
4.2 晶圓檢測 24
4.2.1 實驗環境 24
4.2.2 檢測次數預測 26
4.2.3 模擬實驗結果 28
第五章 結論以及未來展望 30
5.1 結論 30
5.2 未來展望 30
參考文獻 31
[1]R. Sutton and A. Barto, Introduction to Reinforcement Learning, MIT Press, Cambridge, MA, USA, 1st edition, 1998.
[2]V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level Control through Deep Reinforcement Learning”, Nature, 518(7540):529–533, 2015.
[3]R. J. Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning”, Machine Learning, 8(3-4):229–256, 1992.
[4]Zell, Andreas, “Chapter 5.2”, Simulation Neuronaler Netze, 1st edition, 1994.
[5]V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning”, ArXiv preprint arXiv:1602.01783, 2016.
[6]T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude”, COURSERA: Neural Networks for Machine Learning, 4(2), 2012.
[7]P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al, “Learning to Navigate in Complex Environments”, ICLR, 2017.
[8]R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to Construct Deep Recurrent Neural Networks”, ArXiv preprint arXiv:1312.6026, 2013.
[9]B. Bakker, “Reinforcement Learning with Long Short-Term Memory”, NIPS, 1475–1482, 2001.
[10]P. de Boer, D. P. Kroese, S. Mannor, R. Y. Rubinstein, “A Tutorial on the Cross-Entropy Method”, Annals of Operations Research, 134(1):19–67, 2005.
[11]Beattie, Charles, et al. “DeepMind Lab” ArXiv preprint arXiv:1612.03801, 2016.
[12]D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction from a Single Image using a Multi-scale Deep Network”, Proc. of Neural Information Processing Systems, NIPS, 2366-2374, 2014.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊