(3.232.129.123) 您好!臺灣時間:2021/02/26 21:10
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:陳怡靜
研究生(外文):Chen, Yi-Ching.
論文名稱:應用強化獎勵機制學習解魔術方塊
論文名稱(外文):Solving Rubik's Cube by Policy Gradient Based Reinforcement Learning
指導教授:林永隆林永隆引用關係
指導教授(外文):Lin, Youn-Long
口試委員:陳煥宗黃俊達
口試委員(外文):Chen, Hwann-TzongHuang, Juinn-Dar
口試日期:2018-08-31
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2018
畢業學年度:107
語文別:英文
論文頁數:30
中文關鍵詞:強化學習魔術方塊策略梯度
外文關鍵詞:Reinforcement LearningRubik's CubePolicy Gradient
相關次數:
  • 被引用被引用:1
  • 點閱點閱:222
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:43
  • 收藏至我的研究室書目清單書目收藏:0
強化學習系統提供了代理人與環境互動機制,策略梯度方法目的在於儘可能採
取好的動作。我們提出一個在強化學習系統上運用線性的策略梯度方法和強化獎
懲機制進而達到對於好的動作有較高的機率。實驗結果顯示此方法用神經網路模式
可以解部分的魔術方塊問題,但是仍不能解所有問題。
Reinforcement Learning provides a mechanism for training an agent to interact with its environment. Policy gradient makes the right actions more probable. We propose using a linear policy gradient method in a deep neural network-based reinforcement learning. The proposed method employs an intensifying reward function to increase the probabilities of right actions to solve the Rubik's Cube problems. Experiments show that our proposed neural network learned to solve some Rubik's Cube states. For more difficult initial states, the network still cannot always give the correct suggestion.
Abstract i
Contents ii
List of Tables iii
List of Figures iv
1 Motivation 1
2 Related Work 3
3 Reinforcement Learning 5
3.1 Basic concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Proposed Methodology and Implementation 8
4.1 Data Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Experiment Results 17
6 Conclusion and Future Work 26
References 28
[1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., "Mastering the game of Go with deep neural networks and tree search," nature, vol. 529,
no. 7587, pp. 484{489, 2016.
[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing Atari with deep reinforcement learning," arXiv preprint arXiv:1312.5602, 2013.
[3] "MuJoCo physics engine." [Online]. Available: http://www.mujoco.org/.
[4] H. Kociemba, "Two-Phase Algorithm Details." [Online]. Available: http://
kociemba.org/math/imptwophase.htm.
[5] S. McAleer, F. Agostinelli, A. Shmakov, and P. Baldi, "Solving the Rubik's Cube
Without Human Knowledge," arXiv preprint arXiv:1805.07470, 2018.
[6] A. Irpan, \Exploring boosted neural nets for rubiks cube solving,"
As of this writing, paper may be found at http://www. alexirpan.
com/public/research/nips 2016. pdf, 2016.
[7] A. Karpathy, "Deep Reinforcement Learning: Pong from Pixels." [Online]. Avail-
able: http://karpathy.github.io/2016/05/31/rl/, 2016.
[8] H. van Hasselt, "UCL Course { 2016: Introduction to reinforcement learning."
Retrieved january, 2016, from University College London Web site: https://
hadovanhasselt.com/2016/01/12/ucl-course/.
[9] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, ch. 1,
pp. 5{6. MIT press Cambridge, 2 ed., 199.
[10] S. Levine, "CS 294: Deep Reinforcement Learning, Fall 2017: Sep 6: Pol-
icy gradients introduction." Retrieved August, 2017, from UC Berkeley Web
site: http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_4_policy_
gradient.pdf.
[11] H. van Hasselt, "UCL Course { 2016: Policy Gradient." Retrieved January,
2016, from University College LondonWeb site: https://hadovanhasselt.com/
2016/01/12/ucl-course/.
[12] "Monte Carlo Method." [Online]. Available: http://mathworld.wolfram.com/
MonteCarloMethod.html.
[13] R. J. Williams, "Simple statistical gradient-following algorithms for connectionist
reinforcement learning," in Reinforcement Learning, pp. 5{32, Springer, 1992.
[14] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, "High-dimensional
continuous control using generalized advantage estimation," arXiv preprint
arXiv:1506.02438, 2015.
[15] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, "Trust region policy
optimization," in International Conference on Machine Learning, pp. 1889-1897, 2015.
[16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv preprint arXiv:1707.06347, 2017.
[17] D. Shah, "Activation Functions." [Online]. Available: https://
towardsdatascience.com/activation-functions-in-neural-networks-58115cda9c96,
2016.
[18] "Inverse transform sampling." [Online]. Available: https://stephens999.
github.io/fiveMinuteStats/inverse_transform_sampling.html.
[19] "Normal distribution." [Online]. Available: http://mathworld.wolfram.com/
NormalDistribution.html.
[20] S. Levine, "CS 294: Deep Reinforcement Learning, Fall 2017: Sep 11:
Actor-critic introduction." Retrieved August, 2017, from UC Berkeley Web
site: http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_5_actor_
critic_pdf.pdf.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔