跳到主要內容

臺灣博碩士論文加值系統

(44.201.97.0) 您好!臺灣時間:2024/04/24 12:09
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:羅祐任
研究生(外文):Luo, You-Ren
論文名稱:深度強化學習之擬人化研究
論文名稱(外文):A Study of Human-Like DeepReinforcement Learning Agents
指導教授:吳毅成
指導教授(外文):Wu, I-Chen
口試委員:江振瑞林宏軒彭文孝吳毅成
口試委員(外文):Jiang, Jehn-RueyLin, Hong-XuanPeng, Wen-HsiaoWu, I-Chen
口試日期:2019-11-08
學位類別:碩士
校院名稱:國立交通大學
系所名稱:數據科學與工程研究所
學門:電算機學門
學類:軟體發展學類
論文種類:學術論文
論文出版年:2019
畢業學年度:108
語文別:中文
論文頁數:38
中文關鍵詞:深度學習深度強化學習擬人化電腦遊戲強化學習
外文關鍵詞:video gamehuman-likereinforcement learningdeep learningshaped reward
相關次數:
  • 被引用被引用:0
  • 點閱點閱:358
  • 評分評分:
  • 下載下載:30
  • 收藏至我的研究室書目清單書目收藏:0
這篇論文提出了兩個方法來測量3D第一人稱視角的遊戲AI是否像人,並提出一個方法從深度強化學習中能夠成功學會如何表現得更像人類,同時能夠維持良好的表現。我們觀察到3D第一人稱視角遊戲中,使用深度強化學習訓練的代理人很容易有不像人的表現,例如搖晃鏡頭,這個行為容易讓使用者在觀看時有暈眩的副作用,同時我們觀察到代理人在實驗中出現了自轉的行為,亦非人類在該環境中會採取的行為表現,因此我們藉由Unity ML Agent所提供的訓練環境與架構,以環境資訊計算偵測這兩種不像人的行為並測量其程度。我們藉由動態調整獎勵函數,使訓練初期能夠維持良好的探索能力,並同時能在訓練後期更有效率地學習排除不像人的策略,最終得以同時像人並擁有良好的表現,在兩種不同的實驗中都能取得較佳的成績。
In this thesis, we proposed two metrics to measure how human-like an agent is, and one method to maintain high performance while making the agent like a human.
In 3D first-person games, we found the agent usually act with an unhuman-like behavior which is shaking the camera. For example, shaking the camera will make the human player feel uncomfortable so that people would never choose this policy. In our experience, we also found the agent will spin around which does not like human either. In this thesis, we purpose a dynamic method to solve the situation. The dynamic method can decrease the hindering exploration problem in the early period of training and increasing the penalty when the exploration is enough. Therefore, we can end-to-end train a reinforcement learning agent with high performance and also make it behave like a human.
摘要 i
Abstract ii
目錄 iii
圖目錄 v
表目錄 vi
I . 動機介紹 1
II . 背景說明 4
2.1 強化學習 4
2.2 馬可夫決策過程 5
2.3 Actor-Critic 神經網路架構 6
2.4 近端策略優化演算法 8
2.5 長短期記憶循環神經網路架構 9
III . 文獻縱覽 11
3.1 模仿學習 11
3.2 生物限制 13
3.2.1 感知錯誤 13
3.2.2 知覺與動作延遲 14
3.2.3 物理疲勞 14
3.2.4 重覆與新奇之間的平衡 14
3.3 調整獎勵函數 15
IV . 研究方法及步驟 16
4.1 擬人指標 16
4.2 動態調整獎勵函數之方法 18
V . 實驗結果 21
5.1 實驗場景與環境設置 21
5.2 比較固定扣分與動態扣分方法 22
5.3 不同權重大小對訓練的影響 25
5.3.1 比較不同的 w 對訓練的影響 25
5.3.2 比較不同的 c 對訓練的影響 27
5.3.3 比較不同的 h 對訓練的影響 29
5.4 比較動態扣分與固定扣分方法在多重成本中的差異 31
5.5 比較動態扣分與固定加分方法之差異 31
VI . 總結 34
6.1 討論 34
6.2 未來展望 35
參考文獻 36
[1] J. X. Chen, “The evolution of computing: Alphago,” Computing in Science & Engineering,
vol. 18, no. 4, p. 4, 2016.
[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep
reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
[3] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. M. Czarnecki,
A. Dudzik, A. Huang, P. Georgiev, R. Powell et al., “Alphastar: Mastering the real-time
strategy game starcraft ii,” DeepMind Blog, 2019.
[4] OpenAI, “Openai five,” https://blog.openai.com/openai-five/ year = 2018.
[5] G. Lample and D. S. Chaplot, “Playing fps games with deep reinforcement learning,” in
Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[6] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman et al., “Human-level performance in 3d
multiplayer games with population-based reinforcement learning,” Science, vol. 364, no.
6443, pp. 859–865, 2019.
[7] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew,
J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in Advances in
Neural Information Processing Systems, 2017, pp. 5048–5058.
[8] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” 2011.
[9] M. L. Puterman, Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
[10] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems, 2000, pp. 1008–1014.
[11] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[12] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
[13] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network
architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014.
[14] D. A. Pomerleau, “Efficient training of artificial neural networks for autonomous navigation,” Neural Computation, vol. 3, no. 1, pp. 88–97, 1991.
[15] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Advances in neural
information processing systems, 2016, pp. 4565–4573.
[16] N. Fujii, Y. Sato, H. Wakama, K. Kazai, and H. Katayose, “Evaluating human-like behaviors of video-game agents autonomously acquired with biological constraints,” in International Conference on Advances in Computer Entertainment Technology. Springer, 2013,
pp. 61–76.
[17] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi,
O. Vinyals, T. Green, I. Dunning, K. Simonyan et al., “Population based training of neural
networks,” arXiv preprint arXiv:1711.09846, 2017.
[18] T. Fuchida, K. T. Aung, and A. Sakuragi, “A study of q-learning considering negative
rewards,” Artificial Life and Robotics, vol. 15, no. 3, pp. 351–354, 2010.
[19] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete
problems in ai safety,” arXiv preprint arXiv:1606.06565, 2016.
[20] A. Juliani, V.-P. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, and D. Lange, “Unity: A
general platform for intelligent agents,” arXiv preprint arXiv:1809.02627, 2018.
[21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊