(3.236.118.225) 您好!臺灣時間:2021/05/16 10:28
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:謝明伯
研究生(外文):Ming-Po Hsieh
論文名稱:時間差預測於智慧型多人遊戲之應用
論文名稱(外文):A Temporal-Difference Prediction Approach for Intelligent Multiplayer Games
指導教授:虞台文
指導教授(外文):Tai-Wen Yue
學位類別:碩士
校院名稱:大同大學
系所名稱:資訊工程學系(所)
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2008
畢業學年度:96
語文別:英文
論文頁數:63
中文關鍵詞:時間差網路強化式學習
外文關鍵詞:Temporal-Difference NetworksReinforcement Learning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:212
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
強化式學習(reinforcement learning)是一種沒有專家指導而讓代理人(agent)在動
態環境中,經由反覆嘗試,以求從錯誤得到經驗來達到學習效果的一種學習方式。
不斷地利用已知與探索未知的方式,代理人持續更新所謂的價值函式(value
function),並以其為準則來更新行為策略。價值函式用以記錄環境中各狀態(state)之
價值,代理人在實際體驗中,將不斷地更新所經歷狀態的價值,作為爾後各狀態行為
決策的參考指標。不過現今的強化式學習存在著兩樣挑戰︰第一,我們無法以表格
方式來存放價值函式,因為所欲解決的問題常有著非常龐大的狀態數,記憶體空間
無法負荷。第二,傳統的強化式學習演算法適用於訊息完全的環境,但是大多數現
實世界中的環境並非如此。
本論文提出可用在訊息不完全環境下的一種新穎學習方式。在此方法中,我們
把兩個類神經網路(neural networks)串接在一起用來做為學習引擎。其中一個類神經
網路是基於時間差網路(temporal-difference network)的概念設計的,其任務是預測各
種行為在未來一段連續的時間內所將發生事件機率值。預測值加上其他已知資訊,
傳入另一個類神經網路進行價值評估,以作為決定策略的依據。訓練完成的代理人
能評估對手可能採取的攻勢,給予迎頭痛擊。
我們用紙牌遊戲“傷心小棧"(Hearts;又稱為西洋拱猪)作為測試平台。它是一
個典型訊息不完全的遊戲案例,傳統的強化式學習方法之效果不彰。與微軟視窗
(Microsoft Windows)的內建遊戲程式MSHEARTS 較量100 場,我們訓練完成的代理
人贏得最後的勝利。
Reinforcement learning is the problem faced by an agent that must learn behavior
through trial-and-error interactions with a dynamic environment. Rather than being given the expertise, the agent takes trial actions itself and experiences the information returning from the environment continuously in order to seek a remarkable way to accomplish its goal. By continually exploiting what it already knew and exploring what it had never experienced, the agent can renew the policy progressively based on the value function built by the agent itself incrementally. The value function is, in fact, a mechanism for the agent to record the merit of each state in the environment so that the agent can look it up to determine which action should be the best under the current situation in the environment. There are two challenges in reinforcement learning. First, it can't create a table to store values for an environment with an enormous large number of states because of insu±cient memory spaces. Second, traditional solution methods for reinforcement learning focused on perfectly knowable
environment, but many real-world problems are not so.
In this thesis, we propose a novel learning method in an information imperfect
environment. In this approach, two cascaded neural-networks are linked together as our learning engine. One neural-network is designed based on the concept of the temporal-difference network. Its task is to predict the occurrence of successive events corresponding to a trial action in probability sense. The predictions, accompanying with some known information, are then passed to the other neural-network for value estimation. The best action corresponding to the current state can then be determined by choosing the one with the most beneficial value. The sophisticated agent is able to estimate the opponents' violent attacks, and make a strong resistance to fight them back.
The card game "Hearts" is our test target. It is a typical example of imperfect
information games, and it is so di±cult that the traditional reinforcement learning methods can't learn well. Playing 100 games with MSHEARTS in Microsoft Windows, our well-trained agent won the championship.
1 Introduction 1
1.1 Motivation 1
1.2 Goal 2
1.3 Reinforcement Learning 3
1.4 Other Related Approaches - Game Trees 5
1.4.1 Two-Player Minimax 6
1.4.2 Paranoid Algorithm 7
1.4.3 Maxn 8
1.4.4 Conclusions 9
1.5 Hearts 9
1.6 Thesis Organization 11
2 Modelling the Decision Processes 12
2.1 Introduction 12
2.2 Markov Decision Processes 12
2.3 Value Functions 13
2.4 Partially Observable Markov Decision Processes 15
2.5 Conclusions and Discussions 18
3 Solution Methods of Reinforcement Learning Problems 20
3.1 Introduction 20
3.2 Monte Carlo Methods 20
3.3 Temporal-Difference Learning 22
3.4 Eligibility Traces 26
3.4.1 The Forward View 27
3.4.2 The Backward View 28
3.5 Conclusions and Discussions 30
4 PSRs and TD Networks 32
4.1 Introduction 32
4.2 Predictive State Representations 32
4.3 Temporal-Difference Networks 34
4.4 Conclusions and Discussions 37
5 The Architecture of Hearts Engine 39
5.1 Introduction 39
5.2 Architecture 39
5.3 Conclusions and Discussions 48
6 Experimental Results 49
6.1 Experiment 1 49
6.2 Experiment 2 50
7 Concluding Remarks 52
Appendices 53
A Equivalence of Forward and Backward Views of Eligibility Traces 54
B Constructing a PSR from a POMDP 58
Bibliography 61
[1] Masa aki Sato and Shin Ishii. On-line EM algorithm for the normalized gaussian network. Neural Computation, 12(2):407-432, 2000.
[2] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA, 1995.
[3] Matthew L. Ginsberg. GIB: Imperfect information in a computationally challenging game. J. Artif. Intell. Res. (JAIR), 14:303-358, 2001.
[4] Shin Ishii, Hajime Fujita, Masaoki Mitsutake, Tatsuya Yamazaki, Jun Matsuda, and Yoichiro Matsuno. A reinforcement learning scheme for a partially-observable multi-agent game. Machine Learning, 59(1-2):31-54, 2005.
[5] Michael Robert James. Using Predictions for Planning and Modeling in Stochastic Environments. PhD thesis, University of Michigan, 2005.
[6] Leslie Pack Kaebling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1-2):99-134, 1998.
[7] D. E. Knuth and R. W. Moore. An analysis of alpha-beta pruning. Artificial Intelligence, 6:293-326, 1975.
[8] Korf. Multi-player alpha-beta pruning. AIJ: Artificial Intelligence, 48, 1991.
[9] Leo Kuvayev. Learning to play hearts. In AAAI/IAAI, page 836, 1997.
[10] Michael L. Littman, Richard S. Sutton, and Satinder P. Singh. Predictive representations of state. In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors, NIPS, pages 1555-1561. MIT Press, 2001.
[11] William S. Lovejoy. A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28(1):47-65, 1991.
[12] R. D. Luce and H. Rai®a. Games and Decisions. John Wiley and Sons, 1957.
[13] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. Neural Comput., 1:281, 1989.
[14] A. Newell, C. Shaw, and H. Simon. Chess playing programs and the problem of complexity. IBM Journal of Research and Development, 2:320-335, 1958.
[15] Armand Prieditis and Evan Fletcher. Two-agent IDA*. J. Exp. Theor. Artif. Intell, 10(4):451-485, 1998.
[16] Nathan R. Sturtevant. Last-branch and speculative pruning algorithms for maxn. In Georg Gottlob and Toby Walsh, editors, IJCIA, pages 669-678. Morgan Kaufmann, 2003.
[17] Nathan R. Sturtevant. Last-branch and speculative pruning algorithms for maxn. In Georg Gottlob and Toby Walsh, editors, IJCIA, pages 669-678. Morgan Kaufmann, 2003.
[18] Nathan R. Sturtevant. Multi-Player Games: Algorithms and Approaches. PhD thesis, UCLA, 2003.
[19] Nathan R. Sturtevant and Richard E. Korf. On pruning techniques for multi-player games. In AAAI/IAAI, pages 201-207. AAAI Press / The MIT Press, 2000.
[20] Nathan R. Sturtevant and Adam M. White. Feature construction for reinforcement learning in hearts. In H. Jaap van den Herik, Paolo Ciancarini, and H. H. L. M. Donkers, editors, Computers and Games, volume 4630 of Lecture Notes in Computer Science, pages 122-134. Springer, 2006.
[21] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
[22] Richard S. Sutton and Brian Tanner. Knowledge representation in td networks(associated talk). http://www.cs.ualberta.ca/.
[23] Richard S. Sutton and Brian Tanner. Temporal-difference networks. In NIPS, 2004.
[24] Brian Tanner and Richard S. Sutton. TD(lambda) networks: temporal-difference networks with eligibility traces. In Luc De Raedt and Stefan Wrobel, editors, ICML, volume 119 of ACM International Conference Proceeding Series, pages 888-895. ACM, 2005.
[25] G. Tesauro. TD-Gammon, A self-teaching backgammon program, achieves master-level play. 1993.
[26] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, London, 1989.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top