跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.172) 您好!臺灣時間:2024/12/07 03:54
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:翁星宇
研究生(外文):Hsing-Yu Wong
論文名稱:獎勵化約法(R3D):一個新的強化學習複雜任務高效獎勵設計方法
論文名稱(外文):Reductionist Reinforcement Reward Design(R3D): A Novel Method for Efficient Reward Function Design in Complex Reinforcement Learning Tasks
指導教授:陳慶瀚陳慶瀚引用關係
指導教授(外文):Ching-Han Chen
學位類別:碩士
校院名稱:國立中央大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2024
畢業學年度:112
語文別:中文
論文頁數:75
中文關鍵詞:強化學習階層式強化學習大型語言模型獎勵函數
外文關鍵詞:reinforcement learninghierarchical reinforcement learningLLMreward function
相關次數:
  • 被引用被引用:0
  • 點閱點閱:14
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
大型語言模型(Large Language Model, LLM)的崛起為機器人控制領域帶來了許多突破性的發展。然而,現階段利用LLM為強化學習任務進行獎勵函數自動設計的系統在面對複雜任務時存在瓶頸,難以有效完成獎勵設計。而這個問題不僅對LLM是一大挑戰,對於人類專家而言同樣是困難重重。因此,我們提出了一個專門應對強化學習複雜任務的獎勵設計方法—獎勵化約法(Reductionist Reinforcement Reward Design, R3D),並結合LLM的生成能力,設計出LLM獎勵協同設計系統(LLM based Reward Co-design System, LLMRCS)。獎勵化約法透過階層式強化學習的方式,將複雜任務分解為多個子任務,並為每個子任務設計子任務獎勵函數,最終合成出有效的獎勵函數。獎勵化約法不僅簡化獎勵設計流程,更保障了複雜任務獎勵的有效性。LLM獎勵協同設計系統以先前的Eureka系統作為基礎,設計了一套改進方案。其在獎勵生成流程中結合了獎勵化約法,並能夠在最佳化過程中結合專家意見。實驗結果顯示,獎勵化約法顯著提昇了複雜任務的學習效率。在FrankaCubeStack任務中,使用獎勵化約法設計之獎勵達到80%成功率的訓練效率比起傳統方法提昇了50.9倍。而使用獎勵化約法之LLM獎勵協同設計系統能夠在無人干預的情況下生成接近人類設計的獎勵函數,部份任務中甚至能夠超越人類專家。我們也透過實驗指出,獎勵化約法不僅能夠減輕複雜任務設計困難、提昇學習效率,該方法還具有良好的可解釋性。透過分析訓練結果,我們能不斷改進獎勵函數,使獎勵化約法成為一個「獎勵最佳化方法」。最後,我們將獎勵化約法的設計理念做延伸,透過修改最終獎勵使代理人的行為表現出多個策略的融合,以展示獎勵化約法其廣泛的應用方式及未來發展的潛力。
The emergence of Large Language Models (LLMs) has led to numerous groundbreaking advancements in the field of robotic control. However, the current systems that use LLMs to automatically design reward functions for reinforcement learning tasks face significant challenges when dealing with complex tasks. This issue presents a major hurdle not only for LLMs but also for human experts. To address this, we propose Reductionist Reinforcement Reward Design (R3D). We further integrate the generative capabilities of LLMs to create the LLM-based Reward Co-design System (LLMRCS). R3D employs a hierarchical reinforcement learning approach to decompose complex tasks into multiple sub-tasks, each with its own sub-task reward function, which are then combined to form an effective overall reward function. This method simplifies the reward design process and ensures the efficacy of the rewards for complex tasks. Building upon the previous Eureka system, the LLMRCS incorporates improvements by integrating R3D within the reward generation process and leveraging expert input during optimization. Experimental results demonstrate that R3D significantly enhances the learning efficiency for complex tasks. In the FrankaCubeStack task, rewards designed using R3D achieved training efficiency improvements of up to 50.9 times compared to traditional methods, reaching an 80% success rate. Additionally, LLMRCS can autonomously generate reward functions that are comparable to those designed by humans, and in some tasks, it even surpasses human experts. Our experiments also reveal that R3D not only reduces the difficulty of designing rewards for complex tasks and improves learning efficiency but also offers excellent explainability. By analyzing training results, we can continuously refine reward functions, positioning R3D as a method for reward optimization. Finally, we extend the design principles of R3D, demonstrating its potential to generate agents that exhibit a blend of multiple strategies by modifying the final reward. This showcases the broad applicability and future development potential of R3D.
摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 vii
第一章、 緒論 1
1.1 研究背景 1
1.2 研究目標 4
1.3 論文架構 5
第二章、 文獻回顧 6
2.1 強化學習 6
2.1.1 馬可夫決策過程 6
2.1.2 Value-based 與 Action-based 強化學習方法 7
2.1.3 Actor-critic 方法 8
2.1.4 階層式強化學習 9
2.2 LLM及其應用 11
2.2.1 LLM與任務規劃 11
2.2.2 LLM與最佳化工具 12
2.2.3 基於LLM的強化學習獎勵生成工具 13
第三章、 獎勵設計方法 16
3.1 複雜任務與獎勵設計 16
3.2 獎勵化約法 21
第四章、 基於LLM的獎勵協同設計系統 31
4.1 系統架構 31
4.2 獎勵化約法的實作 33
4.3 結合專家知識的最佳化流程 35
第五章、 實驗 37
5.1 獎勵化約法的直接使用實驗 37
5.1.1 實驗環境 37
5.1.2 實驗結果 40
5.1.3 獎勵化約法的特性 43
5.2 LLMRCS的獎勵生成實驗 47
5.2.1 實驗環境 47
5.2.2 實驗結果 48
5.3 獎勵化約法的延伸應用實驗 52
第六章、 結論與未來展望 56
6.1 結論 56
6.2 未來展望 57
參考文獻 59
附錄 61
附錄一 LLMRCS 針對獎勵化約法設計之prompts 61
R3D_code_output_tip 61
R3D_code_feedback 62
R3D_policy_feedback 63
[1] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, "Progprompt: Generating situated robot task plans using large language models," in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11523-11530, 2023.
[2] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, "Voyager: An open-ended embodied agent with large language models," arXiv preprint arXiv:2305.16291, 2023.
[3] W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L. Chiang, T. Erez, L. Hasenclever, and J. Humplik, "Language to rewards for robotic skill synthesis," arXiv preprint arXiv:2306.08647, 2023.
[4] Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar, "Eureka: Human-level reward design via coding large language models," arXiv preprint arXiv:2310.12931, 2023.
[5] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen, "Large language models as optimizers," arXiv preprint arXiv:2309.03409, 2023.
[6] S. Booth, W. B. Knox, J. Shah, S. Niekum, P. Stone, and A. Allievi, "The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 5, pp. 5920-5929, 2023.
[7] V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, and A. Handa, "Isaac gym: High performance gpu-based physics simulation for robot learning," arXiv preprint arXiv:2108.10470, 2021.
[8] S. Pateria, B. Subagdja, A.-h. Tan, and C. Quek, "Hierarchical reinforcement learning: A comprehensive survey," ACM Computing Surveys (CSUR), vol. 54, no. 5, pp. 1-35, 2021.
[9] M. L. Puterman, "Markov decision processes," Handbooks in operations research and management science, vol. 2, pp. 331-434, 1990.
[10] H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double q-learning," in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
[11] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, "Policy gradient methods for reinforcement learning with function approximation," Advances in neural information processing systems, vol. 12, 1999.
[12] S. Gronauer and K. Diepold, "Multi-agent deep reinforcement learning: a survey," Artificial Intelligence Review, pp. 1-49, 2022.
[13] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, and P. Abbeel, "Soft actor-critic algorithms and applications," arXiv preprint arXiv:1812.05905, 2018.
[14] G. Kwon, B. Kim, and N. K. Kwon, "Reinforcement Learning with Task Decomposition and Task-Specific Reward System for Automation of High-Level Tasks," Biomimetics, vol. 9, no. 4, p. 196, 2024.
[15] R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, "Reward machines: Exploiting reward function structure in reinforcement learning," Journal of Artificial Intelligence Research, vol. 73, pp. 173-208, 2022.
[16] Z. Juozapaitis, A. Koul, A. Fern, M. Erwig, and F. Doshi-Velez, "Explainable reinforcement learning via reward decomposition," in IJCAI/ECAI Workshop on explainable artificial intelligence, 2019.
[17] Y. Septon, T. Huber, E. André, and O. Amir, "Integrating policy summaries with reward decomposition for explaining reinforcement learning agents," in International Conference on Practical Applications of Agents and Multi-Agent Systems, pp. 320-332, 2023.
[18] C.-H. Chen, M.-Y. Lin, and X.-C. Guo, "High-level modeling and synthesis of smart sensor networks for Industrial Internet of Things," Computers & Electrical Engineering, vol. 61, pp. 48-66, 2017.
[19] S. Gronauer and K. Diepold, "Multi-agent deep reinforcement learning: a survey," Artificial Intelligence Review, vol. 55, no. 2, pp. 895-943, 2022.
電子全文 電子全文(網際網路公開日期:20290722)
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊