Game On: Towards Language Models as RL Experimenter\xspaces
๐ Abstract
The article proposes an agent architecture that automates parts of the common reinforcement learning experiment workflow, to enable automated mastery of control domains for embodied agents. It leverages a large vision language model (VLM) to perform capabilities normally required of a human experimenter, including:
- Monitoring and analyzing experiment progress
- Proposing new tasks based on past successes and failures
- Decomposing tasks into a sequence of subtasks (skills)
- Retrieving the skill to execute
This allows the system to build automated curricula for learning. The authors provide a first prototype of this system and examine the feasibility of current models and techniques for the desired level of automation.
๐ Q&A
[01] Task Proposition, Decomposition, and Skill Retrieval
1. What are the key components of the proposed agent architecture? The key components are:
- The curriculum module, which performs high-level reasoning to guide the learning process with auto-curriculum
- The embodiment module, which maintains a skill library and executes skills assigned by the curriculum
- The analysis module, which monitors the training progress of skills and reports learning status
2. How does the curriculum module generate an auto-curriculum? The curriculum module uses prompting of the Gemini VLM to perform the following:
- Task proposition: Propose new high-level tasks for the agent to accomplish
- Task decomposition: Break down high-level tasks into sequences of sub-tasks/skills
- Skill retrieval: Match the decomposed sub-tasks to the skills available in the agent's skill library
3. How does the embodiment module interact with the curriculum module? The embodiment module:
- Executes the skill sequences proposed by the curriculum module
- Judges the success of the executed skill sequences
- Collects the executed episodes into a dataset
- Triggers policy learning iterations to fine-tune the agent's low-level control policy
4. What is the role of the analysis module? The analysis module examines the learning progress of skills by few-shot prompting the Gemini VLM. It judges which skills have converged and adds them to the agent's skill library, allowing the curriculum module to use them for more complex task decomposition.
[02] Experimental Results
1. How did the authors evaluate the proposed system? The authors evaluated the system on a simulated robotic block stacking task. They:
- Trained a base policy with a pre-existing dataset
- Used the curriculum module to collect new diverse data
- Fine-tuned the base policy using the new data
- Analyzed the VLM's ability to judge the convergence of skill learning
2. What were the key findings from the experiments?
- The data collected by the curriculum module was more diverse than the pre-existing dataset
- Using the diverse data improved the performance of the fine-tuned policy, especially on more complex skills
- The VLM was able to reasonably judge the convergence of skill learning, though with some errors
3. How did the authors demonstrate the curriculum module's ability to handle progressive skill addition? The authors examined the proposals and decompositions generated by the curriculum module at different stages of the pre-training process. They showed that as more skills became available, the complexity of the proposed tasks and decompositions increased accordingly.
</output_format>