AURA: Autonomous Upskilling with Retrieval-Augmented Agents

TL;DR: AURA uses RAG retrieval and a YAML schema with static verification to iteratively generate rewards, domain randomizations, and training configs for curriculum RL.

Abstract

Designing reinforcement learning curricula for agile robots traditionally requires extensive manual tuning of reward functions, environment randomizations, and training configurations. We introduce AURA (Autonomous Upskilling with Retrieval-Augmented Agents), a schema-validated curriculum reinforcement learning (RL) framework that leverages Large Language Models (LLMs) as autonomous designers of multi-stage curricula. AURA transforms user prompts into YAML workflows that encode full reward functions, domain randomization strategies, and training configurations. All files are statically validated before any GPU time is used, ensuring efficient and reliable execution. A retrieval-augmented feedback loop allows specialized LLM agents to design, execute, and refine curriculum stages based on prior training results stored in a vector database, enabling continual improvement over time. Quantitative experiments show that AURA consistently outperforms LLM-guided baselines in generation success rate, humanoid locomotion, and manipulation tasks. Ablation studies highlight the importance of schema validation and retrieval for curriculum quality. AURA successfully trains end-to-end policies directly from user prompts and deploys them zero-shot on a custom humanoid robot in multiple environments - capabilities that did not exist previously with manually designed controllers. By abstracting the complexity of curriculum design, AURA enables scalable and adaptive policy learning pipelines that would be complex to construct by hand.

Training Curriculum: Click on a rollout above to see

Simulation Rollouts

Hardware Rollouts

Hardware deployment rollouts showing zero-shot transfer from simulation to real robot.

AURA Framework

AURA enables prompt-to-policy deployment through specialized LLM agents. A High-Level Planner queries past experiences from a vector database to design a multi-stage workflow, which Stage-Level LLMs expand into schema-validated YAML files encoding rewards, randomizations, and training configurations. After GPU-accelerated training using MuJoCo-MJX, user feedback on deployment rollouts is attached to the curriculum and embedded into the VDB, enabling iterative improvement across tasks and embodiments.

Iteration Graphs

Survival and linear velocity tracking scores across iterations to evaluate locomotion policy quality on a custom humanoid. The plots show the policy quality improvements of AURA over five iterations compared to MuJoCo Playground's expert designed rewards. AURA Blind generates rewards from scratch (VDB is initialized as empty) and AURA Tune modifies and improves an existing reward designed for another embodiment (VDB is initialized with MuJoCo Playground's Berkeley Humanoid expert human rewards, domain randomizations, and training configuration).

Framework Comparisons

Berkeley Humanoid performance comparison graphs

Policy evaluation across metrics. Episode survival length and linear velocity tracking are used to evaluate a velocity command following task on the Berkeley Humanoid. The success rate of pushing cubes is evaluated on the UR5e enviorment. *CurricuLLM's Berkeley Humanoid and Fetch-Push results are reported in the paper. **MuJoCo Playground's Pushing Cube Success is reported using Franka Emika Panda rewards on the UR5e embodiment, which shouldn't be expected to be successful. AURA adapts the Franka expert reward and training configuration into an effective curriculum for the UR5e.

Framework Training-Launch Comparisons

Training-launch-success-rate comparing AURA and its ablated variants. All evaluations are conducted with GPT-4.1, as the original models used in the baselines are deprecated at the time of assessment. *CurricuLLM is evaluated on generating rewards for Berkeley Humanoid locomotion. **Eureka's 12% is evaluated on training-launch success rate for their ANYmal task, which is most similar in complexity to humanoid robot tasks. Eureka's training-launch success rate across all available embodiments in their examples is 49% with simpler tasks generating more successfully.

Learning Curve Graphs

The learning curve above shows the training convergence of each framework's policies.

Full Project Video

Appendix

Implementation and Testing Details

Training via Proximal Policy Optimization (PPO)

We train the control policy with PPO, which constrains the policy update to prevent destructive parameter jumps. Let \[ r_t(\boldsymbol{\theta}) \;=\; \frac{\pi_{\boldsymbol{\theta}}\!\bigl(a_t \mid s_t\bigr)} {\pi_{\boldsymbol{\theta}_{\text{old}}}\!\bigl(a_t \mid s_t\bigr)} \] denote the probability ratio between the new and old policies, and let \(A_t\) be the generalized advantage estimate at timestep \(t\). The clipped surrogate objective is \[ L^{\text{CLIP}}(\boldsymbol{\theta}) \;=\; \mathbb{E}_t\!\left[ \min\!\Bigl( r_t(\boldsymbol{\theta})\,A_t,\; \text{clip}\!\bigl(r_t(\boldsymbol{\theta}),\,1-\epsilon,\,1+\epsilon\bigr)\,A_t \Bigr) \right], \] where \(\epsilon\) is the clipping parameter (we use \(\epsilon=0.2\) unless otherwise noted). During training, we maximize \(L^{\text{CLIP}}\) with Adam, using entropy regularization for exploration and a value–function loss with coefficient \(c_v\).

Score Calculation

During one evaluation run, we launch \(N=1024\) parallel environments and roll each for a horizon of \(T_{\max}=3000\) simulation steps (or until failure). All three metrics are normalized to \([0,1]\) for direct comparability.

Survival Score

Let \(T_i\in[1,T_{\max}]\) denote the number of steps survived by environment \(i\). The Survival Score is the mean fractional episode length \[ \mathcal{S}_{\text{surv}} \;=\; \frac{1}{N}\sum_{i=1}^N\frac{T_i}{T_{\max}} . \]

Linear‑Velocity Tracking Score

For step \(t\) in environment \(i\) let \(\mathbf{v}^{\mathrm{cmd}}_{t,i}\in\mathbb{R}^2\) be the commanded planar velocity and \(\mathbf{v}^{\mathrm{loc}}_{t,i}\in\mathbb{R}^2\) the robot's actual planar COM velocity expressed in the local frame. Define the squared tracking error \[ e_{t,i} \;=\; \bigl\lVert \mathbf{v}^{\mathrm{cmd}}_{t,i}-\mathbf{v}^{\mathrm{loc}}_{t,i} \bigr\rVert_2^{\,2}. \] The YAML exponential_decay entry with \(\sigma=0.1\) corresponds to the per‑step reward \[ r^{\text{lin}}_{t,i} \;=\; \exp\!\bigl(-e_{t,i}/(2\sigma^{2})\bigr), \qquad \sigma=0.1 . \] Aggregating over time and averaging across the batch yields the normalized Linear‑Velocity Tracking Score \[ \mathcal{S}_{\text{lin}} \;=\; \frac{1}{N}\sum_{i=1}^{N}\frac{1}{T_{\max}} \sum_{t=1}^{T_i} r^{\text{lin}}_{t,i}. \]

Summary

The triplet \(\bigl(\mathcal{S}_{\text{surv}},\mathcal{S}_{\text{lin}},\mathcal{S}_{\text{air}}\bigr)\) captures robustness (survival), command‑following fidelity, and gait coordination, respectively, and forms the basis of all quantitative comparisons in the main paper.

AURA Task Prompts for Training Launch

These five user prompts were used as task inputs to AURA for evaluating its curriculum training-launch success rate across the custom humanoid and Berkeley Humanoid embodiments. For baseline comparisons, we used the prompts provided in each baseline's open-source repository to generate their respective training files.

Robust Bipedal Walking with Perturbation Resilience

I want to use a staged approach to train a humanoid robot for advanced locomotion with external perturbations. I want to deploy this policy onto hardware and walk outside, and I want the steps to be even (between the right and left legs) and smooth. I want the walking to be very robust to both uneven terrain and perturbations. I want the training to have equal capabilities for forward and backwards walking, with an absolute max speed being 0.5m/s and max yaw being 0.5rad/s. You must generate AT LEAST 2 stages for this task.

Terrain-Adaptive Walking

I want to use a multi-stage curriculum to train a humanoid robot to walk over irregular terrain, such as small rocks and low barriers. The robot must learn to lift its feet high enough to avoid tripping and maintain a steady gait while stepping over obstacles of varying height (up to 0.02m). The policy should be deployable outdoors and remain balanced when landing on slightly angled or unstable surfaces. I want both forward and backward walking to be supported, with even step timing and foot clearance. You must generate AT LEAST 2 stages for this task.

Precision Jumping Between Platforms

I want to use a staged approach to train a humanoid to jump onto and off of elevated platforms. The policy should support both single jumps (from ground to platform) and double-jumps (from one platform to another). The target jump height is 0.05m with target air time of 0.5s. I want landing posture and knee angle to remain stable, and I want the robot to absorb impacts smoothly. The final policy should transfer to hardware and be tested over rigid and slightly deformable platforms. You must generate AT LEAST 2 stages for this task.

Rhythmic Forward-Backward Hopping

I want to train a humanoid robot to perform continuous forward-backward hopping in a rhythmic, energy-efficient manner. The hops should alternate directions every few steps and maintain even left-right force distribution. I want the robot to be robust to mild perturbations during flight and landing. Deployment should be feasible on a physical robot outdoors, with stability maintained on moderately uneven terrain. Hop height should range between 0.05-0.1 meters with a frequency of ~1.5 Hz. You must generate AT LEAST 2 stages for this task.

Stable Lateral Walking with Perturbation Handling

I want to use a staged curriculum to train a humanoid robot to perform lateral (sideways) walking in both directions. The walking should be smooth and balanced, with equal step distances between the left and right legs. The policy should be robust to minor terrain irregularities and moderate lateral perturbations. Maximum lateral velocity should be capped at +-0.3 m/s, and yaw rotation should be minimized during side-stepping. I want the final policy to be deployable on hardware and capable of sustained lateral walking in outdoor environments. You must generate AT LEAST 2 stages for this task.

Training-Launch Comparison Details

In the following section, we provide a more detailed description of the environments used to evaluate the CurricuLLM and Eureka training-launch success rates. All environments were evaluated over 100 training launches.

CurricuLLM

CurricuLLM's training-launch success rate was evaluated on their Berkeley Humanoid environment, with all environment scripts and tasks taken directly from their open-source repository.

Eureka

Eureka's training-launch success rate was evaluated on their only legged robot environment, ANYmal, across 100 training-launch attempts, successfully launching training 12 times. We also tested training-launch success across all embodiments provided in their open-source repository, which include the Gymnasium environments Shadow Hand, Franka Emika Panda, Ant, Humanoid, and Cartpole, as well as other robot environments such as ANYmal, Allegro, Ball Balance, and Quadcopter.

AURA LLM Agent Prompt Templates

The prompts below define the instruction templates used by AURA's LLM agents throughout the curriculum generation and training process. These templates structure how input data, such as user task prompts, stage descriptions, and past training artifacts, are transformed into schema-compliant outputs like workflows, configuration files, and reward functions. Placeholders in the form of <INSERT_...> denote dynamic inputs that are automatically populated by the AURA framework at runtime with the appropriate content, ensuring that each prompt remains generalizable while retaining full contextual relevance for the target task.

Curriculum LLM

You are an expert in reinforcement learning, CI/CD pipelines, GitHub Actions workflows, and curriculum design. Your task is to generate a complete curriculum—with clearly defined steps—for a multi-stage (staged curriculum learning) training process for a humanoid robot in simulation. 

**MOST IMPORTANT DIRECTIONS:**
**<INSERT_TASK_PROMPT_HERE>**
Determine the number of stages based on the complexity of the task described.
For simpler tasks, 1-2 stages is enough, for more complex tasks, over 3-4 stages is possible.

You are to generate as follows:
1. A GitHub Actions workflow file that orchestrates the entire staged training process.
2. For each training stage that appears in the workflow, a separate text file containing in-depth, rich details that describe the objectives, expected changes to reward files, configuration file modifications, and overall curricular rationale.

-------------------------------------------------
**Important Guidelines:**
- IMPORTANT! The example baseline content parameters comes from a vector database of a past training workflow that similar to the current task. Update them to fit the user task more.
- IMPORTANT! Here is the EXPERT evaluation of the example training workflow results: <INSERT_EVALUATION_HERE>.
  - Use this evaluations by the EXPERT to update the parameters.
- **Workflow File Generation:**  
  Generate a workflow file named `generated_workflow.yaml` that orchestrates the staged training process. This file must:
  - Begin with the baseline structure provided via the placeholder `<INSERT_WORKFLOW_YAML_HERE>`, and then be modified to support multiple training stages with explicit jobs for training and feedback (if feedback is needed).
  - **CRITICAL JOB NAMING CONVENTION:** All training job names MUST end with `_training` (not `_training_stage1` or similar); this is for parsing later.
    For Multi-Stage, you should call it *_stage1_training, *_stage2_training, etc.
    This naming convention is required for the downstream processing system to correctly identify training jobs.
  - Use the example reward file (provided via `<INSERT_REWARD_YAML_HERE>`) and example config file (provided via `<INSERT_CONFIG_YAML_HERE>`) as context. These files define all the reward function keys, their evaluations, scales, default values, and all configuration parameters.
  - Use the example randomize file (provided via `<INSERT_RANDOMIZE_YAML_HERE>`) as context to understand the domain randomization parameters for the scene.
  - Remember the workflow should use the same generated config, reward, and randomize that it defines in the stage description for training.

- **Detailed Stage Description Files:**  
  In addition to the workflow file, for every stage that is generated in the workflow, you must output a corresponding text file (named `generated_stage{X}_details.txt`, where `{X}` is the stage number) that contains very rich, in-depth details. Each of these text files must include:
  - A clear description of the training objective for that stage.
  - Precise descriptions of modifications to the config environment diturbances (e.g. obs noise, kicks, imu disturbances, etc.)
  - Explicit details on the expected modifications to the reward file (e.g., which reward terms should be scaled up, which should be disabled by setting to 0.0, and any new reward terms added to support the stage's goals).
  - A precise explanation of the configuration changes (e.g., adjustments in learning rate, entropy, discounting, batch sizes, and observation disturbances) along with the rationale behind these changes.
  - A discussion on the terrain/environment context and how it affects the training process.
  - Which gate to use (there is only "walk" and "jump" gate for now)
  - How this stage fits into the overall curriculum progression 
  - Specify if the stage is starting from new (resume_from_checkpoint=false), or continueing from a previous trained checkpoint (resume_from_checkpoint=true)

-------------------------------------------------
**Additional Context:**
- The robot and training environment context is provided via <INSERT_ROBOT_DESCRIPTION_HERE>, and is used to understand the general features and capabilities of the robot and the environment it is trained in.
- The generated workflow file must include explicit, detailed inline comments that clearly describe what is expected for each stage's reward and configuration modifications.
- **IMPORTANT! The generated config, reward, and randomization YAMLs by the downstream generators are in the format generated_config_stage{X}.yaml, generated_reward_stage{X}.yaml, generated_randomize_stage{X}.yaml, so make the generated workflow reflect that.**
- Do not output any text or explanation outside of the YAML blocks.
- IMPORTANT! The number of stages in the workflow and the stage descriptions generated should match exactly
-------------------------------------------------
**Remember:** The inline comments in each file should be very explicit about what each stage is intended to achieve, including detailed explanations of parameter modifications, terrain conditions, and overall training objectives. Do not include any extra text outside these YAML blocks.

--------------------------------------------------
**Output Format:**
**THIS IS THE MOST IMPORTANT PART! YOU MUST FOLLOW THESE DIRECTIONS EXACTLY! Return your output as separate YAML blocks (and nothing else) in the following format:**

file_name: "generated_workflow.yaml"
file_path: "../workflows/generated_workflow.yaml"
content: |
  # [The updated workflow.yaml content here with inline comments]

file_name: "generated_stage1_details.txt"
file_path: "../prompts/tmp/generated_stage1_details.txt"
content: |
  # [The in-depth description for Stage 1 with explicit detail on training objectives, reward configuration adjustments, and other parameters]

Number of files generated for stage descriptions depends on how ever many training stages are detailed by the generated workflow.
-------------------------------------------------

RAG

These prompts are used within the RAG Block to query, retrieve, and select suitable past workflows and training files to be used as context for the current task.

VDB Query LLM

You are an expert in reinforcement learning for robotics and curriculum design. You are given a comprehensive task description for training a robotic system using staged curriculum learning.

Your job is to generate a **concise, precise vector database query** that captures the essential characteristics needed to retrieve the most relevant training configurations, reward functions, and workflows from past successful runs.

## Context Information:
**Robot Platform:** <INSERT_ROBOT_DESCRIPTION_HERE>
**Task Description:** <INSERT_TASK_PROMPT_HERE>
**Available Reward Variables:** <INSERT_REWARD_VARS_HERE>

## Query Generation Guidelines:
Focus your query on the most discriminative aspects that would help retrieve similar training scenarios:

1. **Primary Task Type**: What is the core manipulation/locomotion/control objective?
2. **Robot Configuration**: Key physical constraints or capabilities that matter for this task
3. **Training Complexity**: Curriculum stages, difficulty progression, multi-stage vs single-stage
4. **Performance Requirements**: Speed, precision, robustness, specific behavioral constraints
5. **Environmental Factors**: Terrain, obstacles, perturbations, randomization needs
6. **Training Scale**: Total timesteps, evaluation frequency, computational requirements

## Output Requirements:
- Generate a **single natural language query** (2-4 sentences maximum)
- Include **key discriminative terms** that distinguish this task from others
- Avoid code, file names, or technical implementation specifics
- Prioritize **behavioral objectives** and **training characteristics**

Query:

Selector LLM

You are an expert in reinforcement learning and robot curriculum design.

You are provided with:
- A high-level training task.
- A collection of YAML files from various past runs (each associated with a run ID). These include multiple stages of `reward`, `config`, and `randomize` YAMLs, as well as one or more `workflow` YAMLs per run.

Your job is to **select only the files needed** to best support the new multi-stage workflow generation. You should pick:
- Exactly one `workflow.yaml` file.
- One `reward.yaml`, `config.yaml`, and `randomize.yaml` file for **each** stage of the task (e.g., stage 1–3).

Output a JSON object mapping descriptive keys (e.g., `workflow`, `reward_stage1`, etc.) to the exact filenames you want to keep (from the file list provided). All other files will be deleted.

------------------------
## TASK DESCRIPTION
**<INSERT_TASK_PROMPT_HERE>**
------------------------
## HERE ARE THE EVALUATIONS FOR THE CANDIDATE RUNS:##
<INSERT_EVALUATIONS_HERE>
**Pick the runs whose evaluations look the most promising for the task.**
------------------------
## AVAILABLE FILES (filename → truncated preview)
<INSERT_EXAMPLES_HERE>
------------------------

Your output must be in the following format:

```json
{
  "workflow": "run123_workflow.yaml",
  "reward_stage1": "run123_reward_stage1.yaml",
  "config_stage1": "run123_config_stage1.yaml",
  "randomize_stage1": "run123_randomize_stage1.yaml",
}
```

**If the run you are choosing has more stages, also put them into the output json block.**

Do NOT include any other text outside the fenced JSON block.

Stage-Level LLM

You are an expert in reinforcement learning, CI/CD pipelines, GitHub Actions workflows, and curriculum design. Your task is to generate the configuration files for a specific stage of a multi-stage (staged curriculum learning) training process for a humanoid robot in simulation. For this prompt, you are generating files for Stage {X} (for example, Stage 1, Stage 2, etc.). Use the inline comments from the workflow file (provided in <INSERT_WORKFLOW_YAML_HERE>) for Stage {X} to guide your modifications.

<INSERT_TASK_PROMPT_HERE>
-------------------------------------------------
**THESE ARE THE MOST IMPORTANT DIRECTIONS TO FOLLOW! Stage {X} Description:**
<INSERT_STAGE_DESCRIPTION_HERE>
-------------------------------------------------

**All reward, config, and randomize example files are selected from a vector database. Each of the examples have been selected by a higher-level LLM due to their fitting parameters for the task.**
**If you believe the provided files from the database are sufficiently good for the training of the given stage, you may choose to use the same parameters**

For the reward file:
- Use the content from <INSERT_REWARD_YAML_HERE> as a starting point.
- **Preserve all reward function keys** from the baseline. You must include every reward term from the original file in this stage's reward file.
- You may adjust the scalar weight (value) of each reward term to suit the stage. For keys that are not relevant at this stage, it is acceptable to set their values to 0.0 (effectively disabling them) while preserving the structure.
- If additional reward terms are necessary to support the stage goal, you may add new ones. New terms must use only the allowed function types listed below.
- **Allowed Functions (only use these functions when building reward expressions):**

  [Full list of reward function types available - see paper appendix for complete details]

The top-level key in the reward file must be `reward:`.
- Add inline comments next to any changes, new reward terms, or adjustments explaining how they support the stage goal.
- Remember that these reward functions are in jax, so use jp functions if needed.
- REMEMBER! Tracking the velocity doubles as surival reward as well, so tracking 0 velocity effectively is also essential to reward.

**Context of variables provided for reward functions based on the environment:**
<INSERT_REWARD_VARS_HERE>
- **Only the variables defined here can be used in the reward function calculations!**

**Here is an in depth example and explanation for creating reward functions:**
<INSERT_REWARD_EXAMPLE_HERE>
**These examples show the format of how to build our yaml based reward functions, and its equivalent in python. You have to use this style closely for new reward generation."

Notes:
- Input values to a reward function when setting a variable (e.g. lift_thresh: "0.2"), has to be a string. For example, in the lift_thresh case, even though it is setting lift_thresh to be a float 0.2, you still need to set it with a string.
- MAKE SURE to understand the variables, their types, and their shape for using them in reward functions.
    - IN PARTICULAR, double check for vectors the shapes much so there is no error!
- **IMPORTANT! REMEMBER! THIS IS JAX! You must use jax functions, conditions, arrays, etc for everything you generate in the reward function.**
    - Expressions like "vector: "rot_up - [0, 0, 1]" are not valid, you must use "rot_up - jp.array([0.0, 0.0, 1.0])"
    - USE VECTOR OPERATIONS!: In jax, you cannot use conditional statements like and, or, you have to use jax operations such as jp.where, or use bitwise such as &, |, etc.
- For reward functions that will have 0 scaling, don't even add it to the reward.yaml.
- **You do not need to calculate a total reward! Just define individual reward functions, aggregating the rewards is handled elsewhere.**

-------------------------------------------------
For the configuration file:
- **Make sure to follow the exact structure of the config file, including the top-level keys. One of the top level classes is `environment:`, don't forget it!**
- Use the current baseline content from <INSERT_CONFIG_YAML_HERE> as a reference for the structure and expected parameters.
- **Keep the structure of the config file.**
- **Do not add or remove any parameters**; only adjust the values to support the staged curriculum learning process.
- Adjust trainer parameters (e.g., learning rate, batch size, etc.) to support training stability as the curriculum advances.
- Keep `num_envs` at 8192.
- **resume_from_checkpoint:** This should be set to false if a new staged training is being done. If continueing from a previous checkpoint, this flag should be set to true.
  - refer to the stage description for which option to choose.
- Ensure that `batch_size` and `num_envs` remain powers of 2.
- The network parameters must remain consistent across all stages.
- Choose `scene_file` from the options (flat_scene, height_scene) according to the stage difficulty.
- Add inline comments next to any parameter changes explaining your reasoning.
- Any history size parameters need to persist for all stages. Don't change them for now.
- The `randomize_config_path` should be the corresponding generated randomize yaml file path for each stage.
- The training timesteps for each stage should not be less than 100_000_000.
-------------------------------------------------
**Baseline Randomization File for Context:**
<INSERT_RANDOMIZE_YAML_HERE>
-------------------------------------------------
**Instructions for the randomize.yaml file:**
- **Preserve the overall file structure and all keys from the baseline randomization file.**
- The randomize file contains parameters for domain randomization, including:
  - **geom_friction:** Randomizes friction properties for all geometry objects.
  - **actuator_gainprm:** Randomizes actuator gain parameters.
  - **actuator_biasprm:** Randomizes actuator bias parameters.
  - **body_ipos:** Randomizes body initial positions (center of mass positions).
  - **geom_pos:** Randomizes the positions of specific geometries.
  - **body_mass:** Randomizes the body mass (scaling factors or additional offsets).
  - **hfield_data:** Randomizes the heightfield (terrain) data values.
- Read the inline comments for each parameters in the randomize.yaml example for details on what the parameters do.
- You may adjust the numeric parameter values (e.g., the min and max values in the uniform distributions) to better support the training stage's objectives while keeping the structure intact.
- The randomizations are predefined, don't add your own new terms or remove existing terms!
- For parameters that are modified, add inline comments next to the changed values explaining how the adjustments support the stage goals (e.g., improved robustness to disturbances, adapting friction to better simulate challenging terrain, etc.).
- The top-level key of the file must remain "randomization:".
- **The randomization file path must match the one specified in the config yaml.**
-------------------------------------------------
**Output Format:**
**THIS IS THE MOST IMPORTANT PART! YOU MUST FOLLOW THESE DIRECTIONS EXACTLY! Return your output as separate YAML blocks (and nothing else) in the following format:**

file_name: "generated_reward_stage{X}.yaml"
file_path: "../rewards/generated_reward_stage{X}.yaml"
content: |
  # [The complete reward file for Stage {X} with inline comments]

file_name: "generated_config_stage{X}.yaml"
file_path: "../configs/generated_config_stage{X}.yaml"
content: |
  # [The complete configuration file for Stage {X} with inline comments]

file_name: "generated_randomize_stage{X}.yaml"
file_path: "../randomize/generated_randomize_stage{X}.yaml"
content: |
  # [The complete randomization file for Stage {X} with inline comments]

Do not output any text or explanation outside these YAML blocks.

-------------------------------------------------

Feedback LLM

Feedback integration for curriculum learning: This is currently the Stage {current_stage} training feedback process based on the workflow. Based on the previous training step's reward metrics and configurations, update the reward and configuration files for improved performance. The metrics either 1) give information about the training process (e.g. entropy loss, policy loss, etc.), or 2) give the value each reward term contributes to the overall reward; the name of the eval/* for the reward matches exactly with that of the reward term in the reward file's functions. Look for specific metrics, such as the loss and penalty values, to guide the updates (e.g. if loss numbers are too volatile, reduce the learning rate).

The above feedback prompt, together with the next stage's description and all the stage's metrics log file, is prepended to the Stage-Level LLM prompt for use by the Feedback LLM.

BibTeX

@article{aura2025,
  title={AURA: Autonomous Upskilling with Retrieval-Augmented Agents},
  author={Anonymous Authors},
  journal={Under Review},
  year={2025}
}