reward-penalty-engineering

Name: reward-penalty-engineering
Author: mzqef

Methodology for exploring, testing, and archiving reward/penalty functions for VBot quadruped navigation. A process-oriented guide for systematic reward discovery.

Install

mkdir -p .claude/skills/reward-penalty-engineering && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15210" && unzip -o skill.zip -d .claude/skills/reward-penalty-engineering && rm skill.zip

Installs to .claude/skills/reward-penalty-engineering

Activation

This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.

Methodology for exploring, testing, and archiving reward/penalty functions for VBot quadruped navigation. A process-oriented guide for systematic reward discovery.

163 charsno explicit “when” trigger

About this skill

Purpose

This skill teaches the methodology of reward/penalty exploration — how to discover, test, evaluate, and archive reward signals. It is a process guide, not a recipe book.

How to identify what reward/penalty to try next
How to formulate a hypothesis and test it
How to evaluate whether a reward change helped
How to archive findings in the reward library for reuse

IMPORTANT:

The reward function lives in starter_kit/navigation*/vbot/vbot_*_np.py → _compute_reward().

Reward weights are in starter_kit/navigation*/vbot/cfg.py → RewardConfig.scales dict.

Current reward component details, default values, and search ranges are documented in starter_kit_docs/{task-name}/Task_Reference.md.

Anti-laziness mechanisms (conditional alive_bonus, time_decay, successful truncation) are active. Do NOT remove them.

Do NOT re-implement the reward function from scratch. Always check the existing implementation and documentation before making changes. Modify weights or add new terms incrementally with clear hypotheses, testing, and evaluation and archiving.

This skill does NOT contain reward component examples or scale tables. Those live in their respective locations:

What Where
Component reference & scale ranges starter_kit_schedule/templates/reward_config_template.yaml
Archived reward/penalty instances starter_kit_schedule/reward_library/
Terrain strategies & reward code quadruped-competition-tutor skill
Stage-specific reward overrides curriculum-learning skill
Reward weight search spaces hyperparameter-optimization skill
Visual reward debugging subagent-copilot-cli skill

When to Use This Skill

Situation	Use This
"I need a new reward idea"	✅ Follow the Discovery Process
"This reward isn't working, what now?"	✅ Follow Diagnostic Methodology
"I want to compare two reward designs"	✅ Follow Experiment Protocol
"I found a good reward, where to save it?"	✅ Follow Archiving Process
"What are the reward scale ranges?"	❌ Read `reward_config_template.yaml`
"What reward code exists for stairs?"	❌ Read `quadruped-competition-tutor`
"How do I tune reward weights automatically?"	❌ Read `hyperparameter-optimization`

The Exploration Cycle

Reward engineering is iterative. Every change follows this cycle:

    ┌──────────────┐
    │   DIAGNOSE   │ ← What behavior is wrong?
    └──────┬───────┘
           ▼
    ┌──────────────┐
    │  HYPOTHESIZE  │ ← What reward signal could fix it?
    └──────┬───────┘
           ▼
    ┌──────────────┐
    │   IMPLEMENT   │ ← Minimal change, one variable at a time
    └──────┬───────┘
           ▼
    ┌──────────────┐
    │     TEST      │ ← Short run (1-2M steps), multiple seeds
    └──────┬───────┘
           ▼
    ┌──────────────┐
    │   EVALUATE    │ ← Did the hypothesis hold?
    └──────┬───────┘
           ▼
    ┌──────────────┐
    │   ARCHIVE     │ ← Record result in reward library
    └──────┬───────┘
           │
           ▼
      Next cycle

Rule: Never change more than one reward dimension per cycle. If you change both the termination penalty AND add a new gait reward, you cannot attribute outcomes.

Phase 1: Diagnose

Behavioral Signals

Before touching rewards, identify what behavior is wrong. Not "the reward is too low" but a concrete observable:

Observable	Likely Reward Gap
Robot doesn't move	Missing or weak positive incentive
Robot moves but falls	Missing or weak stability penalty
Robot oscillates near goal	Reward gradient too steep near target
Robot takes bizarre paths	Reward hacking — high reward from unintended behavior
Robot crouches/crawls	Missing height maintenance signal
Robot ignores obstacles	Missing proximity/collision signal
Robot is fast but jerky	Missing smoothness penalty
Robot is stable but slow	Positive incentive too weak relative to penalties
Reward curve plateaus	Reward provides no gradient in current state region
Robot stands still near target	alive_bonus accumulation > goal reward — see Lazy Robot Case Study below
Distance increases during training	Reward hacking via per-step bonus. Check alive_bonus × avg_ep_len vs arrival_bonus
Episode length near max, reached% drops	Robot exploiting per-step rewards instead of completing task

Diagnostic Commands

# 1. Watch the policy — ALWAYS start here before looking at numbers
uv run scripts/play.py --env <env-name>

# 2. Train with rendering to see behavior in real time
uv run scripts/train.py --env <env-name> --render

# 3. TensorBoard for reward curves
uv run tensorboard --logdir runs/<env-name>

Visual Diagnosis

Use subagent-copilot-cli to analyze simulation frames and training curves:

# Describe what you see, ask what reward signal is missing
copilot --model gpt-4.1 --allow-all -p "Watch this simulation frame. The robot is <describe behavior>. What reward signal might cause this?" -s

Key insight: A reward signal is "missing" if the agent has no gradient pointing toward the desired behavior in its current state. The fix may be a new reward, a penalty, or reshaping an existing one.

Phase 2: Hypothesize

Formulating a Good Hypothesis

A testable reward hypothesis has three parts:

Behavior target: What the robot should do differently
Signal mechanism: What mathematical signal encodes that behavior
Expected side effect: What might go wrong

Template:

"If I add/modify <signal> with weight <w>, the robot should <desired behavior>, but might also <risk>."

Discovery Strategies

When you don't know what to try, use these strategies to generate candidates:

Strategy 1: Inversion

Take the undesired behavior and directly penalize it.

Robot bouncing → penalize vertical velocity Robot spinning → penalize angular velocity Robot retreating → penalize backward displacement

Strategy 2: Shaping the Gradient

If the robot is stuck, the reward surface is flat in its current region. Add a signal that creates local gradient:

Robot stuck far from goal → Add distance-based shaping (sigmoid, exponential) Robot stuck near goal → Add fine-grained proximity bonus Robot stuck on terrain edge → Add progress checkpoints

Strategy 3: Proxy Decomposition

Break the competition score into component sub-goals and create a signal for each:

Final score = traversal + bonus zones + time bonus → Create separate signals for: forward progress, zone proximity, speed

Strategy 4: Biomimetic Analogy

What would a real quadruped "want" in this situation?

Stairs → lift knees higher Uneven ground → keep center of mass low Obstacles → slow down, increase awareness

Strategy 5: Ablation Discovery

Temporarily remove one existing reward and see what degrades:

# Remove component to see its effect
python scripts/train.py --env <env> --seed 42 --cfg-override "reward_config.scales.<component>=0.0"

If removing a component doesn't change behavior, it was irrelevant. If behavior collapses, it was critical.

Strategy 6: Competition-Score Alignment

Compare training reward to competition scoring rules. Gaps indicate missing signals:

Competition awards points for stopping in smiley zones → but training reward only rewards forward velocity → mismatch: need a "stop in zone" signal

Refer to quadruped-competition-tutor skill for competition scoring rules.

Strategy 7: Browse the Library

Check previously tried components in the reward library before inventing new ones:

# Browse archived reward components
Get-ChildItem starter_kit_schedule/reward_library/components/ | Select-Object Name
# Read a specific component's notes
Get-Content starter_kit_schedule/reward_library/components/<name>.yaml

Phase 3: Implement

Principles

One variable at a time — Change a single reward component per experiment
Minimal change — Prefer adjusting a weight before adding new code
Use existing infrastructure — Check reward_config_template.yaml for components that can be enabled/disabled before writing new code

Where to Make Changes

Change Type	Location
Adjust existing weight	`starter_kit/{task}/vbot/cfg.py` → `RewardConfig.scales` dict
Add new reward term	`starter_kit/{task}/vbot/vbot_*_np.py` → `_compute_reward()`
Configure component	`starter_kit_schedule/templates/reward_config_template.yaml`

Change Magnitude Guidelines

When adjusting weights, use multiplicative steps not additive:

Small adjustment: ×0.5 or ×2 (halve or double)
Medium adjustment: ×0.1 or ×10
Large adjustment: ×0.01 or ×100

For new components, start with a weight that produces reward magnitude comparable to existing dominant terms (check reward_breakdown logs).

Phase 4: Test

🔴 AutoML-First Testing (MANDATORY)

NEVER iterate manually with train.py, changing one reward weight, running, reading TensorBoard, killing, repeating. This is manual one-at-a-time search — slow, error-prone, and wasteful. ALWAYS use automl.py for batch reward hypothesis testing.

The correct workflow:

Add your reward hypothesis as a search range in REWARD_SEARCH_SPACE (in automl.py)
Run automl.py --hp-trials 8+ to test multiple configurations in one batch
Read the structured comparison in starter_kit_log/automl_*/report.md
Archive results in the reward library

Example: Testing near_target_speed activation radius

# In automl.py REWARD_SEARCH_SPACE:
"near_target_speed": {"type": "uniform", "low": -2.0, "high": -0.1},
"near_target_activatio

---

*Content truncated.*

Install

mkdir -p .claude/skills/reward-penalty-engineering && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/15210" && unzip -o skill.zip -d .claude/skills/reward-penalty-engineering && rm skill.zip

Installs to .claude/skills/reward-penalty-engineering

Safety

No risk patterns found

Automated static scan of the SKILL.md and repo. A flag describes what the skill can do — not a verdict. Always review code before installing.

Source & maintenance

Updated

4mo ago

License

Apache-2.0

Repo stars

Loads

~5,183 tokens

Stars are for the whole repository, not this skill alone.

Stats

Views

Installs

Author

mzqef

Links

Source code

What	Where
Component reference & scale ranges	`starter_kit_schedule/templates/reward_config_template.yaml`
Archived reward/penalty instances	`starter_kit_schedule/reward_library/`
Terrain strategies & reward code	`quadruped-competition-tutor` skill
Stage-specific reward overrides	`curriculum-learning` skill
Reward weight search spaces	`hyperparameter-optimization` skill
Visual reward debugging	`subagent-copilot-cli` skill

reward-penalty-engineering

Install

Activation

About this skill

Purpose

When to Use This Skill

The Exploration Cycle

Phase 1: Diagnose

Behavioral Signals

Diagnostic Commands

Visual Diagnosis

Phase 2: Hypothesize

Formulating a Good Hypothesis

Discovery Strategies

Strategy 1: Inversion

Strategy 2: Shaping the Gradient

Strategy 3: Proxy Decomposition

Strategy 4: Biomimetic Analogy

Strategy 5: Ablation Discovery

Strategy 6: Competition-Score Alignment

Strategy 7: Browse the Library

Phase 3: Implement

Principles

Where to Make Changes

Change Magnitude Guidelines

Phase 4: Test

🔴 AutoML-First Testing (MANDATORY)

Search skills