Policy Gradient Simulator

Parameters

Learning rate α

How much the policy parameter moves per update

Policy std. dev. σ

Exploration width. Larger explores more but adds gradient noise

Training episodes

Batch size

Trials per update. More trials make the gradient more stable

Baseline

Whether to subtract a reference value to cut gradient variance

Results

—

Final policy mean μ

—

True optimal action a*

—

Final average reward

—

Convergence episode

—

Reward improvement

—

Baseline variance reduction

—

Reward landscape and policy — learning animation

The blue curve is the reward landscape R(a) = −(a−a*)², the red line marks the optimal action a*. The yellow bell curve is the Gaussian policy, which slides toward the optimum as learning proceeds.

Policy mean μ over training

Average reward over training

Theory & Key Formulas

$$\nabla_\theta J=\mathbb{E}\big[\nabla_\theta\log\pi_\theta(a)\,(R-b)\big]$$

The policy gradient theorem. The gradient of the expected reward J is the expectation of the log-probability gradient (score function) of action a weighted by the reward R minus a baseline b.

$$\nabla_\mu\log\pi=\frac{a-\mu}{\sigma^{2}}$$

The score function of a Gaussian policy with mean μ and standard deviation σ. Taking an above-mean action that earns a good reward pushes μ upward.

$$\mu \leftarrow \mu + \alpha\,\widehat{\nabla_\mu J},\qquad b=\overline{R}$$

The gradient-ascent update with learning rate α. Subtracting the baseline b (batch-average reward R̄) leaves the expected gradient unchanged and only lowers the variance of the estimate.

What is the Policy Gradient Simulator?

🙋

I've heard of "Q-learning" in reinforcement learning — is "policy gradient" something completely separate?

🎓

Roughly speaking, the goal is the same — learn the actions that maximise reward — but the route is reversed. Q-learning first estimates a "value (score) of each action" and then picks the highest-scoring one, deciding the policy indirectly through the value. Policy gradient instead tweaks the policy itself directly. In this tool the policy is a function — a Gaussian distribution with mean μ — and we nudge μ little by little in the direction that increases reward.

🙋

Tweaking the policy directly — concretely, how do you know "the direction that increases reward"?

🎓

That is the REINFORCE estimator, with gradient ∇J = E[∇log π(a)·(R−b)]. The term ∇log π(a) is "the direction that makes that action more likely", and (R−b) is "how much better that action was than a reference". An action that performs above the baseline gets pushed that way and becomes more likely; a worse one is pushed back. Try many actions and average, and the policy naturally drifts toward the good ones.

🙋

There's a "baseline" parameter on the left. Is that the b in (R−b)? It feels like it shouldn't matter whether you subtract it or not...

🎓

That's the interesting part. Subtracting a baseline b does not change the expected gradient at all — not by a hair — because the expectation of ∇log π is zero. But the actual estimate is full of noise. If the rewards are all big positives like "+95", "+98", every action looks good and the gradient is jumpy. Subtract the mean of about +96 and only the "better or worse than average" difference remains, and the spread of the estimate drops sharply. Flip to "No baseline" in the card above and you'll see the gradient variance rise in numbers.

🙋

I see. So do the learning rate and σ (the policy std. dev.) make learning faster the bigger you set them?

🎓

Sadly it's not that simple. Push the learning rate α too high and the updates run wild — μ overshoots the optimal action and oscillates or diverges. σ is the exploration width: large explores widely but adds gradient noise, small only looks at a narrow region. Set α to about 0.4 in the chart below and the reward curve goes jagged. In practice this balancing act is the crux, and newer methods like PPO are an evolved policy gradient designed not to "step too far in one update".

🙋

PPO — I've heard of it! It's used to train ChatGPT, right? So that's a member of the policy gradient family too.

🎓

Exactly. Today's PPO at the heart of RLHF (reinforcement learning from human feedback), and actor-critic methods too, share the same root idea as this REINFORCE: "push the policy directly with a gradient". This tool uses the simplest one-dimensional action, but what happens here — pushing with the score function, cutting variance with the baseline — works the same way when tuning large language models. It's a great subject for getting the basics into your bones.

Frequently Asked Questions

Value-based methods such as Q-learning first estimate the value (expected return) of each state-action and then choose the highest-valued action, so the policy is decided indirectly through the value. Policy gradient methods instead hold the policy itself as an explicit parameterised function — here a Gaussian distribution over a continuous action — and update its parameters directly by gradient ascent to increase the expected reward. They handle continuous actions and stochastic policies naturally, change the policy smoothly, and form the basis of actor-critic methods and PPO.

The REINFORCE gradient estimate is ∇J = E[∇log π(a)·(R−b)]. The term ∇log π(a) is the score function — the direction that makes that action more likely — and (R−b) is a weight measuring how much better that action did than a reference. Actions that performed above the baseline are pushed along the score direction and become more likely; worse actions are pushed the other way. For a Gaussian policy, ∇_μ log π = (a−μ)/σ², so taking an action above the mean and earning a good reward raises μ — exactly the intuitive update.

The baseline b reduces the variance of the gradient estimate. Subtracting a reference value b (typically the batch-average reward) from the reward R does not change the expected gradient (the bias), because the expectation of ∇log π is zero. But if rewards are all large positive numbers, every action looks good and the gradient is noisy. Subtracting the mean leaves only the difference "better or worse than average", which dramatically cuts the spread of the estimate and stabilises learning. This tool reports the ratio of gradient standard deviations with and without the baseline.

If the learning rate α is too large the updates become unstable and the policy diverges or oscillates; too small and convergence is slow. The policy standard deviation σ is the exploration width — larger σ explores more widely but adds gradient noise, smaller σ stays local. Increasing the batch size makes each gradient estimate more stable at the cost of more trials per update. In practice, start with a small learning rate, tune σ and batch size so the reward curve rises smoothly, and keep the baseline on by default.

Real-World Applications

RLHF for large language models: The most prominent use of policy gradients is "reinforcement learning from human feedback (RLHF)" for conversational AI such as ChatGPT. The text the model generates is treated as the "policy", the score from a reward model trained on human preferences as the "reward", and the policy is updated with PPO (an evolved policy gradient). Replace the μ in this tool with the model's vast parameter set and the essential mechanism is identical.

Robot control and continuous-action tasks: For control problems where actions are continuous — joint torques, wheel speeds — policy gradient methods fit more naturally than value-based ones. Leg motion for a walking robot, attitude control of a drone, trajectory generation for a robot arm: these typically express the continuous action with a Gaussian policy and optimise it through many simulated trials.

Game AI and decision-making: Policy gradients are also used in AI for Go, Shogi and video games, and in sequential decision problems such as inventory management, ad serving and recommender systems. Especially when the action choice should be stochastic (a strategy your opponent cannot read), policy gradients — which can learn a stochastic policy directly — are well suited. Actor-critic is a widely used configuration in this area.

Foundational understanding in learning and education: A one-dimensional continuous bandit like this tool is an effective teaching aid for confirming, both through the maths and the behaviour, "does policy gradient really converge to the optimal action?" and "why does the baseline help?" Because it runs deterministically with a fixed seed, you can compare exactly what changes when you adjust a single parameter.

Common Misconceptions and Pitfalls

A common misconception is that "subtracting a baseline changes the result (introduces bias)". As long as the baseline b does not depend on the action a, E[∇log π·b] = b·E[∇log π] = 0, so the expected gradient is exactly the same with or without b. Only the variance changes. That is precisely why you can safely subtract a smart baseline like the mean reward and stabilise learning. Conversely, if you let b depend on the action it introduces bias, so the design rule is: it may depend on the state but never on the action.

Next, the belief that "policy gradient always converges to the global optimum". Policy gradient guarantees convergence to a local optimum only; if the reward landscape has multiple peaks, the initial policy or exploration width σ may trap it on a lower one. The reward landscape in this tool has a single peak, so it always heads to a*, but real problems require securing enough exploration, learning from several initial values, or adding an entropy term to encourage exploration.

Finally, the overconfidence that "a higher learning rate means faster learning". The REINFORCE gradient estimate is inherently noisy, and raising the learning rate steps boldly along that noise too, so the policy overshoots the optimal action and oscillates or diverges. Set α near its maximum in this tool and you can see the reward curve become jagged. In practice, take a small learning rate, increase the batch size to stabilise the gradient, and combine it with a mechanism that "limits the step per update" as PPO does. Keep in mind that speed and stability are a trade-off.

What is the Policy Gradient Simulator?

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Pitfalls

How to Use

Worked Example

Practical Notes