Experience REINFORCE, a classic reinforcement-learning algorithm. On a continuous-action task, watch the mean of a Gaussian policy slide toward the optimal action and see how much the baseline tames the gradient noise — all computed deterministically with a fixed seed.
Parameters
Learning rate α
How much the policy parameter moves per update
Policy std. dev. σ
Exploration width. Larger explores more but adds gradient noise
Training episodes
Batch size
Trials per update. More trials make the gradient more stable
Baseline
Whether to subtract a reference value to cut gradient variance
Results
—
Final policy mean μ
—
True optimal action a*
—
Final average reward
—
Convergence episode
—
Reward improvement
—
Baseline variance reduction
—
Reward landscape and policy — learning animation
The blue curve is the reward landscape R(a) = −(a−a*)², the red line marks the optimal action a*. The yellow bell curve is the Gaussian policy, which slides toward the optimum as learning proceeds.
The policy gradient theorem. The gradient of the expected reward J is the expectation of the log-probability gradient (score function) of action a weighted by the reward R minus a baseline b.
$$\nabla_\mu\log\pi=\frac{a-\mu}{\sigma^{2}}$$
The score function of a Gaussian policy with mean μ and standard deviation σ. Taking an above-mean action that earns a good reward pushes μ upward.
The gradient-ascent update with learning rate α. Subtracting the baseline b (batch-average reward R̄) leaves the expected gradient unchanged and only lowers the variance of the estimate.
What is the Policy Gradient Simulator?
🙋
I've heard of "Q-learning" in reinforcement learning — is "policy gradient" something completely separate?
🎓
Roughly speaking, the goal is the same — learn the actions that maximise reward — but the route is reversed. Q-learning first estimates a "value (score) of each action" and then picks the highest-scoring one, deciding the policy indirectly through the value. Policy gradient instead tweaks the policy itself directly. In this tool the policy is a function — a Gaussian distribution with mean μ — and we nudge μ little by little in the direction that increases reward.
🙋
Tweaking the policy directly — concretely, how do you know "the direction that increases reward"?
🎓
That is the REINFORCE estimator, with gradient ∇J = E[∇log π(a)·(R−b)]. The term ∇log π(a) is "the direction that makes that action more likely", and (R−b) is "how much better that action was than a reference". An action that performs above the baseline gets pushed that way and becomes more likely; a worse one is pushed back. Try many actions and average, and the policy naturally drifts toward the good ones.
🙋
There's a "baseline" parameter on the left. Is that the b in (R−b)? It feels like it shouldn't matter whether you subtract it or not...
🎓
That's the interesting part. Subtracting a baseline b does not change the expected gradient at all — not by a hair — because the expectation of ∇log π is zero. But the actual estimate is full of noise. If the rewards are all big positives like "+95", "+98", every action looks good and the gradient is jumpy. Subtract the mean of about +96 and only the "better or worse than average" difference remains, and the spread of the estimate drops sharply. Flip to "No baseline" in the card above and you'll see the gradient variance rise in numbers.
🙋
I see. So do the learning rate and σ (the policy std. dev.) make learning faster the bigger you set them?
🎓
Sadly it's not that simple. Push the learning rate α too high and the updates run wild — μ overshoots the optimal action and oscillates or diverges. σ is the exploration width: large explores widely but adds gradient noise, small only looks at a narrow region. Set α to about 0.4 in the chart below and the reward curve goes jagged. In practice this balancing act is the crux, and newer methods like PPO are an evolved policy gradient designed not to "step too far in one update".
🙋
PPO — I've heard of it! It's used to train ChatGPT, right? So that's a member of the policy gradient family too.
🎓
Exactly. Today's PPO at the heart of RLHF (reinforcement learning from human feedback), and actor-critic methods too, share the same root idea as this REINFORCE: "push the policy directly with a gradient". This tool uses the simplest one-dimensional action, but what happens here — pushing with the score function, cutting variance with the baseline — works the same way when tuning large language models. It's a great subject for getting the basics into your bones.
Frequently Asked Questions
Value-based methods such as Q-learning first estimate the value (expected return) of each state-action and then choose the highest-valued action, so the policy is decided indirectly through the value. Policy gradient methods instead hold the policy itself as an explicit parameterised function — here a Gaussian distribution over a continuous action — and update its parameters directly by gradient ascent to increase the expected reward. They handle continuous actions and stochastic policies naturally, change the policy smoothly, and form the basis of actor-critic methods and PPO.
The REINFORCE gradient estimate is ∇J = E[∇log π(a)·(R−b)]. The term ∇log π(a) is the score function — the direction that makes that action more likely — and (R−b) is a weight measuring how much better that action did than a reference. Actions that performed above the baseline are pushed along the score direction and become more likely; worse actions are pushed the other way. For a Gaussian policy, ∇_μ log π = (a−μ)/σ², so taking an action above the mean and earning a good reward raises μ — exactly the intuitive update.
The baseline b reduces the variance of the gradient estimate. Subtracting a reference value b (typically the batch-average reward) from the reward R does not change the expected gradient (the bias), because the expectation of ∇log π is zero. But if rewards are all large positive numbers, every action looks good and the gradient is noisy. Subtracting the mean leaves only the difference "better or worse than average", which dramatically cuts the spread of the estimate and stabilises learning. This tool reports the ratio of gradient standard deviations with and without the baseline.
If the learning rate α is too large the updates become unstable and the policy diverges or oscillates; too small and convergence is slow. The policy standard deviation σ is the exploration width — larger σ explores more widely but adds gradient noise, smaller σ stays local. Increasing the batch size makes each gradient estimate more stable at the cost of more trials per update. In practice, start with a small learning rate, tune σ and batch size so the reward curve rises smoothly, and keep the baseline on by default.
Real-World Applications
RLHF for large language models: The most prominent use of policy gradients is "reinforcement learning from human feedback (RLHF)" for conversational AI such as ChatGPT. The text the model generates is treated as the "policy", the score from a reward model trained on human preferences as the "reward", and the policy is updated with PPO (an evolved policy gradient). Replace the μ in this tool with the model's vast parameter set and the essential mechanism is identical.
Robot control and continuous-action tasks: For control problems where actions are continuous — joint torques, wheel speeds — policy gradient methods fit more naturally than value-based ones. Leg motion for a walking robot, attitude control of a drone, trajectory generation for a robot arm: these typically express the continuous action with a Gaussian policy and optimise it through many simulated trials.
Game AI and decision-making: Policy gradients are also used in AI for Go, Shogi and video games, and in sequential decision problems such as inventory management, ad serving and recommender systems. Especially when the action choice should be stochastic (a strategy your opponent cannot read), policy gradients — which can learn a stochastic policy directly — are well suited. Actor-critic is a widely used configuration in this area.
Foundational understanding in learning and education: A one-dimensional continuous bandit like this tool is an effective teaching aid for confirming, both through the maths and the behaviour, "does policy gradient really converge to the optimal action?" and "why does the baseline help?" Because it runs deterministically with a fixed seed, you can compare exactly what changes when you adjust a single parameter.
Common Misconceptions and Pitfalls
A common misconception is that "subtracting a baseline changes the result (introduces bias)". As long as the baseline b does not depend on the action a, E[∇log π·b] = b·E[∇log π] = 0, so the expected gradient is exactly the same with or without b. Only the variance changes. That is precisely why you can safely subtract a smart baseline like the mean reward and stabilise learning. Conversely, if you let b depend on the action it introduces bias, so the design rule is: it may depend on the state but never on the action.
Next, the belief that "policy gradient always converges to the global optimum". Policy gradient guarantees convergence to a local optimum only; if the reward landscape has multiple peaks, the initial policy or exploration width σ may trap it on a lower one. The reward landscape in this tool has a single peak, so it always heads to a*, but real problems require securing enough exploration, learning from several initial values, or adding an entropy term to encourage exploration.
Finally, the overconfidence that "a higher learning rate means faster learning". The REINFORCE gradient estimate is inherently noisy, and raising the learning rate steps boldly along that noise too, so the policy overshoots the optimal action and oscillates or diverges. Set α near its maximum in this tool and you can see the reward curve become jagged. In practice, take a small learning rate, increase the batch size to stabilise the gradient, and combine it with a mechanism that "limits the step per update" as PPO does. Keep in mind that speed and stability are a trade-off.
How to Use
Set learning rate (lrNum: 0.001–0.1) to control policy mean μ updates; higher values accelerate convergence but risk overshooting the true optimal action a*.
Configure standard deviation (sdNum: 0.1–2.0) to balance exploration during rollouts; narrow ranges reduce variance but may trap in local optima.
Specify episode count (epNum: 100–5000) and batch size (bsNum: 4–128) to observe reward improvement and convergence episode tracking across iterations.
Worked Example
Continuous control task with true optimal action a*=3.5. Initialize Gaussian policy μ=0.0, σ=1.5, learning rate=0.02, batch size=32. After 500 episodes, policy mean converges to μ=3.51, final average reward reaches 8.7 (vs. baseline 2.1), and convergence occurs at episode 387. Baseline variance reduction (using reward-to-go) drops gradient estimate variance from 12.4 to 3.8, improving sample efficiency.
Practical Notes
Lower learning rates (0.001–0.005) suit high-variance continuous tasks; reduces reward oscillation in episodes 1–100 but extends convergence by 40–60%.
Batch size 64+ stabilizes gradient estimates in REINFORCE; batch size 8 produces noisy updates unless paired with learning rate 0.005 and episode count ≥2000.
Standard deviation decay (0.99/episode) after episode 800 exploits learned policy; omitting decay yields suboptimal final rewards by 15–25%.