Q-Learning Simulator Back
Machine Learning

Q-Learning Simulator — Reinforcement Learning Gridworld

Run Q-learning, the foundation of reinforcement learning, in a maze-like gridworld. Adjust the learning rate, discount factor and exploration rate, and watch in real time how the agent learns the optimal path to the goal from rewards alone, and how the value function converges.

Parameters
Gridworld
The type of environment to learn
Learning rate α
How strongly each new experience updates the Q values
Discount factor γ
How much future rewards matter
Exploration rate ε
Probability of a random action under ε-greedy
Training episodes
One episode runs from the start until the goal or a pit
Results
Training episodes
Optimal path steps
Start-cell value V(start)
Avg. reward (last 50)
Goal-reach rate (last 50)
Learning state
Gridworld — value heat map and greedy policy

Each cell's colour is the learned value V(s) (dark blue → bright cyan = higher value). Arrows show the greedy policy; green = goal, red = pit, grey = wall. The yellow dot is the agent walking the greedy path to the goal.

Learning curve — total reward per episode
Convergence of the start-cell value V(start)
Theory & Key Formulas

$$Q(s,a)\leftarrow Q(s,a)+\alpha\Big[\,r+\gamma\max_{a'}Q(s',a')-Q(s,a)\,\Big]$$

The Q-learning update rule. s: current state, a: action taken, r: reward received, s': next state. The bracketed term $r+\gamma\max_{a'}Q(s',a')-Q(s,a)$ is the temporal-difference (TD) error, the gap between the current estimate and "observed reward + best estimate of the next state".

$$\pi(s)=\arg\max_{a}Q(s,a), \qquad V(s)=\max_{a}Q(s,a)$$

The greedy policy π and state value V after learning. The optimal policy keeps choosing the highest-Q action in each state, and that maximum Q value is the value of the state.

$$a_t=\begin{cases}\text{random action} & \text{with probability }\varepsilon\\[2pt]\arg\max_{a}Q(s_t,a) & \text{with probability }1-\varepsilon\end{cases}$$

Action selection under ε-greedy. The agent explores with probability ε and exploits with probability 1−ε; ε sets the balance between the two.

What is Q-Learning?

🙋
I keep hearing about "reinforcement learning" with game AI and robots — but what is the agent actually learning?
🎓
In short, it learns "how to act to gain the most" purely from trial and error. In this gridworld, the agent (the yellow dot) starts moving from the start cell: reaching the goal gives a reward of +1, falling into a pit gives −1, and every move costs a small −0.04 "living penalty". At first it knows nothing about the map and was taught no rules. It just moves, collects rewards, and uses them as clues to get smarter.
🙋
If it doesn't know the map, how does it ever find the right route?
🎓
That is where Q-learning comes in. It keeps a number Q(s,a) — "how much do I expect to gain if I take this action in this state?" — in a table for every state-action pair. Each step it nudges Q with the update rule in the theory card above. The key is the max term: it mixes "the value of the best action available in the next cell" into the current estimate, so the +1 at the goal seeps gradually back to the cells before it. In the end the cells nearest the goal get the highest value, and you reach the goal simply by walking toward higher value.
🙋
I see! What is the "discount factor γ" slider on the left? It seems to default to 0.9.
🎓
γ is the dial for "how much I care about future rewards". The value one step ahead is carried back multiplied by γ, so with γ=0.9 the goal's +1 fades to about 0.9⁷ ≈ 0.48 seven steps away. A small γ makes distant rewards almost invisible, so the agent only thinks about the immediate moment. Closer to 1, it plans toward a far goal but the value takes longer to propagate. Look at the V(start) convergence chart and you can see the final value height change as you vary γ.
🙋
When I set the "exploration rate ε" to 0, the path seems a bit worse. I'd think NOT moving randomly would be more efficient...
🎓
That is the most famous dilemma in reinforcement learning: the exploration-exploitation trade-off. With ε=0 the agent only ever picks the best action it currently knows. So it keeps reinforcing whatever route it happened to take early on, never tries a genuinely better shortcut, and settles into a so-so policy. But with ε=1, fully random, it cannot use any of the knowledge it gained. That is why in practice people use ε-decay: start ε high to explore well, then shrink it gradually as learning progresses.
🙋
Outside of mazes, where is Q-learning actually used?
🎓
The idea shows up everywhere. Elevator dispatching, traffic-signal control, inventory reorder timing, game-playing AI — all classic examples. For problems with an enormous number of states (Go, a video-game screen) you cannot store Q in a table, so you approximate it with a neural network. That is DQN (Deep Q-Network), which made headlines in 2015, and it grew directly out of this tabular Q-learning. So this little grid example is the most important foundation for understanding the cutting-edge methods.

Frequently Asked Questions

Q-learning is a classic reinforcement-learning algorithm in which an agent learns an action-value function Q(s,a) — how good each action is in each state — purely from the rewards it collects, with no model of the environment. By repeating the update Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') − Q(s,a)] at every step, the values are theoretically guaranteed to converge to the optimal action values given enough trials. In the tabular version a Q value is stored in a table for each state-action pair.
The learning rate α sets how strongly each new experience updates the Q values: larger means faster learning but more oscillation, smaller means slower but more stable. The discount factor γ sets how much future rewards matter; the closer γ is to 1, the further ahead toward the goal reward the agent learns to plan. The exploration rate ε is the probability of taking a random action under the ε-greedy rule, balancing exploration (trying new paths) against exploitation (using the best action known so far).
With ε=0 the agent follows a purely greedy policy, always picking the action with the highest current Q value. When all Q values start equal, it keeps reinforcing whatever path it stumbled onto early and may lock into a sub-optimal policy without ever trying a better route — the classic under-exploration problem. At the other extreme, ε=1 makes the agent act completely at random, so it cannot use what it has learned and the goal-reach rate drops sharply. In practice an ε-decay schedule, starting high and shrinking over time, is common.
Both are value-based temporal-difference methods, but they differ in how they take the value of the next state. Q-learning uses max Q(s',a') — the value of the best possible action in the next state — so it learns the optimal policy independently of how the agent actually behaves; this makes it off-policy. SARSA uses Q(s',a') for the action a' actually selected, making it on-policy and yielding a more cautious policy that accounts for the risk of exploration. The two behave noticeably differently along risky routes such as cliff edges.

Real-World Applications

Robot control and path planning: Learning a route for a mobile robot to reach a destination while avoiding obstacles is exactly an extension of this gridworld. On a real robot the state becomes continuous (position, velocity, angle), so Q cannot be kept in a plain table and is combined with function approximation — but the skeleton of "learn a policy from rewards" is unchanged. It is applied to autonomous warehouse transport vehicles and cleaning-robot behaviour optimization.

Game AI and DQN: Starting from tabular Q-learning, approximating the Q function with a deep neural network gives DQN (Deep Q-Network). In 2015 it drew attention by playing Atari video games at a human level from screen pixels alone. Even when the state space is huge, the Q-learning update rule (correcting values via the TD error) is used unchanged, and this simulator is its minimal model.

Control and scheduling: Elevator-group dispatching, traffic-signal timing, data-center cooling optimization, inventory reorder quantities — problems of "choosing the best action sequentially given the current state" are a strength of reinforcement learning. Even for complex systems where an explicit equation model is hard to build, a policy can be acquired by trial and error on a simulator.

Education and algorithm validation: The gridworld is a standard reinforcement-learning benchmark. When proposing a new algorithm, researchers first check on a small grid whether the learning curve rises and whether the value function converges correctly. Visualizing the learning curve and V(start) convergence, as this tool does, is the first step in building intuition for the hyperparameters α, γ and ε.

Common Misconceptions and Pitfalls

A common misconception is that a larger learning rate α always makes the agent smarter faster. α does speed up learning, but pushing it toward 1 makes each update essentially replace the Q value wholesale with the new experience, so noisy rewards and the random variation from ε make the values oscillate wildly. When the learning curve stays jagged and never settles, lowering α often calms it down. In theory, obtaining a convergence guarantee requires gradually shrinking α as learning proceeds. Understand that "speed" and "stability" are a trade-off.

Next, the assumption that setting the discount factor γ to 1 is best because it looks furthest ahead. A larger γ does carry value back from a distant goal, but for a task that never terminates (the agent reaches neither goal nor pit), γ=1 makes the values diverge and the computation breaks down. Even with a design like this tool that caps each episode in steps and ends at the goal, setting γ to exactly 1 makes value propagation slow and demands an enormous number of episodes to converge. That is why most implementations use γ between 0.9 and 0.99.

Finally, be careful with the idea that once learning is done, exploration ε is no longer needed. If the environment is fixed, a greedy policy with ε=0 is fine after learning. But real environments change (a route gets blocked, the reward structure shifts). Drop ε to exactly 0 and the agent cannot notice the change, clinging to an outdated optimal policy. Also, for sparse-reward problems (almost 0 everywhere except the goal), too little exploration can mean the agent never reaches the goal even once, so its Q values never grow at all. Before zeroing ε, always confirm that it is truly safe to stop exploring.

How to Use

  1. Set learning rate (alpha) between 0.1–0.9: higher values (0.7–0.9) converge faster in 5×5 grids; lower values (0.1–0.3) stabilize Q-values in stochastic environments.
  2. Configure discount factor (gamma) from 0.7–0.99: use 0.95 for balanced long-term planning; increase to 0.99 when distant goals matter (15+ step paths).
  3. Adjust exploration rate (epsilon) decay: start at 0.5–1.0, decay to 0.01 over training episodes to balance exploration vs. exploitation after convergence.
  4. Run 500–5000 training episodes depending on grid complexity; observe optimal path steps and V(start) stabilization in the learning state output.

Worked Example

Configure a 7×7 gridworld maze with walls. Set alpha=0.8, gamma=0.95, epsilon=1.0 decaying to 0.05 over 2000 episodes. After 2000 training iterations, the agent discovers a 12-step optimal path from start cell (0,0) to goal (6,6). V(start) converges to approximately 8.5, indicating expected cumulative discounted reward. Average reward in final 50 episodes reaches 0.85, goal-reach rate stabilizes at 94%, confirming the Q-table has learned viable policies.

Practical Notes

  1. Use alpha=0.5, gamma=0.9 as defaults for 5×5 grids; increase alpha to 0.85 only if reward signals are sparse (goal at far corner).
  2. If optimal path steps remain high (>20) after 3000 episodes, decrease epsilon decay rate to maintain exploration; agent may be stuck in local optima.
  3. Monitor goal-reach rate: below 70% indicates insufficient exploration; raise initial epsilon or extend training episodes by 1000–2000.
  4. Compare V(start) across runs: convergence variance >2.0 suggests alpha too high; reduce to 0.5 and re-train.