Each cell's colour is the learned value V(s) (dark blue → bright cyan = higher value). Arrows show the greedy policy; green = goal, red = pit, grey = wall. The yellow dot is the agent walking the greedy path to the goal.
$$Q(s,a)\leftarrow Q(s,a)+\alpha\Big[\,r+\gamma\max_{a'}Q(s',a')-Q(s,a)\,\Big]$$
The Q-learning update rule. s: current state, a: action taken, r: reward received, s': next state. The bracketed term $r+\gamma\max_{a'}Q(s',a')-Q(s,a)$ is the temporal-difference (TD) error, the gap between the current estimate and "observed reward + best estimate of the next state".
$$\pi(s)=\arg\max_{a}Q(s,a), \qquad V(s)=\max_{a}Q(s,a)$$
The greedy policy π and state value V after learning. The optimal policy keeps choosing the highest-Q action in each state, and that maximum Q value is the value of the state.
$$a_t=\begin{cases}\text{random action} & \text{with probability }\varepsilon\\[2pt]\arg\max_{a}Q(s_t,a) & \text{with probability }1-\varepsilon\end{cases}$$
Action selection under ε-greedy. The agent explores with probability ε and exploits with probability 1−ε; ε sets the balance between the two.