The reward hypothesis

2 min readMay 31


In reinforcement learning, the learning process typically follows a loop that generates a sequence of state-action-reward-next state tuples. This sequence is often referred to as an “experience replay” or “trajectory.”

In reinforcement learning, the agent’s ultimate goal is to maximize its cumulative reward, which is often referred to as the “expected return.” The expected return represents the total sum of rewards that the agent expects to receive over time by following a particular policy.

The expected return is typically defined as the sum of discounted rewards over a sequence of steps. The discount factor (gamma, denoted as γ) determines the importance of immediate rewards compared to future rewards. By discounting future rewards, the agent values immediate rewards more highly, reflecting the desire for immediate gratification or the uncertainty of the future.

The agent’s objective is to find a policy that maximizes the expected return. This involves exploring and learning from interactions with the environment, evaluating the goodness of different actions in different states, and updating its policy accordingly. Reinforcement learning algorithms, such as Q-learning or policy gradient methods, are designed to guide the agent towards discovering an optimal policy that leads to the highest expected return.

By iteratively improving its policy through trial and error, the agent strives to make better decisions over time, maximizing its cumulative reward and achieving its goals in the given environment.