CS188 Notes 2 - Markov Decision Processes (MDPs) • TheUnknownBlog

Note:#

You could view previous notes on CS188: Lecture 4 - Constraint Satisfaction Problems (CSPs)

Also note that my notes are based on the Spring 2025 version of the course, and my understanding of the material. So they MAY NOT be 100% accurate or complete. Also, THIS IS NOT A SUBSTITUTE FOR THE COURSE MATERIAL. I would only take notes on parts of the lecture that I find interesting or confusing. I will NOT be taking notes on every single detail of the lecture.

CS188: Lecture 8 - Markov Decision Processes (MDPs)#

Markov Decision Processes (MDPs)#

A Markov Decision Process (MDP) represents sequential decision-making in environments where actions produce stochastic (random) outcomes, and an agent’s goal is to maximize its cumulative reward over time. In an MDP, the agent faces uncertainty: it cannot always predict the result of its actions, but it must still try to act optimally.

The key components of an MDP are:

States $S$ : Possible situations the agent can find itself in.
Actions $A$ : The set of possible moves or decisions the agent can make in each state.
Transition Function $T(s, a, s')$ : The probability that action $a$ in state $s$ leads to state $s'$ .
Reward Function $R(s, a, s')$ : The reward received after transitioning from $s$ to $s'$ via action $a$ .
Discount Factor $\gamma$ : How much the agent values future rewards compared to immediate rewards.

We yield the value function $V(s)$ for each state $s$ , which represents the expected cumulative reward starting from state $s$ and following the optimal policy thereafter. And the action-value function $Q(s, a)$ for each action state $(s,a)$ , which represents the expected cumulative reward starting from state $s$ , taking action $a$ , and then following the optimal policy thereafter.

You might think “why not just use the value function $V(s)$ ?” The reason is actions are easier to select from $Q$ -values than values! You will see this in the following part of this lecture.

The goal is to find an optimal policy $\pi^*$ , which is a mapping from states to actions ( $\pi(s) = a$ ), maximizing the expected cumulative (usually discounted) reward from any state. In this sense, an MDP defines both the “game rules” and what it means to “play well” in that environment.

Stationary Preferences#

The assumption of stationary preferences means that your relative preference between two future sequences of rewards doesn’t change just because you receive the same immediate reward before both. This property imposes a recursive structure on the utility function for reward sequences.

Formally, the utility $U$ of a sequence $[r_0, r_1, r_2, ...]$ must satisfy:

U([r_0, r_1, r_2, ...]) = f(r_0, U([r_1, r_2, ...]))

where $f$ is some consistent function. If we assume $f$ is linear, this recursion unrolls to only two possible forms for the utility function (after appropriate normalization):

Additive Utility: $U = r_0 + r_1 + r_2 + \cdots$ (corresponds to $\gamma = 1$ )
Discounted Utility: $U = r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots$ (where $0 \leq \gamma < 1$ )

The discounted utility is the standard in MDPs, as it ensures convergence for infinite horizons and reflects the diminishing importance of rewards further in the future.

Why MDPs?#

MDPs are particularly suitable when:

The environment is stochastic: the same action in the same state can yield different results.
Rewards may be delayed: the value of an action now may be realized only after multiple future steps.

Unlike simple search algorithms (e.g., greedy or expectimax), MDPs explicitly model both uncertainty (via $T$ ) and the accumulation of rewards over time (via $R$ and $\gamma$ ). While solving an MDP requires knowledge of $T$ and $R$ , reinforcement learning (RL) methods learn optimal policies directly from experience, using the MDP framework as a theoretical foundation.

MDPs vs Expectimax#

Both MDPs and expectimax handle uncertainty and aim for maximum expected utility. Expectimax, however, is typically used to compute the expected value of actions from a specific starting point, often with a tree structure and a finite horizon. MDPs, in contrast, compute a policy—the best action for every possible state—naturally handling cycles and infinite (discounted) horizons.

In short: expectimax is a limited lookahead from the current state; solving an MDP finds a full strategy for all states.

MDPs and Multi-Agent Games#

Standard MDPs are designed for a single agent interacting with a stochastic environment. They do not directly accommodate multiple strategic agents whose actions affect each other’s outcomes. Multi-agent situations typically require other formalisms, such as stochastic games or Markov games.

MDPs vs Greedy Search#

Greedy algorithms make decisions based solely on immediate rewards, without considering long-term consequences. MDPs, by calculating the expected sum of (possibly discounted) future rewards, are inherently long-sighted. Optimizing for the value function $V^*(s)$ or the action-value function $Q^*(s,a)$ , MDPs look ahead through the space of future possibilities, not just the next step.

ONE SENTENCE SUMMARY:
Markov Decision Processes are mathematical models for sequential decision-making under uncertainty, aiming to find policies that maximize expected (possibly discounted) cumulative reward, and forming the theoretical foundation for reinforcement learning.