Logo
Overview

Markov Decision Process (MDP)

October 6, 2025
12 min read

Stochastic Process

A stochastic process is a collection of random variables representing the evolution of some system of random values over time.

A sequence of random variables

X=X(t),tTX = {X(t), t \in T}

is called a stochastic process, where tt is countable/uncountable set, often representing time, and X(t)X(t) is the sample path.

Example: Markov Chain (Markov Decision Process)

Note
  • Random variable is when only random variable is considered.
  • Stochastic process is when both time and random variable are considered.

이건 확랜프 참고 하삼

Discrete-Time Markov Chain (DTMC)

A discrete-time Markov chain is a stochastic process that satisfies the Markov property and evolves in discrete time steps.

A stochastic process in discrete time

Xnn,Xn,n=0,1,2,...X_n \def \text{ state of the system at time } n, \\ {X_n, n = 0, 1, 2, ...}

If we write Xn=iX_n = i : means the system is in state ii at time nn.

Example: Consecutive coin tossing

  • XnX_n : number of heads in nn tosses
  • Sample space: T,H,H,T,...{T, H, H, T, ...} —> these are the states?

Basic Properties of DTMC

  1. A Discrete Index Set

    1. This typically is the “time” index.
    2. N=0,1,2,...,nNN = {0, 1, 2, ...}, n \in N
  2. A Countable State Space

    1. Denoted by SS. Contains the possible states that XnX_n can take.
    2. Can be finite or countably infinite.
  3. The Markov Property

    1. The Markov property states that the future state depends only on the current state and not on the sequence of events that preceded it.
    2. Only depends on current time nn, and not on the past states Xn1,Xn2,...X_{n-1}, X_{n-2}, ....
    3. In other words, the next state only depends on the current state.
    P(Xn+1=jXn=i,Xn1=in1,...,X0=i0)=P(Xn+1=jXn=i)=PijP(X_{n+1} = j | X_n = i, X_{n-1} = i_{n-1}, ..., X_0 = i_0) = P(X_{n+1} = j | X_n = i) = P_{ij}

    This is the probability that the system transitions into state jj at time n+1n+1 given that it is in state ii at time nn.

Chapman-Kolmogorov Equations

PP (transition probability matrix) is a square matrix where the element at row ii and column jj represents the probability of transitioning from state ii to state jj in one time step. It has entries

Pij=P(Xn+1=jXn=i)P_{ij} = P(X_{n+1} = j | X_n = i)

Now, the m-step transition probability is defined as

Pij(m)=P(Xn+m=jXn=i)P_{ij}^{(m)} = P(X_{n+m} = j | X_n = i)

The Chapman-Kolmogorov equations relate the m-step transition probabilities to the one-step transition probabilities. They state that the probability of transitioning from state ii to state jj in m+nm+n steps can be computed by summing over all possible intermediate states kk after mm steps:

Pij(m+n)=kSPik(m)Pkj(n)P_{ij}^{(m+n)} = \sum_{k \in S} P_{ik}^{(m)} P_{kj}^{(n)}

for all m,n0m, n \geq 0 and all states i,jSi, j \in S.

By the Chapman-Kolmogorov equations, we are performing matrix multiplication of the transition probability matrix PP.

PP=P2Pm+n=PmPnP \cdot P = P^2 \\ P^{m+n} = P^m \cdot P^n

Example: Weather Chains

Example (Weather Chain): Suppose if it is rainy or snowy in Suwon today, the probability that it will rain or snow tomorrow is 40%. If, on the other hand, if it is sunny in Suwon today, the probability that it will rain or snow tomorrow is 70%. Let

  • State 0 := Rainy/Snowy
  • State 1 := Sunny

Draw the state transition diagram of Suwon’s weather DTMC.

P(Xn+1=0Xn=0)=0.4P(Xn+1=1Xn=0)=0.6P(Xn+1=0Xn=1)=0.7P(Xn+1=1Xn=1)=0.3P(X_{n+1} = 0 | X_n = 0) = 0.4 \\ P(X_{n+1} = 1 | X_n = 0) = 0.6 \\ P(X_{n+1} = 0 | X_n = 1) = 0.7 \\ P(X_{n+1} = 1 | X_n = 1) = 0.3 \\

This gives us the transition probability matrix

P=[0.40.60.70.3]P = \begin{bmatrix} 0.4 & 0.6 \\ 0.7 & 0.3 \end{bmatrix}

We can compute the two-step transition probability matrix by squaring the one-step transition probability matrix:

P2=[0.40.60.70.3][0.40.60.70.3]=[0.580.420.490.51]P^2 = \begin{bmatrix} 0.4 & 0.6 \\ 0.7 & 0.3 \end{bmatrix} \cdot \begin{bmatrix} 0.4 & 0.6 \\ 0.7 & 0.3 \end{bmatrix} = \begin{bmatrix} 0.58 & 0.42 \\ 0.49 & 0.51 \end{bmatrix}

What if we wanted to look at days 4, 8, and 16?

We can compute this by continuing to square the matrix:

P4=P2P2P8=P4P4P16=P8P8P^4 = P^2 \cdot P^2 \\ P^8 = P^4 \cdot P^4 \\ P^{16} = P^8 \cdot P^8

Code Example

import numpy as np
P = np.array([[0.4, 0.6],
[0.7, 0.3]])
P2 = np.matmul(P, P)
P4 = np.matmul(P2, P2)
P8 = np.matmul(P4, P4)
P16 = np.matmul(P8, P8)

Output

P2=[0.580.420.490.51],P4=[0.5130.4870.5670.433],P8=[0.53850.46150.53840.4616],P16=[0.53850.46150.53850.4615]P^2 = \begin{bmatrix} 0.58 & 0.42 \\ 0.49 & 0.51 \end{bmatrix}, \quad P^4 = \begin{bmatrix} 0.513 & 0.487 \\ 0.567 & 0.433 \end{bmatrix}, \quad P^8 = \begin{bmatrix} 0.5385 & 0.4615 \\ 0.5384 & 0.4616 \end{bmatrix}, \quad P^{16} = \begin{bmatrix} 0.5385 & 0.4615 \\ 0.5385 & 0.4615 \end{bmatrix}

As we can see, as we continue to square the matrix, the rows of the resulting matrix converge to the same values. This indicates that the Markov chain is reaching a steady state, where the probabilities of being in each state stabilize over time. This is called the steady-state distribution.

Mathematically,

limnPn=limnP(Xn=jX0=i)=πj\lim_{n \to \infty} P^n = \lim_{n \to \infty} P(X_{n} = j | X_0 = i) = \pi_j

This limit exists and is the same for all iSi \in S for a fixed state jSj \in S.

Example: Smart Phone Brand Preference

There are 3 categories of smart phone brands: iPhone, Android, or Windows Mobile. Customers who buy their nth smart phone in one of these brands, prefer the same brand for their next phone purchase with probability 0.8, 0.6, or 0.4, respectively. When they change their brand, they do so randomly among the remaining two brands. Find the long-run proportion of time a customer will have an iPhone, Android, or a Windows Mobile phone.

Let 0 := iPhone, 1 := Android, 2 := Windows Mobile.

P=[0.80.10.10.20.60.20.30.30.4]P = \begin{bmatrix} 0.8 & 0.1 & 0.1 \\ 0.2 & 0.6 & 0.2 \\ 0.3 & 0.3 & 0.4 \end{bmatrix}

Code Example

import numpy as np
P = np.array([[0.8, 0.1, 0.1],
[0.2, 0.6, 0.2],
[0.3, 0.3, 0.4]])
P2 = np.matmul(P, P)
P4 = np.matmul(P2, P2)
P8 = np.matmul(P4, P4)
P16 = np.matmul(P8, P8)
P32 = np.matmul(P16, P16)

Output

P2=[0.690.170.140.340.440.220.420.330.25],P4=[0.5970.2380.1690.4760.3240.1990.5070.2990.193],P8=[0.5510.2690.1800.5370.2780.1840.5410.2750.183],P16=[0.54550.27270.18180.54550.27270.18180.54550.27270.1818],P32=[0.54550.27270.18180.54550.27270.18180.54550.27270.1818]P^2 = \begin{bmatrix} 0.69 & 0.17 & 0.14 \\ 0.34 & 0.44 & 0.22 \\ 0.42 & 0.33 & 0.25 \end{bmatrix}, \quad P^4 = \begin{bmatrix} 0.597 & 0.238 & 0.169 \\ 0.476 & 0.324 & 0.199 \\ 0.507 & 0.299 & 0.193 \end{bmatrix}, \quad P^8 = \begin{bmatrix} 0.551 & 0.269 & 0.180 \\ 0.537 & 0.278 & 0.184 \\ 0.541 & 0.275 & 0.183 \end{bmatrix}, \quad P^16 = \begin{bmatrix} 0.5455 & 0.2727 & 0.1818 \\ 0.5455 & 0.2727 & 0.1818 \\ 0.5455 & 0.2727 & 0.1818 \end{bmatrix}, \quad P^{32} = \begin{bmatrix} 0.5455 & 0.2727 & 0.1818 \\ 0.5455 & 0.2727 & 0.1818 \\ 0.5455 & 0.2727 & 0.1818 \end{bmatrix}

Now we need to solve for the steady-state distribution π\pi.

π=1π1=0.8π1+0.2π2+0.3π3π2=0.1π1+0.6π2+0.3π3π1+π2+π3=1\sum \pi = 1 \\ \pi_1 = 0.8 \pi_1 + 0.2 \pi_2 + 0.3 \pi_3 \\ \pi_2 = 0.1 \pi_1 + 0.6 \pi_2 + 0.3 \pi_3 \\ \pi_1 + \pi_2 + \pi_3 = 1
from scipy.linalg import solve
pi1, pi2, pi3 = symbols('pi1 pi2 pi3')
system = [
Eq(pi1, 0.8 * pi1 + 0.2 * pi2 + 0.3 * pi3),
Eq(pi2, 0.1 * pi1 + 0.6 * pi2 + 0.3 * pi3),
Eq(pi1 + pi2 + pi3, 1)
]
solution = solve(system, (pi1, pi2, pi3))
solution

Output

{pil: 0.545454545454545, pi2: 0.272727272727273, pi3: 0.181818181818182}

Google PageRank Algorithm

The Google PageRank algorithm measures the importance of web pages based on their link structure. That is, instead of counting links coming from different pages equally, we normalize the number of links to each page:

PR(A)=(1d)N+d(PR(T1)C(T1)+PR(T2)C(T2)+...+PR(Tn)C(Tn))PR(A) = \frac{(1-d)}{N} + d \left( \frac{PR(T1)}{C(T1)} + \frac{PR(T2)}{C(T2)} + ... + \frac{PR(Tn)}{C(Tn)} \right)

where:

  • PR(A)PR(A) is the PageRank of page A.
  • dd is a damping factor, usually set to 0.85.
  • PR(Ti)PR(Ti) is the PageRank of pages Ti which link to page A.
  • C(Ti)C(Ti) is the number of outbound links on page Ti.
  • T1,T2,...,TnT1, T2, ..., Tn are the pages that link to page A.

Consider a simple web with four pages: A, B, C, and D. The link structure is as follows:

PageLinks To
AB, C
BC
CA
DA, C

Question 1

Implement the PageRank formula using the damping factor d = 0.85. Print the final PageRank values for all pages and identify the most important page.

import numpy as np
A = np.array(
[[1, 0, -0.85, -0.85*0.5],
[-0.85*0.5, 1, 0, 0],
[-0.85*0.5, -0.85, 1, -0.85*0.5],
[1, 1, 1, 1]]
)
b = np.array([(1-0.85)/4, (1-0.85)/4, (1-0.85)/4, 1])
X = np.linalg.lstsq(A, b, rcond=None)[0]
# Damping factor
d = 0.85
# Number of pages
N = 4
# Initialize PageRank values equally
PR = {'A': 1/N, 'B': 1/N, 'C': 1/N, 'D': 1/N}
# Define the link structure
links = {
'A': ['B', 'C'],
'B': ['C'],
'C': ['A'],
'D': ['A', 'C']
}
# Compute incoming links for each page
in_links = {page: [] for page in links}
for src, dests in links.items():
for dest in dests:
in_links[dest].append(src)
# Iterate until convergence (or fixed number of iterations)
for _ in range(100):
new_PR = {}
for page in links:
new_PR[page] = (1 - d) / N
for in_page in in_links[page]:
new_PR[page] += d * PR[in_page] / len(links[in_page])
PR = new_PR
# Print results
for page, value in PR.items():
print(f"{page}: {value:.4f}")
# Find the most important page
most_important = max(PR, key=PR.get)
print("\nMost important page:", most_important)

Output

A: 0.3797
B: 0.1989
C: 0.3839
D: 0.0375
Most important page: C

Question 2

Implement the PageRank using the Discrete-Time Markov Chains after 100 moves? The damping factor is 0.85

import numpy as np
A = np.array([[0, 0, 1, 1],
[1, 0, 0, 0],
[1, 1, 0, 1],
[0, 0, 0, 0]], dtype=float)
N = A.shape[0]
col_sums = A.sum(axis=0)
col_sums[col_sums == 0] = 1
P = A / col_sums
d = 0.85
G = d * P + (1 - d) / N * np.ones((N, N))
# Start from uniform distribution
r = np.ones(N) / N
for _ in range(100):
r = G @ r # multiply repeatedly
r = r / r.sum()
print ("PageRank:", np. round(r, 4))
print ("Sum =", r.sum())

Output

PageRank: [0.3797 0.1989 0.3839 0.0375]
Sum = 1.0000000000000002

Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It provides a formalism for sequential decision-making problems, where an agent interacts with an environment over a series of time steps.

Definitions

  • XnX_n : state of the system at time nn
  • AnA_n : action taken at time nn
  • (Xn,An),n=0,1,2,...{(X_n, A_n), n = 0, 1, 2, ...} : Markov decision process
  • RijR_ij : reward received when transitioning from state ii to state jj
  • State transition probability when action aa is taken at state ii : Pija=P(Xn+1=jXn=i,An=a)P_{ij}^a = P(X_{n+1} = j | X_n = i, A_n = a)

Finite State model

Markov Decision Process

What is the optimal decision (action) at each state to maximize the expected total reward over a given time horizon?

  • fn(i)f_n(i) : optimal expected total reward when starting in state ii at time nn
  • According to the principle of optimality, the optimal expected total reward can be expressed recursively as: fn(i)=maxaA(jSPija(Rija+fn+1(j)))f_n(i) = \max_{a \in A} \left( \sum_{j \in S} P_{ij}^a (R_{ij}^a + f_{n+1}(j)) \right)
  • Solve in order of fN1(i),fN2(i),...,f0(i)f_{N-1}(i), f_{N-2}(i), ..., f_0(i)

Example

Given three possible ad states: 0 (good), 1 (average), and 2 (bad). The advertiser can choose between two actions: a = 0 (high bid) and a = 1 (low bid). The state transition probabilities and rewards for each action are as follows:

P0=[0.30.50.200.60.4001],P1=[0.40.50.10.20.60.20.10.40.5]R0=[752641311],R1=[641532433]P^0 = \begin{bmatrix} 0.3 & 0.5 & 0.2 \\ 0 & 0.6 & 0.4 \\ 0 & 0 & 1 \end{bmatrix}, \quad P^1 = \begin{bmatrix} 0.4 & 0.5 & 0.1 \\ 0.2 & 0.6 & 0.2 \\ 0.1 & 0.4 & 0.5 \end{bmatrix} R^0 = \begin{bmatrix} 7 & 5 & 2 \\ 6 & 4 & 1 \\ 3 & 1 & -1 \end{bmatrix}, \quad R^1 = \begin{bmatrix} 6 & 4 & -1 \\ 5 & 3 & -2 \\ 4 & 3 & -3 \end{bmatrix}
import numpy as np
# transition probability matrices
P0 = np.array([[0.3, 0.5, 0.2],
[0, 0.6, 0.4],
[0, 0, 1]])
P1 = np.array([[0.4, 0.5, 0.1],
[0.2, 0.6, 0.2],
[0.1, 0.4, 0.5]])
# reward matrices
R0 = np.array([[7, 5, 2],
[6, 4, 1],
[3, 1, -1]])
R1 = np.array([[6, 4, -1],
[5, 3, -2],
[4, 3, -3]])
num_states = 3 # number of states
num_stages = 3 # number of stages (time horizon)
# initialize value function
f = {n: np.zeros(num_states) for n in range(num_stages+3)}
for n in range(num_stages, 0, -1): # backward induction
for i in range(num_states):
action_0 = sum(P0[i, j] * (R0[i, j] + f[n+1][j]) for j in range(num_states))
action_1 = sum(P1[i, j] * (R1[i, j] + f[n+1][j]) for j in range(num_states))
f[n][i] = max(action_0, action_1)
policy = {}
for n in range(1, num_stages+1):
policy[n] = []
for i in range(num_states):
action_0 = sum(P0[i, j] * (R0[i, j] + f[n+1][j]) for j in range(num_states))
action_1 = sum(P1[i, j] * (R1[i, j] + f[n+1][j]) for j in range(num_states))
if action_0 >= action_1:
policy[n].append(0) # choose action 0
else:
policy[n].append(1) # choose action 1
print("Optimal Value Function:")
for n in range(1, num_stages+1):
print(f"Stage {n}: {['Action 0' if a == 0 else 'Action 1' for a in policy[n]]}")

Output

Optimal Value Function:
Stage 1: ['Action 0', 'Action 1', 'Action 1']
Stage 2: ['Action 0', 'Action 1', 'Action 1']
Stage 3: ['Action 0', 'Action 0', 'Action 1']

Infinite Stage

Markov Decision Process

  • In a finite stage model, the optimal current action may depend on the number of remaining periods.
  • However, this type of optimal policy may not be possible in an infinite stage model.
  • So, the optimal policy should be only dependent on states, independently of time points, which is called a stationary policy.
  • The following methods are available for obtaining stationary policy:
    • Enumeration: Enumerate all possible policies and evaluate their performance.
    • Policy Iteration: Start with an arbitrary policy and iteratively improve it.
    • Linear Programming: Formulate the problem as a linear program and solve it.