Markov Decision Process (MDP)

Stochastic Process

A stochastic process is a collection of random variables representing the evolution of some system of random values over time.

A sequence of random variables

X = {X(t), t \in T}

is called a stochastic process, where $t$ is countable/uncountable set, often representing time, and $X(t)$ is the sample path.

Example: Markov Chain (Markov Decision Process)

Note

Random variable is when only random variable is considered.
Stochastic process is when both time and random variable are considered.

이건 확랜프 참고 하삼

Discrete-Time Markov Chain (DTMC)

A discrete-time Markov chain is a stochastic process that satisfies the Markov property and evolves in discrete time steps.

A stochastic process in discrete time

X_n \def \text{ state of the system at time } n, \\ {X_n, n = 0, 1, 2, ...}

If we write $X_n = i$ : means the system is in state $i$ at time $n$ .

Example: Consecutive coin tossing

$X_n$ : number of heads in $n$ tosses
Sample space: ${T, H, H, T, ...}$ —> these are the states?

Basic Properties of DTMC

A Discrete Index Set
1. This typically is the “time” index.
2. $N = {0, 1, 2, ...}, n \in N$
A Countable State Space
1. Denoted by $S$ . Contains the possible states that $X_n$ can take.
2. Can be finite or countably infinite.
The Markov Property
1. The Markov property states that the future state depends only on the current state and not on the sequence of events that preceded it.
2. Only depends on current time $n$ , and not on the past states $X_{n-1}, X_{n-2}, ...$ .
3. In other words, the next state only depends on the current state.
$P(X_{n+1} = j | X_n = i, X_{n-1} = i_{n-1}, ..., X_0 = i_0) = P(X_{n+1} = j | X_n = i) = P_{ij}$
This is the probability that the system transitions into state $j$ at time $n+1$ given that it is in state $i$ at time $n$ .

Chapman-Kolmogorov Equations

$P$ (transition probability matrix) is a square matrix where the element at row $i$ and column $j$ represents the probability of transitioning from state $i$ to state $j$ in one time step. It has entries

P_{ij} = P(X_{n+1} = j | X_n = i)

Now, the m-step transition probability is defined as

P_{ij}^{(m)} = P(X_{n+m} = j | X_n = i)

The Chapman-Kolmogorov equations relate the m-step transition probabilities to the one-step transition probabilities. They state that the probability of transitioning from state $i$ to state $j$ in $m+n$ steps can be computed by summing over all possible intermediate states $k$ after $m$ steps:

P_{ij}^{(m+n)} = \sum_{k \in S} P_{ik}^{(m)} P_{kj}^{(n)}

for all $m, n \geq 0$ and all states $i, j \in S$ .

By the Chapman-Kolmogorov equations, we are performing matrix multiplication of the transition probability matrix $P$ .

P \cdot P = P^2 \\ P^{m+n} = P^m \cdot P^n

Example: Weather Chains

Example (Weather Chain): Suppose if it is rainy or snowy in Suwon today, the probability that it will rain or snow tomorrow is 40%. If, on the other hand, if it is sunny in Suwon today, the probability that it will rain or snow tomorrow is 70%. Let

State 0 := Rainy/Snowy
State 1 := Sunny

Draw the state transition diagram of Suwon’s weather DTMC.

P(X_{n+1} = 0 | X_n = 0) = 0.4 \\ P(X_{n+1} = 1 | X_n = 0) = 0.6 \\ P(X_{n+1} = 0 | X_n = 1) = 0.7 \\ P(X_{n+1} = 1 | X_n = 1) = 0.3 \\

This gives us the transition probability matrix

P = \begin{bmatrix} 0.4 & 0.6 \\ 0.7 & 0.3 \end{bmatrix}

We can compute the two-step transition probability matrix by squaring the one-step transition probability matrix:

P^2 = \begin{bmatrix} 0.4 & 0.6 \\ 0.7 & 0.3 \end{bmatrix} \cdot \begin{bmatrix} 0.4 & 0.6 \\ 0.7 & 0.3 \end{bmatrix} = \begin{bmatrix} 0.58 & 0.42 \\ 0.49 & 0.51 \end{bmatrix}

What if we wanted to look at days 4, 8, and 16?

We can compute this by continuing to square the matrix:

P^4 = P^2 \cdot P^2 \\ P^8 = P^4 \cdot P^4 \\ P^{16} = P^8 \cdot P^8

Code Example

1
import numpy as np
2

3
P = np.array([[0.4, 0.6],
4
              [0.7, 0.3]])
5

6
P2 = np.matmul(P, P)
7

8
P4 = np.matmul(P2, P2)
9

10
P8 = np.matmul(P4, P4)
11

12
P16 = np.matmul(P8, P8)

Output

P^2 = \begin{bmatrix} 0.58 & 0.42 \\ 0.49 & 0.51 \end{bmatrix}, \quad P^4 = \begin{bmatrix} 0.513 & 0.487 \\ 0.567 & 0.433 \end{bmatrix}, \quad P^8 = \begin{bmatrix} 0.5385 & 0.4615 \\ 0.5384 & 0.4616 \end{bmatrix}, \quad P^{16} = \begin{bmatrix} 0.5385 & 0.4615 \\ 0.5385 & 0.4615 \end{bmatrix}

As we can see, as we continue to square the matrix, the rows of the resulting matrix converge to the same values. This indicates that the Markov chain is reaching a steady state, where the probabilities of being in each state stabilize over time. This is called the steady-state distribution.

Mathematically,

\lim_{n \to \infty} P^n = \lim_{n \to \infty} P(X_{n} = j | X_0 = i) = \pi_j

This limit exists and is the same for all $i \in S$ for a fixed state $j \in S$ .

Example: Smart Phone Brand Preference

There are 3 categories of smart phone brands: iPhone, Android, or Windows Mobile. Customers who buy their nth smart phone in one of these brands, prefer the same brand for their next phone purchase with probability 0.8, 0.6, or 0.4, respectively. When they change their brand, they do so randomly among the remaining two brands. Find the long-run proportion of time a customer will have an iPhone, Android, or a Windows Mobile phone.

Let 0 := iPhone, 1 := Android, 2 := Windows Mobile.

P = \begin{bmatrix} 0.8 & 0.1 & 0.1 \\ 0.2 & 0.6 & 0.2 \\ 0.3 & 0.3 & 0.4 \end{bmatrix}

Code Example

1
import numpy as np
2

3
P = np.array([[0.8, 0.1, 0.1],
4
              [0.2, 0.6, 0.2],
5
              [0.3, 0.3, 0.4]])
6
P2 = np.matmul(P, P)
7
P4 = np.matmul(P2, P2)
8
P8 = np.matmul(P4, P4)
9
P16 = np.matmul(P8, P8)
10
P32 = np.matmul(P16, P16)

Output

P^2 = \begin{bmatrix} 0.69 & 0.17 & 0.14 \\ 0.34 & 0.44 & 0.22 \\ 0.42 & 0.33 & 0.25 \end{bmatrix}, \quad P^4 = \begin{bmatrix} 0.597 & 0.238 & 0.169 \\ 0.476 & 0.324 & 0.199 \\ 0.507 & 0.299 & 0.193 \end{bmatrix}, \quad P^8 = \begin{bmatrix} 0.551 & 0.269 & 0.180 \\ 0.537 & 0.278 & 0.184 \\ 0.541 & 0.275 & 0.183 \end{bmatrix}, \quad P^16 = \begin{bmatrix} 0.5455 & 0.2727 & 0.1818 \\ 0.5455 & 0.2727 & 0.1818 \\ 0.5455 & 0.2727 & 0.1818 \end{bmatrix}, \quad P^{32} = \begin{bmatrix} 0.5455 & 0.2727 & 0.1818 \\ 0.5455 & 0.2727 & 0.1818 \\ 0.5455 & 0.2727 & 0.1818 \end{bmatrix}

Now we need to solve for the steady-state distribution $\pi$ .

\sum \pi = 1 \\ \pi_1 = 0.8 \pi_1 + 0.2 \pi_2 + 0.3 \pi_3 \\ \pi_2 = 0.1 \pi_1 + 0.6 \pi_2 + 0.3 \pi_3 \\ \pi_1 + \pi_2 + \pi_3 = 1

1
from scipy.linalg import solve
2

3
pi1, pi2, pi3 = symbols('pi1 pi2 pi3')
4

5
system = [
6
    Eq(pi1, 0.8 * pi1 + 0.2 * pi2 + 0.3 * pi3),
7
    Eq(pi2, 0.1 * pi1 + 0.6 * pi2 + 0.3 * pi3),
8
    Eq(pi1 + pi2 + pi3, 1)
9
]
10

11
solution = solve(system, (pi1, pi2, pi3))
12
solution

Output

1
{pil: 0.545454545454545, pi2: 0.272727272727273, pi3: 0.181818181818182}

Google PageRank Algorithm

The Google PageRank algorithm measures the importance of web pages based on their link structure. That is, instead of counting links coming from different pages equally, we normalize the number of links to each page:

PR(A) = \frac{(1-d)}{N} + d \left( \frac{PR(T1)}{C(T1)} + \frac{PR(T2)}{C(T2)} + ... + \frac{PR(Tn)}{C(Tn)} \right)

where:

$PR(A)$ is the PageRank of page A.
$d$ is a damping factor, usually set to 0.85.
$PR(Ti)$ is the PageRank of pages Ti which link to page A.
$C(Ti)$ is the number of outbound links on page Ti.
$T1, T2, ..., Tn$ are the pages that link to page A.

Consider a simple web with four pages: A, B, C, and D. The link structure is as follows:

Page	Links To
A	B, C
B	C
C	A
D	A, C

Question 1

Implement the PageRank formula using the damping factor d = 0.85. Print the final PageRank values for all pages and identify the most important page.

1
import numpy as np
2

3
A = np.array(
4
    [[1, 0, -0.85, -0.85*0.5],
5
     [-0.85*0.5, 1, 0, 0],
6
     [-0.85*0.5, -0.85, 1, -0.85*0.5],
7
     [1, 1, 1, 1]]
8
)
9

10
b = np.array([(1-0.85)/4, (1-0.85)/4, (1-0.85)/4, 1])
11

12
X = np.linalg.lstsq(A, b, rcond=None)[0]
13

14
# Damping factor
15
d = 0.85
16
# Number of pages
17
N = 4
18

19
# Initialize PageRank values equally
20
PR = {'A': 1/N, 'B': 1/N, 'C': 1/N, 'D': 1/N}
21

22
# Define the link structure
23
links = {
24
    'A': ['B', 'C'],
25
    'B': ['C'],
26
    'C': ['A'],
27
    'D': ['A', 'C']
28
}
29

30
# Compute incoming links for each page
31
in_links = {page: [] for page in links}
32
for src, dests in links.items():
33
    for dest in dests:
34
        in_links[dest].append(src)
35

36
# Iterate until convergence (or fixed number of iterations)
37
for _ in range(100):
38
    new_PR = {}
39
    for page in links:
40
        new_PR[page] = (1 - d) / N
41
        for in_page in in_links[page]:
42
            new_PR[page] += d * PR[in_page] / len(links[in_page])
43
    PR = new_PR
44

45
# Print results
46
for page, value in PR.items():
47
    print(f"{page}: {value:.4f}")
48

49
# Find the most important page
50
most_important = max(PR, key=PR.get)
51
print("\nMost important page:", most_important)

Output

1
A: 0.3797
2
B: 0.1989
3
C: 0.3839
4
D: 0.0375
5

6
Most important page: C

Question 2

Implement the PageRank using the Discrete-Time Markov Chains after 100 moves? The damping factor is 0.85

1
import numpy as np
2

3
A = np.array([[0, 0, 1, 1],
4
              [1, 0, 0, 0],
5
              [1, 1, 0, 1],
6
              [0, 0, 0, 0]], dtype=float)
7

8
N = A.shape[0]
9
col_sums = A.sum(axis=0)
10
col_sums[col_sums == 0] = 1
11
P = A / col_sums
12
d = 0.85
13

14
G = d * P + (1 - d) / N * np.ones((N, N))
15

16
# Start from uniform distribution
17
r = np.ones(N) / N
18
for _ in range(100):
19
    r = G @ r  # multiply repeatedly
20

21
r = r / r.sum()
22

23
print ("PageRank:", np. round(r, 4))
24
print ("Sum =", r.sum())

Output

1
PageRank: [0.3797 0.1989 0.3839 0.0375]
2
Sum = 1.0000000000000002

Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It provides a formalism for sequential decision-making problems, where an agent interacts with an environment over a series of time steps.

Definitions

$X_n$ : state of the system at time $n$
$A_n$ : action taken at time $n$
${(X_n, A_n), n = 0, 1, 2, ...}$ : Markov decision process
$R_ij$ : reward received when transitioning from state $i$ to state $j$
State transition probability when action $a$ is taken at state $i$ : $P_{ij}^a = P(X_{n+1} = j | X_n = i, A_n = a)$

Finite State model

Markov Decision Process

What is the optimal decision (action) at each state to maximize the expected total reward over a given time horizon?

$f_n(i)$ : optimal expected total reward when starting in state $i$ at time $n$
According to the principle of optimality, the optimal expected total reward can be expressed recursively as: $f_n(i) = \max_{a \in A} \left( \sum_{j \in S} P_{ij}^a (R_{ij}^a + f_{n+1}(j)) \right)$
Solve in order of $f_{N-1}(i), f_{N-2}(i), ..., f_0(i)$

Example

Given three possible ad states: 0 (good), 1 (average), and 2 (bad). The advertiser can choose between two actions: a = 0 (high bid) and a = 1 (low bid). The state transition probabilities and rewards for each action are as follows:

P^0 = \begin{bmatrix} 0.3 & 0.5 & 0.2 \\ 0 & 0.6 & 0.4 \\ 0 & 0 & 1 \end{bmatrix}, \quad P^1 = \begin{bmatrix} 0.4 & 0.5 & 0.1 \\ 0.2 & 0.6 & 0.2 \\ 0.1 & 0.4 & 0.5 \end{bmatrix} R^0 = \begin{bmatrix} 7 & 5 & 2 \\ 6 & 4 & 1 \\ 3 & 1 & -1 \end{bmatrix}, \quad R^1 = \begin{bmatrix} 6 & 4 & -1 \\ 5 & 3 & -2 \\ 4 & 3 & -3 \end{bmatrix}

1
import numpy as np
2

3
# transition probability matrices
4

5
P0 = np.array([[0.3, 0.5, 0.2],
6
               [0, 0.6, 0.4],
7
               [0, 0, 1]])
8
P1 = np.array([[0.4, 0.5, 0.1],
9
               [0.2, 0.6, 0.2],
10
               [0.1, 0.4, 0.5]])
11

12
# reward matrices
13
R0 = np.array([[7, 5, 2],
14
               [6, 4, 1],
15
               [3, 1, -1]])
16
R1 = np.array([[6, 4, -1],
17
               [5, 3, -2],
18
               [4, 3, -3]])
19

20

21
num_states = 3 # number of states
22
num_stages = 3 # number of stages (time horizon)
23

24
# initialize value function
25
f = {n: np.zeros(num_states) for n in range(num_stages+3)}
26

27
for n in range(num_stages, 0, -1): # backward induction
28
    for i in range(num_states):
29
        action_0 = sum(P0[i, j] * (R0[i, j] + f[n+1][j]) for j in range(num_states))
30
        action_1 = sum(P1[i, j] * (R1[i, j] + f[n+1][j]) for j in range(num_states))
31
        f[n][i] = max(action_0, action_1)
32

33
policy = {}
34
for n in range(1, num_stages+1):
35
    policy[n] = []
36
    for i in range(num_states):
37
        action_0 = sum(P0[i, j] * (R0[i, j] + f[n+1][j]) for j in range(num_states))
38
        action_1 = sum(P1[i, j] * (R1[i, j] + f[n+1][j]) for j in range(num_states))
39
        if action_0 >= action_1:
40
            policy[n].append(0)  # choose action 0
41
        else:
42
            policy[n].append(1)  # choose action 1
43

44
print("Optimal Value Function:")
45
for n in range(1, num_stages+1):
46
    print(f"Stage {n}: {['Action 0' if a == 0 else 'Action 1' for a in policy[n]]}")

Output

1
Optimal Value Function:
2
Stage 1: ['Action 0', 'Action 1', 'Action 1']
3
Stage 2: ['Action 0', 'Action 1', 'Action 1']
4
Stage 3: ['Action 0', 'Action 0', 'Action 1']

Infinite Stage

Markov Decision Process

In a finite stage model, the optimal current action may depend on the number of remaining periods.
However, this type of optimal policy may not be possible in an infinite stage model.
So, the optimal policy should be only dependent on states, independently of time points, which is called a stationary policy.
The following methods are available for obtaining stationary policy:
- Enumeration: Enumerate all possible policies and evaluate their performance.
- Policy Iteration: Start with an arbitrary policy and iteratively improve it.
- Linear Programming: Formulate the problem as a linear program and solve it.