2-1. Markov Decision Process(MDP)

Tommy Kim·2023년 9월 8일
0
post-thumbnail

What is MDP?

Decision means, the agent decides what action to take in each state.

Every action(ata_t) leads to next state(sts_t).

The first important properties of MDP is that every action is done randomly. This means state and action has (discrete) probability distribution.
p(a1s0,a0,s1)p(a_1 | s_0, a_0, s_1)
The arrows shown in the figure represent the information needed to find the probability distribution of the random variable.
So if we know the information of s1s_1, then we don’t need s0,s1s_0, s_1 anymore.(The probability distribution becomes p(a1s1)p(a_1 | s_1)). This is because s1s_1 already has the information about s0s_0 and a0a_0.

On the same principle, we can simplify the distribution of s2s_2.
It first represented as p(s2s0,a0,s1,a1)p(s_2|s_0,a_0,s_1,a_1).
Since s1s_1 already has the information about s0,a0s_0, a_0, we do not need anymore. Therefore we need s1,a1s_1, a_1 to decide the distribution of s2s_2.
p(s2s1,a1)p(s_2|s_1,a_1)

Policy

We know how to represent the probability distribution about each state and action. The probability distribution of the action is called Policy. Since the agent has information about current state, and prior states, instructions on what action to take are expressed as a probability distribution.

The goal of reinforcement learning is to maximize reward. To be specific, the agent tries to maximize expected return.
Return can be expressed in a formula as follows:
GtRt+γRt+1+γ2Rt+2+G_t \triangleq R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + …
The formula means the return is sum of the discounted reward. (γ\gamma is the discounted factor as we learned in Q-learning)
Let’s recall the Q-learning.

Look at the yellow box. The return of this box is 0.5. Since the next box do not have any reward, RtR_t becomes 0. The only return is γRt+1\gamma R_{t+1}.
In contrast, the return of the green box is Rt=1R_t = 1(because the next box is goal).

The agent acts randomly -> next state and reward is random as well.
\therefore we need expected reward.
The policy forces the agent to maximize average of random rewards.

In conclusion, Markov Decision Process is to find the optimal policy that maximizes the expected return.

profile
I’m interested in artificial intelligence

0개의 댓글