Markov Decision Processes

1. Markov Processes

1.1. Introduction to MDPs

  • Markov Decision Processes formally describe an environment for reinforcement learning.
  • environment is fully observable

1.2. Markov Property

  • the future is independent of the past given the present

Definition
A state \(S_t\) is Markov if and only if \(\mathbb{P}[S_{t+1}|S_t] = \mathbb{P}[S_{t+1}|S_1, ..., S_t]\)

1.3. Markov Chains

Definition
A Markov Process (or Markov Chain) is tuple \((\mathcal{S}, \mathcal{P})\) \(\mathcal{S}\) is a finite set of states \(\mathcal{P}\) is a state transition probability matrix

2. Markov Reward Processes

2.1. Markov Reward Process

Definition
A Markov Reward Process is a tuple \((\mathcal{S}, \mathcal{P}, \mathcal{R}, \mathcal{\gamma})\)

2.2. Return

Definition
\(G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum_{k=0}^{\infty}\gamma^k R_{t+k+1}\) where \(\gamma \in [0, 1]\).

2.3. Value Function

Definition
\(v(s) = \mathbb{E}[G_t |S_t = s]\) e.g.) \(G_1 = R_2 + \gamma R_3 + ... + \gamma^{T-2}R_T\)

2.4. Bellman Equation for MRPs

\[\begin{align*} v(s) &= \mathbb{E}[G_t | S_t = s] \\ &= \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + ... | S_t = s] \\ &= \mathbb{E}[R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + ...) | S_t = s] \\ &= \mathbb{E}[R_{t+1} + \gamma(G_{t+1}) | S_t = s] \\ &= \mathbb{E}[R_{t+1} + \gamma(v(S_{t+1})) | S_t = s] \\ \end{align*}\]

\[v(s) = \mathcal{R}_s + \gamma\sum_{s'\in S}\mathcal{P}_{ss'}v(s')\]

3. Markov Decision Processes

3.1. Markov Decision Process

Definition
A Markov Decision Process is a tuple \((\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)\)

3.2. Policies

Definition
\(\pi(a|s) = \mathbb{P}[A_t = a | S_t = s]\)

3.3. Value Function

Definition
state-value function
\(v_\pi (s) = \mathbb{E}[G_t | S_t = s]\)

Definition
action-value function
\(q_\pi(s, a) = \mathbb{E}[G_t | S_t = s, A_t = a]\)

3.4. Bellman Expectation Equation

\[\begin{align*} v_\pi (s) &= \mathbb{E}[G_t | S_t = s] \\ &= \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s] \end{align*}\] \[\begin{align*} q_\pi(s, a) &= \mathbb{E}[G_t | S_t = s, A_t = a] \\ &= \mathbb{E}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) | S_t = s, A_t = a] \end{align*}\]

\[\begin{align*} v_\pi(s) = \sum_{a\in \mathcal{A}}{\pi(a|s)q_\pi(s, a)} \end{align*}\]

\[\begin{align*} q_\pi(s, a) = \mathcal{R}_{s}^{a} + \gamma\sum_{s'\in \mathcal{S}}{\mathcal{P}_{ss'}^{a}v_\pi(s')} \end{align*}\]

\[\begin{align*} v_\pi(s) = \sum_{a\in \mathcal{A}}{\pi(a|s) \left( \mathcal{R}_{s}^{a} + \gamma\sum_{s'\in \mathcal{S}}{\mathcal{P}_{ss'}^{a}v_\pi(s')} \right)} \end{align*}\]