# Reinforcement Learning

Describes a sequential decision making problem. We are given an environment that transitions to a state $$s'$$ and returns a reward $$r$$ for a given action $$a$$ and state $$s$$ with probability $$p(s',r|s,a)$$. We want to find the policy $$\pi$$ that choses an action $$a$$ for a given state $$s$$ with probability $$\pi(a|s)$$ such that for every state $$s$$ the chosen action $$a$$ is such that the value function

$\begin{split}V^\pi(s)=\E_{\substack{a_t\sim \pi(\cdot|s_t) \\ s_{t+1},r_t\sim p(\cdot,\cdot|s_t,a_t)}}\left[\sum\limits_{t=0}^\infty \gamma^t r_t\bigg|s_0=s\right]\end{split}$

is maximal. Here $$\gamma$$ is the discount factor.

We distinguish discrete and continuous action domains. Sub-Categories: