# Reinforcement LearningΒΆ

Describes a sequential decision making problem. We are given an environment that transitions to a state \(s'\) and returns a reward \(r\) for a given action \(a\) and state \(s\) with probability \(p(s',r|s,a)\). We want to find the policy \(\pi\) that choses an action \(a\) for a given state \(s\) with probability \(\pi(a|s)\) such that for every state \(s\) the chosen action \(a\) is such that the value function

\[\begin{split}V^\pi(s)=\E_{\substack{a_t\sim \pi(\cdot|s_t) \\ s_{t+1},r_t\sim p(\cdot,\cdot|s_t,a_t)}}\left[\sum\limits_{t=0}^\infty \gamma^t r_t\bigg|s_0=s\right]\end{split}\]

is maximal. Here \(\gamma\) is the discount factor.

We distinguish discrete and continuous action domains.
**Sub-Categories:**