Describes a sequential decision making problem. We are given an environment that transitions to a state \(s'\) and returns a reward \(r\) for a given action \(a\) and state \(s\) with probability \(p(s',r|s,a)\). We want to find the policy \(\pi\) that choses an action \(a\) for a given state \(s\) with probability \(\pi(a|s)\) such that for every state \(s\) the chosen action \(a\) is such that the value function
is maximal. Here \(\gamma\) is the discount factor.
We distinguish discrete and continuous action domains. Sub-Categories: