Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures

04/17/2002
by   Marcus Hutter, et al.
0

The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle t action y_t results in perception x_t and reward r_t, where all quantities in general may depend on the complete history. The perception x_t and reward r_t are sampled from the (reactive) environmental probability distribution μ. This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if μ is known. Reinforcement learning is usually used if μ is unknown. In the Bayesian approach one defines a mixture distribution ξ as a weighted sum of distributions ν∈, where is any class of distributions including the true environment μ. We show that the Bayes-optimal policy p^ξ based on the mixture ξ is self-optimizing in the sense that the average value converges asymptotically for all μ∈ to the optimal value achieved by the (infeasible) Bayes-optimal policy p^μ which knows μ in advance. We show that the necessary condition that admits self-optimizing policies at all, is also sufficient. No other structural assumptions are made on . As an example application, we discuss ergodic Markov decision processes, which allow for self-optimizing policies. Furthermore, we show that p^ξ is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments ν∈ and a strictly higher value in at least one.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset