A sojourn-based approach to semi-Markov Reinforcement Learning
In this paper we introduce a new approach to discrete-time semi-Markov decision processes based on the sojourn time process. Different characterizations of discrete-time semi-Markov processes are exploited and decision processes are constructed by means of these characterizations. With this new approach, the agent is allowed to consider different actions depending on how much time the process has been in the current state. Numerical method based on Q-learning algorithms for finite horizon reinforcement learning and stochastic recursive relations are investigated. We consider a toy example in which the reward depends on the sojourn-time, according to the gambler's fallacy and we prove that the underlying process does not generally exhibit the Markov property. Finally, we use this last example to carry on some numerical evaluations on the previously presented Q-learning algorithms and on a different method based on deep reinforcement learning.
READ FULL TEXT