Tech Report CS-97-05

Bounded Parameter Markov Decision Processes

Robert Givan, Sonia Leach and Thomas Dean

May 1997

Abstract:

In this paper, we introduce the notion of a {\em bounded parameter Markov decision process\/} as a generalization of the traditional {\em exact\/} MDP. A bounded parameter MDP is a set of exact MDPs specified by giving upper and lower bounds on transition probabilities and rewards (all the MDPs in the set share the same state and action space). Bounded parameter MDPs can be used to represent variation or uncertainty concerning the parameters of sequential decision problems. Bounded parameter MDPs can also be used in aggregation schemes to represent the variation in the transition probabilities for different base states aggregated together in the same aggregate state.

We introduce {\em interval value functions\/} as a natural extension of traditional value functions. An interval value function assigns a closed real interval to each state, representing the assertion that the value of that state falls within that interval. An interval value function can be used to bound the performance of a policy over the set of exact MDPs associated with a given bounded parameter MDP. We describe an iterative dynamic programming algorithm called {\em interval policy evaluation\/} which computes an interval value function for a given bounded parameter MDP and specified policy. Interval policy evaluation on a policy $policy$ computes the most restrictive interval value function that is sound, i.e. that bounds the value function for $policy$ in every exact MDP in the set defined by the bounded parameter MDP. A simple modification of interval policy evaluation results in a variant of value iteration [Bellman57] that we call {\em interval value iteration\/} which computes a policy for an bounded parameter MDP that is optimal in a well-defined sense.

(complete text in pdf or gzipped postscript)

Information for:

Tech Report CS-97-05

Bounded Parameter Markov Decision Processes

Robert Givan, Sonia Leach and Thomas Dean

May 1997

Abstract: