Research Abstract |
Stopped decision process is a combined model of Markov decision processes (MDPs) and the stopping problem. MDPs are specified by the set of countable states S, compact action space A (i) assigned at each state i ∈ S, transition probabilities q= (q_<ij>(a)), and a uniformly bounded immediate reward function r(i, a, j), which are continuous in a ∈ A(i) for any i, j ∈ S.A policy π is a sequence of probabilities on A (i_t) conditioned by each histories (i_0, a_0, i_1, …, i_t ) for t=0,1, …. Denote by σ a stopping time and by g a utility function. Let B(t)=Σ^t_<k=1> r (X_<k-1>, Δ_<k-1>, X_k), where X_t and Δ_t are the state and action at time t, respectively. The pair (π, σ) is called (i_0, α_0)-optimal if it maximizes E^π_<i_0> [g (α_0+B(σ))], where E^π_<i_0> is the expectation by the probability measure on the sample space Ω=(S×A)^∞ for an initial state i_0. It is assumed that g is non-decreasing, concave and bounded above, or that g has an bounded derivative on any compact subset of the real line R satisfying E^π_i [sup_<t【greater than or equal】0> g^+(α_0+B(t))] < ∞ for any π, i, where g^+ is the positive part of g. Let v(i, α) = max_<{(π, σ)}> E^π_i (g(α+B(σ)). Then, we have following results. 1. For any i ∈ S and α, υ(i, α) satisfies optimality equations υ(i, α) = max {g(α), max_<α∈A> Σ_<j∈S> q_<ij>(a) υ (j, α+r(i, a, j)}(1)Furthermore, suppose (π, σ) satisfies P^π_<i_0> (σ>1)=1. 2. If (π, σ) is (i_0, α_0)-optimal pair, then E^π_<i_0> [g(α_0 + B(σ))] satisfies (1). 3. If E^π_<i_0>[g(α_0 + B(σ))] satisfies (1), then (π, σ) is (i_0, α_0)-optimal.
|