Research Abstract |
The results of this study are summarized as follows : (1) A formal model of non-Markovian problems is the partially observable Markov decision problem (POMDP). The most useful solution to overcome partial observability is to use memory to estimate state. In this study, we proposed a new memory architecture of reinforcement learning algorithms to solve certain type of POMDPs. The agent's task is to discover a path leading from start position to goal in a partially observable maze. The agent is assumed to have life-time separable into "trials". The basic framework of the algorithm, called labeling Q-learning, is described as follows. Let 0 be the set of finite observations. At each step t, when the agent gets an observation o_t epsilon OMICRON from the environment, a label, theta_t is attached to the observation, where theta_t is an element of THETA={0, 1, 2, ・, M -1}, (in the beginning of each trial, the labels for all omicron_t epsilon OMICRON are initialized to 0).Then the pair OMICRON_t=(OMICRON_t*THETA_t) defines a new observation, and the usual reinforcementlearning algorithm TD( lambda) that uses replacing traces is applied to OMICRON=OMICRON*THETA, as if the pair = (omicron_t, theta_t) has the Markov property. (2) The labeling Q-learning was applied to test problems of simple mazes taken from the recent literature. The results demonstrated labeling Q-learning's ability to work well in near-optimal manner. (3) Most problems will have continuous or large discrete observation space. We studied generalization techniques by recurrentneural networks(RNN) and holon networks, which allow compact storage of similar observations. Further, we developed an approximate method of controlling the complexity, i.e., the Lyapunov exponent, of RNNs, and the method was demonstrated by applying it to identification problems of certain nonlinear systems. (4) We made fundamental experiments on sensor-based navigation for a mobile robot.
|