2022 Fiscal Year Final Research Report

Reward occurence probability vector space that Visualizes the distribution of whole learning results of multi-objective reinforcement learning

Research Project

PDF

Project/Area Number	20K11946
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 61030:Intelligent informatics-related
Research Institution	Nara National College of Technology
Principal Investigator	Yamaguchi Tomohiro 奈良工業高等専門学校, 情報工学科, 教授 (00240838)
Co-Investigator(Kenkyū-buntansha)	高玉圭樹電気通信大学, 大学院情報理工学研究科, 教授 (20345367) 市川嘉裕奈良工業高等専門学校, 情報工学科, 助教 (60805159)
Project Period (FY)	2020-04-01 – 2023-03-31
Keywords	機械学習 / 多目的強化学習 / 報酬生起確率ベクトル / 重みベクトル / 部分計算 / 多目的最適方策 / 可視化 / ベクトル空間
Outline of Final Research Achievements	First, we implemented parallelization of the collection of all reward acquisition policies and the determination of the multi-objective optimal policies, as well as speeding up the process by partial computation. In a stochastic MDP environment with 12 states and 3 rewards, the number of reward acquisition policies was 253,000, while the number of reward occurrence probability vectors was reduced to 5430, about 1/50. In the case of 4 rewards, the parallelized method (8.8 sec) was 1/180th faster than the existing method (1590 sec) in terms of the execution time required to calculate the set of occurrence probability vectors corresponding to all reward acquisition policies. Next, for the case of 3 rewards, we used the mesh method to determine the range of weight vectors among the objectives to optimize the multi-objective optimal policy, and visualized the average reward of the optima policy for the weight vectors.
Free Research Field	強化学習
Academic Significance and Societal Importance of the Research Achievements	本研究の学術的意義は，従来手法では，平均報酬最大となる多目的最適方策の境界を解析的に解くのが，目的数3以上の場合に困難だったのに対し，本手法では，各重みベクトルに対して，式(1)を用いて各方策の平均報酬値を算出し，最大となる方策を決定するため，計算コストの許す限り，近似的な算出が可能な点である．しかも，多目的最適方策の決定過程において，多目的間の重要度を表す重みベクトルとは独立な，報酬生起確率ベクトルをまず算出し，次にそれを用いて多目的最適方策を最適化するための，目的間の重みベクトルの範囲の決定を，メッシュ法を用いて近似的に行うことで，目的数3以上の場合の算出を実現した点である．