2018 Fiscal Year Research-status Report

Research on new machine learning method combining Exploitation-oriented Learning and Deep Learning

Research Project

Project/Area Number	17K00327
Research Institution	National Institution for Academic Degrees and Quality Enhancement of Higher Education
Principal Investigator	宮崎和光独立行政法人大学改革支援・学位授与機構, 研究開発部, 准教授 (20282866)
Project Period (FY)	2017-04-01 – 2020-03-31
Keywords	強化学習 / 経験強化型学習 / 深層学習 / 深層強化学習 / ロボット
Outline of Annual Research Achievements	研究代表者はこれまでに、経験強化型学習と深層学習を融合させた手法としてDQNwithPSを提案している。平成30年度においては、DQNwithPSからQ-learning（QL）に相当する部分を切り離した手法であるDeep P-Network（DPN）を、共同研究を行っている大学院生らとともに提案した。 DPNは、QLに依存せずにProfit Sharing（PS）のみで学習を行うことができる初めての深層強化学習手法である。Atari2600ゲーム環境中のPongで検証したところ、条件次第では、DQNの1/10以下で学習を行えることが確認できた。これは本研究課題の目的である「学習に要する試行錯誤回数の大幅な削減」を実現するものである。なお、DPNは、QLを用いていないことから、PS同様、非ブートストラップ手法に分類される。そのため、部分観測マルコフ決定過程（POMDPs）環境下において、特に威力を発揮する手法である。今後、POMDPs環境への対応が鍵となるような、より現実的な問題を扱うことで、DPNの有効性がより明確になるものと考える。加えて、今年度においては、平成29年度に提案したLearning Acceleration DQN（LADQN）の発展も行った。具体的には、DQNの改善手法を統合したモデルとして知られるRainbowの知見を利用して、LADQNの改善を実現した。現時点ではDPNでは罰を扱えないので、報酬と罰が混在する例題に対してはLADQNが有望である。これらとともに、手法としてのPSの拡張も実現した。具体的には、PSの割引率に考察を加えたDetour Path Suppression Methodと呼ばれる手法や、新たなPSベース手法であるStable Profit Sharingと呼ばれる手法を、共同研究を行っている大学院生らとともに提案した。
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason 平成30年度においては、DQNwithPSからQ-learning（QL）に相当する部分を切り離した手法であるDeep P-Network（DPN）を、共同研究を行っている大学院生らとともに提案した。DPNは、QLに依存せずにProfit Sharing（PS）のみで学習を行うことができる初めての深層強化学習手法である。そのため、PS同様、非ブートストラップ手法となり、POMDPs環境下において特に有望な手法である。実際に、Atari2600ゲーム環境中のPongで検証したところ、条件次第では、DQNの1/10以下で学習を行えることが確認できた。これは本研究課題の目的である「学習に要する試行錯誤回数の大幅な削減」を実現するものである。また、今年度においては、平成29年度に提案したLearning Acceleration DQN（LADQN）の発展を行った。具体的には、DQNの改善手法を統合したモデルとして知られるRainbowの知見を利用して、LADQNの改善を実現した。さらに、PSの割引率に考察を加えたDetour Path Suppression Methodと呼ばれる手法や、新たなPSベース手法であるStable Profit Sharingと呼ばれる手法を、共同研究を行っている大学院生らとともに提案した。今後、これらの成果をDPNに取り入れることで、さらなる手法の改善が期待できる。以上の成果はともに、本研究課題の目的である「DQNの学習に要する試行錯誤回数の大幅削減」に大きく寄与するものである。また、本年度の当初の目的である「具体的な手法の完成」につながるものである。そのため、本研究課題はおおむね順調に進展していると言える。
Strategy for Future Research Activity	本研究課題では、当初、罰を回避するXoL手法である予想失敗確率EFP（Expected Failure Probability）との組み合わせを視野に入れていたが、より効率を重視する立場から、LADQNの改良ならびに、手法としてのPSの改良を平成30年度において実現した。今後は、これらの知見をEFPとの組み合わせに際しても活かす予定である。また、ゲーム問題以外の、例えば、実環境下でのリアルタイムな意思決定が要求されるような領域への適用も順次検討する。これに関しては、平成29年度に購入したヒューマノイドロボット「ナオ」を最大限活用する予定である。さらに、近年は、大学における３ポリシーの分析といったテキストマイニングに関連する研究も行っている。これらの領域への深層強化学習の導入も今後の方策として考えている。

Research Products
(14 results)

All 2019 2018

All Journal Article (2 results) (of which Peer Reviewed: 2 results, Open Access: 2 results) Presentation (12 results) (of which Int'l Joint Research: 6 results)

[Journal Article] Proposal and Evaluation of Reward Sharing Method Based on Safety Level2018
- Author(s)
  KODAMA Naoki、MIYAZAKI Kazuteru、KOBAYASHI Hiroaki
- Journal Title
  
  SICE Journal of Control, Measurement, and System Integration
  
  Volume: 11 Pages: 207～213
- DOI
  https://doi.org/10.9746/jcmsi.11.207
- Peer Reviewed / Open Access
[Journal Article] Proposal of a Deep Q-network with Profit Sharing2018
- Author(s)
  Miyazaki Kazuteru
- Journal Title
  
  Procedia Computer Science
  
  Volume: 123 Pages: 302～307
- DOI
  https://doi.org/10.1016/j.procs.2018.01.047
- Peer Reviewed / Open Access
[Presentation] Research on Consistency between Diploma Policies and Nomenclature of Major Disciplines : Deep Learning Approach2019
- Author(s)
  Kazuteru Miyazaki、Nozomi Takahashi、Rie Mori
- Organizer
  7th International Conference on Information and Education Technology (ICIET 2019)
- Int'l Joint Research
[Presentation] 非ブートストラップ手法を利用した深層強化学習アルゴリズムの提案2019
- Author(s)
  小玉直樹、原田拓、宮崎和光
- Organizer
  第46回知能システムシンポジウム
[Presentation] A Proposal for Reducing the Number of Trial-and-Error Searches for Deep Q-Networks Combined with Exploitation-Oriented Learning2018
- Author(s)
  Naoki Kodama、Kazuteru Miyazaki、Taku Harada
- Organizer
  17th IEEE International Conference on Machine Learning and Applications (ICMLA 2018)
- Int'l Joint Research
[Presentation] Consistency Assessment between Diploma Policy and Curriculum Policy using Character-level CNN2018
- Author(s)
  Kazuteru Miyazaki、Masaaki Ida
- Organizer
  Joint 10th International Conference on Soft Computing and Intelligent Systems and 19th International Symposium on Advanced Intelligent Systems (SCIS&ISIS 2018)
- Int'l Joint Research
[Presentation] Proposal of Detour Path Suppression Method in PS Reinforcement Learning and Its Application to Altruistic Multi-agent Environment2018
- Author(s)
  Daisuke Shiraishi、Kazuteru Miyazaki、Hiroaki Kobayashi
- Organizer
  International Conference on Principles and Practice of Multi-Agent Systems (PRIMA 2018)
- Int'l Joint Research
[Presentation] On Stable Profit Sharing Reinforcement Learning with Expected Failure Probability2018
- Author(s)
  Daisuke Mizuno、Kazuteru Miyazaki、Hiroaki Kobayashi
- Organizer
  Biologically Inspired Cognitive Architectures Meeting (BICA 2018)
- Int'l Joint Research
[Presentation] Proposal and Evaluation of an Indirect Reward Assignment Method for Reinforcement Learning by Profit Sharing2018
- Author(s)
  Kazuteru Miyazaki、Naoki Kodama、Hiroaki Kobayashi
- Organizer
  IntelliSys 2018
- Int'l Joint Research
[Presentation] Character-level CNNを用いたディプロマ・ポリシーマッチングテスト2018
- Author(s)
  宮崎和光、高橋望、森利枝
- Organizer
  計測自動制御学会システム・情報部門学術講演会2018
[Presentation] 深層強化学習アルゴリズムRainbowとProfit Sharingベース学習の結合2018
- Author(s)
  小玉直樹、原田拓、宮崎和光
- Organizer
  計測自動制御学会システム・情報部門学術講演会2018
[Presentation] 経験強化型学習XoLに関する最近の発展2018
- Author(s)
  宮崎和光
- Organizer
  計測自動制御学会システム・情報部門学術講演会2018
[Presentation] Character-level CNN を用いたディプロマポリシーとカリキュラムポリシーの整合性判定2018
- Author(s)
  宮崎和光、井田正明
- Organizer
  システム研究会インテリジェント・システム (FAN2018)
[Presentation] 2つのエピソードを持つ経験強化型深層強化学習手法の提案2018
- Author(s)
  小玉直樹、原田拓、宮崎和光
- Organizer
  平成30年電気学会電子・情報・システム部門大会

2018 Fiscal Year Research-status Report

Research on new machine learning method combining Exploitation-oriented Learning and Deep Learning

Principal Investigator

宮崎 和光 独立行政法人大学改革支援・学位授与機構, 研究開発部, 准教授 (20282866)

Current Status of Research Progress

Reason

Research Products

[Journal Article] Proposal and Evaluation of Reward Sharing Method Based on Safety Level2018

Author(s)

Journal Title

DOI

[Journal Article] Proposal of a Deep Q-network with Profit Sharing2018

Author(s)

Journal Title

DOI

[Presentation] Research on Consistency between Diploma Policies and Nomenclature of Major Disciplines : Deep Learning Approach2019

Author(s)

Organizer

[Presentation] 非ブートストラップ手法を利用した深層強化学習アルゴリズムの提案2019

Author(s)

Organizer

[Presentation] A Proposal for Reducing the Number of Trial-and-Error Searches for Deep Q-Networks Combined with Exploitation-Oriented Learning2018

Author(s)

Organizer

[Presentation] Consistency Assessment between Diploma Policy and Curriculum Policy using Character-level CNN2018

Author(s)

Organizer

[Presentation] Proposal of Detour Path Suppression Method in PS Reinforcement Learning and Its Application to Altruistic Multi-agent Environment2018

Author(s)

Organizer

[Presentation] On Stable Profit Sharing Reinforcement Learning with Expected Failure Probability2018

Author(s)

Organizer

[Presentation] Proposal and Evaluation of an Indirect Reward Assignment Method for Reinforcement Learning by Profit Sharing2018

Author(s)

Organizer

[Presentation] Character-level CNNを用いたディプロマ・ポリシーマッチングテスト2018

Author(s)

Organizer

[Presentation] 深層強化学習アルゴリズムRainbowとProfit Sharingベース学習の結合2018

Author(s)

Organizer

[Presentation] 経験強化型学習XoLに関する最近の発展2018

Author(s)

Organizer

[Presentation] Character-level CNN を用いたディプロマポリシーとカリキュラムポリシーの整合性判定2018

Author(s)

Organizer

[Presentation] 2つのエピソードを持つ経験強化型深層強化学習手法の提案2018

Author(s)

Organizer

宮崎和光独立行政法人大学改革支援・学位授与機構, 研究開発部, 准教授 (20282866)