2023 Fiscal Year Final Research Report

A study of server management technology for sustaining a large scale distributed neural network

Research Project

PDF

Project/Area Number	20K19791
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 60060:Information network-related
Research Institution	Kindai University
Principal Investigator	Mizutani Kimihiro 近畿大学, 情報学部, 准教授 (40845939)
Project Period (FY)	2020-04-01 – 2024-03-31
Keywords	広域分散コンピューティング / 分散学習 / 分散ニューラルネットワーク / ネットワーク管理
Outline of Final Research Achievements	In this study, we aim to construct a distributed neural network execution platform by developing core technologies. First, we used structured overlay network technology to quickly restore the distributed platform. This method's strength is in estimating the union of failure nodes and quickly propagating failure information to them. This approach reduces unnecessary failure information propagation and shortens the platform's Mean Time to Repair (MTTR). Secondly, we integrated distributed federated learning techniques into the platform to manage scalable learning nodes. We proposed an efficient scalable node management tree architecture that balances learning efficiency and high fault tolerance. Finally, we developed various schemes for traffic data estimation and control within the platform. By combining these technologies, we expect to maintain a robust and fault-tolerant future distributed neural network management platform.
Free Research Field	情報ネットワーク
Academic Significance and Societal Importance of the Research Achievements	本研究では，自律的なニューラルネットワークの分散実行基盤の構築において，学習・推論の永続的な実行をサポートするサーバ連携技術および学習状況の管理手法の提案を行った．サーバ連携技術では，構造化オーバレイ技術を活用し，基盤内で発生するサーバの故障対応を高速化する手法を創出した．学習状況の管理手法については，連合学習フレームワーク上で学習・推論の円滑な同時実行を実現する技術を開発した．さらに，分散実行基盤内で発生するデータの制御・解析に関する技術の創出も行った．これらの技術は，当該研究分野において重要な貢献を果たしており，今後のさらなる研究や実用化の基盤となると考えられる．