• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Self-supervised graph-based representation for language and speaker detection

Research Project

Project/Area Number 21K17776
Research Category

Grant-in-Aid for Early-Career Scientists

Allocation TypeMulti-year Fund
Review Section Basic Section 61010:Perceptual information processing-related
Research InstitutionNational Institute of Information and Communications Technology

Principal Investigator

Shen Peng  国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター, 主任研究員 (80773118)

Project Period (FY) 2021-04-01 – 2024-03-31
Project Status Completed (Fiscal Year 2023)
Budget Amount *help
¥4,550,000 (Direct Cost: ¥3,500,000、Indirect Cost: ¥1,050,000)
Fiscal Year 2023: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000)
Fiscal Year 2022: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000)
Fiscal Year 2021: ¥2,080,000 (Direct Cost: ¥1,600,000、Indirect Cost: ¥480,000)
Keywordslanguage identification / Speech recognition / self-supervised learning / speaker recognition / pre-training model / large language models / speaker diarization / cross-domain / language recognition / speaker recogntion
Outline of Research at the Start

Developing spoken language and speaker detection techniques is one of the important tasks for improving the usability of real-time multilingual speech translation systems. However, current advanced spoken language and speaker detection techniques cannot perform well on cross-channel and cross-domain data. In this project, investigations will be conducted to understand how to better represent languages and speakers of a speech signal by developing self-supervised graph-based learning techniques for robust spoken language and speaker detection tasks.

Outline of Final Research Achievements

In this project, we focus on developing self-supervised or pre-trained techniques to enhance spoken language and speaker recognition tasks. We experimented with different methods to better capture the characteristics of languages and speakers from speech signals. Our proposed techniques include transducer-based language embeddings, pronunciation-aware character encoding, cross-modal alignment, and generative linguistic representations. These innovations aim to improve language and speaker recognition, as well as speech recognition tasks. Further, we explored multi-task recognition to advance language, speaker, and speech recognition using a single model. The results of this project have been published at top international conferences, including IEEE ICASSP, SLT, ASRU, and Interspeech.

Academic Significance and Societal Importance of the Research Achievements

本プロジェクトは、音声信号の理解と表現を進化させることをその大きな目的としており、このことは重要な科学的意義を有する。言語と話者の認識におけるパフォーマンス向上のための技術は、技術的な応用を進めることに役立つ。

Report

(4 results)
  • 2023 Annual Research Report   Final Research Report ( PDF )
  • 2022 Research-status Report
  • 2021 Research-status Report
  • Research Products

    (8 results)

All 2024 2023 2022 2021

All Journal Article (1 results) (of which Int'l Joint Research: 1 results,  Peer Reviewed: 1 results,  Open Access: 1 results) Presentation (7 results) (of which Int'l Joint Research: 6 results)

  • [Journal Article] Coupling a Generative Model With a Discriminative Learning Framework for Speaker Verification2021

    • Author(s)
      Lu Xugang、Shen Peng、Tsao Yu、Kawai Hisashi
    • Journal Title

      IEEE/ACM Transactions on Audio, Speech, and Language Processing

      Volume: 29 Pages: 3631-3641

    • DOI

      10.1109/taslp.2021.3129360

    • Related Report
      2021 Research-status Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Presentation] Hierarchical cross-modality knowledge transfer with Sinkhorn attention for CTC-based ASR2024

    • Author(s)
      X. Lu, P. Shen, Y. Tsao, H. Kawai
    • Organizer
      IEEE ICASSP
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Generative linguistic representation for spoken language identification2023

    • Author(s)
      P. Shen, X. Lu, H. Kawai
    • Organizer
      IEEE ASRU
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Cross-modal alignment with optimal transport for CTC-based ASR2023

    • Author(s)
      X. Lu, P. Shen, Y. Tsao, H. Kawai
    • Organizer
      IEEE ASRU
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Investigation on Multi-task Universal Speech Models2023

    • Author(s)
      P. Shen, X. Lu, H. Kawai
    • Organizer
      Autumn Meeting of Acoustical Society of Japan
    • Related Report
      2023 Annual Research Report
  • [Presentation] Partial Coupling of Optimal Transport for Spoken Language Identification2022

    • Author(s)
      P Shen, X Lu, H Kawai
    • Organizer
      SLT2022
    • Related Report
      2022 Research-status Report
    • Int'l Joint Research
  • [Presentation] Transducer-based language embedding for spoken language identification2022

    • Author(s)
      P Shen, X Lu, H Kawai
    • Organizer
      Interspeech2022
    • Related Report
      2022 Research-status Report
    • Int'l Joint Research
  • [Presentation] Siamese Neural Network with Joint Bayesian Model Structure for Speaker Verification2021

    • Author(s)
      X. Lu, P. Shen, Y. Tsao, H. Kawai
    • Organizer
      APASIPA ASC
    • Related Report
      2021 Research-status Report
    • Int'l Joint Research

URL: 

Published: 2021-04-28   Modified: 2025-01-30  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi