2023, In Search of Strong Embedding Extractors For Speaker Diarization [ICASSP]

DongKeon Park·2023년 5월 24일

SD_Embedding

목록 보기

1/4

Published on ICASSP 2023, Naver CLOCA

Introduction

speaker diarization system: four sub-systems
- end point detection
- speaker embedding extraction
  - window size of 1.5s and shift size of 0.5s
  - Three state-of-the-art models are adopted
  - RawNet3\, ECAPA-TDNN\, and MFA-Conformer
- feature enhancement
  - dimensionality reduction and attention-based embedding aggregation
  - to refine the extracted speaker embeddings
- Clustering
  - assigns speaker labels to each extracted embedding

Introduction

Speaker embedding extractors (EEs)
- the process of separating an audio recording into segments corresponding to individual speaker
- maps an utterance to a latent space where speakers can be discriminated
- plays a critical role in speaker diarization systems

Problems

The authors focus on two key problems:
- 1) the difficulty of evaluating the performance of EEs for diarization
  - differ between speaker verification and diarization
- 2) the fact that EEs have not been trained on inputs with multiple speakers
  - such as overlapped speech and speaker changes
  - degrades their performance
Propose
- two data augmentation techniques
- new evaluation protocols
1) the difficulty of evaluating the performance of EEs for diarization
- Problem: the correlation between (EERs) of (EEs) and rate (DER) is not strong

New evaluation protocols

generating and adopting speaker verification evaluation protocols
- easier positive and harder negative trials by composing pairs within the same audio file
- They crop the input audio into short segments using RTTM files and compose trials using these segments:
  - (a) target and non-target single speaker-single speaker
  - (b) target and non-target overlap-single speaker
  - (c) target and non-target speaker change-single speaker.

two data augmentation techniques
- make EEs aware of overlapped speech and speaker change segments
Overlapped speech augmentation
- adds a minor speaker's scaled and cropped utterance => major speaker's utterance
- randomly selected between 200ms and 700ms
- scaled to a randomly selected target signal-to-noise ratio (SNR) compared major utt.
Speaker change augmentation
- replaces a random region of a major speaker's utterance
  - a scaled and cropped minor speaker's utterance
- first select the type of speaker change among three types:
  - (i) major to minor speaker
  - (ii) minor to major speaker
  - (iii) major to minor to major speaker
    - two different speaker changes consecutively\,
- lower maximum duration of 300ms
  - prevent excessively removing the major speaker's information

Experiment

Experiment
- three state-of-the-art models (RawNet3\, ECAPA-TDNN\, and MFA-Conformer)
- evaluate them on three datasets (AMI\, DIHARD3\, and VoxConverse).
- propose two data augmentation techniques (overlapped speech and speaker change)

DongKeon Park

Currently pursuing my Ph.D. in GIST, I am deeply intrigued by the field of speaker diarization and committed to making meaningful contributions to it.

다음 포스트

2021, AutoEncoder, attention-aggregation [Interspeech]

1개의 댓글

Gregory Sharpe

2023년 9월 27일

I appreciate you giving this knowledge. I really appreciate your great post. You've really given readers of this blog article something useful and engaging to read right now geometry dash

답글 달기