Discriminative Neural Clustering (DNC)
[Paper] [Code]
2019, SLT conference (Best paper)
Abstract
- Data clustering with maximum number of clustering
- No explicit definition of similarity measure
- 기존 traditional 방법에서는 cosine similarity 사용
- Transformer arch
- Data scarcity (AMI dataset only 147 complete meetings)
- Three Data augmentation
- Sub-sequence randomization
- Input vector randomization
- Diaconis augmentation
- Generate new samples by rotating L_2 normalized speaker embedding
- Assumption
- Maximum number of speaker
- VAD
- Non-overlap
- Using extracted d-vector
Introduction
DNC
-
Assumption
- the maximum number of clusters is know
-
As long as each cluster is associated with a unique identify label, permutating the cluster labels should not affect the clustering outcome
-
Task of Clustering can be considered as a special seq2seq classification problem
-
Input sequence is xi has an underlying identity zi
-
Attempts to assign 𝑥𝑖 to a clustering label 𝑦𝑖
Same identity 𝑧𝑖 are assigned the same cluster label 𝑦𝑖
- Instead of the absolute identities zi assigned to each xi, the relative cluster labels y(1:N) across 𝑿
- Multiple data samples (𝑋, 𝑧_(1:𝑁)) have to be available for training
Data Augmentation DNC
Two objective
- Generate as many training sequence (𝑿,𝑦_(1:𝑁) ) => sub-sequence randomization
- Match the true data distribution 𝑃(𝑿, 𝑦_(1:𝑁) ) as closely as possible => input vector randomization
Sub-sequence randomization
- Multiple sub-sequence (𝑿(𝑠:𝑒), 𝑦(𝑠:𝑒)) , random starting, ending indexes
- DNC 에서는 input sequence 가 많아 질 수록 같은 x_i 가 다른 y_i로 맵핑될 수 있게 한다
이를 통해 x_i 가 fixed cluster label 가지는 것을 막아준다
Input vectors randomization
- Preserving its cluster label sequence
- reassign to an identity randomly chosen from the training set 𝑧_(1:𝑁)
- for each zi a feature vector is randomly chosen as xi
이때 하나의 미팅에서 sampling 도 가능하고, 전체 training set 에서도 sampling 이 가능함
따라서 global 과 meeting 으로 나눠서 실험
Diaconis Augmentation (Diac-Aug)
- Applicable xi are 𝐿_2-normalized
- Forming clusters on the surface of a hypersphere whose radius is the L2 norm
- Rotate entire input sequence to different region of the hypersphere
- Effect: unseen 𝒙𝒊^′⇒(𝑿^′,𝑦(1:𝑁) ),
- Random oration matrix 𝑹∈(𝐷,𝐷), 𝑿^′=𝑿𝑹
- prevent the model from overfitting
5.1. Data and Segmentation
- AMI meeting corpus
- Official train\, dev\, eval
- BeformIt[32]를 통해 8-channel => 1 channel
- Assumption: perfect VAD
- Manual segmentation & stripping silence at both end of each utterance (silence 제거했다는 의미??)
- Spectral clustering 과 성능을 비교하기위해서
- Short segment 들은 Enclosed 되어있어서 제거 (output 을 generation 하기에 unrepresentative)
5.2. Segment-level Spk emb
Segment-level embedding (Clustering is performed)
- 𝑥_𝑖 is obtained by averaging the window-level speaker embedding
- 𝐿_2-normalization
- Both before and after averaging
Window-level embedding generator
- Using TDNN: 2 seconds (215 frames, [-107, 106])
- TDNN 각 DNN에서는 [-7, 7]
- Combine TDNN output vectors (like [14])
- Train on AMI training data with angular Softmax (Asoftmax)
DNC model
- Transformer (using ESPNet)
- 4 encoder & 4 Decoder (7.3M)
- Head is 4
- Adam (rasmps up learning 0 to 12 in the first 40\,00) and decrease
- Dropout 10%
- Diagonal local attention
- Input-to-output alignment
- Strictly one-to-one and monotonic
- Source attention can be restricted to an identity matrix
- Attention matrix masked to be a tri-diagonal matrix
Experiments