
- (대규모 데이터셋 수집 문제 해결) DNN으로 Supervised Modeling하면 적당한 양의 데이터로도 화자정보 feature를 잘 뽑을 수 있음
- (augmentation을 통한 성능 향상 확인) 데이터에 Augmentation을 수행하였을 때, 화자정보 Feature를 잘 Extract했고, Robustness도 강화되었음
전통적인 GMM-UBM based i-vector system을 baseline으로 이용하였음
기본 MFCC(20차원) + 델타(feature의 1차 차분)(20차원) + 가속(델타의 1차 차분)(20차원) = 60차원 feature 사용
수집 절차
- 음성 샘플에서 MFCC, Delta, Accelerate 를 추출하여 총 60차원 feature 추출
- 2048 component를 사용하여 UBM 학습
- 각 화자에 따른 Baum-Welch 통계량 계산
- 통계량을 이용하여 600차원 i-vector 추출
- i-vector를 이용하여 PLDA Classifier 학습 후 평가
코드 예제
- 음성 데이터 준비
def load_audio(file_path):
signal, sr = librosa.load(file_path, sr=None)
return signal, sr- 특징 추출
def extract_features(signal, sr):
mfcc = librosa.feature.mfcc(signal, sr=sr, n_mfcc=20)
delta = librosa.feature.delta(mfcc)
delta2 = librosa.feature.delta(mfcc, order=2)
features = np.vstack((mfcc, delta, delta2))
return features.T- UBM 학습
def train_ubm(features, n_components=2048):
gmm = GaussianMixture(n_components=n_components, covariance_type='full', max_iter=100)
gmm.fit(features)
return gmm- Baum-Welch 통계 계산
def compute_baum_welch_statistics(gmm, features):
responsibilities = gmm.predict_proba(features)
n_k = np.sum(responsibilities, axis=0)
f_k = np.dot(responsibilities.T, features)
s_k = np.dot(responsibilities.T, features**2)
return n_k, f_k, s_k- i-vector 추출
def extracti_vector(gmm, n_k, f_k, s_k, total_variability_matrix):
supervector_mean = gmm.means.flatten()
supervectorcovariance = np.diag(gmm.covariances.flatten())
ivector = np.dot(total_variability_matrix.T, np.linalg.inv(np.dot(total_variability_matrix, total_variability_matrix.T) + supervector_covariance))
ivector = np.dot(ivector, (f_k.flatten() - n_k * supervector_mean))
return ivector- PLDA Modeling & Evaluate
def train_plda(vectors, labels):
scaler = StandardScaler()
vectors = scaler.fit_transform(vectors)
lda = LinearDiscriminantAnalysis(n_components=150)
vectors_lda = lda.fit_transform(vectors, labels)
mean_vectors = np.mean(vectors_lda, axis=0)
cov_matrix = np.cov(vectors_lda, rowvar=False)
return mean_vectors, cov_matrix, scaler, lda
# def plda_score(test_vector, mean_vectors, cov_matrix, scaler, lda):
test_vector = scaler.transform([test_vector])
test_vector_lda = lda.transform(test_vector)
# 유사도 계산 (간단한 Mahalanobis 거리 예제)
diff = test_vector_lda - mean_vectors
score = -np.dot(np.dot(diff, np.linalg.inv(cov_matrix)), diff.T)
return score

# class XVectorModel(nn.Module):
# def __init__(self):
super(XVectorModel, self).__init__()
# Frame layers
self.frame1 = nn.Conv1d(in_channels=120, out_channels=512, kernel_size=5, stride=1, padding=2)
self.frame2 = nn.Conv1d(in_channels=1536, out_channels=512, kernel_size=3, stride=1, padding=1)
self.frame3 = nn.Conv1d(in_channels=1536, out_channels=512, kernel_size=3, stride=1, padding=1)
self.frame4 = nn.Conv1d(in_channels=512, out_channels=512, kernel_size=1, stride=1)
self.frame5 = nn.Conv1d(in_channels=512, out_channels=1500, kernel_size=1, stride=1)
# Stats pooling layer
self.pooling = nn.AdaptiveAvgPool1d(1)
# Segment layers
self.segment6 = nn.Linear(3000, 512)
self.segment7 = nn.Linear(512, 512)
# Softmax output layer (for training, assuming N speakers)
self.softmax = nn.Linear(512, 100) # Assuming 100 speakers for training
def forward(self, x):
x = torch.relu(self.frame1(x))
x_spliced = self.splice_context(x, left=2, right=2) # Splice context for frame2
x = torch.relu(self.frame2(x_spliced))
x_spliced = self.splice_context(x, left=2, right=2) # Splice context for frame3
x = torch.relu(self.frame3(x_spliced))
x = torch.relu(self.frame4(x))
x = torch.relu(self.frame5(x))
# Stats pooling
mean = self.pooling(x)
std = torch.sqrt(self.pooling(x**2) - mean**2)
stats = torch.cat((mean, std), dim=1)
stats = stats.view(stats.size(0), -1)
x = torch.relu(self.segment6(stats))
x = torch.relu(self.segment7(x))
x = self.softmax(x)
return x
def splice_context(self, x, left=0, right=0):
""" Splice context frames around the current frame. """
T, D = x.size(2), x.size(1)
x_spliced = [x[:, :, max(0, t-left):min(T, t+right+1)] for t in range(T)]
x_spliced = torch.cat(x_spliced, dim=1)
return x_spliced



