YAMNet (audio event classifier)

배배토·2025년 2월 7일

YAMNet audio-event-classifier kaggle tensorflow

데이터 분석이나 예측 모델 개발에 관심이 있는 사람이라면 Kaggle을 들어 보았을 것이다.

Kaggle에서는 기업이 제공한 빅데이터 분석 경연 참여, 나만의 커널을 생성하여 데이터셋 분석 예측, 분석 내용을 바탕으로 전문가와 토론할 수 있는 커뮤니티들이 활성화 되어있다.

또한 이미 완성된 모델을 다운받아 사용해볼 수도 있는데, YAMNet이 다음과 같은 경우에 속한다.

yamnet
: An audio event classifier trained on the AudioSet dataset to predict audio events from the AudioSet ontology.

그니까 한마디로 yamnet은 오디오 이벤트 분류기이다.

Kaggle API key가 있다면, 다음의 코드를 통해 간편하게 모델을 다운받을 수 있는듯.

import kagglehub

# Download latest version
path = kagglehub.model_download("google/yamnet/tensorFlow2/yamnet")

print("Path to model files:", path)

이 외에 TensorFlow2 (or TensorFlow1)에서 모델을 로드해 yamnet을 사용하는 방법은 다음과 같다.

TensorFlow 허브에서 모델을 로드.

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import csv
import io

# Load the model.
model = hub.load('https://www.kaggle.com/models/google/yamnet/TensorFlow2/yamnet/1')

class_names 변수에 model.class_map_path()에 있는 레이블 파일을 불러옴.

# Find the name of the class with the top score when mean-aggregated across frames.
def class_names_from_csv(class_map_csv_text):
  """Returns list of class names corresponding to score vector."""
  class_map_csv = io.StringIO(class_map_csv_text)
  class_names = [display_name for (class_index, mid, display_name) in csv.reader(class_map_csv)]
  class_names = class_names[1:]  # Skip CSV header
  return class_names
class_map_path = model.class_map_path().numpy()
class_names = class_names_from_csv(tf.io.read_file(class_map_path).numpy().decode('utf-8'))
print(class_names[scores.numpy().mean(axis=0).argmax()])  # Should print 'Silence'.

로드된 오디오가 적절한 sample_rate(16K)인지 확인하고 변환하는 메서드를 추가. (그렇지 않으면, 모델의 결과에 영향을 미치기때문에 해당 절차가 필수)

def ensure_sample_rate(original_sample_rate, waveform,
                       desired_sample_rate=16000):
  """Resample waveform if required."""
  if original_sample_rate != desired_sample_rate:
    desired_length = int(round(float(len(waveform)) /
                               original_sample_rate * desired_sample_rate))
    waveform = scipy.signal.resample(waveform, desired_length)
  return desired_sample_rate, waveform

사운드 파일 다운로드 & 사용 준비 (오디오 파일은 16kHz 샘플링 속도의 모노 wav 파일이어야 함)
++ wav_data는 [-1.0, 1.0]의 값으로 정규화되어야 한다. 공식문서 참조

curl -O https://storage.googleapis.com/audioset/speech_whistling2.wav

모델 실행하기
: 준비해둔 데이터를 사용하여 모델을 호출하고 해당 데이터에 대한 scores, embeddings 및 spectrogram을 얻음.
(scores는 데이터에 대한 주요 결과이며, spectrogram은 나중에 데이터 시각화를 수행하는 데 사용됨)

# Run the model, check the output.
scores, embeddings, spectrogram = model(waveform)

scores_np = scores.numpy()
spectrogram_np = spectrogram.numpy()
infered_class = class_names[scores_np.mean(axis=0).argmax()]
print(f'The main sound is: {infered_class}')

시각화
YAMNet은 시각화에 사용할 수 있는 몇 가지 추가 정보도 반환한다. 다음의 코드를 활용하여 추론된 waveform, spectrogram 및 class를 살펴볼 수 있다.

plt.figure(figsize=(10, 6))

# Plot the waveform.
plt.subplot(3, 1, 1)
plt.plot(waveform)
plt.xlim([0, len(waveform)])

# Plot the log-mel spectrogram (returned by the model).
plt.subplot(3, 1, 2)
plt.imshow(spectrogram_np.T, aspect='auto', interpolation='nearest', origin='lower')

# Plot and label the model output scores for the top-scoring classes.
mean_scores = np.mean(scores, axis=0)
top_n = 10
top_class_indices = np.argsort(mean_scores)[::-1][:top_n]
plt.subplot(3, 1, 3)
plt.imshow(scores_np[:, top_class_indices].T, aspect='auto', interpolation='nearest', cmap='gray_r')

# patch_padding = (PATCH_WINDOW_SECONDS / 2) / PATCH_HOP_SECONDS
# values from the model documentation
patch_padding = (0.025 / 2) / 0.01
plt.xlim([-patch_padding-0.5, scores.shape[0] + patch_padding-0.5])
# Label the top_N classes.
yticks = range(0, top_n, 1)
plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks])
_ = plt.ylim(-0.5 + np.array([top_n, 0]))

다음 포스트에서는 librosa에 대해 알아보자 ! 다들 YAMNet 공식 문서 참조해서 사용해보기 ~.~

배배토

울며코드먹기..쩝쩝ㅂ

이전 포스트

앙상블 기법 (Ensemble Learning) 이란?

다음 포스트

YAMNet (audio event classifier)

앙상블 기법 (Ensemble Learning) 이란?

FT(Fourier transform), STFT (Short Time Fourier transform)에 대해 알아보자

0개의 댓글