Paper review: <Sampling Frequency Independent Dialogue Separation>

이용준·2025년 9월 18일

Sampling Frequency Speech separation

Paper Review

목록 보기

15/15

Summary

A model trained with a single sampling rate (sampling frequency) can be transferred to a model with different sampling rates.

Methods

Sampling frequency independence (SFI) can be achieved when two conditions are met.

For a given piece of waveform, if the window length and the hop size are set in terms of a constant unit of seconds not a constant unit of sample size.

Example) A : Audio sample with 8 kHz sampling frequency.
B : Audio sample with 16 kHz sampling frequency.

Set window size = 20 ms, hop size = 4 ms. Then adjust the actual sample sizes with regards to the sampling frequency.

Channel size-independent operations after STFT.

The number of time steps remain the same, but the number of frequency bins are different after SFI operations. Thus, a network such as FC layer cannot be applied in a simple manner since frequency dimension(=channel) is not the same for data. So, a type of operation such as 1D Conv or pooling along the frequency axis should first be applied.

Results

In the upper table, CNN 8 kHz means a CNN model trained in 8 kHz sampling rate. CNN 48 kHz is a CNN model trained in 48 kHz sampling rate. Now, CNN 8->48kHz means a CNN trained in 8 kHz are only inferenced in 48 kHz. Surprisingly, the performance is better!

Discussions

It can be applied to adapting models trained in a certain frequency-range to meet other frequency's needs.

이용준

Ad libitum

이전 포스트