Paper review: TASNET: Time-Domain Audio Separation Network For Real-Time and Single-Channel Speech Separation

이용준·2025년 2월 13일

Paper Review

목록 보기

11/15

Introduction

Real-time speech processing remains challenging.
Issues are: is Fourier decomposition the best? The vanilla works usually predict the source magnitude and then use the original phase from the magnitude, which leads to sub-optimal results.
To do the STFT-Speech Separation, typically longer time is required for the latency, hindering the real-time processing.

Problem Formulation

x(t) = \sum^Cs_i(t)

$x(t)$ represents the mixture signal and $s_i(t)$ represents the clean signal.

First segment the mixture and clean sources into K non-overlapping vectors of length L samples.

x(t) = [x_1, x_2, ... ,x_K], x_k \in \Re^{L} \\ s(t) = [s_1, s_2, ..., s_K], s_k \in \Re^{L}

Then, we represent $x, s$ into a nonnegative weighted sum of N basis signals $B = [b_1, b_2, ..., b_N]$ .

x = wB \\ s_i = d_iB

$B, d_i$ are all learnable matrices of following sizes. Intuitively, $w= \sum{d_i}$ , Since the learnable matrices are all non-negative, we can say that $d_i = m_i \times w$ .

Methods

First, we draw a encoder that estimates $w$ . And then input $K$ w s to a source estimation network (LSTM) to get the K mask vectors for source i.

Then there is a decoder for waveform reconstruction.

Three modules are trained jointly because we can directly use SDR (Si-SDR here).

이용준

Ad libitum

이전 포스트

<Scaling Laws in Patchification: An Image is Worth 50,176 Tokens and More> Paper Review

다음 포스트