Paper review: TASNET: Time-Domain Audio Separation Network For Real-Time and Single-Channel Speech Separation

이용준·2025년 2월 13일
0

Paper Review

목록 보기
11/15

Introduction

  • Real-time speech processing remains challenging.
  • Issues are: is Fourier decomposition the best? The vanilla works usually predict the source magnitude and then use the original phase from the magnitude, which leads to sub-optimal results.
  • To do the STFT-Speech Separation, typically longer time is required for the latency, hindering the real-time processing.

Problem Formulation

x(t)=Csi(t)x(t) = \sum^Cs_i(t)

x(t)x(t) represents the mixture signal and si(t)s_i(t) represents the clean signal.

First segment the mixture and clean sources into K non-overlapping vectors of length L samples.

x(t)=[x1,x2,...,xK],xkLs(t)=[s1,s2,...,sK],skLx(t) = [x_1, x_2, ... ,x_K], x_k \in \Re^{L} \\ s(t) = [s_1, s_2, ..., s_K], s_k \in \Re^{L}

Then, we represent x,sx, s into a nonnegative weighted sum of N basis signals B=[b1,b2,...,bN]B = [b_1, b_2, ..., b_N].

x=wBsi=diBx = wB \\ s_i = d_iB

B,diB, d_i are all learnable matrices of following sizes. Intuitively, w=diw= \sum{d_i}, Since the learnable matrices are all non-negative, we can say that di=mi×wd_i = m_i \times w.

Methods

First, we draw a encoder that estimates ww. And then input KK w s to a source estimation network (LSTM) to get the K mask vectors for source i.

Then there is a decoder for waveform reconstruction.

Three modules are trained jointly because we can directly use SDR (Si-SDR here).

profile
Ad libitum

0개의 댓글