Introduction
- Real-time speech processing remains challenging.
- Issues are: is Fourier decomposition the best? The vanilla works usually predict the source magnitude and then use the original phase from the magnitude, which leads to sub-optimal results.
- To do the STFT-Speech Separation, typically longer time is required for the latency, hindering the real-time processing.
x(t)=∑Csi(t)
x(t) represents the mixture signal and si(t) represents the clean signal.
First segment the mixture and clean sources into K non-overlapping vectors of length L samples.
x(t)=[x1,x2,...,xK],xk∈ℜLs(t)=[s1,s2,...,sK],sk∈ℜL
Then, we represent x,s into a nonnegative weighted sum of N basis signals B=[b1,b2,...,bN].
x=wBsi=diB
B,di are all learnable matrices of following sizes. Intuitively, w=∑di, Since the learnable matrices are all non-negative, we can say that di=mi×w.
Methods
First, we draw a encoder that estimates w. And then input K w s to a source estimation network (LSTM) to get the K mask vectors for source i.
Then there is a decoder for waveform reconstruction.
Three modules are trained jointly because we can directly use SDR (Si-SDR here).