LEARNING CONTINUOUS REPRESENTATION OF AUDIO FOR ARBITRARY SCALE SUPER RESOLUTION

SHIN·2023년 8월 30일
0

1. Introduction

Suggesting Local Implicit representation for Super resolution of Arbitrary scale(LISA), which can obtain a continuous representation of audio and enable super resolution for arbitrary scale factor. (Current approaches treat audio as discrete data)

LISA

Encoder

Corresponds each chunk of audio to a latent code parameterizing a local input signal around the chunk. -> Don't get it. Isn't local input the chunk?

Decoder

Takes 1. a continuous time coordinate and 2. the neighboring set of latent codes around the coordinate, and predicts the value of signal at the coordinate

Process

  1. Down sample to the input resolution.
  2. Generate super resolution tasks of random scale up to the original resolution.
  3. Use a stochastic measure of audio discrepancy between entire reconstructed and original signals in waveform and spectrogram
  4. Closer discrepancy gets higher random weight
    Thus, local latent code of a chunk captures a characteristic of the global audio signal, while focuses on the local signal around the chunk. -> HOW?

Advantages

Having the local implicit representation in terms of the low latency and the arbitrary scale factor as input audio is often streaming in audio super resolution. -> What does 'streaming' mean in this context? Flows?

2. Method

Compositions

  • F(t)F(t) : Audio
  • tR, t=tit \in \mathbb{R}, \ t=t_i's in every sampling period 1Rin\frac{1}{R_{in}}, where Rin>0R_{in}>0 is input resolution
  • F(ti)F(t_i) : local part of discrete samples around tt
  • gϕg_\phi : encoder with parameter ϕ\phi
  • fθf_\theta : decoder with parameter θ\theta
  • ziz_i \coloneqq gϕ(F(tik),,F(ti+k))g_{\phi}(F(t_i−k), \cdots, F(t_i+k)) i.e. (2k+1) samples around time tit_i
  • z(t)(zi(t)1,zi(t),zi(t)+1)z(t) \coloneqq (z_{i(t)−1}, z_{i(t)} , z_{i(t)+1}), where i(t)argminittii(t) \coloneqq \arg\min_i |t − t_i|, i.e. i(t)i(t) is the closest index to tt
  • F^(t):=fθ(tti(t);z(t))F(t)\hat{F}(t) := f_θ(t − t_{i(t)}; z(t)) ≈ F(t)
    : continuous representation

LISA

Note that, i~=t(i)\tilde{i}=t(i)

Output prediction

By putting the sequence of time coordinates every 1Rout\frac{1}{R_{out}} to F^()\hat{F}(\cdot)
-> Don't get it

2.1 Model Architecture

Encoder gϕg_\phi

  1. Convolutional network
    • Induces temporal correlation, by summarizing a few consecutive data points.
profile
HAPPY the cat

0개의 댓글