Before Multi-head self attention(이하 MHSA) 라는 연산은 데이터의 표현을 학습하는 (token-mixing) 하나의 방법으로 CNN, FCN, Pooling 등과 그 궤를 같이한다. CNN은 늘 MHSA와 비견되어 왔는데 일반적으로 알려
Introduction 이전까지 ViT에 대해 사람들이 알고 있는 것에는 아래의 내용이 있다. ** 데이터의 양이 적을 때 ViT는 ResNet을 뛰어넘기 힘들다. ** ViT는 Inductive Bias가 부족하기 때문에 ViT의 표현력을 키우기 위해서는 많은 양
[Rethinking Mamba in Speech Processing by Self-Supervised Models (Xiangyu Zhang et al., 2024)](https://arxiv.org/pdf/2409.07273) Insight: Reconstruct
Paper Are all negatives created equal in contrastive instance discrimination? (arxiv preprint, 2020) Link https://arxiv.org/abs/2010.06682 Introduct
2023년 8월 CVSSP(Center for Vision, Speech and Signal Processing)와 Bytedance가 Audio Generation, 혹은 멀티모달(Multi-Modal) 영역의 SOTA AudioLDM2를 발표했다. 동해 1월에 등장
논문 리뷰를 마음속으로만, 노트 필기로만 하고 velog 등에 올리는 것이 귀찮아서 좀 안했었다. 하지만 이제부터라도 꾸준히 논문 리뷰를 하고, 이전에 읽었던 것들 혹은 새로이 읽게되는 것들에 대해 리뷰성 글들을 꾸준히 작성해 보아야겠다.
As AI models get bigger and deeper, understanding how AI model works has been a growing topic among many researchers. A few researches focus on layer-
ICASSP 2024에 게재된 Alexander H.Liu et al.의 논문 REVISITING SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONFROM A MUTUAL INFORMATION PERSPECTIVE은 Self-Su
Studies on Speech Enhancement usually were performed on Time-Frequency (T-F) domain since the patterns of noise could be easily distinguished in TF do
Vision Transformer (ViT) encoding a 224 $\\times$ 224 image and Vanilla Transformer encoding a 196 characters result in the same size of feature map w
Real-time speech processing remains challenging. Issues are: is Fourier decomposition the best? The vanilla works usually predict the source magnitude
This post includes a brief paper review on the paper called 'FINALLY: fast and universal speech enhancement with studio-like quality', Babaev et al.,
Speech Restoration is a complicated task that deals with multiple acoustic distortions such as reverberation, band-limitation and more.Such a complica
Current models do not explicitly model degradation information (e.g., type, intensity)Injection of conditions like SSL or Speaker embeddingACX, a nove
A model trained with a single sampling rate (sampling frequency) can be transferred to a model with different sampling rates.Sampling frequency indepe