Multi-modal self-supervised learning \- 1) Cross-modality Semantic Correspondence: Fine-grained Modeling(local) + Coarse-grained Modeling(global)2) T