Hyperbolic representations and dimensionality reduction for single cell biology

Junha Park·2023년 7월 7일
0
post-thumbnail

Representation learning of data on hyperbolic latent space is especially powerful at computational biology. Nature has its inherent hierarchy: molecular graph, protein structure, 3D chromatin structure, single-cell sequencing data all has its own hierarchy. Hyperbolic latent space has unique property which can naturally represent hierarchical relationship, thus become a good match when we try to learn high-dimensional geometry of biological data.

This article is a review of paper Scphere, which is a single-cell application of hyperbolic representation learning. ECCV 2022 Tutotrial is referenced as a gentle introduction to hyperbolic representation learning.

1. Introduction

Hyperbolic geometry provides a non-euclidean view to learn a manifold which data is distributed on. Not always eucledian geometry is the good choice of manifold: spherical space and hyperbolic space is an example of spaces with positive and negative curvature.

Hyperbolic geometry is a geometry of constant negative curvature, which provides more spce than euclidean geometry for given dimension and has its intrinsic hierarchical structure. Well-defineness of basic elements which consists of the notion of space is also guaranteed. Let's take a closer look about the differences of distance, angle, line, geodesic, volume, and area defined at hyperbolic space.

Key postulate of eucledian geometry is parallel postulate: there exists a unique line which does not intersect with a given line but going through the given point. Hyperbolic counterpart is: given a line and point not on at it, there exists more than one line going through the line that does not intersect with.

If you're not familiar with basic terminologies, you can refer to the following table.

TerminologyExplanation
DistanceGiven points x,yx,y \rightarrow how far in space?
Geodesic arcDistance-minimizing curve from point xx to yy
GeodesicGeodesic arc extended as far as possible

2. Two models of hyperbolic geometry and its properties

Poincare ball model

Poincare ball model represents hyperbolic space as the interior of a unit ball in the Euclidean space. Poincare ball model is powerful for visualization, especially for 2-dimension Poincare disc model.

  • Definition
    P={zRM+1z<1,z0=0,MZ+}\mathbb{P}=\{\mathbf{z}\in\mathbb{R}^{M+1} \mid ||\mathbf{z}||<1, z_0 = 0, M\in \mathbb{Z}^+ \}, where z=(z0,...,zM)T\mathbf{z}=(z_0, ..., z_M)^T.
  • Distance
    dP(z1,z2)=cosh1(1+2z1z22(1z12)(1z22))d_\mathbb{P}(\mathbf{z_1},\mathbf{z_2})=\cosh^{-1}(1+\frac{2||\mathbf{z_1}-\mathbf{z_2}||^2}{(1-||\mathbf{z_1}||^2)(1-||\mathbf{z_2}||^2)})
  • Norm
    z2=0\mathbf{z_2}=\mathbf{0} gives zp=cosh1(1+z21z2)||\mathbf{z}||_p=\cosh^{-1}(\frac{1+||\mathbf{z}||^2}{1-||\mathbf{z}||^2})

Lorentz model

Lorentz model is defined with Lorentzian inner product with special one-hot vector that corresponds to the origin of the hyperbolic space. Point in the Lorentz model can be easily mapped to the Poincare ball for visualization

  • Definition
    HM={zRM+1z0>0,z,zH=1}\mathbb{H}^M=\{\mathbf{z}\in\mathbb{R}^{M+1}\mid z_0>0, \langle{\mathbf{z},\mathbf{z}\rangle}_{\mathbb{H}}=-1 \}
  • Lorentzian inner product
    z,zH=z0z0+i=1Mzizi\langle{\mathbf{z}},\mathbf{z}'\rangle{}_{\mathbb{H}}=-z_0z_0'+\sum_{i=1}^Mz_iz_i'
  • Distance
    dH(z1,z2)=cosh1(z1,z2H)d_{\mathbb{H}}(\mathbf{z_1},\mathbf{z_2})=\cosh^{-1}(-\langle \mathbf{z_1},\mathbf{z_2}\rangle_{\mathbb{H}})
  • Mapping PHM\mathbb{P} \rightarrow \mathbb{H}^M
    (z0,z1,...,zM)TP(0,(z1,...,zM)Tz0+1)(z_0, z_1, ..., z_M)^T \in \mathbb{P} \rightarrow (0, \frac{(z_1, ..., z_M)^T}{z_0+1})

3. Overall scheme of ScPhere

ScPhere aims to fit model into sparse single-cell data which is approximated by neural network, exploiting optimization power of neural network. Since single-cell data are sparse, i.e. matrix with lots of zero counts, they use softmax as the activation function to estimate the means of negative-bionmial distributions to generate sparse outputs. Softmax output is a vector with sum of 1, and it is multiplied by the size vector of a cell(sum of UMI counts for that cell) to get the mean of NB distribution of each gene in that cell.

Remark

  • Negative-bionmial distribution
    XNB(r,p)fX(x)=(x1r1)pr(1p)xr,x=r,r+1,...X\sim NB(r,p) \rarr f_X(x)= {{x-1}\choose{r-1}}p^r(1-p)^{x-r}, x=r, r+1, ...
  • E(X)=rp,Var(X)=r1ppE(X)=\frac{r}{p}, Var(X)=r\cdot\frac{1-p}{p}
  • pmf of NB distribution = P(x flips r success ,p success prob.)P(x \text{ flips } | r\text{ success },p\text{ success prob.})

ScPhere is learning non-euclidean geometry of single-cell data, especially learning underlying hierarchies with hyperbolic representation learning. For given dataset D={(xi,yi)i=1N}D=\{(x_i, y_i)^N_{i=1}\}(xiRDx_i\in\mathbb{R}^D for D genes, yiy_i for batch indicator), the goal is to learn low-dimensional latent vector ziz_i. The joint distribution is factorized as

p(xi,yi,ziθi)=p(yiθi)p(ziθi)p(xiyi,zi,θi)p(x_i,y_i,z_i|\theta_i)=p(y_i|\theta_i)p(z_i|\theta_i)p(x_i|y_i,z_i,\theta_i)

Posterior distribution p(zixi,yi,θi)p(z_i|x_i,y_i,\theta_i) of latent variable ziz_i is approximated with parameterized distribution q(zixi,yi,ϕi)q(z_i|x_i,y_i,\phi_i) by variational inference. Objective of hyperbolic VAE is to minimize ELBO term, formulated as

KL(q(ziyi,xi,ϕi)p(ziθi))+Eq(zixi,yi,ϕi)[p(xiyi,zi,θi)]-\mathbb{KL}(q(z_i|y_i,x_i,\phi_i)||p(z_i|\theta_i))+\mathbb{E}_{q(z_i|x_i,y_i,\phi_i)}{[p(x_i|y_i,z_i,\theta_i)]}

Prior distribution p(ziθi)p(z_i|\theta_i) is assumed to be a uniform distribution on a hypersphere with density (2πM/2Γ(M/2))1(\frac{2\pi^{M/2}}{\Gamma(M/2)})^{-1}, and observation of UMI xix_i is modeled as negative-binomial distribution p(xiyi,zi,θi)=j=1DNB(xi,jμyi,zi,σyi,zi)p(x_i|y_i,z_i,\theta_i)=\prod_{j=1}^D NB(x_{i,j}|\mu_{y_i,z_i},\sigma_{y_i,z_i}). Posterior distribution is assumed to be vMF(von Mises-Fisher) distribution on a uni hypersphere of dimensioinality M1M-1. For robust representation learning, downsampled UMI is fed into hyperbolic VAE and MSE between latent representation from downsampled & raw UMI is added as penalty term.

4. Personal discussion

Dimensionality reduction methods, including tSNE and UMAP are powerful techniques for visualizing sparse geometry of single cell genomics data. However, there is a trade-off between expressivity and interpretability. Interpretability is important since cell annotation is quite vague after cells are clustered.

PCA utilizes linear transformation to map high dimensional features into low dimensional vector space, thus partially preserves the relationshp between raw vectors after elements are aggregated. On the other hand, nonlinear transformations including tSNE can learn rich representation but does not guarantee that relationships based on coordinate at low-dimensional vector space is equivalent to original vector calcuation.

In order to compensate this weakness, maniofold learning based on generative models are emerging methods for single cell genomics. Autoencoder, Variational autoencoder, GANs are sequentially suggested. This paper suggests hyperbolic VAE to learn inherent hierarchy of single cell genomics data. Poincare ball model is exploited for visualization and shows properties of hyperbolic geometry, thus might help cell cluster annotations.

profile
interested in 🖥️,🧠,🧬,⚛️

0개의 댓글