Transformations of deconvolved gene expression distributions

Introduction
Setup
Methods
Results

Introduction

From the perspective of distribution deconvolution, scRNA-seq count data has a mean-variance relationship simply by convolution with a Poisson technical noise model.

Here, we investigate whether the latent gene expression distribution also has a mean-variance relationship, and what happens to it when we transform the distribution.

Setup

import numpy as np
import pandas as pd
import scmodes

%matplotlib inline
%config InlineBackend.figure_formats = set(['retina'])

import matplotlib.pyplot as plt
plt.rcParams['figure.facecolor'] = 'w'

Methods

The key idea of our approach is to deconvolve the gene expression of iPSCs assuming:

\[ x_{ijk} \sim \mathrm{Poisson}(s_{ij} \exp(\mathbf{z}_i' \mathbf{b}_j) \lambda_{ijk}) \]

\[ \lambda_{ijk} \sim g_{ik}(\cdot) = \pi_{ik} \delta_0(\cdot) + (1 - \pi_{ik}) \mathrm{Gamma}(\mu_{ik}, \phi_{ik}) \]

From the fitted \(\hat{g}_1, \ldots, \hat{g}_p\), we can investigate whether there is a mean-variance relationship in latent gene expression.

Results

Untransformed distribution

Read the deconvolved distributions.

log_mu = pd.read_csv('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-mu.txt.gz', sep=' ', index_col=0)
log_phi = pd.read_csv('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-phi.txt.gz', sep=' ', index_col=0)
logodds = pd.read_csv('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-logodds.txt.gz', sep=' ', index_col=0)

Throw out the individual with evidence of contamination.

for x in (log_mu, log_phi, logodds):
  del x['NA18498']

Recover the mean and variance of latent gene expression per gene, per individual.

# Important: log(sigmoid(x)) = -softplus(-x)
mean = np.exp(log_mu - np.log1p(np.exp(logodds)))
variance = np.exp(2 * log_mu + log_phi - np.log1p(np.exp(logodds))) + np.exp(-np.log1p(np.exp(logodds)) - np.log1p(np.exp(-logodds)) + 2 * log_mu)

Plot the mean/variance relationship.

plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.semilogx()
plt.semilogy()
plt.scatter(mean.values.ravel(), variance.values.ravel(), c='k', s=4, alpha=0.1)
plt.xlabel('Latent gene expression mean')
plt.ylabel('Latent gene expression variance')

Text(0, 0.5, 'Latent gene expression variance')

Restrict to the individual with the most cells (\(n=284\)).

plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.semilogx()
plt.semilogy()
plt.scatter(mean['NA18507'], variance['NA18507'].values.ravel(), c='k', s=4, alpha=0.1)
plt.title('NA18507')
plt.xlabel('Latent gene expression mean')
plt.ylabel('Latent gene expression variance')

Text(0, 0.5, 'Latent gene expression variance')

Log transform

Assume \(\lambda_{ijk} \sim g_{ik}(\cdot)\), as above. Plot the approximate relationship between \(E[\ln(\lambda + \epsilon)]\) and \(V[\ln(\lambda + \epsilon)]\) by first-order Taylor expansion.

\[ E[\ln(x + \epsilon)] \approx \ln(E[x] + \epsilon) - \frac{V[x]}{2 (E[x] + \epsilon)^2} \]

\[ V[\ln(x + \epsilon)] \approx \frac{V[x]}{(E[x] + \epsilon)^2} - \frac{V[x]^2}{4 (E[x] + \epsilon)^4} \]

Importantly, \(\epsilon\) needs to be on the same scale as \(\lambda\). But \(x_{ijk} \sim \mathrm{Poisson}(s_{ij} \lambda_{ijk})\), so \(\lambda_{ijk}\) is on the scale of \(1 / s_{ij}\). Instead of assuming \(x_{ijk} \sim F_{ijk}(\cdot)\), assume:

\[ x_{ijk} \sim \frac{1}{n} \sum_j F_{ijk}(\cdot) \]

Then, \(E[x_{ijk}] = \frac{1}{n}\sum_j s_{ij} E[\lambda_{ijk}]\). This gives us a single scaling factor for \(g_{ik}\).

Read the metadata for the iPSCs.

annotations = pd.read_csv('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-annotation.txt', sep='\t')
keep_samples = pd.read_csv('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/quality-single-cells.txt', sep='\t', index_col=0, header=None)
annotations = annotations.loc[keep_samples.values.ravel()]

s = annotations.loc[(annotations['chip_id'] == 'NA18507').values, 'mol_hs'].mean()

plt.clf()
fig, ax = plt.subplots(1, 3, sharey=True)
fig.set_size_inches(7, 3)
for a, eps, t in zip(ax, np.array([0, 0.5, 1]) / s, ['0', '1/(2s)', '1/s']):
  a.set_yscale('log')
  m = np.log(mean['NA18507'] + eps) - variance['NA18507'] / (2 * np.square(mean['NA18507'] + eps))
  v = variance['NA18507'] / np.square(mean['NA18507'] + eps) - np.square(variance['NA18507']) / (4 * (mean['NA18507'] + eps) ** 4)
  a.scatter(m, v, c='k', s=4, alpha=0.1)
  a.set_title(f'$\epsilon = {t}$')
ax[1].set_xlabel('Latent log gene expression mean')
ax[0].set_ylabel('Latent log gene expression variance')
fig.tight_layout()

Zoom in on highly expressed genes.

plt.clf()
fig, ax = plt.subplots(1, 3, sharey=True)
fig.set_size_inches(7, 3)
for a, eps, t in zip(ax, np.array([0, 0.5, 1]) / s, ['0', '1/(2s)', '1/s']):
  a.set_yscale('log')
  a.set_xlim(-8, -6)
  a.set_ylim(1e-2, .5)
  a.scatter(m, v, c='k', s=4, alpha=0.25)
  a.set_title(f'$\epsilon = {t}$')
ax[1].set_xlabel('Latent log gene expression mean')
ax[0].set_ylabel('Latent log gene expression variance')
fig.tight_layout()

Square root transform

Assume \(\lambda_{ijk} \sim g_{ik}(\cdot)\), as above. Plot the approximate relationship between \(E[\lambda^{1/2}]\) and \(V[\lambda^{1/2}]\) by first-order Taylor expansion.

\[ E[x^{1/2}] \approx (E[x])^{1/2} - \frac{V[x]}{8 E[x]^{3/2}} \]

\[ V[x^{1/2}] \approx \frac{V[x]}{4 E[x]} - \frac{V[x]^2}{64 E[x]^3} \]

plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.semilogx()
plt.semilogy()
m = np.sqrt(mean['NA18507']) - variance['NA18507'] / (8 * np.power(mean['NA18507'], 1.5))
v = variance['NA18507'] / (4 * mean['NA18507']) - np.square(variance['NA18507']) / (64 * np.power(mean['NA18507'], 3))
plt.scatter(m, v, c='k', s=4, alpha=0.1)
plt.xlabel('Latent sqrt gene expression mean')
plt.ylabel('Latent sqrt gene expression variance')
plt.gcf().tight_layout()

Variance effects versus mean effects

After convolution of a Gamma with a Poisson, we expect a quadratic mean-variance relationship. But we showed above that after deconvolution, we still see a linear mean-variance relationship within samples.

What if we look between conditions? To investigate this, look at MA plots of latent gene expression for two different donor individuals.

Find the individual with the second-most cells.

annotations['chip_id'].value_counts().sort_values(ascending=False).head(n=2)

NA18507    284
NA18501    221
Name: chip_id, dtype: int64

a = np.log(mean['NA18507'] + 1 / annotations.loc[(annotations['chip_id'] == 'NA18507').values, 'mol_hs'].mean())
b = np.log(mean['NA18501'] + 1 / annotations.loc[(annotations['chip_id'] == 'NA18501').values, 'mol_hs'].mean())
plt.clf()
fig, ax = plt.subplots(1, 2, sharey=True)
fig.set_size_inches(6, 3)
ax[0].scatter((a + b) / 2, a - b, s=2, c='k', alpha=0.2)
ax[0].axhline(y=0, ls=':', lw=1, c='r')
ax[0].set_title('NA18507 / NA18501')
ax[1].scatter((a + b) / 2, a - b, s=2, c='k', alpha=0.2)
ax[1].axhline(y=0, ls=':', lw=1, c='r')
ax[1].set_xlim(-8, ax[0].get_xlim()[1])
ax[1].set_title('NA18507 / NA18501 (zoomed)')
for a in ax:
  a.set_xlabel('Average log latent mean')
ax[0].set_ylabel('Log ratio latent mean')
fig.tight_layout()

Comparison of sampling variance with biological variance

From the perspective of deconvolution, biological variance describes \(\lambda \sim g(\cdot)\), while sampling variance describes \(x \mid s, \lambda \sim \mathrm{Poisson}(s \lambda)\).

Our results above suggest that for the highest expressed genes, \(V[\ln(\lambda + 1 / s)]\) is order \(10^{-1}\). Estimate the sampling variance \(V[\ln(x + 1) \mid s, \lambda=\mu]\) for genes with \(E[\ln(\lambda + 1 / s)] > -8\), where \(\mu = E[\lambda]\).

s = annotations.loc[(annotations['chip_id'] == 'NA18507').values, 'mol_hs'].mean()
m = np.log(mean['NA18507'] + 1 / s) - variance['NA18507'] / (2 * np.square(mean['NA18507'] + 1 / s))
v = variance['NA18507'] / np.square(mean['NA18507'] + 1 / s) - np.square(variance['NA18507']) / (4 * (mean['NA18507'] + 1 / s) ** 4)

query = m[m > -8].index

ipsc = scmodes.dataset.ipsc('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/', query=query, return_df=True)

subset = ipsc[(annotations['chip_id'] == 'NA18507').values.ravel()]
lam = s * mean['NA18507']
vx = lam / np.square(lam + 1) - np.square(lam) / (4 * (lam + 1) ** 4)

plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.scatter(v[m > -8], vx[m > -8], c='k', s=2, alpha=0.25)
plt.plot([0, .25], [0, .25], ls=':', lw=1, c='r')
plt.xlim([0, .25])
plt.ylim([0, .25])
plt.xlabel('Log latent gene expression variance')
_ = plt.ylabel('Log count sampling variance')