Gene expression cartography

Multiplexed RNA sequencing in individual cells is transforming basic and clinical life sciences1–4. Often, however, tissues must first be dissociated, and crucial information about spatial relationships and communication between cells is thus lost. Existing approaches to reconstruct tissues assign spatial positions to each cell, independently of other cells, by using spatial patterns of expression of marker genes5,6—which often do not exist. Here we reconstruct spatial positions with little or no prior knowledge, by searching for spatial arrangements of sequenced cells in which nearby cells have transcriptional profiles that are often (but not always) more similar than cells that are farther apart. We formulate this task as a generalized optimal-transport problem for probabilistic embedding and derive an efficient iterative algorithm to solve it. We reconstruct the spatial expression of genes in mammalian liver and intestinal epithelium, fly and zebrafish embryos, sections from the mammalian cerebellum and whole kidney, and use the reconstructed tissues to identify genes that are spatially informative. Thus, we identify an organization principle for the spatial expression of genes in animal tissues, which can be exploited to infer meaningful probabilities of spatial position for individual cells. Our framework (‘novoSpaRc’) can incorporate prior spatial information and is compatible with any single-cell technology. Additional principles that underlie the cartography of gene expression can be tested using our approach. A new computational framework, novoSpaRc, leverages single-cell data to reconstruct spatial context for cells and spatial expression across tissues and organisms, on the basis of an organization principle for gene expression.


The spatial organization principle and its global character
In this manuscript, we explored the assumption that gene expression between nearby cells is generally more similar than gene expression between cells which are separated by larger distances.Biologically, this phenotype can result from multiple mechanisms -gradients of morphogens and nutrients, trajectory of cell maturation, and communication between neighboring cells.While all of these can induce either smooth gradients or sharp boundaries (or combinations thereof) in gene expression patterns, as long as there are spatial shifts between sharp boundaries exhibited by different genes, our hypothesis would hold since closeness is a combined property of all genes in the transcriptome.In other words, this is an assumption about overall gene expression across space.Individual genes may very well have sharp expression territories from one cell to a neighboring cell.Our assumption just states that overall, expression of individual genes should only rarely look like salt and pepper patterns but should be organized, for most genes, in (gene specific) spatial territories.
This assumption can be readily tested.Indeed, we showed that at different stages of the developmental process of organisms, or in different tissues in matured organisms, cells that are physically close are also close in expression space, and vice versa (Figs.2b,f, and 3b).This occurs despite existing sharp boundaries in expression patterns for different genes, since closeness, which is properly defined below, depends on the combined effect of genes composing the full transcriptome.

Integrating the continuity assumption into reference-guided reconstructions
Our framework enables the incorporation of both structural and reference guided information.When reconstructing only by a reference atlas (which corresponds to alpha=1 in the manuscript), however, the structural correspondence assumption is already integrated to a certain extent.Although counter-intuitive, this stems from the fact that novoSpaRc exploits a new framework to reconstruct spatial information based on marker genes.There are two major advances implemented into novoSpaRc.First, existing methods (Seurat [1] and DistMap [2]) require binarization of the reference atlas: a gene is considered to be either ON or OFF in a given location of the tissue.In contrast, novoSpaRc works with continuous values and therefore exploits subtle gradients in gene expression that might be present in the data.Second, Seurat and DistMap map individually one cell at a time.By using the framework of optimal transport, novoSpaRc finds the optimal reconstruction by mapping all cells simultaneously, choosing the reconstruction that best satisfies the constraints and is consistent with the marginal distributions (see "Single cell embedding using optimal transport" supplementary section, and "Mathematical formulation of novoSpaRc" Methods section).Taking into account the use of gradients, this process favors reconstructions that best respect the continuity assumption in gene expression.This rather intricate issue will be further discussed in a future manuscript.

Choosing the target space shape
A geometrical representation of the tissue will be in general unknown.In these cases, we can flexibly represent the target space as a regular lattice covering the shape of the original tissue (or, in fact, any distribution of finite support with predefined desirable properties).novoSpaRc supports target spaces of any shape and density, but we found that the reconstruction is greatly benefited when an appropriate target shape is selected.Therefore, any prior information or a good educational guess will result in better reconstruction of the investigated tissue.
If the (effective) dimension of the tissue to be reconstructed is unknown, it can be possibly approximated by computing the intrinsic dimensionality of the manifold spanned by single cells lying in the high dimensional expression space (by using for example a maximum likelihood-based approach [3] ).

Computing graph-based distances in expression and physical space
As the expression profiles are represented in high-dimensional space, metric distances are prone to multiple limitations.Instead, we use steps motivated by non-linear dimensionality reduction methods (e.g., Isomap [4]).However, at this stage we do not require finding low-dimensional coordinates, but rather constructing a robust distance matrix.For symmetry we apply the same procedure to both cells and locations independently (Extended Data Fig. 1a, first column).We start by computing pairwise distances between entities.We chose as a distance metric the Euclidean distance for the physical space (locations) and the correlation-based distance for the expression space (cells), but other measures can be used.These however do not capture the true geometry of nonlinear low-dimensional manifolds.Thus, we use these pairwise distances to construct a k-nearest neighbors graph (Extended Data Fig. 1a, second column).From these graphs, we compute the shortest path lengths for each pair of cells, resulting in graph-based distance matrices for cells and for locations.(Extended Data Fig. 1a, third column).

Single cell embedding using optimal transport
As was discussed in the main text, the optimal probabilistic coupling  * ∈  !!×! between N single cell expression profiles and M cellular locations can be framed as the solution to the following optimization problem: where ,  is a non-negative regularization constant, and  ∈ [0,1] is a constant interpolating between the first two objectives, and can be set to  = 0 when no reference atlas is available.The set of coupling between the distribution over expression profiles,  ∈ { ∈  !! ;  != 1 !}, and the distribution over locations, To retrieve the coupling  * , we extend upon the results for entropically regularized optimal transport [5] and Gromov-Wasserstein distance-based mapping between metric-measure spaces [6], and use projected gradient descent, where the projection is based on the Kullback-Leibler (KL) metric.Each iteration of the projected exponentiated gradient method consists of two steps; in the first step the current estimate of T is updated by exponentiated gradient descent step, similarly to [6], to yield : where ⊙ is an element-wise multiplication,  (!) is element-wise operation, and  > 0 is a small step size.In the second step,  is projected back into the set  !,! according to the KL metric: where the KL projection is It was shown in [7] that the KL projection can be rewritten as an instance of entropically-regularized optimal transport: The gradient of the objective function can be written as where log() is an element-wise operation, and the tensor product is defined as .
Altogether, we have: Therefore, if we set  = 1/ , each iteration of the algorithm can be simplified to a Sinkhorn projection, Each of these iteration steps can be computed using Sinkhorn's fixed point algorithm [5].Specifically, where the Gibbs kernel associated with { Finally,  ∈  !! and  ∈  !!can be computed using Sinkhorn's fixed point iteration [8] involving element-wise division: We provide a Python package for the implementation of novoSpaRc at https://github.com/rajewskylab/novosparc.Parts of the code are based on modifications of the Python Optimal Transport Package (https://pot.readthedocs.io/).

Justification of probabilistic mapping
We posed the spatial mapping problem as finding a probabilistic embedding between the cells and locations.That is, each single cell is to be assigned a distribution over cellular locations.A probabilistic mapping is preferable for several reasons.
Single cell data does not yield an exact 1-to-1 matching problem.(i) When a tissue is dissociated into single cells, we would generally not be able to retrieve information for the full batch of single cells, but only for a certain fraction of them, due to experimental constraints.(ii) There would generally not be information about the number of original single cells in the tissue and their exact location, meaning we would need to resort to assignment of single cells over a grid.(iii) Even in cases where there are known, reproducible cellular locations, and there is the possibility to dissociate many nearly-identical tissues to increase the single cell coverage, we would still expect to have cellular locations that correspond to multiple single cells, and cellular locations that do not correspond to any of the single cells in the dataset.
Probabilistic mapping would yield smoother expression patterns and would be more robust to the noisy, partially sampled single cell data.Given imperfect data, as is the case for experimental setups, we may be uncertain about the exact location of a dissociated single cell and would rather place it in a certain neighborhood of the tissue (or, probabilistically spread it over several locations in that area).This is motivated both by noise and dropouts in the original data, and the fact that if we are mapping single cells to a grid, their true original location may be in between several nodes (cellular locations) on the grid, in which case their true mapping should be distributed over the grid nodes surrounding the original location, weighted by their corresponding distance from that location.
Probabilistic matching is more efficient computationally.Intuitively, we replace a discrete optimization problem over a large combinatorial space with a continuous optimization of a smooth function, which allows us to employ more efficient optimization methods.Details can be found in the Supplementary section 'Single cell embedding using optimal transport'.
Finally, we are interested in the reconstructed expression patterns over stereotypical tissues, and not necessarily in assigning single cells their exact original location.

Evaluation of spatial reconstruction.
We evaluate the quality of reconstruction by novoSpaRc by three different measures: (a) Correlation of expression patterns.The reconstructed spatial gene expression of all genes (vISH) can be compared to the original expression patterns by computing the Pearson correlation between them, averaged over all genes, such as in Fig. 3c.(b) Alignment of single cell assignment.For the tissues with 1d symmetry we also compute the fraction of cells correctly assigned to their original spatial zone.To do this, we compare for each cell its original spatial zone to its reconstructed zone according to novoSpaRc.More specifically, the zone that the cell is assigned to with highest probability.This notion can be extended to the fraction of cells assigned to a spatial zone that is found at most at a certain distance from their original zone.We show this evaluation for increasing distances for the reconstruction of the intestinal epithelium and the liver (Extended Data Figs.2a,b).(c) Probability heatmap.In Fig. 2c,g we quantify the assignment of single cells to their corresponding 1d spatial zones by a probabilistic version of a confusion matrix (the probability heatmap).For each original zone (on the x-axis), we average over the reconstructed spatial probability distribution of single cells originating from that zone and display that on the y-axis.

Generative model for spatial gene expression.
To systematically evaluate novoSpaRc's performance, we generated synthetic spatial expression data using a simple generative model that is based on independent Gaussian spatial expression patterns for each gene, for either a 1d (line), 2d (squre) or 3d (cube) shaped synthetic tissue.For 1d tissues, the expression  of each gene  over the spatial zones is proportional to a gaussian distribution, (| !, σ ! ) ∝ , where  ! is the mean of the gaussian, sampled uniformly across the 1d grid, and  ! is the standard deviation.For 2d and 3d tissues, the expression is proportional to a multivariate normal distribution, (| , where  ! is the mean vector (sampled uniformly across the 2d or 3d grid), and  ! is the covariance matrix.After generating the synthetic expression matrix, we add gaussian noise to the expression values with 0 mean and  !"#$%  !"#$!%%&'( standard deviation, where  !"#$!%%&'( is the standard deviation of the entire expression matrix, and  !"#$% is a parameter that sets the signal to noise ratio.The expression of 'spatially informative' genes is set according to the model above, while the expression of 'spatially non-informative' genes is randomly permuted across the synthetic tissue.The default parameters for the simulations and novoSpaRc reconstructions are: 1000 single cells (or closest approximation for the 2d grid), 100 grid locations (or closest approximation for the 3d grid), 100 genes,  = 10,  !(where  is the identity matrix),  = 0.5, number of marker genes = 5, and  !"#$!%%&'( = 0.1.When reconstructing a tissue de novo, meaning without any reference atlas, the reconstruction can be established up to the symmetry axes of the target space.This is not specific to novoSpaRc, but an inherent the problem.For example, when reconstructing a 1-dimensional tissue, then without additional prior knowledge there is no way of distinguishing between a 'right-to-left' and 'left-to-right' reconstruction (see illustration in the Figure below, left).This can be remedied by including prior information that would break the symmetry, such as marker gene expression information (as long as they do not exhibit the same underlying symmetries).Such prior information would correctly "anchor" the reconstruction and might also appear more intuitive to the user.

Genes contained within each nutrient class in the intestinal epithelium
Below are a few examples for such symmetries (specified using a double-arrowed line on each target space) that fundamentally cannot be resolved de novo: 11.Gene ontology analysis for genes extracted as highly zonated in the intestine and liver Based on novoSpaRc's de novo reconstruction of the mammalian intestine [9] and liver [10] single cell datasets, we extracted a list of genes whose expression is localized to different layers of the two tissues.We used gene ontology (GO) enrichment analysis [11] to show that these groups of genes are enriched for distinct biological processes, many of which are consistent with the respective expression localization.
We filtered for genes whose maximum value in the sDGE is among the top 20% of all genes.We then chose the genes whose maximum spatial expression is localized at one of the two ends of the 1d tissue and whose associated Kendall Tau p-value < 0.05, to be set as the groups of genes highly expressed in either the 'top' or the 'bottom' of the tissue.We chose the genes whose maximum spatial expression is localized at either one of the middle layers of the tissue (one layer apart from the two border layers of the tissue at either side), to be set as the 'intermediate' gene group.We used the Gorilla package [11] to run GO enrichment analysis, by performing pairwise comparisons between two unranked lists of genes at a time: a target set and a background set (composed of the complement of the target set).
In the intestine, the reconstructed crypt was enriched for transcription, translation, RNA splicing, and cell cycle related genes.The middle part of the reconstructed intestinal crypt-to-villus axis was enriched for amino acid and carbohydrates transport and processing, ion transport and intestinal absorption related genes.The reconstructed tip of the villus (V6) was enriched for lipoprotein metabolic processes, catabolic processes, extracellular matrix organization, cholesterol transport and processing, and stress related processes.In the liver, the reconstructed pericentral layer was enriched for xenobiotic metabolism, fatty acid metabolism and catabolic processes, while the reconstructed periportal layer was enriched for carboxylic processes, oxidation-reduction processes and ATP biosynthesis.The full lists of enriched genes and their associated p-values and FDR q-values, for both the intestine and the liver, can be found as Supplementary Files.

Sample sizes for main text figures
Respective sample sizes for data along x-axis:

Supplementary Discussion
novoSpaRc's advantages, limitations, and direct comparison to existing reconstruction methods.novoSpaRc offers several features which cannot be exploited as a whole by existing methods: (a) it enables incorporation and interpolation of both structural information (such as the structural correspondence assumption) and a reference atlas, (b) it naturally provides probabilistic embedding of single cells onto their original spatial context, which provides a more robust reconstruction, (c) it allows to incorporate prior structural information regarding the structure of the tissue from which the cells were dissociated, (d) it does not require any tailored pre-processing steps and can utilize continuous expression data directly, (e) and finally, it is flexible in terms of its structural assumption (which can be potentially adjusted in future work) and allows to incorporate marginal information (effectively incorporating prior knowledge about different aspects such as varying local density of cells across the tissue and varying quality of sequenced single cells).
We directly compare novoSpaRc to two available spatial reconstruction methods that fully rely on a reference atlas: Seurat [1] and DistMap [2].A comparison of the intrinsic characteristics of the three approaches, as well as their corresponding reconstruction results for the BDTNP data [12], scRNA-seq data of the Drosophila [2] and zebrafish embryos [1] and the cerebellum [13] are shown in Extended Data Fig. 10.This comparative analysis is performed for varying numbers of marker genes and shows how, for the same number of marker genes, novoSpaRc generally outperforms other available methods.Both DistMap and Seurat require a large number of marker genes to reconstruct the BDTNP dataset, whereas the Pearson correlations for novoSpaRc saturate at perfect reconstruction with only 2 marker genes.novoSpaRc outperforms Seurat and DistMap in the case of the Drosophila embryo and performs comparably to them for the zebrafish embryo, while it should be stressed that DistMap and Seurat were developed and tailored for these two datasets, respectively.Finally, novoSpaRc substantially outperforms DistMap and Seurat for the reconstruction of the brain cerebellum, where both DistMap and Seurat struggle to form meaningful reconstructions.It should be noted that DistMap requires a threshold to produce the expression patterns, which is in principle unknown.We selected the threshold which maximizes the Pearson correlations, thus giving DistMap an unfair advantage in these comparisons.
It is important, however, to keep in mind novoSpaRc's limitations.novoSpaRc works by embedding the single cells into a predefined shape, and so does not allow to learn a latent representation of the data that was not used as input.In addition, as mentioned in the main text, do novo reconstruction can be achieved up to global transformations relative to symmetries of the shape of the target space.This is not a limitation specific to novoSpaRc but inherent to the problem of de novo reconstruction without additional prior information, such as marker gene data (see "De novo reconstruction is possible up to inherent symmetries of the target space" supplementary section).Finally, novoSpaRc employs an assumption about spatial gene expression (here we use the structural correspondence assumption) to reconstruct cellular locations.In general, we found the structural correspondence assumption to hold to a certain extent in all tissues and organisms we looked into so far, including highly heterogeneous and challenging tissues like the brain.We believe this hints that spatial gene expression is much more structured and informative than currently believed, and that external signaling gradients and cell-to-cell communication provide stronger signals for spatial patterning than expected.In cases where this is a weak assumption, challenged for example by complex tissues with multiple cell types or multiple domains, novoSpaRc may struggle.However, it is important to stress that novoSpaRc's flexibility allows it to employ alternative principles or assumptions that would fit different biological scenarios or incorporate diverse experimental prior information.zonated genes (10 top zonated genes towards the crypt, and 10 top zonated genes towards V6), were either independently reconstructed (based on a reference atlas) to be zonated, and/or have direct experimental support for their zonation profiles, and/or were shown to be functionally related to processes associated with their respective zonation profiles.Selection of top zonated genes is described in Methods.zonation profiles, and/or were shown to be functionally related to processes associated with their respective zonation profiles.Selection of top zonated genes is described in Methods.
Table corresponding to Fig. 2d.De novo reconstruction is possible up to inherent symmetries of the target space