1 Introduction

1.1 Background and Motivation

Automated image analysis methods are often used for extracting important information from image databases. Common applications include mining information in sets of microscopy images (Loo et al. 2007), understanding mass distribution in celestial objects from telescopic images (Shamir 2009), as well as analysis of facial expressions (Stegmann et al. 2003), for example. We note that the prevalent technique for quantifying information in such large datasets has been to reduce the entire information (described by pixel intensity values) contained in each image in the database to a numerical feature vector (e.g. size, form factor, etc.). Presently, this approach (coupled with clustering, classification, and other machine learning techniques) remains the prevalent method through which researchers extract quantitative information about image databases and cluster important image subgroups (Bengtsson 1999; Yang et al. 2009; Kong et al. 2009).

In addition to feature-based methods, several techniques for analyzing image databases based on explicit modeling approaches have recently emerged. Examples include contour-based models (Pincus and Theriot 2007), medial axis models (Blum et al. 1967), as well as model-based deconvolution methods (Gardner et al. 2010). Moreover, when analyzing images within a particular type (e.g. brain images) researchers have often used a more geometric approach. Here the entire morphological exemplar as depicted in an image is viewed as a point in a suitably constructed metric space (see for example Beg et al. 2005), often facilitating visualization. These approaches have been used to characterize the statistical variation of a particular object in a given population (or set of populations) (Beg et al. 2005; Miller et al. 2009). The main idea in these is to understand the variation of similar objects through analysis of the deformation fields required to warp one object (as depicted in its image) onto another.

Alternatively, transportation-based metrics have also been used to analyze image data in problems when pixel intensities (or derived quantities) can be interpreted as ‘mass’ free to move without strict geometric constraints (Rubner et al. 2000; Haker et al. 2004; Chefd’hotel and Bousquet 2007; Grauman and Darrell 2004; Wang et al. 2011). They are interesting alternatives to other methods since when applied directly to pixel intensities, they have the potential to quantify both texture and, to some extent, shape information combined (see for example Haker et al. 2004; Wang et al. 2011). In particular, there has been continuing effort to develop fast and reliable methods for computing transportation related distances (Orlin 1993; Ling and Okada 2007; Shirdhonkar and Jacobs 2008; Pele and Werman 2009; Angenent et al. 2003; Barrett and Prigozhin 2009; Benamou and Brenier 2000; Delzanno and Finn 2010; Haber et al. 2010). While the computational complexity of these methods ranges from quadratic to linear with respect to image size (for smooth enough images), the computations are still expensive (time wise), in particular for rough images (with large gradients), and there are issues with convergence (e.g. due to local minima in PDE-based variational implementations).

1.2 Overview of Our Approach and Contributions

Our contribution is to develop a linear framework closely related to the optimal transportation metric (OT) (Wang et al. 2011) for analyzing sets of images. This framework, which we call the linear optimal transportation (LOT), not only provides a fast way for computing a metric (and geodesics) between all pairs in a dataset, permitting one to use pixel intensities directly, but it also provides an isometric embedding for a set of images. Therefore our method takes as input a potentially large set of images and outputs an isometric embedding of the dataset (endowed with the LOT metric) onto the standard Euclidean space. Our approach achieves this task by utilizing the following series of steps:

  • Step 1: compute a template image that will serve as a reference point for analyzing the given image dataset.

  • Step 2: for each image in the input dataset, as well as the estimated template, compute a particle approximation that will enable a linear programming-based computation of the OT distance between it and the template computed in Step 1.

  • Step 3: normalize each particle approximation with respect to translation, rotation, and coordinate inversions.

  • Step 4: compute a quadratic-based OT distance between the particle approximation of each image and the template.

  • Step 5: from the output of Step 4, compute the LOT distances and embedding.

As compared to previous works that make use of transportation-related metrics in image analysis, we highlight the following innovations of our approach. The first is that, given a database of M images, the number of transportation related optimizations required for computing the distance between all pairs of images is M when utilizing our approach (versus M(M−1)/2 when utilizing other approaches). This provides a substantial increase in speed especially when performing pattern recognition tasks (e.g. classification) in large databases. Secondly, as mentioned above, our LOT framework also provides an isometric linear embedding that has a couple of convenient properties. One being that the embedding of an image newly added to the database can be computed exactly and with one transportation optimization only. Another being that any point in the embedded space can be visualized as an image. This includes measured points (existing images in the database) but also any other point in this space. Below we show how this embedding greatly facilitates the use of standard geometric data analysis techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA) for visualizing interesting variations in a set of images.

1.3 Paper Organization

In Sect. 2 we begin by reviewing the mathematical underpinnings of the traditional optimal transportation framework and then describe its linearized version and some of its properties. Equation (1) gives the definition of the traditional OT metric, and formulas (2) and (3) the linearized version of the metric. Then in Sect. 3 we describe our computational approach in the discrete setting. This includes an algorithm for ‘estimating’ the content of a relatively sparse image with particles (full details of this algorithm are provided in the Appendix). The definition of the OT distance in the discrete setting is provided in (5), while our approximation of its linearized version in the discrete setting is given in formula (9). Equations (10) and (11) provide the isometric linear embedding for a given particle approximation according to a reference point. Section 4 describes how one can utilize the linear embedding provided in the LOT framework to extract useful information from sets of images using principal component analysis and linear discriminant analysis. The applications of LOT towards decoding subcellular protein patterns and organelle morphology, galaxy morphology as well as facial expression differences are presented in Sect. 5.

2 Optimal Transportation Framework

Transportation-based metrics are especially suited for quantifying differences between structures (depicted in quantitative images) that can be interpreted as a distribution of mass with few strict geometric constraints. More precisely, we utilize the optimal transportation (Kantorovich-Wasserstein) framework to quantify how much mass, in relative terms, is distributed in different regions of the images. We begin by describing the mathematics of the traditional OT framework, and in particular the geometry behind it, which is crucial to our subsequent introduction of the linearized OT distance.

2.1 Optimal Transportation Metric

Let Ω represent the domain (the unit square [0,1]2, for example) over which images are defined. To describe images we use the mathematical notion of a measure. While this is somewhat abstract, it enables us to treat simultaneously the situation when we consider the image to be a continuous function and when we deal with actual computations and consider the image as a discrete array of pixels. It is important for us to do so, because many notions related to optimal transportation are simpler in the continuous setting, and in particular we can give an intuitive explanation for the LOT distance we introduce. On the other hand the discrete approximation necessitates considering a more general setting because the mass (proportional to intensity) from one pixel during transport often needs to be redistributed over several pixels.

We note that, in the current version of the technique, we normalize all images in a given dataset so that the intensity of all pixels in each image sums to one. Thus we may interpret images as probability measures. The assumption is adequate for the purpose of analyzing shape and texture in the datasets analyzed in this paper. We note that OT-related distances can be used when masses are not the same (Rubner et al. 2000; Pele and Werman 2008).

Recall that probability measures are nonnegative and that the measure of the whole set Ω is 1: μ(Ω)=ν(Ω)=1. Let c:Ω×Ω→[0,∞) be the cost function. That is c(x,y) is the ‘cost’ of transporting unit mass located at x to the location y. The optimal transportation distance measures the least possible total cost of transporting all of the mass from μ to ν . To make this precise, consider Π(μ,ν), the set of all couplings between μ and ν. That is consider the set of all probability measures on Ω×Ω with the first marginal μ and the second marginal ν. More precisely, if πΠ(μ,ν) then for any measurable set AΩ we have π(A×Ω)=μ(A) and π(Ω×A)=ν(A). Each coupling describes a transportation plan, that is π(A 0×A 1) is telling one how much ‘mass’ originally in the set A 0 is being transported into the set A 1.

We consider optimal transportation with quadratic cost c(x,y)=|xy|2. The optimal transportation (OT) distance, also known as the Kantorovich-Wasserstein distance, is then defined by

$$ d_W(\mu,\nu) = \biggl(\inf_{\pi\in\varPi(\mu,\nu)} \int_{\varOmega \times\varOmega} |x-y|^2 d\pi \biggr)^{\frac{1}{2}} . $$
(1)

It is well known that the above infimum is attained and that the distance defined is indeed a metric (satisfying the positivity, the symmetry, and the triangle inequality requirements), see Villani (2003). We denote the set of minimizers π above by Π OT (μ,ν).

2.2 Geometry of Optimal Transportation in Continuous Setting

The construction of the LOT metric which we introduce below can be best motivated in the continuous setting. Consider measures μ and ν which have densities α and β, that is

$$d\mu= \alpha(x) dx \quad\textrm{and}\quad d\nu= \beta(x) dx. $$

Then the following mathematical facts, available in Villani (2003), hold. The OT plan between μ and ν is unique and furthermore the mass from each point x, is sent to a single location, given as the value of the function φ(x) called the optimal transportation map (see Fig. 1). The relation between φ and the optimal transportation plan Π introduced above is that Π(A 0×A 1)=μ({xA 0:φ(x)∈A 1}). That is Π is concentrated on the graph of φ. We note that φ is a measure preserving map from μ to ν, that is that for any A, \(\int_{\varphi^{-1}}(A) \alpha(x) dx = \int_{A} \beta(y) dy\). The Kantorovich-Wasserstein distance is then d W (μ,ν)=∫ Ω |φ(x)−x|2 α(x)dx.

Fig. 1
figure 1

Optimal transport map φ between measures μ and ν whose supports are outlined

In addition to being a metric space, the set of measures is formally a Riemannian manifold (do Carmo 1992), that is at any point there is a tangent space endowed with an inner product. In particular the tangent space at the measure σ with density γ (i.e. =γ(x)dx) is the set of the following vector fields T σ ={v:Ω→ℝd such that ∫ Ω |v(x)|2 γ(x)dx<∞} and the inner product is the weighted L 2:

$$\langle v_1, v_2 \rangle_\sigma= \int _\varOmega v_1(x) \cdot v_2(x) \gamma(x) dx. $$

The OT distance is just the length of the shortest curve (geodesic) connecting two measures (Benamou and Brenier 2000).

A very useful fact is the geodesics have a form that is simple to understand. Namely if μ t , 0≤t≤1 is the geodesic connecting μ to ν. Then μ t is the measure obtained when mass from μ is transported by the transportation map x→(1−t)x+(x). Then \(\mu_{t}(A) = \int_{\varphi_{t}^{-1}(A)} \alpha(x) dx\).

2.3 Linear Optimal Transportation Metric

Computing a pairwise OT distance matrix is expensive for large datasets. In particular let T OT be the time it takes to compute the OT distance and T 2 the time it takes to compute the Euclidean distance between two moderately complex images. Using the method described below, T OT is typically on the order of tens of seconds, while T 2 is on the order of miliseconds. For a dataset with M images, computing the pairwise distances takes time on the order of M(M−1)T OT /2. Here we introduce a version of the OT distance that is much faster to compute. In particular computing the distance matrix takes approximately MT OT +M(M−1)T 2/2. For a large set of images, in which the number of pixels in each image is fixed, as the number of images in the set tends to infinity, the dominant term is M 2 T 2.

The distance we compute is motivated by the geometric nature of the OT distance. Heuristically, instead of computing the geodesic distance on the manifold we compute a ‘projection’ of the manifold to the tangent plane at a fixed point and then compute the distances on the tangent plane. This is the main reason we use the term linear when naming the distance. To consider the projection one needs to fix a reference image σ (where the tangent plane is set). Computing the projection for each image requires computing the OT plan between the reference image σ and the given image. Once computed, the projection provides a natural linear embedding of the dataset, which we describe in Sect. 3.5.

We first describe the LOT distance in the continuous setting. Let σ be a probability measure with density γ. We introduce the identification (‘projection’), P, of the manifold with the tangent space at σ. Given a measure μ, consider the optimal transportation map ψ between σ and μ. Then

$$P(\mu) = v \quad \textrm{where } v(x) = \psi(x) - x. $$

Note that vT σ and that P is a mapping from the manifold to the tangent space. See Fig. 2 for a visualization. Also P(σ)=0 and \(d_{W}(\sigma, \mu)^{2} = \int_{\varOmega}|\psi(x) - x |^{2} \gamma(x) dx = \int_{\varOmega}|v(x)|^{2} \gamma(x) dx = \langle v , v \rangle_{\sigma}= \|P(\mu) - P(\sigma)\|_{\sigma}^{2}\), were \(\|v \|_{\sigma}^{2}\) is defined to be 〈v,v σ . So the mapping preserves distances to σ. Let us mention that in cartography such projection is known as the equidistant azimuthal projection. In differential geometry this would be the inverse of the exponential map. We define the LOT as:

$$ d_{\mathit{LOT}}(\mu, \nu) = \big\| P(\mu) - P(\nu) \big\|_\sigma. $$
(2)
Fig. 2
figure 2

The identification of the manifold with the tangent plane. Distances to σ as well as angles at σ are preserved

When σ is not absolutely continuous with respect to the Labesgue measure the situation is more complicated. The purely discrete setting (when the measures are made of particles) is the one relevant for computations and we discuss it in detail in Sect. 3.3. In the general setting, the proper extension of the LOT distance above is the shortest generalized geodesic (as defined in Ambrosio et al. 2008) connecting μ and ν. More precisely, given a reference measure σ, and measures μ and ν, with transportation plans π μ Π OT (σ,μ) and π ν Π OT (σ,ν), let Π(σ,μ,ν) be the set of all measures on the product Ω×Ω×Ω such that the projection to first two coordinates is π μ and the projection to the first and the third is π ν . The linearized version of the OT distance between μ and ν is given by:

$$ d_{\mathit{LOT}, \sigma}(\mu, \nu)^2 = \inf_{\pi\in\varPi(\sigma, \mu, \nu )} \int_{\varOmega\times\varOmega\times\varOmega} |x-y|^2 d\pi. $$
(3)

2.4 Translation and Rotation Normalization

It is important to note that the OT metric defined in (1) is not invariant under translations or rotations. It can be rendered translation invariant by simply aligning all measures in a dataset μ 1,…,μ N by setting their center of mass to a common coordinate. This is based on the fact that if μ is a measure on Ω with center of mass

$$x_\mu= \frac{1}{\mu(\varOmega)} \int_{\varOmega} x d\mu $$

and ν a measure of the same mass, then among all translates of the measure ν, ν x (A)=ν(Ax), the one that minimizes the distance to μ is the one with the center of mass x μ .

From now on, we will always assume that the measures are centered at the origin (the implementation of this normalization step is given below). The Euclidean-transformation invariant Kantorovich-Wasserstein distance is then defined in the following way. Given a measure μ and an orthogonal matrix T, define μ T by μ T (A)=μ(T −1 A). Then the invariant distance is defined by d(μ,ν)=min TO(n) d W (μ T ,ν). In two dimensions we have developed an algorithm for finding the minimum above. However we know of no reasonable algorithm for computing the rotation invariant linearized OT distance (d LOT ). Below we present an algorithm that greatly reduces the effect of rotation, but does not eliminate it completely.

3 Computing LOT Distances from Image Data

To compute the LOT distance for a set of images we need to first compute the OT distance from a template image to each of the images. There exist several approaches for computing OT distances. Perhaps the most direct idea is to discretize the problem defined in (1). This results in a linear programming problem (see (5)). While this approach is very robust and leads to the global minimizer, it is computationally expensive (the fastest strongly polynomial algorithm for this case is the Hungrian algorithm with time complexity of O(n 3) with n the number of pixels in an image). Other approaches are based on continuum ideas, where PDEs and variational techniques are often used (Angenent et al. 2003; Barrett and Prigozhin 2009; Benamou and Brenier 2000; Delzanno and Finn 2010; Haber et al. 2010). Some of them achieve close to linear scaling with the number of pixels. On the other hand, their performance can deteriorate as one considers images that lack regularity (are not smooth, have large gradients). A particular difficulty with some of the approaches that are not based on linear programing is that they may not converge to the global solution if they arrive at a local minimum first.

Our approach is based on combining linear programming with a particle approximation of the images. It is a refinement of the algorithm we used in Wang et al. (2011). That is, each image is first carefully approximated by a particle measure (a convex combination of delta masses) that has much fewer particles than there are pixels in the image. This significantly reduces the complexity of the problem. Furthermore it takes advantage of the fact that many images we consider are relatively sparse. Then one needs fewer particles for accurate approximation of the image which accelerates the computation. In addition, as the number of particles approaches the number of pixels in the image, the approximation error tends to zero. We now present the details of the algorithm. In particular, we give a description of the particle approximation, and detailed algorithms for translation and rotation normalization, and the computation of the LOT distance. The full details pertaining to the particle approximation algorithm, which is more involved, are presented in the Appendix.

3.1 Particle Approximation

The first step in our computational approach is to represent each image as a weighted combination of ‘particles’ each with mass (intensity) m i and location x i :

$$ \mu= \sum_{i=1}^{N_\mu}m_{i} \delta_{x_{i}} $$
(4)

with N μ being the number of masses used to represent the measure (image) μ. Since the images are digital, each image could be represented exactly using the model above by selecting m i to be the pixel intensity values and \(\delta_{x_{i}}\) the pixel coordinate grid. Given that images can contain a potentially large number of pixels, and that the cost of a linear programing solution for computing the OT is generally \(O(N_{\mu}^{3})\), we do not represent each image exactly and instead choose to approximate them.

The goal of the algorithm is to use at most, approximately, N particles to approximate each image, while maintaining a similar approximation error (by approximation error we mean the OT distance between the image and the particle approximation) for all images of a given dataset. For a dataset of M images, our algorithm requires the user to specify a number N for the approximate number of particles that could be used to approximate any image, and consists of these four steps:

  • Step 1: use a weighted K-means algorithm (Lloyd 1982) to approximate each image, with the number of clusters set to the chosen N. As we describe in the Appendix, the weighted K-means is an optimal local step for reducing the OT error between a given particle approximation and an image.

  • Step 2: improve the approximation of each image by investigating regions where each image is not approximated well, and add particles to those regions to improve the approximation locally. Details are provided in the Appendix.

  • Step 3: compute the OT distance between each digital image and the approximation of each image. For images not well approximated by the current algorithm (details provided in the Appendix), add particles to improve the approximation.

  • Step 4: For images where the approximation error is signifficantly smaller than the average error (for the entire dataset) remove (merge) particles until the error between the digital image and its approximation is more similar to the other images. This reduces the time to compute the distances, while it does not increase the typical error.

The algorithm above was chosen based on mathematical observations pertaining to the OT metric. These observations, including a careful description of each of the steps above, are provided in the Appendix. We have experimented with several variations of the steps above, and through this experience we have converged on the algorithm presented. A few of the details of the algorithm could be improved further, though we view further improvements to be beyond the scope of this paper. We emphasize that this particle approximation algorithm works particularly well for relatively sparse images, and less so for non-sparse (e.g. flat) images. Bellow we demonstrate, however, that the algorithm can nonetheless be used to extract meaningful quantitative information for datasets of images which are not sparse.

Translation and Rotation Normalization

In two dimensions, it suffices to consider translation, rotation, and mirror symmetry. To achieve that we simply align the center of mass of each particle approximation to a fixed coordinate, and rotate each particle approximation to a standard reference frame according to a principal axis (Hotteling) transform. Each particle approximation is then flipped left to right, and up and down, simply by reversing their coordinates, until the skewness (the third standardized moment) of the coordinates of each sample have the same sign.

3.2 OT Distance in Particle Setting

We start by explaining how the general framework of Sect. 2 applies to particle measures (such as particle approximations obtained above). A particle probability measure, μ, which approximates the image is given as \(\mu= \sum_{i=1}^{N} m_{i} \delta_{x_{i}}\) where x i Ω, m i ∈(0,1], \(\sum_{i=1}^{N} m_{i} =1\).

We note that for AΩ, \(\mu(A) = \sum_{i \,:\, x_{i} \in A} m_{i}\). An integral with respect to measure μ is \(\int_{A} f(x) d\mu(x) = \sum_{i \,:\, x_{i} \in A} f(x_{i}) m_{i}\).

Now we turn to our main goal of the section: obtaining the OT distance. For \(\mu= \sum_{i=1}^{N_{\mu}} m_{i} \delta_{x_{i}}\) and \(\nu= \sum_{j=1}^{N_{\nu}} p_{j} \delta_{y_{j}}\) the set of couplings Π(μ,ν) is given by a set of N μ ×N ν matrices, as follows:

Since it is clear from the context we will make no distinction between measures in Π(μ,ν) and matrices f=[f i,j ] that satisfy the conditions above.

The optimal transportation distance between μ and ν defined in (1) is the solution to the following linear programing problem:

$$ d_{W}^2(\mu,\nu) = \min_{f \in\varPi(\mu, \nu)} \sum_{i=1}^{N_\mu } \sum _{j=1}^{N_\nu} |x_i-y_j|^2 f_{i,j}. $$
(5)

See Fig. 3 for a visualization. For convenience, we utilize Matlab’s implementation of a variation of Mehrotras dual interior point method (Methora 1992) to solve the linear program. We note however that better suited alternatives, which take advantage of the special structure of the linear programming problem, exist and could be used (Orlin 1993; Rubner et al. 2000) to increase the computational efficiency of the method even further. The solution of the linear program above is relatively expensive (Wang et al. 2011), with typical computation times being around 30 seconds (with 500 particles per image) on an Apple laptop with a 2.2 GHz intel dual core processor and 2 GB of RAM.

Fig. 3
figure 3

When measures are discrete as in (4), then finding the transportation plan which minimizes the transportation cost, (5), necessitates splitting the particles. The arrows are drawn only for positive f i,j

For datasets containing tens of thousands of images this is practical only if we need to compute a single OT distance per image (the one to the reference image), and is prohibitive if we need the pairwise distance between all images. This is the reason we introduce the linear optimal transportation framework to compute a related distance between μ and ν.

3.3 LOT Distance in Particle Setting

Computing the linear optimal transportation (LOT) distance requires setting a reference template σ, which we also choose to be a particle measure \(\sigma= \sum_{k=1}^{N_{\sigma}} q_{k} \delta_{z_{k}}\) (how the reference template is computed is described in Sect. 3.4). The distance between μ and ν given by (3) in a discrete setting becomes:

(6)

We remark that the sets of optimal transportation plans, Π OT (σ,μ) and Π OT (σ,ν), typically have only one element, so there is only one possibility for f, and g. See Fig. 4 for a visualization. Usually there is more than one possibility for h, in the case of particle measures.

Fig. 4
figure 4

Given f and g in (6), for each k fixed, h k,i,j gives the optimal transportation plan between the ‘f-image’ and the ‘g-image’ of the particle at z k . Arrows are drawn only for positive coefficients, and only for the particle at z k

We now introduce the ‘distance’ which is an approximation of the one above (hence we denote it as d aLOT ) and which is used in most computations in this paper. Namely, given OT plans f and g, as indicated in Fig. 5, we replace the f-image of the particle at z k , namely the measure \(\sum_{i} f_{k,i} \delta_{x_{i}}\), by one particle at the center of mass, namely the measure \(q_{k} \delta_{\bar{x}_{k}}\). We recall from Sect. 2.2 that if σ, μ and ν were measures which had a density function (and hence no particles) then there exists an optimal transportation map and the image of any z is just a single x. On the discrete level we have an approximation of that situation and typically the f-image of the particle at z k is spread over a few nearby particles. Thus the error made by replacing the f image by a single particle at the images center of mass is typically small.

Fig. 5
figure 5

For the LOT distance we replace the full f-image and the g-image of the particle at z k by their centers of mass \(\bar{x}_{k}\) and \(\bar{y}_{k}\), (7). When there are many particles the images of particle at z k tend to be concentrated on a few nearby particles. Thus the error introduced by replacing them by their center of mass is small

To precisely define the new distance, let f be an optimal transportation plan between \(\sigma= \sum_{k=1}^{N_{\sigma}} q_{k} \delta_{z_{k}}\) and \(\mu= \sum_{i=1}^{N_{\mu}} m_{i} \delta_{x_{i}}\) obtained in (5), and g is an optimal transportation plan between \(\sigma= \sum_{k=1}^{N_{\sigma}} q_{k} \delta_{z_{k}}\) and \(\nu= \sum_{j=1}^{N_{\nu}} p_{j} \delta_{x_{j}}\). Then

$$ \bar{x}_k = \frac{1}{q_k} \sum _{i=1}^{N_\mu} f_{k,i} x_i \quad\textrm{and}\quad \bar{y}_k = \frac{1}{q_k} \sum_{j=1}^{N_\nu} g_{k,j} y_j $$
(7)

are the centroids of the forward image of the particle \(q_{k} \delta_{z_{k}}\) by the transportation plans f and g respectively (see Fig. 5). We define

$$ d_{a\mathit{LOT},\sigma}(\mu, \nu)^2 =\min_{\substack{f \in\varPi _{\mathit{OT}}(\sigma, \mu) \\ g \in\varPi_{\mathit{OT}}(\sigma, \nu)}} \sum_{k=1}^{N_\sigma} q_k | \bar{x}_k - \bar{y}_k|^2 $$
(8)

We clarify that computing (8) in practice does not require a minimization over f and g, since f and g are unique with probability one. The reason for that is that in the discrete case the condition for optimality of transportation plans can be formulated in terms of cyclic monotonicity (see Villani 2009). The nonuniuqueness of the OT plan can only happen if the inequalities in some cyclic monotonicity conditions become equalities, which means that the coordinates of particles satisfy an algebraic equation. And this can only happen with probability zero. For example, if both measures have exactly two particles of the same mass then the condition for nonuniqueness is that |x 1y 1|2+|x 2y 2|2=|x 2y 1|2+|x 1y 2|2.

Thus we compute

$$ d_{a\mathit{LOT},\sigma}(\mu, \nu)^2 = \sum _{k=1}^{N_\sigma} q_k | \bar{x}_k - \bar{y}_k|^2. $$
(9)

One should notice that the d aLOT,σ distance between two particle measures may be zero. This is related to the ‘resolution’ achieved by the base measure σ. In particular if σ has more particles then it is less likely that d aLOT,σ (μ,ν)=0 for μν. It is worth observing that d aLOT,σ (μ,ν)2 computed in (9) is always less than or equal to d LOT,σ (μ,ν)2 introduced in (6).

Furthermore if we consider measures with continuous density σ, μ, and ν and approximate them by particle measures σ n , μ n , and ν n , then as the number of particles goes to infinity \(d_{\mathit{LOT}, \sigma_{n}}(\mu_{n}, \nu_{n}) \to d_{\mathit{LOT}, \sigma}(\mu, \nu)\) where the latter object is defined in (2). This follows from the stability of optimal transport (Villani 2009).

Let us also remark that, while we use the linear programming problem as described in (5) to compute the optimal transportation plan, other methods can be used as well. In particular the LOT distance is applicable even when a different approach to computing the OT plan is used.

3.4 Template Selection

Selecting the template (reference) σ above is important since, in our experience, a template that is distant from all images (particle approximations) is likely to increase the difference between d W and d aLOT . In the results presented below, we use a reference template computed from the ‘average’ image. To compute such average image, we first align all images to remove translation, rotation, and horizontal and vertical flips, as described in Sect. 3.3. All images are then averaged, and the particle approximation algorithm described above is used to obtain a particle representation of the average image. We denote this template as σ and note that it will contain N σ particles. Once the particle approximation for the template is computed, it is also normalized to a standard position and orientation as described above.

3.5 Isometric Linear Embedding

When applying our approach to a given a set of images (measures) I 1,…,I M , we first compute the particle approximation for each image with the algorithm described in Sect. 3.1. We then compute a template σ as described in Sect. 3.4 and compute the OT distance (5) between a template σ (also chosen to be a particle measure) and the particle approximation of each image. Once these are computed, the LOT distance between any two particle sets is given by (9).

The lower bound of the linear optimal transportation distance d aLOT,σ defined in (9) provides a method to map a sample measure ν n (estimated from image I n ) into a linear space. Let \(\nu_{n} = \sum_{j=1}^{N_{\nu_{n}}} m_{j} \delta_{y_{j}}\), and recall that the reference measure is \(\sigma= \sum_{k=1}^{N_{\sigma}} q_{k} \delta_{z_{k}}\). The linear embedding is obtained by applying the discrete transportation map between the reference measure σ and ν n to the coordinates y j via

$$ \mathbf{ x}_n = \bigl( \sqrt{q_1} a^1_n \cdots\sqrt {q_{N_\sigma}} a^{N_\sigma}_n \bigr)^T $$
(10)

where a k is the centroid of the forward image of the particle \(q_{k} \delta_{k_{k}}\) by the optimal transportation plan, g k,j , between images σ and ν n :

$$ a^k_n = \sum _{j=1}^{N_{\nu_n}} g_{k,j} y_j / q_k $$
(11)

This results in an N σ -tuple of points in ℝ2 which we call the linear embedding x n of ν n . That is, \(\mathbf{ x}_{n} \in\mathbb{R}^{N_{\sigma}\times2}\). We note that the embedding is interpretable in the sense that any point in this space can be visualized by simply plotting the vector coordinates (each in ℝ2) in the image space Ω.

4 Statistical Analysis

When a linear embedding for the data can be assumed, standard geometric data processing techniques such as principal component analysis (PCA) can be used to extract and visualize major trends of variation in morphology (Blum et al. 1967; Rueckert et al. 2003; Vaillant et al. 2004). Briefly, given a set of data points x n , for n=1,…,M, we can compute the covariance matrix \(S = \frac{1}{M}\sum_{n} (\mathbf{ x}_{n}-\bar{\mathbf{ x}})(\mathbf{ x}_{n}-\bar{\mathbf{ x}})^{T}\), with \(\bar{\mathbf{ x}}=\frac {1}{M}\sum_{n=1}^{M}\mathbf{ x}_{n}\) representing center of the entire data set. PCA is a method for computing the major trends (directions over which the projection of the data has largest variance) of a dataset via the solution of the following optimization problem:

$$ \mathbf{w}^*_{\mathit{PCA}} = \arg\max_{\Vert \mathbf{w}\Vert =1} \mathbf{ w}^T S \mathbf{ w} $$
(12)

The problem above can be solved via eigenvalue and eigenvector decomposition, with each eigenvector corresponding to a major mode of variation in the dataset (its corresponding eigenvalue is the variance of the data projected over that direction).

In addition to visualizing the main modes of variation for a given dataset, important applications involve visualizing the modes of variation that best discriminate between two or more separate populations (e.g. as in control vs. effect studies). To that end, we apply the methodology we developed in Wang et al. (2011), based on the well known Fisher linear discriminant analysis (FLDA) technique (Fisher 1936). As explained in Wang et al. (2011), simply applying the FLDA technique in morphometry visualization problems can lead to erroneous interpretations, since the directions computed by the FLDA technique are not constrained to pass through the data. To alleviate this effect we modified the method as follows. Briefly, given a set of data points x n , for n=1,…,N, with each index n belonging to class c, we modified the original FLDA by adding a least squares-type representation penalty in the function to be optimized. The representation constrained optimization can then be reduced to

$$ \mathbf{w}^*_{\mathit{LDA}} = \arg\max_{\mathbf{w}} \frac{\mathbf{ w}^TS_T\mathbf{ w} }{\mathbf{ w}^T ( S_W + \alpha\mathbf{ I} ) \mathbf{ w}} $$
(13)

where \(S_{T}=\sum_{n} (\mathbf{ x}_{n}-\bar{\mathbf{ x}})(\mathbf{ x}_{n}-\bar{\mathbf{ x}})^{T}\) represents the ‘total scatter matrix’, \(S_{W}=\sum_{c}\sum_{n\in c}(\mathbf{ x}_{n}-\bar{\mathbf{ x}}_{c})(\mathbf{ x}_{n}-\bar{\mathbf{ x}}_{c})^{T}\) represents the ‘within class scatter matrix’, \(\bar{\mathbf{ x}}_{c}\) is the center of class c. The solution for the problem above is given by the well-known generalized eigenvalue decomposition S T w=λ(S W +α I)w (Bishop 2006). In short, the penalized LDA method above seeks to find the direction w that has both a low reconstruction error, but that best discriminates between two populations. We use the method described in Wang et al. (2011) to select the penalty weight α.

5 Computational Results

In this section, we describe results of applying the LOT method to quantify the statistical variation of three types of datasets. We analyze (sub) cellular protein patterns and organelles from microscopy images, visualize the variation of shape and brightness in a galaxy image database and characterize the variation of expressions in a facial image database. We begin by introducing the datasets used. For the cellular image datasets, we then show results that (1) evaluate the particle approximation algorithm described earlier, (2) evaluate how well the LOT distance approximates the OT distance, (3) evaluate the discrimination power of the LOT distance (in comparison to more traditional feature-based methods), and (4) use the linear embedding provided by the LOT to visualize the geometry (summarizing trends), including discrimination properties. For both the galaxy and facial images, we evaluate the performance of the particle approximation and discrimination power of the LOT metric, and visualize the discriminating information. For the facial image dataset, we also visualize the summarizing trends of variation of facial expressions. In the case of facial expressions, the results show that LOT obtains similar quantification of expression as the original paper (Stegmann et al. 2003), yet LOT has the disinct advantage of being landmark free.

5.1 Datasets and Pre-processing

The biological imaging dataset has two sub-datasets. The nuclear chromatin patterns were extracted from histopathology images obtained from tissue sections. Tissue for imaging was obtained from the tissue archives of the Children’s Hospital of Pittsburgh of the University of Pittsburgh Medical Center. The extraction process included a semi-automatic level set-based segmentation, together with a color to gray scale conversion and pixel normalization. The complete details related to this dataset are available in our previous work (Wang et al. 2010, 2011). The dataset we use in this paper consists of five human cases of liver hepatoblastoma (HB), a liver tumor in children, each containing adjacent normal liver tissue (NL). In total, 100 nuclei were extracted from each case (50 NL, 50 HB). The second sub dataset we use are fluorescent images of 2 golgi proteins: giantin and GPP130. The dataset is described in detail in Boland and Murphy (2001). In total, we utilize 81 GPP and 66 giantin protein patterns. The galaxy dataset we use contain two types of galaxy images including 225 Elliptical galaxies and 223 Spiral galaxies. The dataset is described in Shamir (2009). The facial expression dataset is the same as described in Stegmann et al. (2003). We use only the normal and smiling expressions, with each expression containing 40 images. Sample images for the nuclear chromatin, golgi, galaxy and facial expression datasets are show in Fig. 6(A), (B), (C) and (D) respectively.

Fig. 6
figure 6

Sample images for the two data sets. (A): Sample images from the liver nuclear chromatin data set. (B): Sample images from the Golgi protein data set. (C): Sample images from galaxy image data set. (D): Sample image from facial expression data set

5.2 Particle Approximation Error

Here we quantify how well our particle-based algorithm described above can approximate the different datasets we use in this paper. Instead of an absolute figure, of interest is the OT error between each image and its particle approximation, relative to the OT distances that are present in a dataset. Figure 7(A) can give us some indication of the relative (in percentage) error for the particle approximation of each image in each dataset. For each image in a dataset, we compute the error \(e_{i} = \frac{\varphi_{i}}{\frac {1}{M}\sum_{j=1,\ldots,M} d_{i,j}}\), where d i,j is the OT distance computed through (5), and φ i is the upper bound on the OT distance between the particle approximation of image i and the image i (defined in (16) in the Appendix). The computation is repeated for M images chosen randomly from a given dataset. Figure 7(A) contains the histogram of the errors for all the datasets. The x-axis represents the relative error in percentage, and the y-axis represents the portion of samples that has the corresponding relative error. Though, by definition, we can always reduce the approximation errors by adding more particles, considering the computational cost, we use 500 initial particles for the nuclei, golgi and galaxy data sets, and 2000 initial particles for the face data set. The data shows that the average errors were 14.1 % for the nuclear chromatin dataset, 3.2 % for the golgi protein dataset, 13.1 % for the galaxy image dataset and 18.5 % for the face data set.

Fig. 7
figure 7

Histograms of particle approximation error and LOT error. (A): The histogram of relative error generated during the particle approximation process. The x-axis represents the relative error in percentage, and the y-axis represents the portion of samples that has the corresponding relative error. (B): The histogram of relative errors between LOT and OT for nuclei and golgi data sets. The x-axis represents the relative error in percentage, and the y-axis represents the portion of samples that has the corresponding relative error. The dotted line represents the relative errors for the liver nuclear chromatin dataset, and the solid line represents relative errors for the Golgi dataset

5.3 Difference Between OT and LOT

We compare the LOT distance given in (9) with the OT distance, defined in (5). To that end, we compute the full OT distance matrix (all pairwise distances) for both biological image datasets and compare it to the LOT distance via: \(e_{\mathit{OT},\mathit{LOT}} = \frac{d_{\mathit{OT}}-d_{\mathit{LOT}}}{d_{\mathit{OT}}}\). Figure 7(B) contains the relative difference for the nuclear chromatin and golgi. The average absolute difference for the nuclear chromatin data is e OT,LOT =2.94 %, and e OT,LOT =3.28 % for the golgi dataset.

5.4 Discrimination Power

In Wang et al. (2011), we have shown that the OT metric can be used to capture the nuclear morphological information to classify different classes of liver and thyroid cancers. We now show that the linear OT approximation we propose above can be used for discrimination purposes with little or no loss in accuracy when compared to traditional feature-based approaches. To test classification accuracy we use a standard implementation of the support vector machine method (Bishop 2006) with a radial basis function (RBF) kernel.

For the nuclear chromatin in the liver cancer dataset we followed the same leave one case out cross validation approach described in Wang et al. (2011). The classification results are listed in Table 1. For comparison, we also computed classification results using a numerical feature approach where the same 125 numerical features (including shape parameters, Haralick features, and multi-resolution-type features) as described in Wang et al. (2011), were used. Stepwise discriminant analysis was used to select the significant features for classification (12 of them were consequently selected). In addition, we also provide classification accuracies utilizing the method described in Pele and Werman (2009), which we denote here as EMD-L1. The EMD-L1 in this case was applied to measure the L 1 weighted earth mover’s distance (EMD) between image pairs. The radius above which pixels displacements were not computed was set to 15 for fast computation. When using the EMD-L1 method, each image was downsampled by four, after Gaussian filtering, to allow for the computations to be performed in a reasonable time frame (see discussion section for a comparison and discussion of computation times).

Table 1 Average classification accuracy in liver data

Classification results are shown in Table 1. We note that because we used a kernel-based SVM method, the only difference between all implementations were the pairwise distances computed (OT vs. LOT vs. EMD-L1 vs. feature-based). In Table 1, each row corresponds to a test case, and the numbers correspond to the average classification accuracy (per nucleus) for normal liver and liver cancer. The first column contains the classification accuracy for the feature-based approach, the second column the accuracy for the OT metric, the third column the accuracy for the LOT metric, and the final column contains the computations utilizing the EMD-L1 approach (Pele and Werman 2009). We followed a similar approach for comparing classification accuracies in the golgi protein dataset. In this case we utilized the feature set described in Boland and Murphy (2001). This feature set was specifically designed for classification tasks of this type. A five fold cross validation strategy was used. Results are shown in Tables 2 and 3.

Table 2 Average classification accuracy in golgi protein data
Table 3 Average classification accuracy for EMD-L1 in golgi protein data

We also computed the LOT metric for the galaxy and facial expression datasets and applied the same support vector machine (SVM) strategy. In the galaxy case we utilized the feature set described in Shamir (2009). A five fold cross validation accuracy was used. Compared with the accuracy of the feature-based approach (93.6 %), the classification accuracy of LOT metric was 87.7 %. For the facial expression data set, the classification was performed based on SVM with the LOT metric, and a 90 % classification accuracy was obtained.

5.5 Visualizing Summarizing Trends and Discriminating Information

We applied the PCA technique as described in Sect. 4 to the nuclei and facial expression data sets. The first three modes of variation for the nuclear chromatin datasets (containing both normal and cancerous cells) are displayed in Fig. 8(A). The first three modes of variation for the facial expression dataset are shown in Fig. 8(B). For both datasets, the first three modes of variation correspond to over 99 % of the variance of the data. The modes of variation are displayed from the mean to plus and minus four times times the standard deviation of each corresponding mode. For the chromatin dataset we can visually detect size (first mode), elongation (second mode), and differences in chromatin concentration, from the center to the periphery of the nucleus (third mode). In the face dataset, since we did not normalize for size of the face, one can detect variations in size (first mode), facial hair texture (second mode), as well as face shape (third mode). The variations detected from the second and third modes are similar to the results reported in Stegmann et al. (2003).

Fig. 8
figure 8

The modes of variation given by the PCA method combined with the LOT framework, for nuclear chromatin (A) and facial expression (B) datasets. Each row corresponds to a mode of variation, starting from the first PCA mode (top)

We also applied the method described in Sect. 4 to visualize the most significant differences between the two classes in each dataset. The most discriminating modes for the nuclear chromatin dataset, golgi dataset, galaxy and the facial expression dataset are shown in Fig. 9(A)–(D), respectively. In the nuclear chromatin case, the normal tissue cells tend to look more like the images on the left, while the cancerous ones tend to look more like the images on the right. In this case we can see that the most discriminating effect seems to be related to differences in chromatin placement (near periphery vs. concentrated at the center). The p value for this direction was 0.019. For the golgi protein dataset, the discriminating direction (p=0.049) shows that the giantin golgi protein tends to be more scattered than the gpp protein, which tends to be more elongated in its spatial distribution. For the galaxy dataset, the discriminant direction (p=0.021) seems to vary from a spiral structure to a bright dense disk. For the facial expression dataset, the discriminating direction (p=0.011) shows a trend from a smiling face to a serious or neutral facial expression.

Fig. 9
figure 9

Modes of discrimination computed using the penalized LDA method in combination with the LOT framework. Each row contains the mode of variation that best discriminates the two classes in each dataset. Parts (A), (B), (C), (D) refer to discrimination modes in nuclear morphology, golgi proteins, galaxy morphology, and facial expression datasets, respectively

6 Summary and Discussion

We described a new method for quantifying and visualizing morphological differences in a set of images. Our approach called the LOT is based on the optimal transportation framework with quadratic cost, and we utilize a linearized approximation based on a tangent space representation to make the method amenable to large datasets. Our current implementation is based on a discrete ‘particle’ approximation of each pattern, with subsequent linear programming optimization, and is therefore best suited for images which are sparse. We evaluated the efficacy of our methods in several aspects. As our results show, the error between an image and its particle approximation, relative to the distances in the dataset, are relatively small for a moderate number of particles when the images are sparse (e.g. golgi protein images), and can be made small for general images if more time is allowed for computation. We also showed that the LOT distance closely approximates the standard OT distance between particle measures, with absolute percentage errors on the order of a few percent in the datasets we used. In experiments not shown, we also evaluated the reproducibility of our particle-based LOT computation with respect to the random particle initializations required by algorithm. These experiments revealed that the average coefficient of variation was on the order of a couple of percent. Finally, we also evaluated how well the LOT distance described can perform in classifying sets of images in comparison to standard feature-based methods, the traditional OT distance (Wang et al. 2011), and the EMD-L1 method of Pele and Werman (2009). Overall, considering all classification tests, no significant loss of accuracy was detected.

A major advantage of the LOT framework is the reduced cost of computing the LOT distance between all image pairs in a dataset. For a database of M images, Sect. 2.3 explains that the number of transportation related optimization problems that need to be solved for computing all pairwise distances is M. In comparison, to our knowledge, the number of transportation related optimizations necessary for this purpose of all other available methods is M(M−1)/2. In concrete terms, if all computations were to be performed using a single Apple laptop computer with 2 GB or RAM and a dual processor of 2.2 GHz, the total computation time for producing all pairwise distances for computing the results shown in Table 1 of our paper, for example, would be approximately 4.1 hours for the LOT method, 15.6 hours for the EMD-L1 method (Pele and Werman 2009), and 1200 hours for the OT method described in Wang et al. (2011). We clarify that in order to utilize the EMD-L1 method images had to be downsampled to allow the computations to be performed in reasonable times. The time quoted above (and results shown in Tables 1, 2 and 3) was computed by downsampling each image by a factor of four (after spatial filtering). We clarify again, however, that we have used a generic linear programming package for solving the OT optimizations associated with our LOT framework. The computation times for the LOT framework could be decreased further by utilizing linear programming techniques more suitable for this problem.

In addition to fast computation, a significant innovation of the LOT approach is the convenient isometric linear embedding it provides. In contrast to other methods that also obtain linear embeddings, see for example Roweis and Saul (2000), Tenenbaum et al. (2000), the embedding LOT space allows for one to synthesize and visualize any point in the linear space as an image. Therefore the LOT embedding can facilitate visualization of the data distribution (via PCA) as well as differences between distributions (via penalized LDA). Hence, it allows for the automatic visualization of differences in both shape and texture in different classes of images. In biological applications, such meaningful visualizations can be important in helping elucidate effects and mechanisms automatically from the image data. For example, the results with the nuclear chromatin dataset suggest that, in this disease, cancer cells tend to have their chromatin more evenly spread (euchromatin) throughout the nucleus, whereas normal cells tend to have their chromatin in more compact form (heterochromatin). This suggests that the loss of heterochromatin is associated with the cancer phenotype in this disease. This finding is corroborated by earlier studies (Wang et al. 2010, 2011) (which used different methodology), as well as other literature which suggests that cancer progression is associated with loss of heterochromatin (Moss and Wallrath 2007; Dialynas et al. 2008).

We note that the use of the quadratic cost in (1) allows for a Riemannian geometric interpretation for the space of images. With this point of view, the LOT framework we describe can be viewed as a tangent plane approximation of the OT manifold. Thus the linear embedding we mentioned above can be viewed as a projection onto the tangent plane. While computing theoretical bounds for the difference between the OT and LOT distances is difficult since it would depend on the curvature of the OT manifold (which we have no way of estimating for a real distribution of images), several important facts about this approximation are worth noting. Firstly, we note that the LOT is a proper mathematical distance on its own, even when the tangent plane approximation is poor. Secondly, the projection is a one to one mapping. Thus information cannot be lost by projecting two images onto the same point. Finally, we note that utilizing the LOT distances (instead of the OT ones) results in no significant loss in classification accuracy. This suggests, that for these datasets, no evidence is available to prefer the OT space versus the LOT one.

As mentioned above, the LOT embedding obtained via the tangent space approximation can facilitate the application of geometric data analysis methods such as PCA to enable one to visualize interesting variations in texture and shape in the dataset. As compared to applying the PCA method directly on pixel intensities (data not shown for brevity), the PCA method applied with the use of the LOT method yields sharper and easier to interpret variations, since it can account for both texture and shape variations via the transportation of mass. As already noted above, several other graph-based methods also exist for the same purpose (Roweis and Saul 2000; Tenenbaum et al. 2000). However, in contrast to our LOT method, these are not analytical meaning that only measured images can be visualized. Moreover, the LOT distances we provide here can be used as ‘local’ distances, based on which such graphs can be constructed. Thus both classes of methods could be used jointly, for even better performance (one example is provided in Rohde et al. 2008).

The LOT framework presented here utilizes on a ‘particle’ approximation of the input images. The advantage of such a particle approximation is that the underlying OT optimization becomes a linear program whose global optimum can be solved exactly with standard methods. Drawbacks include the fact that such approach can become computationally intensive for non sparse images. We investigated the average classification accuracy for the experiments encountered in the previous section as a function of the number of particles used to approximate each image. For the experiment involving nuclear chromatin patterns, for example, the average classification accuracy for the LOT method using 300, 500, and 900 particles was 79 %, 83.3 %, and 83.4 %, respectively. We note that for the golgi protein dataset, utilizing the LOT with 100 particles produced nearly the same accuracy as the ones reported in Table 2. These results, together with the already provided comparison to other methods, suggest the particle approximation was sufficient to classify these datasets.

Finally, we wish to point out several limitations inherent in the approach we describe here. First, as already noted, the particle approximation is best able to handle images which are sparse. For images that are not sparse, many more particles are needed to approximate the image well, thus increasing significantly the computational cost of the method. In addition, like all transportation related distances, our approach is not able to handle transport problems which include boundaries. Instead the approach is best suited for analyzing images whose intensities can be viewed as a distribution of mass that is ‘free’ to move about the image. In addition, we note that transportation-type distances are not good at telling the difference between smooth versus punctate patterns. For this specific task, other approaches (such as Fourier analysis, for example), may be more fruitful. Finally, we note that for some types of images, the traditional OT and the LOT can yield very different distances. This could occur when the images to be compared have very sharp transitions, of high frequency, and if the phase of these transitions are mismatched. We note that this is rarely the case in the images we analyze in this paper. These and other aspects of our current LOT framework will be the subject of future work.