Keywords

1 Introduction

In traditional literature on pattern recognition and machine learning, the so-called perceptron, introduced by Rosenblatt [21], has been the dominant model of neuronal computation. A neuron is a computational unit whose activation is a “multiply-accumulate” product of the input and a set of associated synaptic weights, optionally fed through a non-linearity. This model has been challenged in terms of both biological and mathematical plausibility by the morphological paradigm, widely used in computer vision and related disciplines. This has lately attracted a stronger interest from researchers in computational intelligence motivating further theoretical and practical advances in morphological neural networks, despite the fact that learning methods based on lattice algebra and mathematical morphology can be traced back to at least as far as the 90s (e.g. [6, 19]).

In this paper, we re-visit the model of the morphological perceptron [24] in Sect. 3 and relate it with the framework of \((\max , +)\) and \((\min , +)\) algebras. In Sect. 3.1, we investigate its potential as a classifier, providing some fundamental geometric insight. We present a training algorithm for binary classification that uses the Convex-Concave Procedure and a more robust variant utilizing a simple form of outlier ablation. We also consider more general models such as maxout activations [11], relating the number of linear regions of a maxout unit with the Newton Polytope of its activation function, in Sect. 4. Finally, in Sect. 5, we present some experimental results pertinent to the efficiency of our proposed algorithm and provide some insight on the use of morphological layers in multilayer architectures.

We briefly describe the notation that we use. Denoting by the line of real numbers, \((-\infty , \infty )\), let and . We use lowercase symbols for scalars (like x), lowercase symbols in boldface for vectors (like \({\varvec{w}}\)) and uppercase symbols in boldface for matrices (like \({\varvec{A}}\)). Vectors are assumed to be column vectors, unless explicitly stated otherwise.

We will focus on the \((\max , +)\) semiring, which is the semiring with underlying set , using \(\max \) as its binary “addition” and \(+\) as its binary “multiplication”. We may also refer to the \((\min , +)\) semiring which has an analogous definition, while the two semirings are actually isomorphic by the trivial mapping \(\phi (x) = -x\). Both fall under the category of idempotent semirings [10], and are considered examples of so-called tropical semirings.Footnote 1

Finally, we will use the symbol \( \boxplus \) to refer to matrix and vector “multiplication” in \((\max , +)\) algebra and \(\boxplus '\) for its dual in \((\min , +)\) algebra, following the convention established in [15]. Formally, we can define matrix multiplication as:

$$\begin{aligned} ({\varvec{A}} \boxplus {\varvec{B}})_{ij} = \bigvee _{q=1}^k A_{iq} + B_{qj} \qquad ({\varvec{A}} \boxplus ' {\varvec{B}})_{ij} = \bigwedge _{q=1}^k A_{iq} + B_{qj} \end{aligned}$$
(1)

for matrices of compatible dimensions.

2 Related Work

In [20], the authors argued about the biological plausibility of nonlinear responses, such as those introduced in Sect. 3. They proposed neurons computing max-sums and min-sums in an effort to mimic the response of a dendrite in a biological system, and showed that networks built from such neurons can approximate any compact region in Euclidean space within any desired degree of accuracy. They also presented a constructive algorithm for binary classification. Sussner and Esmi [24] introduced an algorithm based on competitive learning, combining morphological neurons to enclose training patterns in bounding boxes, achieving low response times and independence from the order by which training patterns are presented to the training procedure.

Yang and Maragos [29] introduced the class of min-max classifiers, boolean-valued functions appearing as thresholded minima of maximum terms or maxima of minimum terms:

$$\begin{aligned} f_{\text {max-min}}(x_1, x_2, \dots x_d)&= \bigwedge _j \bigvee _{i \in I_j} l_i, \quad l_i \in \{x_i, 1 - x_i\} \end{aligned}$$
(2)

and vice-versa for \(f_{\text {min-max}}\). In the above, \(I_j\) is the set of indices corresponding to term j. These classifiers produce decision regions similar to those formed by a \((\max , +)\) or \((\min , +)\) perceptron.

Barrera et al. [3] tried to tackle the problem of statistically optimal design for set operators on binary images, consisting of morphological operators on sets. They introduced an interval splitting procedure for learning boolean concepts and applied it to binary image analysis, such as edge detection or texture recognition.

With the exception of [29], the above introduce constructive training algorithms which may produce complex decision regions, as they fit models precisely to the training set. They may create superfluous decision areas to include outliers that might be disregarded when using gradient based training methods, a fact that motivates the work in Sect. 3.2.

In a recent technical report, Gärtner and Jaggi [8] proposed the concept of a tropical support vector machine. Its response and j-th decision region are given by:

$$\begin{aligned} y({\varvec{x}}) = \bigwedge _{i=1}^n w_i + x_i\ , \quad \mathcal {R}^j({\varvec{x}}) = \left\{ {\varvec{x}} : w_j + x_j \le w_i + x_i, \forall i \right\} \end{aligned}$$
(3)

instead of a “classical” decision region (e.g. defined by some discriminant function).

Cuninghame-Green’s work on minimax algebra [5] provides much of the matrix-vector framework for the finite-dimensional morphological paradigm. A fundamental result behind Sussner and Valle’s article [25] on morphological analogues of classical associative memories such as the Hopfield network, states that the “closest” under-approximation of a target vector \({\varvec{b}}\) by a max-product in the form \({\varvec{A}} \boxplus {\varvec{x}}\) can be found by the so-called principal solution of a max-linear equation.

Theorem 1

[5] If , then

$$\begin{aligned} \overline{{\varvec{x}}} = {\varvec{A}}^{\sharp } \boxplus ' {\varvec{b}} \qquad ({\varvec{A}}^{\sharp } \triangleq -{\varvec{A}}^T) \end{aligned}$$
(4)

is the greatest solution to \({\varvec{A}} \boxplus {\varvec{x}} \le {\varvec{b}}\), and furthermore \({\varvec{A}} \boxplus {\varvec{x}} = {\varvec{b}}\) has a solution if and only if \(\overline{{\varvec{x}}}\) is a solution.Footnote 2

3 The Morphological Perceptron

Classical literature defines the perceptron as a computational unit with a linear activation possibly fed into a non-linearity. Its output is the result of the application of an activation function, that is usually nonlinear, to its activation \(\phi ({\varvec{x}})\). Popular examples are the logistic sigmoid function or the rectifier linear unit, which has grown in popularity among deep learning practitioners [17]. For the morphological neuron, in [20], its response to an input is given by

$$\begin{aligned} \tau ({\varvec{x}}) = p \cdot \bigvee _{i=1}^n r_i (x_i + w_i)\ , \qquad \tau '({\varvec{x}}) = p \cdot \bigwedge _{i=1}^n r_i (x_i + m_i) \end{aligned}$$
(5)

for the cases of the \((\max , +) \text { and } (\min , +)\) semirings respectively. Parameters \(r_i\) and p take values in \(\{+1, -1\}\) depending on whether the synapses and the output are excitatory or inhibitory. We adopt a much simpler version:

Definition 1

(Morphological Perceptron). Given an input vector , the morphological perceptron associated with weight vector and activation bias computes the activation

$$\begin{aligned} \tau ({\varvec{x}}) = w_0 \vee (w_1 + x_1) \vee \dots \vee (w_n + x_n) = w_0 \vee \left( \bigvee _{i=1}^n w_i + x_i \right) \end{aligned}$$
(6)

We may define a “dual” model on the \((\min , +)\) semiring, as the perceptron with parameters that computes the activation

$$\begin{aligned} \tau '({\varvec{x}}) = m_0 \wedge (m_1 + x_1) \wedge \dots \wedge (m_n + x_n) = m_0 \wedge \left( \bigwedge _{i=1}^n m_i + x_i \right) \end{aligned}$$
(7)

The models defined by (6, 7) may also be referred to as \((\max , +)\) and \((\min , +)\) perceptron, respectively. They can be treated as instances of morphological filters [14, 22], as they define a (grayscale) dilation and erosion over a finite window, computed at a certain point in space or time. Note that \(\tau ({\varvec{x}})\) is a nonlinear, convex (as piecewise maximum of affine functions) function of \({\varvec{x}}, {\varvec{w}}\) that is continuous everywhere, but not differentiable everywhere (points where multiple terms maximize \(\tau ({\varvec{x}})\) are singular).

3.1 Geometry of a \((\max , +)\) Perceptron for Binary Classification

Let us now put the morphological perceptron into the context of binary classification. We will first try to investigate the perceptron’s geometrical properties drawing some background from tropical geometry.

Let be a matrix containing the patterns to be classified as its rows, let \({\varvec{x}}^{(k)}\) denote the k-th pattern (row) and let \(\mathcal {C}_1, \mathcal {C}_0\) be the two classes of the relevant decision problem. Without loss of generality, we may choose \(y_k = 1 \text { if } {\varvec{x}}^{(k)} \in \mathcal {C}_1\) and \(y_k = -1 \text { if } {\varvec{x}}^{(k)} \in \mathcal {C}_0\). Using the notation in (1), the \((\max , +)\) perceptron with parameter vector \({\varvec{w}}\) computes the output

$$\begin{aligned} \tau ({\varvec{x}}) = {\varvec{w}}^T \boxplus {\varvec{x}} \end{aligned}$$
(8)

Note that the variant we study here has no activation bias (\(w_0 = -\infty \)). If we assign class labels to patterns based on the sign function, we have \(\tau ({\varvec{x}}) > 0 \Rightarrow {\varvec{x}} \in \mathcal {C}_1\), \(\tau ({\varvec{x}}) < 0 \Rightarrow {\varvec{x}} \in \mathcal {C}_0\). Therefore, the decision regions formed by that perceptron have the form

(9)

As it turns out, these inequalities are collections of so called affine tropical halfspaces and define tropical polyhedra [9, 13], which we will now introduce.

Definition 2

(Affine tropical halfspace). Let . An affine tropical halfspace is a subset of defined by

(10)

We can further assume that \(\min (a_i, b_i) = -\infty \ \ \forall i \in \{ 1, 2, \dots , n+1\}\), as per [9, Lemma 1].

A tropical polyhedron is the intersection of finitely many tropical halfspaces (and comes in signed and unsigned variants, as in [1]). In our context, we will deal with tropical polyhedra like the following: assume . The inequalities

$$\begin{aligned} {\varvec{A}} \boxplus {\varvec{x}} \ge {\varvec{c}} \ , \quad {\varvec{B}} \boxplus {\varvec{x}} \le {\varvec{d}} \end{aligned}$$
(11)

define a subset that is a tropical polyhedron, which can be empty if some of the inequalities cannot be satisfied, leading us to our first remark.

Proposition 1

(Feasible Regions are Tropical Polyhedra). Let be a matrix containing input patterns of dimension n as its rows, partitioned into two distinct matrices \({\varvec{X}}_{\mathrm {pos}}\) and \({\varvec{X}}_{\mathrm {neg}}\), which contain all patterns of classes \(\mathcal {C}_1, \mathcal {C}_0\) respectively. Let \(\mathcal {T}\) be the tropical polyhedron defined by

(12)

Patterns \({\varvec{X}}_{\mathrm {pos}}, {\varvec{X}}_{\mathrm {neg}}\) can be completely separated by a \((\max , +)\) perceptron if and only if \(\mathcal {T}\) is nonempty.

Remark 1

In [9], it has been shown that the question of a tropical polyhedron being nonempty is polynomially equivalent to an associated mean payoff game having a winning initial state.

Using the notion of the Cuninghame-Green inverse from Theorem 1, we can restate the separability condition in Proposition 1. As we know that \(\overline{{\varvec{w}}} = {\varvec{X}}_{\mathrm {neg}}^{\sharp } \boxplus ' {\varvec{0}}\) is the greatest solution to \({\varvec{X}}_{\mathrm {neg}} \boxplus {\varvec{w}} \le {\varvec{0}}\), that condition is equivalent to

$$\begin{aligned} {\varvec{X}}_{\mathrm {pos}} \boxplus ({\varvec{X}}_{\mathrm {neg}}^{\sharp } \boxplus ' {\varvec{0}}) \ge {\varvec{0}} \end{aligned}$$
(13)

3.2 A Training Algorithm Based on the Convex-Concave Procedure

In this section, we present a training algorithm that uses the Convex-Concave Procedure [30] in a manner similar to how traditional Support Vector Machines use convex optimization to determine the optimal weight assignment for a binary classification problem. It is possible to state an optimization problem with a convex cost function and constraints that consist of inequalities of difference-of-convex (DC) functions. Such optimization problems can be solved (at least approximately) by the Convex-Concave Procedure.

$$\begin{aligned} \text { Minimize }&J({\varvec{X}}, {\varvec{w}}) = \sum _{k=1}^K \max (\xi _k, 0)&\nonumber \\ \text { s. t. }&{\left\{ \begin{array}{ll} \ \displaystyle \bigvee \limits _{i=1}^n w_i + x^{(k)}_i \le \xi _k &{} \text { if } {\varvec{x}}^{(k)} \in \mathcal {C}_0 \\ \ \displaystyle \bigvee \limits _{i=1}^n w_i + x^{(k)}_i \ge -\xi _k &{} \text { if } {\varvec{x}}^{(k)} \in \mathcal {C}_1 \end{array}\right. } \end{aligned}$$
(14)

The slack variables \(\xi _k\) in the constraints are used to ensure that only misclassified patterns will contribute to J. In our implementation, we use [23, Algorithm 1.1], utilizing the authors’ DCCP library that extends CvxPy [7], a modelling language for convex optimization in Python. An application on separable patterns generated from a Gaussian distribution can be seen in Fig. 1.

So far, we have not addressed the case where patterns are not separable or contain “abnormal” entries and outliers. Although many ways have been proposed to deal with the presence of outliers [28], the method we used to overcome this was to “penalize” patterns with greater chances of being outliers. We introduce a simple weighting scheme that assigns, to each pattern, a factor that is inversely proportional to its distance (measured by some \(\ell _p\)-norm) from its class’s centroid.

$$\begin{aligned} \varvec{\mu }_i&:= \frac{1}{|\mathcal {C}_i|} \sum _{{\varvec{x}}^{(k)} \in \mathcal {C}_i} {\varvec{x}}^{(k)},\ \lambda _k := \frac{1}{|| {\varvec{x}}^{(k)} - \varvec{\mu }_i ||_p} \end{aligned}$$
(15)
$$\begin{aligned} \nu _k&:= \frac{\lambda _k}{\max _k \lambda _k} \end{aligned}$$
(16)

Equation (16) above serves as a normalization step that scales all \(\lambda _k\) in the (0, 1] range. We arrive at a reformulated optimization problem which can be stated as

$$\begin{aligned} \text { Minimize }&J({\varvec{X}}, {\varvec{w}}, \varvec{\nu }) = \sum _{k=1}^K \nu _k \cdot \max (\xi _k, 0)&\nonumber \\ \text { s. t. }&{\left\{ \begin{array}{ll} \ \displaystyle \bigvee \limits _{i=1}^n w_i + x^{(k)}_i \le \xi _k &{} \text { if } {\varvec{x}}^{(k)} \in \mathcal {C}_0 \\ \ \displaystyle \bigvee \limits _{i=1}^n w_i + x^{(k)}_i \ge -\xi _k &{} \text { if } {\varvec{x}}^{(k)} \in \mathcal {C}_1 \end{array}\right. } \end{aligned}$$
(17)

To illustrate the practical benefits of this method (which we will refer to as WDccp), we use both versions of the optimization problem on a set of randomly generated data which is initially separable but then a percentage r of its class labels is flipped. Comparative results for a series of percentages r are found in Fig. 2. The results for \(r = 20\%\) can be seen in Fig. 3, with the dashed line representing the weights found by WDccp. This weighting method can be extended to complex or heterogeneous data; for example, one could try and fit a set of patterns to a mixture of Gaussians or perform clustering to obtain the coefficients \(\varvec{\nu }\).

It is possible to generalize the morphological perceptron to combinations of dilations (\(\max \)-terms) and erosions (\(\min \)-terms). In [2], the authors introduce the Dilation-Erosion Linear Perceptron, which contains a convex combination of a dilation and an erosion, as:

$$\begin{aligned} M({\varvec{x}}) = \lambda \tau ({\varvec{x}}) + (1 - \lambda ) \tau '({\varvec{x}}), \quad \lambda \in [0, 1] \end{aligned}$$
(18)

plus a linear term, employing gradient descent for training. The formulation in (17) can be used here too, as constraints in difference-of-convex programming can be (assuming \(f_i\) convex):

$$\begin{aligned} f_i({\varvec{x}}) - g_i({\varvec{x}}) \le 0, \ g_i \text { convex, or } f_i({\varvec{x}}) + g'_i({\varvec{x}}) \le 0, \ g'_i \text { concave } \end{aligned}$$
(19)

This observation is exploited in the first experiment of Sect. 5.

Fig. 1.
figure 1

Decision surface

Fig. 2.
figure 2

Method accuracy

Fig. 3.
figure 3

Optimal weights found

Fig. 4.
figure 4

\(\text {Newt}(p)\) of Eq. (23)

4 Geometric Interpretation of Maxout Units

Maxout units were introduced by Goodfellow et al. [11]. A maxout unit is associated with a weight matrix as well as an activation bias vector . Given an input pattern and denoting by \({\varvec{W}}_{j,:}\) the j-th row vector of \({\varvec{W}}\), a maxout unit computes the following activation:

$$\begin{aligned} h({\varvec{x}}) = \bigvee _{j=1}^k {\varvec{W}}_{j,:} {\varvec{x}} + b_j = \bigvee _{j=1}^k \left[ \left( \sum _{i=1}^n W_{ji} x_i \right) + b_j \right] \end{aligned}$$
(20)

Essentially, a maxout unit generalizes the morphological perceptron using k terms (referred to as the unit’s rank) that involve affine expressions. In tropical algebra, such expressions are called tropical polynomials [13] or maxpolynomials [4] when specifically referring to the \((\max , +)\) semiring. In [16], maxout units are investigated geometrically in an effort to obtain bounds for the number of linear regions of a deep neural network with maxout layers:

Proposition 2

([16], Proposition 7). The maximal number of linear regions of a single layer maxout network with n inputs and m outputs of rank k is lower bounded by \(k^{\min (n, m)}\) and upper bounded by \(\min \left\{ \sum _{j=0}^n \left( {\begin{array}{c}k^2 m\\ j\end{array}}\right) , k^m \right\} \).

This result readily applies to layers consisting of \((\max , +)\) perceptrons, as a \((\max , +)\) perceptron has rank \(k = n\).

For a maxout unit of rank k, the authors argued that the number of its linear regions is exactly k if every term is maximal at some point. We provide an exact result using tools from tropical geometry; namely, the Newton Polytope of a maxpolynomial. For definitions and fundamental results on polytopes the reader is referred to [31]; we kick off our investigation omitting the presence of the bias term \(b_j\) as seen in (20).

Definition 3

(Newton Polytope). Let be a maxpolynomial with k terms, given by

$$\begin{aligned} p({\varvec{x}}) = \max _{i \in 1, 2, \dots , k} \left\{ c_{i1} x_1 + c_{i2} x_2 + \dots + c_{in}x_n \right\} = \bigvee _{i=1}^k {\varvec{c}}_i^T {\varvec{x}} \end{aligned}$$
(21)

The Newton Polytope of p is the convex hull of the coefficient vectors \({\varvec{c}}_i\):

$$\begin{aligned} Newt (p) = conv \{ {\varvec{c}}_i: i \in 1, \dots , k \} = conv \{(c_{i1}, c_{i2}, \dots c_{in}): i \in 1, \dots , k\} \end{aligned}$$
(22)

For an illustrative example, see Fig. 4. The maxpolynomial in question is

$$\begin{aligned} p({\varvec{x}}) = 0 \vee (x + y) \vee 3x \vee (2x + 2y) \vee 3y \end{aligned}$$
(23)

and its terms can be matched to the coefficient vectors (0, 0), (1, 1), (3, 0), (2, 2) and (0, 3) respectively. The Newton Polytope’s vertices give us information about the number of linear regions of the associated maxpolynomial:

Proposition 3

Let \(p({\varvec{x}})\) be a maxout unit with activation given by (21). The number of p’s linear regions is equal to the number of vertices of its Newton Polytope, \(\mathrm {Newt}(p)\).

Proof

A proof can be given using the fundamental theorem of Linear Programming [26, Theorem 3.4]. Consider the linear program:

$$\begin{aligned} \text { Maximize }&{\varvec{x}}^T {\varvec{c}} \nonumber \\ \text { s.t. }&{\varvec{c}} \in \text{ Newt }(p) \end{aligned}$$
(24)

Note that, for our purposes, \({\varvec{c}}\) is the variable to be optimized. Letting \({\varvec{c}}\) run over assignments of coefficient vectors, we know that for every \({\varvec{x}}\), Problem (24) is a linear program for which the maximum is attained at one of the vertices of \(\text {Newt}(p)\). Therefore, points \({\varvec{c}}_i \in \text {int}(\text {Newt}(p))\) map to coefficient vectors of non-maximal terms of p.    \(\square \)

By Proposition 3, we conclude that the term \(x + y\) can be omitted from \(p({\varvec{x}})\) in (23) without altering it as a function of \({\varvec{x}}\). Proposition 3 can be extended to maxpolynomials with constant terms, such as maxout units with bias terms \(b_j\). Let the extended Newton Polytope be

$$\begin{aligned} p({\varvec{x}}) = \bigvee _{j = 1}^k b_j + {\varvec{c}}_j^T {\varvec{x}} \Rightarrow \text {Newt}(p) = \text {conv}\left\{ (b_j, {\varvec{c}}_j): j \in 1, \dots , k \right\} \end{aligned}$$
(25)

Let \( {\varvec{c}}' = (b, {\varvec{c}})\) and \({\varvec{x}}' = (1, {\varvec{x}})\). Note that the relevant linear program is now

$$\begin{aligned} \text { Maximize }&({\varvec{x}}')^T {\varvec{c}}' \nonumber \\ \text { s.t. }&{\varvec{c}}' \in \text{ Newt }(p) \end{aligned}$$
(26)

The optimal solutions of this program lie in the upper hull of \(\text {Newt}(p)\), \(\text {Newt}^{\max }(p)\), with respect to b. For a convex polytope P, its upper hull is

$$\begin{aligned} P^{\max } := \left\{ (\lambda , {\varvec{x}}) \in P: (t, {\varvec{x}}) \in P \Rightarrow t \le \lambda \right\} \end{aligned}$$
(27)

Therefore, the number of linear regions of a maxout unit given by (20) is equal to the number of vertices on the upper hull of its Newton Polytope. Those results are easily extended for the following models:

Proposition 4

Let \(h_1, \dots h_m\) be a collection of maxpolynomials. Let

$$\begin{aligned} g_{\vee }({\varvec{x}}) = \bigvee _{i=1}^m h_i({\varvec{x}}), \qquad g_{+}({\varvec{x}}) = \sum _{i=1}^m h_i({\varvec{x}}) \end{aligned}$$
(28)

The Newton Polytopes of the functions defined above are

$$\begin{aligned} \mathrm {Newt}(g_{\vee })&= \mathrm {conv}(\mathrm {Newt}(h_1), \dots \mathrm {Newt}(h_m)) \end{aligned}$$
(29)
$$\begin{aligned} \mathrm {Newt}(g_{+})&= \mathrm {Newt}(h_1) \oplus \mathrm {Newt}(h_2) \dots \oplus \mathrm {Newt}(h_m) \end{aligned}$$
(30)

where \(\oplus \) denotes the Minkowski sum of the Newton Polytopes.

5 Experiments

In this section, we present results from a few numerical experiments conducted to examine the efficiency of our proposed algorithm and the behavior of morphological units as parts of a multilayer neural network.

5.1 Evaluation of the WDCCP Method

Our first experiment uses a dilation-erosion or max-min morphological perceptron, whose response is given by

$$\begin{aligned} y({\varvec{x}}) = \lambda \left( \bigvee _{i=1}^n w_i + x_i \right) + (1 - \lambda ) \left( \bigwedge _{i=1}^n m_i + x_i \right) \end{aligned}$$
(31)

We set \(\lambda = 0.5\) and trained it using both stochastic gradient descent with MSE cost and learning rate \(\eta \) (Sgd) as well as the WDccp method on Ripley’s Synthetic Dataset [18] and the Wisconsin Breast Cancer Dataset [27]. Both are 2-class, non-separable datasets. For simplicity, we fixed the number of epochs for the gradient method at 100 and set \(\tau _{\max } = 0.01\) and stopping criterion \(\epsilon \le 10^{-3}\) for the WDccp method. We repeated each experiment 50 times to obtain mean and standard deviation for its classification accuracy, shown in Table 1. On all cases, the WDccp method required less than 10 iterations to converge and exhibited far better results than gradient descent. The negligible standard deviation of its accuracy hints towards robustness in comparison to other methods.

Fig. 5.
figure 5

Dilation layer

Fig. 6.
figure 6

Active filters

5.2 Layers Using Morphological Perceptrons

We experimented on the MNIST dataset of handwritten digits [12] to investigate how morphological units behave when incorporated in layers of neural networks. After some unsuccessful attempts using a single-layer network, we settled on the following architecture: a layer of \(n_1\) linear units followed by a \((\max , +)\) output layer of 10 units with softmax activations. The case for \(n_1 = 64\) is illuminating, as we decided to plot the morphological filters as grayscale images shown in Fig. 5. Plotting the linear units resulted in noisy images except for those shown in Fig. 6, corresponding to maximal weights in the dilation layer. The dilation layer takes into account just one or two linear activation units per digit (pictured as bright dots), so we re-evaluated the accuracy after “deactivating” the rest of them, obtaining the same accuracy, as shown in Table 2.

Table 1. Ripleys/WDBC test set results
Table 2. MNIST results

6 Conclusions and Future Work

In this paper, we examined some properties and the behavior of morphological classifiers and introduced a training algorithm based on a well-studied optimization problem. We aim to further investigate the potential of both ours and other models, such as that proposed in [8]. A natural next step would be to examine their performance as parts of deeper architectures, possibly taking advantage of their tendency towards sparse activations to simplify the resulting networks.

The subtle connections with tropical geometry that we were able to identify make us believe that it could also aid others in the effort to study fundamental properties of deep, nonlinear architectures. We hope that the results of this paper will further motivate researchers active in those areas towards that end.