Computing with Categories in Machine Learning

Sennesh, Eli; Xu, Tom; Maruyama, Yoshihiro

doi:10.1007/978-3-031-33469-6_25

Eli Sennesh¹⁰,
Tom Xu¹¹ &
Yoshihiro Maruyama¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13921))

Included in the following conference series:

International Conference on Artificial General Intelligence

1058 Accesses
2 Citations

Abstract

Category theory has been successfully applied in various domains of science, shedding light on universal principles unifying diverse phenomena and thereby enabling knowledge transfer between them. Applications to machine learning have been pursued recently, and yet there is still a gap between abstract mathematical foundations and concrete applications to machine learning tasks. In this paper we introduce DisCoPyro as a categorical structure learning framework, which combines categorical structures (such as symmetric monoidal categories and operads) with amortized variational inference, and can be applied, e.g., in program learning for variational autoencoders. We provide both mathematical foundations and concrete applications together with comparison of experimental performance with other models (e.g., neuro-symbolic models). We speculate that DisCoPyro could ultimately contribute to the development of artificial general intelligence.

Supported by the JST Moonshot Programme on AI Robotics (JPMJMS2033-02).

Access provided by Autonomous University of Puebla. Download conference paper PDF

Neural String Diagrams: A Universal Modelling Language for Categorical Deep Learning

Category Theory

Category Theoretic Foundations for Systems Science and Engineering

Keywords

1 Introduction

Category theory has been applied in various domains of mathematical science, allowing us to discover universal principles unifying diverse mathematical phenomena and thereby enabling knowledge transfer between them [7]. Applications to machine learning have been pursued recently [21]; however there is still a large gap between foundational mathematics and applicability in concrete machine learning tasks. This work begins filling the gap. We introduce the categorical structure learning framework DisCoPyro, a probabilistic generative model with amortized variational inference. We both provide mathematical foundations and compare with other neurosymbolic models on an example application.

Here we describe why we believe that DisCoPyro could contribute, in the long run, to developing human-level artificial general intelligence. Human intelligence supports graded statistical reasoning [15], and evolved to represent spatial (geometric) domains before we applied it to symbolic (algebraic) domains. Symmetric monoidal categories provide a mathematical framework for constructing both symbolic computations (as in this paper) and geometrical spaces (e.g. [17]). We take Lake [15]’s suggestion to represent graded statistical reasoning via probability theory, integrating neural networks into variational inference for tractability. In terms of applications, we get competitive performance (see Subsect. 3.2 below) by variational Bayes, without resorting to reinforcement learning of structure as with modular neural networks [13, 20].

The rest of the paper is organized as follows. In Sect. 2, we first introduce mathematical foundations of DisCoPyro (Subsects. 2 and 2.1). In Sect. 3 we then explain how to train DisCoPyro on a task (Subsect. 2.1) and provide experimental results and performance comparisons (Subsect. 3.2). Figure 1 demonstrates the flow of execution during the training procedure for the example task. We conclude and discuss further applications in Sect. 4. We provide an example implementation at https://github.com/neu-pml/discopyro with experiments at https://github.com/esennesh/categorical_bpl. DisCoPyro builds upon Pyro [2] (a deep universal probabilistic programming language), DisCoCat [4] (a distributional compositional model for natural language processing [4]), and the DisCoPy [6] library for computing with categories.

1.1 Notation

This paper takes symmetric monoidal categories (SMCs) $\mathcal {C}$ and their corresponding operads $\mathcal {O}$ as its mathematical setting. The reader is welcome to see Fong [7] for an introduction to these. SMCs are built from objects $Ob(\mathcal {C})$ and sets of morphisms $\mathcal {C}(\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2})$ between objects $\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2} \in Ob(\mathcal {C})$. Operads are built from types $Ty(\mathcal {O})$ and sets of morphisms $\mathcal {O}(\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2})$ between types $\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2} \in Ty(\mathcal {O})$. An SMC is usually written $(\mathcal {C}, \otimes , I)$ with a product operation $\otimes $ over objects and morphisms and a unit I of $\otimes $. In both settings, every object/type $\boldsymbol{\tau }_{}$ has a unique identity morphism $id_{\boldsymbol{\tau }_{}}$. Categories support composition $g \circ f$ on morphisms, and operads support indexed composition $g\circ _{i} f$ (for $i \in \mathbb {N}$) on morphisms.

2 Foundations of DisCoPyro

In essence, Definition 1 below exposes a finite number of building blocks (generators) from an SMC, and the morphisms constructed by composing those generators with $\circ $ and $\otimes $. For example, in categories of executable programs, a monoidal signature [6] specifies a domain-specific programming language.

Definition 1 (Monoidal signature in an SMC)

Given a symmetric monoidal category (SMC) $\mathcal {C}$ with the objects denoted by $Ob(\mathcal {C})$, a monoidal signature^{Footnote 1} $\mathcal {S}=(O, M)$ in that SMC consists of

A finite set $O \subseteq Ob(\mathcal {C})$; and
A finite set M consisting of elements $m: \mathcal {C}(\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2})$ for some $\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2} \in O$, such that $\forall \boldsymbol{\tau }_{} \in O, m \ne id_{\boldsymbol{\tau }_{}}$.

The following free operad over a monoidal signature represents the space of all possible programs synthesized from the above building blocks (generators specified by the monoidal signature). Employing an operad rather than just a category allows us to reason about composition as nesting rather than just transitive combination; employing an operad rather than just a grammar allows us to reason about both the inputs and outputs of operations rather than just their outputs.

Definition 2 (Free operad over a signature)

The free operad $\mathcal {O}_{\mathcal {S}}$ over a signature $\mathcal {S}=(O, M)$ consists of

A set of types (representations) $Ty(\mathcal {O}_{\mathcal {S}})=\{I\} \cup O^{\otimes }$;
For every $n \in \mathbb {N}^{+}$ a set of operations (mappings) $\mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{0}, \ldots , \boldsymbol{\tau }_{n-1}; \boldsymbol{\tau }_{n})$ consisting of all trees with finitely many branches and leaves, in which each vertex v with $n-1$ children is labeled by a generator $m(v) \in M$ such that $\textrm{dom}(m(v))$ has product length $n-1$;
An identity operation $id_{\boldsymbol{\tau }_{}}: \mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{}; \boldsymbol{\tau }_{})$ for every $\boldsymbol{\tau }_{} \in Ty(\mathcal {O}_{\mathcal {S}})$; and
A substitution operator $\circ _{i}$ defined by nesting a syntax tree $\mathcal {O}_{\mathcal {S}}(\boldsymbol{\sigma }_1, \ldots , \boldsymbol{\sigma }_m; \boldsymbol{\tau }_{i})$ inside another $\mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{1}, \ldots , \boldsymbol{\tau }_{n-1}; \boldsymbol{\tau }_{n})$ when $i \in [1...n-1]$ to produce a syntax tree $\mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{1}, \ldots , \boldsymbol{\tau }_{i-1}, \boldsymbol{\sigma }_{1}, \ldots , \boldsymbol{\sigma }_{m}, \boldsymbol{\tau }_{i+1}, \ldots \boldsymbol{\tau }_{n-1}; \boldsymbol{\tau }_{n})$.

Intuitively, free operads share a lot in common with context-free grammars, and in fact Hermida [10] proved that they share a representation as directed acyclic hypergraphs. The definition of a signature in an SMC already hints at the structure of the appropriate hypergraph, but Algorithm 1 will make it explicit and add edges to the hypergraph corresponding to nesting separate operations in parallel (or equivalently, to monoidal products in the original SMC).

In the hypergraph produced by Algorithm 1, each vertex corresponds to a non-product type and each hyperedge has a list of vertices as its domain and codomain. Each such hypergraph admits a representation as a graph as well, in which the hyperedges serve as nodes and the lists in their domains and codomains serve as edges. We will use this graph representation $G\simeq H$ to reason about morphisms as paths between their domain and codomain.

We will derive a probabilistic generative model over morphisms in the free operad from this graph representation’s directed adjacency matrix $A_G$.

Definition 3 (Transition distance in a directed graph)

The “transition distance” between two indexed vertices $v_i, v_j$ is the negative logarithm of the i, j entry in the exponentiated adjacency/transition matrix

$$\begin{aligned} d(v_i, v_j)&= -\log \left( \left[ e^{A_G} \right] _{i, j} \right) , \end{aligned}$$

(1)

where the matrix exponential is defined by the series

$$\begin{aligned} e^{A_G}&= \sum _{n=1}^{\infty } \frac{(A_G)^{n}}{n!}. \end{aligned}$$

A soft-minimization distribution over this transition distance will, in expectation and holding the indexed target vertex constant, define an probabilistic generative model over paths through the hypergraph.

Definition 4 (Free operad prior)

Consider a signature $\mathcal {S}=(O, M)$ and its resulting graph representation $G=(V,E)$ and recursion sites R, and then condition upon a domain and codomain $\boldsymbol{\tau }_{-}, \boldsymbol{\tau }_{+} \in Ty(\mathcal {O}_{\mathcal {S}})$ represented by vertices in the graph. The free operad prior assigns a probability density to all finite paths $\textbf{e}=(e_1, e_2, \ldots , e_n)$ with $\textrm{dom}(\textbf{e})=\boldsymbol{\tau }_{-}$ and $\textrm{cod}(\textbf{e})=\boldsymbol{\tau }_{+}$ by means of an absorbing Markov chain. First the model samples a “precision” $\beta $ and a set of “weights” ${\textbf{w}}$

$$\begin{aligned} \beta&\sim \gamma (1, 1)&{\textbf{w}}&\sim \textrm{Dirichlet}\left( \vec {\textbf{1}}^{(|M| + |R|)}\right) . \end{aligned}$$

Then it samples a path (from the absorbing Markov chain in Algorithm 2) by soft minimization (biased towards shorter paths by $\beta $) of the transition distance

$$\begin{aligned} \pi (e \in E \mid \boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2}; \beta )&:= \frac{\exp {\left( -\frac{1}{\beta } d(\textrm{cod}(e), \boldsymbol{\tau }_{2}) \right) }}{ \sum _{e' \in E : \textrm{dom}(e')=\boldsymbol{\tau }_{1}} \exp {\left( -\frac{1}{\beta } d(\textrm{cod}(e'), \boldsymbol{\tau }_{2}) \right) } }. \end{aligned}$$

(2)

Equation 2 will induce a transition operator T which, by Theorem 2.5.3 in Latouche and Ramaswami [16], will almost-surely reach its absorbing state corresponding to $\boldsymbol{\tau }_{2}$. This path can then be filled in according to Algorithm 3. The precision $\beta $ increases at each recursion to terminate with shorter paths. We denote the induced joint distribution as

$$\begin{aligned} p(f, {\textbf{w}}, \beta ; \boldsymbol{\tau }_{-}, \boldsymbol{\tau }_{+})&= p(f \mid \beta , {\textbf{w}}; \boldsymbol{\tau }_{-}, \boldsymbol{\tau }_{+}) p({\textbf{w}}) p(\beta ). \end{aligned}$$

(3)

Having a probabilistic generative model over operations in the free operad over a signature, we now need a way to specify a structure learning problem. Definition 5 provides this by specifying what paths to sample (each box specifies a call to Algorithm 2) and how to compose them.

Definition 5 (Wiring diagram)

An acyclic, O-typed wiring diagram [7, 22] is a map from a series of internal boxes, each one defined by its domain and codomain pair $(\boldsymbol{\tau }_{i}^{-}, \boldsymbol{\tau }_{i}^{+})$ to an outer box defined by domain and codomain $(\boldsymbol{\tau }_{n}^{-}, \boldsymbol{\tau }_{n}^{+})$

$$\begin{aligned} \varPhi&: \mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{1}^{-}, \boldsymbol{\tau }_{1}^{+}) \times \ldots \times \mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{n-1}^{-}, \boldsymbol{\tau }_{n-1}^{+}) \rightarrow \mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{n}^{-}, \boldsymbol{\tau }_{n}^{+}). \end{aligned}$$

Acyclicity requires that connections (“wires”) can extend only from the outer box’s domain to the domains of inner boxes, from the inner boxes codomains to the outer box’s codomain, and between internal boxes such that no cycles are formed in the directed graph of connections between inner boxes.

Given a user-specified wiring diagram $\varPhi $, we can then wite the complete prior distribution over all latent variables in our generative model.

$$\begin{aligned} p(f, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})&= p(\beta ) p({\textbf{w}}) \prod _{(\boldsymbol{\tau }_{i}^{-}, \boldsymbol{\tau }_{i}^{+}) \in \varPhi } p(f_{i} \mid \beta , {\textbf{w}}; \boldsymbol{\tau }_{i}^{-}, \boldsymbol{\tau }_{i}^{+}). \end{aligned}$$

(4)

If a user provides a likelihood $p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}\mid f)$ relating the learned structure f to data ${\textbf{x}}$ (via latents ${\textbf{z}}$) we will have a joint density

$$\begin{aligned} p({\textbf{x}}, {\textbf{z}}, f, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})&= p({\textbf{x}}\mid f) p(f, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}), \end{aligned}$$

(5)

and Eq. 5 then admits inference from the data ${\textbf{x}}$ by Bayesian inversion

$$\begin{aligned} p({\textbf{z}}, f, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})&= \frac{ p({\textbf{x}}, {\textbf{z}}, f, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) }{ p_{{\boldsymbol{\theta }}}({\textbf{x}}; \varPhi , \mathcal {S}) }. \end{aligned}$$

(6)

Section 2.1 will explain how to approximate Eq. 6 by stochastic gradient-based optimization, yielding a maximum-likelihood estimate of ${\boldsymbol{\theta }}$ and an optimal approximation for the parametric family ${\boldsymbol{\phi }}$ to the true Bayesian inverse.

2.1 Model Learning and Variational Bayesian Inference

Bayesian inversion relies on evaluating the model evidence $p_{{\boldsymbol{\theta }}}({\textbf{x}}; \varPhi , \mathcal {S})$, which typically has no closed form solution. However, we can transform the high-dimensional integral over the joint density into an expectation

$$\begin{aligned} p_{{\boldsymbol{\theta }}}({\textbf{x}}; \varPhi , \mathcal {S})&= \int p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) d{\textbf{z}}\, df_{{\boldsymbol{\theta }}}\, d{\textbf{w}}\, d\beta \\&= \mathbb {E}_{p(f_{\boldsymbol{\theta }}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})} \left[ p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) \right] , \end{aligned}$$

and then rewrite that expectation into one over the proposal

$$\begin{aligned}&\mathbb {E}_{p(f_{\boldsymbol{\theta }}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})} \left[ p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) \right] = \\&\qquad \qquad \qquad \qquad \qquad \mathbb {E}_{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \left[ \frac{p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})}{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \right] . \end{aligned}$$

For constructing this expectation, DisCoPyro provides both the functorial inversion described in Sect. 3.1 and an amortized form of Automatic Structured Variational Inference [1] suitable for any universal probabilistic program.

Jensen’s Inequality says that expectation of the log density ratio will lower-bound the log expected density ratio

$$\begin{aligned}&\mathbb {E}_{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \left[ \log \frac{p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})}{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \right] \le \\&\qquad \qquad \qquad \qquad \qquad \qquad \log \mathbb {E}_{p(f_{\boldsymbol{\theta }}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})} \left[ p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) \right] , \end{aligned}$$

so that the left-hand side provides a lower bound to the true model evidence

$$\begin{aligned} \mathcal {L}({\boldsymbol{\theta }}, {\boldsymbol{\phi }}) = \mathbb {E}_{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \left[ \log \frac{p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})}{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \right]&\le \log p_{{\boldsymbol{\theta }}}({\textbf{x}}; \varPhi , \mathcal {S}). \end{aligned}$$

Maximizing this evidence lower bound (ELBO) by Monte Carlo estimation of its values and gradients (using Pyro’s built-in gradient estimators) will estimate the model parameters ${\boldsymbol{\theta }}$ by maximum likelihood and train the proposal parameters ${\boldsymbol{\phi }}$ to approximate the Bayesian inverse (Eq. 6) [12].

3 Example Application and Training

The framework of connecting a morphism to data via a likelihood with intermediate latent random variables allows for a broad variety of applications. This section will demonstrate the resulting capabilities of the DisCoPyro framework. Section 3.1 will describe an example application of the framework to deep probabilistic program learning for generative modeling. Section 3.2 that describe application’s performance as a generative model.

3.1 Deep Probabilistic Program Learning with DisCoPyro

As a demonstrative experiment, we constructed an operad $\mathcal {O}$ whose generators implemented Pyro building blocks for deep generative models $f_{\boldsymbol{\theta }}$ (taken from work on structured variational autoencoders [12, 19, 24]) with parameters ${\boldsymbol{\theta }}$. We then specified the one-box wiring diagram $\varPhi : (I, \mathbb {R}^{28\times 28}) \rightarrow (I, \mathbb {R}^{28\times 28})$ to parameterize the DisCoPyro generative model. We trained the resulting free operad model on MNIST just to check if it worked, and on the downsampled ($28 \times 28$) Omniglot dataset for few-shot learning [14] as a challenge. Since the data ${\textbf{x}}\in \mathbb {R}^{28 \times 28}$, our experimental setup induces the joint likelihood

$$\begin{aligned} p_{{\boldsymbol{\theta }}}({\textbf{x}}\mid {\textbf{z}}, f_{\boldsymbol{\theta }})&= \mathcal {N}({\boldsymbol{\mu }}_{{\boldsymbol{\theta }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}), {\boldsymbol{I}}{\boldsymbol{\tau }}) \\ p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}\mid f_{{\boldsymbol{\theta }}})&= p_{{\boldsymbol{\theta }}}({\textbf{x}}\mid {\textbf{z}}, f_{{\boldsymbol{\theta }}}) p_{{\boldsymbol{\theta }}}({\textbf{z}}\mid f_{{\boldsymbol{\theta }}}). \end{aligned}$$

DisCoPyro provides amortized variational inference over its own random variables via neural proposals $q_{{\boldsymbol{\phi }}}$ for the “confidence” $\beta \sim q_{{\boldsymbol{\phi }}}(\beta \mid {\textbf{x}})$ and the “preferences” over generators ${\textbf{w}}\sim q_{{\boldsymbol{\phi }}}({\textbf{w}}\mid {\textbf{x}})$. Running the core DisCoPyro generative model over structures $f_{\boldsymbol{\theta }}$ then gives a proposal over morphisms in the free operad, providing a generic proposal for DisCoPyro’s latent variables

$$\begin{aligned} q_{{\boldsymbol{\phi }}}(f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})&= p(f_{{\boldsymbol{\theta }}} \mid {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) q_{{\boldsymbol{\phi }}}(\beta \mid {\textbf{x}}) q_{{\boldsymbol{\phi }}}({\textbf{w}}\mid {\textbf{x}}). \end{aligned}$$

Since the morphisms in our example application are components of deep generative models, each generating morphism can be simply “flipped on its head” to get a corresponding neural network design for a proposal. We specify that proposal as $q_{{\boldsymbol{\phi }}}({\textbf{z}}\mid {\textbf{x}}, f_{{\boldsymbol{\theta }}})$; it constructs a faithful inverse [23] compositionally via a dagger functor (for further description of Bayesian inversion as a dagger functor, please see Fritz [8]). Our application then has a complete proposal density

$$\begin{aligned} q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})&= q_{{\boldsymbol{\phi }}}({\textbf{z}}\mid f_{{\boldsymbol{\theta }}}, {\textbf{x}}) q_{{\boldsymbol{\phi }}}(f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S}). \end{aligned}$$

(7)

3.2 Experimental Results and Performance Comparison

Table 1 compares our free operad model’s performance to other structured deep generative models. We report the estimated log model evidence. Our free operad prior over deep generative models achieves the best log-evidence per data dimension, although standard deviations for the baselines do not appear to be available for comparison. Some of the older baselines, such as the sequential attention model and the variational homoencoder, fix a composition structure ahead of time instead of learning it from data as we do. Figure 2 shows samples from the trained model’s posterior distribution, including reconstruction of evaluation data (Fig. 2a) and an example structure for that data (Fig. 2b).

Table 1. Average log-evidence on the Omniglot evaluation set across models. Our free operad model obtains the highest higher log-evidence per data dimension.

Full size table

Historically, Lake [14] proposed the Omniglot dataset to challenge the machine learning community to achieve human-like concept learning by learning a single generative model from very few examples; the Omniglot challenge requires that a model be usable for classification, latent feature recognition, concept generation from a type, and exemplar generation of a concept. The deep generative models research community has focused on producing models capable of few-shot reconstruction of unseen characters. [11, 19] fixed as constant the model architecture, attempting to account for the compositional structure in the data with static dimensionality. In contrast, [5, 9] performed joint structure learning, latent variable inference, and data reconstruction as we did.

4 Discussion

This paper described the DisCoPyro system for generative Bayesian structure learning, along with its variational inference training procedures and an example application. Section 2 described DisCoPyro’s mathematical foundations in category theory, operad theory, and variational Bayesian inference. Section 3 showed DisCoPyro to be competitive against other models on a challenge dataset.

As Lake [15] suggested, (deep) probabilistic programs can model human intelligence across more domains than handwritten characters. Beyond programs, neural network architectures, or triangulable manifolds, investigators have applied operads and SMCs to chemical reaction networks, natural language processing, and the systematicity of human intelligence [3, 18]. This broad variety of applications motivates our interest in representing the problems a generally intelligent agent must solve in terms of operadic structures, and learning those structures jointly with their contents from data.

Notes

1.
Also called a “hypersignature”.

References

Ambrogioni, L., et al.: Automatic structured variational inference. In: Proc. AISTATS, pp. 676–684 (2021)
Google Scholar
Bingham, E., et al.: Pyro: deep universal probabilistic programming. JMLR 20, 973–978 (2019)
Google Scholar
Bradley, T.D.: What is applied category theory? arXiv preprint arXiv:1809.05923 (2018)
Coecke, B., Sadrzadeh, M., Clark, S.: Mathematical foundations for a compositional distributional model of meaning. arXiv preprint arXiv:1003.4394 (2010)
Feinman, R., Lake, B.M.: Learning task-general representations with generative neuro-symbolic modeling. In: Proc. ICLR (2021)
Google Scholar
de Felice, G., Toumi, A., Coecke, B.: DisCoPy: monoidal categories in python. In: Applied Category Theory Conference, pp. 1–20. EPTCS (2020). http://arxiv.org/abs/2005.02975 arXiv: 2005.02975
Fong, B., Spivak, D.I.: Seven Sketches in Compositionality: An Invitation to Applied Category Theory. Cambridge University Press, Cambridge (2019)
Book MATH Google Scholar
Fritz, T.: A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics. Adv. Math. 370, 107239 (2020)
Article MathSciNet MATH Google Scholar
He, J., et al.: Variational autoencoders with jointly optimized latent dependency structure. In: Proc. ICLR 2019, pp. 1–16 (2019)
Google Scholar
Hermida, C., Makkaiy, M., Power, J.: Higher dimensional multigraphs. In: Proceedings - Symposium on Logic in Computer Science, pp. 199–206 (1998)
Google Scholar
Hewitt, L.B., et al.: The variational homoencoder: learning to learn high capacity generative models from few examples. Proc. UAI 2018, 988–997 (2018)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Kirsch, L., Kunze, J., Barber, D.: Modular networks: learning to decompose neural computation. In: Proc. NeurIPS 2018, pp. 2408–2418 (2018)
Google Scholar
Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: The Omniglot challenge: a 3-year progress report. Curr. Opin. Behav. Sci. 29, 97–104 (2019)
Article Google Scholar
Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people. Behav. Brain Sci. 40, e253 (2017)
Article Google Scholar
Latouche, G., Ramaswami, V.: Introduction to matrix analytic methods in stochastic modeling. In: Society for Industrial and Applied Mathematics (1999)
Google Scholar
Milnor, J.: On spaces having the homotopy type of $CW$-complex. Trans. Am. Math. Soc. 90, 272–280 (1959)
MathSciNet MATH Google Scholar
Phillips, S., Wilson, W.H.: Categorial compositionality: a category theory explanation for the systematicity of human cognition. PLoS Comput. Biol. 6, e1000858 (2010)
Article Google Scholar
Rezende, D.J., et al.: One-shot generalization in deep generative models. In: International Conference on Machine Learning, vol. 48 (2016)
Google Scholar
Rosenbaum, C., Klinger, T., Riemer, M.: Routing networks: adaptive selection of non-linear functions for multi-task learning. In: Proc. ICLR 2018, pp. 1–10 (2018)
Google Scholar
Shiebler, D., Gavranović, B., Wilson, P.: Category theory in machine learning. In: Proc. ACT 2021 (2021)
Google Scholar
Spivak, D.I.: The operad of wiring diagrams: formalizing a graphical language for databases, recursion, and plug-and-play circuits (2013). arxiv: 1305.0297
Webb, S., et al.: Faithful inversion of generative models for effective amortized inference. In: Proc. NIPS 2018, pp. 3074–3084 (2018)
Google Scholar
Zhao, S., Song, J., Ermon, S.: Learning hierarchical features from generative models. In: International Conference on Machine Learning (2017)
Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive feedback and encouragement. This work was supported by the Japan Science and Technology Agency (JST JPMJMS2033) and National Science Foundation of the United States of America (NSF 2047253).

Author information

Authors and Affiliations

Northeastern University, Boston, USA
Eli Sennesh
Australian National University, Canberra, Australia
Tom Xu & Yoshihiro Maruyama

Authors

Eli Sennesh
View author publications
You can also search for this author in PubMed Google Scholar
Tom Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yoshihiro Maruyama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eli Sennesh .

Editor information

Editors and Affiliations

Department of Psychology, Stockholm University, Stockholm, Sweden
Patrick Hammer
Örebro University, Örebro, Sweden
Marjan Alirezaie
University of Gothenburg, Gothenburg, Sweden
Claes Strannegård

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sennesh, E., Xu, T., Maruyama, Y. (2023). Computing with Categories in Machine Learning. In: Hammer, P., Alirezaie, M., Strannegård, C. (eds) Artificial General Intelligence. AGI 2023. Lecture Notes in Computer Science(), vol 13921. Springer, Cham. https://doi.org/10.1007/978-3-031-33469-6_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-33469-6_25
Published: 24 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33468-9
Online ISBN: 978-3-031-33469-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Computing with Categories in Machine Learning

Abstract

Similar content being viewed by others