Keywords

1 Introduction

Category theory has been applied in various domains of mathematical science, allowing us to discover universal principles unifying diverse mathematical phenomena and thereby enabling knowledge transfer between them [7]. Applications to machine learning have been pursued recently [21]; however there is still a large gap between foundational mathematics and applicability in concrete machine learning tasks. This work begins filling the gap. We introduce the categorical structure learning framework DisCoPyro, a probabilistic generative model with amortized variational inference. We both provide mathematical foundations and compare with other neurosymbolic models on an example application.

Fig. 1.
figure 1

Example experiment. In each epoch of training, DisCoPyro learns variational autoencoder structures by sampling them from its skeleton according to a wiring diagram, then learning their faithful inverses as approximate posteriors.

Here we describe why we believe that DisCoPyro could contribute, in the long run, to developing human-level artificial general intelligence. Human intelligence supports graded statistical reasoning [15], and evolved to represent spatial (geometric) domains before we applied it to symbolic (algebraic) domains. Symmetric monoidal categories provide a mathematical framework for constructing both symbolic computations (as in this paper) and geometrical spaces (e.g. [17]). We take Lake [15]’s suggestion to represent graded statistical reasoning via probability theory, integrating neural networks into variational inference for tractability. In terms of applications, we get competitive performance (see Subsect. 3.2 below) by variational Bayes, without resorting to reinforcement learning of structure as with modular neural networks [13, 20].

The rest of the paper is organized as follows. In Sect. 2, we first introduce mathematical foundations of DisCoPyro (Subsects. 2 and 2.1). In Sect. 3 we then explain how to train DisCoPyro on a task (Subsect. 2.1) and provide experimental results and performance comparisons (Subsect. 3.2). Figure 1 demonstrates the flow of execution during the training procedure for the example task. We conclude and discuss further applications in Sect. 4. We provide an example implementation at https://github.com/neu-pml/discopyro with experiments at https://github.com/esennesh/categorical_bpl. DisCoPyro builds upon Pyro [2] (a deep universal probabilistic programming language), DisCoCat [4] (a distributional compositional model for natural language processing [4]), and the DisCoPy [6] library for computing with categories.

1.1 Notation

This paper takes symmetric monoidal categories (SMCs) \(\mathcal {C}\) and their corresponding operads \(\mathcal {O}\) as its mathematical setting. The reader is welcome to see Fong [7] for an introduction to these. SMCs are built from objects \(Ob(\mathcal {C})\) and sets of morphisms \(\mathcal {C}(\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2})\) between objects \(\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2} \in Ob(\mathcal {C})\). Operads are built from types \(Ty(\mathcal {O})\) and sets of morphisms \(\mathcal {O}(\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2})\) between types \(\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2} \in Ty(\mathcal {O})\). An SMC is usually written \((\mathcal {C}, \otimes , I)\) with a product operation \(\otimes \) over objects and morphisms and a unit I of \(\otimes \). In both settings, every object/type \(\boldsymbol{\tau }_{}\) has a unique identity morphism \(id_{\boldsymbol{\tau }_{}}\). Categories support composition \(g \circ f\) on morphisms, and operads support indexed composition \(g\circ _{i} f\) (for \(i \in \mathbb {N}\)) on morphisms.

2 Foundations of DisCoPyro

In essence, Definition 1 below exposes a finite number of building blocks (generators) from an SMC, and the morphisms constructed by composing those generators with \(\circ \) and \(\otimes \). For example, in categories of executable programs, a monoidal signature [6] specifies a domain-specific programming language.

Definition 1 (Monoidal signature in an SMC)

Given a symmetric monoidal category (SMC) \(\mathcal {C}\) with the objects denoted by \(Ob(\mathcal {C})\), a monoidal signatureFootnote 1 \(\mathcal {S}=(O, M)\) in that SMC consists of

  • A finite set \(O \subseteq Ob(\mathcal {C})\); and

  • A finite set M consisting of elements \(m: \mathcal {C}(\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2})\) for some \(\boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2} \in O\), such that \(\forall \boldsymbol{\tau }_{} \in O, m \ne id_{\boldsymbol{\tau }_{}}\).

The following free operad over a monoidal signature represents the space of all possible programs synthesized from the above building blocks (generators specified by the monoidal signature). Employing an operad rather than just a category allows us to reason about composition as nesting rather than just transitive combination; employing an operad rather than just a grammar allows us to reason about both the inputs and outputs of operations rather than just their outputs.

Definition 2 (Free operad over a signature)

The free operad \(\mathcal {O}_{\mathcal {S}}\) over a signature \(\mathcal {S}=(O, M)\) consists of

  • A set of types (representations) \(Ty(\mathcal {O}_{\mathcal {S}})=\{I\} \cup O^{\otimes }\);

  • For every \(n \in \mathbb {N}^{+}\) a set of operations (mappings) \(\mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{0}, \ldots , \boldsymbol{\tau }_{n-1}; \boldsymbol{\tau }_{n})\) consisting of all trees with finitely many branches and leaves, in which each vertex v with \(n-1\) children is labeled by a generator \(m(v) \in M\) such that \(\textrm{dom}(m(v))\) has product length \(n-1\);

  • An identity operation \(id_{\boldsymbol{\tau }_{}}: \mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{}; \boldsymbol{\tau }_{})\) for every \(\boldsymbol{\tau }_{} \in Ty(\mathcal {O}_{\mathcal {S}})\); and

  • A substitution operator \(\circ _{i}\) defined by nesting a syntax tree \(\mathcal {O}_{\mathcal {S}}(\boldsymbol{\sigma }_1, \ldots , \boldsymbol{\sigma }_m; \boldsymbol{\tau }_{i})\) inside another \(\mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{1}, \ldots , \boldsymbol{\tau }_{n-1}; \boldsymbol{\tau }_{n})\) when \(i \in [1...n-1]\) to produce a syntax tree \(\mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{1}, \ldots , \boldsymbol{\tau }_{i-1}, \boldsymbol{\sigma }_{1}, \ldots , \boldsymbol{\sigma }_{m}, \boldsymbol{\tau }_{i+1}, \ldots \boldsymbol{\tau }_{n-1}; \boldsymbol{\tau }_{n})\).

Intuitively, free operads share a lot in common with context-free grammars, and in fact Hermida [10] proved that they share a representation as directed acyclic hypergraphs. The definition of a signature in an SMC already hints at the structure of the appropriate hypergraph, but Algorithm 1 will make it explicit and add edges to the hypergraph corresponding to nesting separate operations in parallel (or equivalently, to monoidal products in the original SMC).

figure a

In the hypergraph produced by Algorithm 1, each vertex corresponds to a non-product type and each hyperedge has a list of vertices as its domain and codomain. Each such hypergraph admits a representation as a graph as well, in which the hyperedges serve as nodes and the lists in their domains and codomains serve as edges. We will use this graph representation \(G\simeq H\) to reason about morphisms as paths between their domain and codomain.

We will derive a probabilistic generative model over morphisms in the free operad from this graph representation’s directed adjacency matrix \(A_G\).

Definition 3 (Transition distance in a directed graph)

The “transition distance” between two indexed vertices \(v_i, v_j\) is the negative logarithm of the ij entry in the exponentiated adjacency/transition matrix

$$\begin{aligned} d(v_i, v_j)&= -\log \left( \left[ e^{A_G} \right] _{i, j} \right) , \end{aligned}$$
(1)

where the matrix exponential is defined by the series

$$\begin{aligned} e^{A_G}&= \sum _{n=1}^{\infty } \frac{(A_G)^{n}}{n!}. \end{aligned}$$

A soft-minimization distribution over this transition distance will, in expectation and holding the indexed target vertex constant, define an probabilistic generative model over paths through the hypergraph.

Definition 4 (Free operad prior)

Consider a signature \(\mathcal {S}=(O, M)\) and its resulting graph representation \(G=(V,E)\) and recursion sites R, and then condition upon a domain and codomain \(\boldsymbol{\tau }_{-}, \boldsymbol{\tau }_{+} \in Ty(\mathcal {O}_{\mathcal {S}})\) represented by vertices in the graph. The free operad prior assigns a probability density to all finite paths \(\textbf{e}=(e_1, e_2, \ldots , e_n)\) with \(\textrm{dom}(\textbf{e})=\boldsymbol{\tau }_{-}\) and \(\textrm{cod}(\textbf{e})=\boldsymbol{\tau }_{+}\) by means of an absorbing Markov chain. First the model samples a “precision” \(\beta \) and a set of “weights” \({\textbf{w}}\)

$$\begin{aligned} \beta&\sim \gamma (1, 1)&{\textbf{w}}&\sim \textrm{Dirichlet}\left( \vec {\textbf{1}}^{(|M| + |R|)}\right) . \end{aligned}$$

Then it samples a path (from the absorbing Markov chain in Algorithm 2) by soft minimization (biased towards shorter paths by \(\beta \)) of the transition distance

$$\begin{aligned} \pi (e \in E \mid \boldsymbol{\tau }_{1}, \boldsymbol{\tau }_{2}; \beta )&:= \frac{\exp {\left( -\frac{1}{\beta } d(\textrm{cod}(e), \boldsymbol{\tau }_{2}) \right) }}{ \sum _{e' \in E : \textrm{dom}(e')=\boldsymbol{\tau }_{1}} \exp {\left( -\frac{1}{\beta } d(\textrm{cod}(e'), \boldsymbol{\tau }_{2}) \right) } }. \end{aligned}$$
(2)

Equation 2 will induce a transition operator T which, by Theorem 2.5.3 in Latouche and Ramaswami [16], will almost-surely reach its absorbing state corresponding to \(\boldsymbol{\tau }_{2}\). This path can then be filled in according to Algorithm 3. The precision \(\beta \) increases at each recursion to terminate with shorter paths. We denote the induced joint distribution as

$$\begin{aligned} p(f, {\textbf{w}}, \beta ; \boldsymbol{\tau }_{-}, \boldsymbol{\tau }_{+})&= p(f \mid \beta , {\textbf{w}}; \boldsymbol{\tau }_{-}, \boldsymbol{\tau }_{+}) p({\textbf{w}}) p(\beta ). \end{aligned}$$
(3)
figure b
figure c

Having a probabilistic generative model over operations in the free operad over a signature, we now need a way to specify a structure learning problem. Definition 5 provides this by specifying what paths to sample (each box specifies a call to Algorithm 2) and how to compose them.

Definition 5 (Wiring diagram)

An acyclic, O-typed wiring diagram [7, 22] is a map from a series of internal boxes, each one defined by its domain and codomain pair \((\boldsymbol{\tau }_{i}^{-}, \boldsymbol{\tau }_{i}^{+})\) to an outer box defined by domain and codomain \((\boldsymbol{\tau }_{n}^{-}, \boldsymbol{\tau }_{n}^{+})\)

$$\begin{aligned} \varPhi&: \mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{1}^{-}, \boldsymbol{\tau }_{1}^{+}) \times \ldots \times \mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{n-1}^{-}, \boldsymbol{\tau }_{n-1}^{+}) \rightarrow \mathcal {O}_{\mathcal {S}}(\boldsymbol{\tau }_{n}^{-}, \boldsymbol{\tau }_{n}^{+}). \end{aligned}$$

Acyclicity requires that connections (“wires”) can extend only from the outer box’s domain to the domains of inner boxes, from the inner boxes codomains to the outer box’s codomain, and between internal boxes such that no cycles are formed in the directed graph of connections between inner boxes.

Given a user-specified wiring diagram \(\varPhi \), we can then wite the complete prior distribution over all latent variables in our generative model.

$$\begin{aligned} p(f, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})&= p(\beta ) p({\textbf{w}}) \prod _{(\boldsymbol{\tau }_{i}^{-}, \boldsymbol{\tau }_{i}^{+}) \in \varPhi } p(f_{i} \mid \beta , {\textbf{w}}; \boldsymbol{\tau }_{i}^{-}, \boldsymbol{\tau }_{i}^{+}). \end{aligned}$$
(4)

If a user provides a likelihood \(p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}\mid f)\) relating the learned structure f to data \({\textbf{x}}\) (via latents \({\textbf{z}}\)) we will have a joint density

$$\begin{aligned} p({\textbf{x}}, {\textbf{z}}, f, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})&= p({\textbf{x}}\mid f) p(f, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}), \end{aligned}$$
(5)

and Eq. 5 then admits inference from the data \({\textbf{x}}\) by Bayesian inversion

$$\begin{aligned} p({\textbf{z}}, f, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})&= \frac{ p({\textbf{x}}, {\textbf{z}}, f, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) }{ p_{{\boldsymbol{\theta }}}({\textbf{x}}; \varPhi , \mathcal {S}) }. \end{aligned}$$
(6)

Section 2.1 will explain how to approximate Eq. 6 by stochastic gradient-based optimization, yielding a maximum-likelihood estimate of \({\boldsymbol{\theta }}\) and an optimal approximation for the parametric family \({\boldsymbol{\phi }}\) to the true Bayesian inverse.

2.1 Model Learning and Variational Bayesian Inference

Bayesian inversion relies on evaluating the model evidence \(p_{{\boldsymbol{\theta }}}({\textbf{x}}; \varPhi , \mathcal {S})\), which typically has no closed form solution. However, we can transform the high-dimensional integral over the joint density into an expectation

$$\begin{aligned} p_{{\boldsymbol{\theta }}}({\textbf{x}}; \varPhi , \mathcal {S})&= \int p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) d{\textbf{z}}\, df_{{\boldsymbol{\theta }}}\, d{\textbf{w}}\, d\beta \\&= \mathbb {E}_{p(f_{\boldsymbol{\theta }}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})} \left[ p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) \right] , \end{aligned}$$

and then rewrite that expectation into one over the proposal

$$\begin{aligned}&\mathbb {E}_{p(f_{\boldsymbol{\theta }}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})} \left[ p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) \right] = \\&\qquad \qquad \qquad \qquad \qquad \mathbb {E}_{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \left[ \frac{p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})}{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \right] . \end{aligned}$$

For constructing this expectation, DisCoPyro provides both the functorial inversion described in Sect. 3.1 and an amortized form of Automatic Structured Variational Inference [1] suitable for any universal probabilistic program.

Jensen’s Inequality says that expectation of the log density ratio will lower-bound the log expected density ratio

$$\begin{aligned}&\mathbb {E}_{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \left[ \log \frac{p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})}{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \right] \le \\&\qquad \qquad \qquad \qquad \qquad \qquad \log \mathbb {E}_{p(f_{\boldsymbol{\theta }}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})} \left[ p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) \right] , \end{aligned}$$

so that the left-hand side provides a lower bound to the true model evidence

$$\begin{aligned} \mathcal {L}({\boldsymbol{\theta }}, {\boldsymbol{\phi }}) = \mathbb {E}_{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \left[ \log \frac{p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta ; \varPhi , \mathcal {S})}{q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})} \right]&\le \log p_{{\boldsymbol{\theta }}}({\textbf{x}}; \varPhi , \mathcal {S}). \end{aligned}$$

Maximizing this evidence lower bound (ELBO) by Monte Carlo estimation of its values and gradients (using Pyro’s built-in gradient estimators) will estimate the model parameters \({\boldsymbol{\theta }}\) by maximum likelihood and train the proposal parameters \({\boldsymbol{\phi }}\) to approximate the Bayesian inverse (Eq. 6) [12].

3 Example Application and Training

The framework of connecting a morphism to data via a likelihood with intermediate latent random variables allows for a broad variety of applications. This section will demonstrate the resulting capabilities of the DisCoPyro framework. Section 3.1 will describe an example application of the framework to deep probabilistic program learning for generative modeling. Section 3.2 that describe application’s performance as a generative model.

3.1 Deep Probabilistic Program Learning with DisCoPyro

As a demonstrative experiment, we constructed an operad \(\mathcal {O}\) whose generators implemented Pyro building blocks for deep generative models \(f_{\boldsymbol{\theta }}\) (taken from work on structured variational autoencoders [12, 19, 24]) with parameters \({\boldsymbol{\theta }}\). We then specified the one-box wiring diagram \(\varPhi : (I, \mathbb {R}^{28\times 28}) \rightarrow (I, \mathbb {R}^{28\times 28})\) to parameterize the DisCoPyro generative model. We trained the resulting free operad model on MNIST just to check if it worked, and on the downsampled (\(28 \times 28\)) Omniglot dataset for few-shot learning [14] as a challenge. Since the data \({\textbf{x}}\in \mathbb {R}^{28 \times 28}\), our experimental setup induces the joint likelihood

$$\begin{aligned} p_{{\boldsymbol{\theta }}}({\textbf{x}}\mid {\textbf{z}}, f_{\boldsymbol{\theta }})&= \mathcal {N}({\boldsymbol{\mu }}_{{\boldsymbol{\theta }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}), {\boldsymbol{I}}{\boldsymbol{\tau }}) \\ p_{{\boldsymbol{\theta }}}({\textbf{x}}, {\textbf{z}}\mid f_{{\boldsymbol{\theta }}})&= p_{{\boldsymbol{\theta }}}({\textbf{x}}\mid {\textbf{z}}, f_{{\boldsymbol{\theta }}}) p_{{\boldsymbol{\theta }}}({\textbf{z}}\mid f_{{\boldsymbol{\theta }}}). \end{aligned}$$

DisCoPyro provides amortized variational inference over its own random variables via neural proposals \(q_{{\boldsymbol{\phi }}}\) for the “confidence” \(\beta \sim q_{{\boldsymbol{\phi }}}(\beta \mid {\textbf{x}})\) and the “preferences” over generators \({\textbf{w}}\sim q_{{\boldsymbol{\phi }}}({\textbf{w}}\mid {\textbf{x}})\). Running the core DisCoPyro generative model over structures \(f_{\boldsymbol{\theta }}\) then gives a proposal over morphisms in the free operad, providing a generic proposal for DisCoPyro’s latent variables

$$\begin{aligned} q_{{\boldsymbol{\phi }}}(f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})&= p(f_{{\boldsymbol{\theta }}} \mid {\textbf{w}}, \beta ; \varPhi , \mathcal {S}) q_{{\boldsymbol{\phi }}}(\beta \mid {\textbf{x}}) q_{{\boldsymbol{\phi }}}({\textbf{w}}\mid {\textbf{x}}). \end{aligned}$$

Since the morphisms in our example application are components of deep generative models, each generating morphism can be simply “flipped on its head” to get a corresponding neural network design for a proposal. We specify that proposal as \(q_{{\boldsymbol{\phi }}}({\textbf{z}}\mid {\textbf{x}}, f_{{\boldsymbol{\theta }}})\); it constructs a faithful inverse [23] compositionally via a dagger functor (for further description of Bayesian inversion as a dagger functor, please see Fritz [8]). Our application then has a complete proposal density

$$\begin{aligned} q_{{\boldsymbol{\phi }}}({\textbf{z}}, f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S})&= q_{{\boldsymbol{\phi }}}({\textbf{z}}\mid f_{{\boldsymbol{\theta }}}, {\textbf{x}}) q_{{\boldsymbol{\phi }}}(f_{{\boldsymbol{\theta }}}, {\textbf{w}}, \beta \mid {\textbf{x}}; \varPhi , \mathcal {S}). \end{aligned}$$
(7)

3.2 Experimental Results and Performance Comparison

Table 1 compares our free operad model’s performance to other structured deep generative models. We report the estimated log model evidence. Our free operad prior over deep generative models achieves the best log-evidence per data dimension, although standard deviations for the baselines do not appear to be available for comparison. Some of the older baselines, such as the sequential attention model and the variational homoencoder, fix a composition structure ahead of time instead of learning it from data as we do. Figure 2 shows samples from the trained model’s posterior distribution, including reconstruction of evaluation data (Fig. 2a) and an example structure for that data (Fig. 2b).

Table 1. Average log-evidence on the Omniglot evaluation set across models. Our free operad model obtains the highest higher log-evidence per data dimension.

Historically, Lake [14] proposed the Omniglot dataset to challenge the machine learning community to achieve human-like concept learning by learning a single generative model from very few examples; the Omniglot challenge requires that a model be usable for classification, latent feature recognition, concept generation from a type, and exemplar generation of a concept. The deep generative models research community has focused on producing models capable of few-shot reconstruction of unseen characters. [11, 19] fixed as constant the model architecture, attempting to account for the compositional structure in the data with static dimensionality. In contrast, [5, 9] performed joint structure learning, latent variable inference, and data reconstruction as we did.

Fig. 2.
figure 2

Reconstructions (left) generated by inference in the diagrammatic generative model (right) on handwritten characters in the Omniglot evaluation set. The string diagram shows a model that generates a glimpse, decodes it into an image canvas via a variational ladder decoder, and then performs a simpler process to generate another glimpse and insert it into the canvas.

4 Discussion

This paper described the DisCoPyro system for generative Bayesian structure learning, along with its variational inference training procedures and an example application. Section 2 described DisCoPyro’s mathematical foundations in category theory, operad theory, and variational Bayesian inference. Section 3 showed DisCoPyro to be competitive against other models on a challenge dataset.

As Lake [15] suggested, (deep) probabilistic programs can model human intelligence across more domains than handwritten characters. Beyond programs, neural network architectures, or triangulable manifolds, investigators have applied operads and SMCs to chemical reaction networks, natural language processing, and the systematicity of human intelligence [3, 18]. This broad variety of applications motivates our interest in representing the problems a generally intelligent agent must solve in terms of operadic structures, and learning those structures jointly with their contents from data.