Keywords

1 Introduction

The problem of how to analyze neural network models has been an increasingly important focus as the efficacy of deep networks continues to be demonstrated in many applications. This is a unique aspect of solving problems using structured feedback systems such as machine learning models, where the designer’s task is to design a system that will exhibit certain properties upon being restructured by feedback according to update rules (e.g. stochastic gradient descent), as opposed to designing a solution via a robust understanding of relevant first principles.

Analyzing a very high-dimensional model evokes the classic problem of dimensionality reduction, or determining how to simplify a model with potentially millions of variables into something meaningful to a human being. The choice of what aspects of the model to focus on and how to bring meaningful information to the forefront is non-trivial in this case. One would ascertain very different information about the network from a technique that focuses on analyzing which regions of the input are most significant to the output, such as Relevance Propagation [1], than by using an approach focused on evaluating model dimensionality and latent similarity, such as SVCCA [2].

In this paper we introduce Neurodynamical Agglomerative Analysis (NAA), an analysis model that takes as input a collection of network activations sorted according to input classes, and outputs a hierarchy of network clusters that gives insight into the class relationships in the hidden representation. The goal of this analysis pipeline is to allow designers to convert a network model of seemingly random activations and transform them into a hierarchical graphical model that can give insight into how the network models classes with respect to one another, and which classes are most well-differentiated by the model. This way of analyzing neural networks opens the door to new ways of considering issues such as class imbalance and error analysis in the context of training. Further, since up until now most supervised deep neural network models are not developed with any consideration for higher-order class relationships, a computationally efficient and intuitive way to analyze the implied class relationships naturally begs the question of how one would approach enforcing arbitrary relationships into machine learning methods, thus introducing a key aspect of top-down reasoning into bottom-up models.

The proposed model considers the cross-correlation matrix of network activations for a given class as a lower dimensional encoding of the neural relationships that have been developed at a particular layer. We then take the generalized cosine of the correlation matrices as a similarity metric between the class representations and use the matrix defined by these class similarities in order to compare network representations using both agglomerative clustering and the generalized cosine between the similarity matrices themselves. Training experiments with the MNIST [3] dataset show that class relationships exhibit invariance to random initialization, tending toward a particular state as training progresses. Underparameterized networks exhibit dynamic relationships with higher class similarities that never reach a steady state. Additionally, NAA was used to derive class hierarchies based on the ImageNet [4] dataset which were compared to the WordNet [5] ground-truth ontology in order to consider how this notion of similarity relates to a human-made one.

2 Related Work

As the interest in neural networks has increased over the last decade, analysis techniques for deep networks have become an increasing focus in the field. In [1], Bach et al. develop Relevance Propagation, a method to visualize the contributions single pixels make to output predictions by calculating the extent to which an individual pixel impacted the model in choosing or not choosing a particular output class. This analysis method visualizes the saliency of input pixels via a heat map that enables a human to view which regions of an image contributed most to the output.

In [6], Li et al. explored to what extent invariant features are learned from different random initializations by computing the within-net and between-net correlation of network activations. They then inferred corresponding feature maps by sorting the feature maps according to their between-net correlation values, as well as looking for higher level correspondences among features via spectral clustering on within-net correlation matrices. Their model collects a matrix composed of vectorized activations across the entire dataset from each layer and treats each layer of each input instance as a separate feature in the analysis model, thus it effectively compares similar and dissimilar features at the level of individual feature maps of individual input instances. This can be a prohibitively high level of granularity in the case of deep neural networks, as insight still must be gained via the inspection of individual features. In contrast, NAA generates a correlation matrix of activations across a particular class and compares averaged class representations, as well as performing agglomerative clustering on output classes in a bottom-up fashion. In essence, we trade feature-level granularity for hierarchy-level granularity with respect to the class structure.

Another recent work using a similar approach is SVCCA [2], which forms matrices of network activations across a dataset and then performs canonical correlation analysis [7] on filtered approximations of the activation matrices via singular value decomposition. Raghu et al. show relationships between SVCCA and network parameterization that are used to develop networks that use the parameter space more efficiently.

Our method differs from SVCCA in a similar way as it does to [1] and [6]. SVCCA compares representations across architectures for a given dataset by evaluating the correlation values of maximally-correlated subspaces of features across network representations. In contrast, we compare the network to itself using the generalized cosine of correlation matrices across classes at a given layer to evaluate how well-oriented the correlation matrices are with respect to one another, and therefore how similar the class representations are. Unlike SVCCA, our method is only invariant to global rotations in the feature subspace. Intuitively, this means that our method does not allow for flexible neural correspondences in its notion of representational similarity.

The idea to use the generalized cosine of the cross-correlation matrices was inspired by [8], in which Jaeger similarly uses the generalized cosine between cross-correlation matrices as a similarity metric for Excited Network Dynamics in Echo State Networks. Jaeger shows that, in some cases, evaluating the similarity of the conceptor matrices he defines produces a better correspondence to a human notion of similarity than do the raw matrices.

3 Methods

3.1 Dimensionality Reduction for Analysis

Dimensionality reduction is often brought up in the context of computational feasibility, but many scientific methods can be viewed in terms of dimensionality reduction. For example, representing a collection of data as a probability distribution could be seen as representing a vast collection of data points by a few parameters.

The dimensionality reduction approach used by NAA is to take the cross-correlation matrix of network activations, it being considered in this sense a low-dimensional representation for the network “dynamics”. Given a set of neural activation vectors specific to class c in layer l, \(Z_{c,l} = \{\mathbf {z_1}, \ldots , \mathbf {z_m}\}\), where \(\mathbf {z_k}\) represents the activations on the \(k^{th}\) datapoint, and is of dimensionality n, defined by the amount of output neurons at a given layer. The cross-correlation matrix, R is defined according to

$$\begin{aligned} r_{ij} = E[z_i z_j] \end{aligned}$$
(1)

where, in this case, \(z_i\) and \(z_j\) represent the \(i^{th}\) and \(j^{th}\) (scalar) activations in a given vector, respectively, and \(E[z_iz_j]\) is the expectation over the product of scalar neural activations \(z_i\) and \(z_j\). The cross-correlation matrix is a way of assessing the “neurodynamics” of the network, used here in an analogical sense, in that, although there is no time variable in the convolutional networks in consideration per se, we are using the way neurons would vary together in a dynamical sense as being a low-dimensional representation of the network’s class representations. The analysis pipeline can also be generally applied to any data composed of vectors of network activations. The cross-correlation matrix can be estimated for a dataset according to

$$\begin{aligned} R = \frac{1}{n}ZZ^T. \end{aligned}$$
(2)

3.2 Generalized Cosine for Assessing Neurodynamical Similarity

Definition. Singular Value Decomposition (SVD) is a common tool that forms the basis of dimensionality reduction techniques such as Principle Component Analysis as well as being used for applications such as solving homogeneous linear equations. The Singular Value Decomposition of a real, symmetric matrix R is defined as

$$\begin{aligned} SVD(R) = U \varSigma U^T \end{aligned}$$
(3)

where U is a unitary matrix composed of the eigenvectors of R, and \(\varSigma \) is a diagonal matrix composed of the eigenvalues of R. This decomposition defines a hyperellipsoid, with axes oriented according to the unitary basis defined by U and radii corresponding to the diagonal entries of \(\varSigma \). A generalized cosine [8] between two such hyperellipsoids is defined according to

$$\begin{aligned} cos^2(R_i, R_j) = \frac{\bigg \Vert \varSigma _i^{1/2}U_i'U_j\varSigma _j^{1/2}\bigg \Vert ^2}{\bigg \Vert diag\bigg \{ \varSigma _i\bigg \}\bigg \Vert \bigg \Vert diag\bigg \{ \varSigma _j\bigg \}\bigg \Vert } \end{aligned}$$
(4)

where \(diag\{\varSigma \}\) is the vectorized diagonal entries of \(\varSigma \) and \(\Vert \varvec{\cdot }\Vert \) represents the Frobenius norm, or the \(L^2\) vector norm. The relationship in (4) defines a similarity metric between two matrices and in this case is used for comparing two class representations of a given layer of a neural network.

Computational Optimization and Implementation. Calculating the generalized cosine defined in (4) requires performing singular value decomposition on the matrix, which would require \(\mathcal {O}(n^3)\) computation time, where n is the dimensionality of the square matrix, R, or dimensionality of the layer of interest in our case. However, an equivalent relationship can be derived that avoids the SVD step by using the trace of matrix products, resulting in the equivalent formula

$$\begin{aligned} cos^2(R_i, R_j) = \frac{Tr\bigg \{ R_i R_j\bigg \}}{\sqrt{Tr\bigg \{ R_i R_i\bigg \}Tr\bigg \{ R_j R_j\bigg \}}} \end{aligned}$$
(5)

which brings the computational cost of calculating the similarity metric down to \(\mathcal {O}(n^2)\), since for the trace of matrix products only the diagonal elements are required.

Proof

The numerator of (4) can be simplified utilizing the following properties of the trace product and Frobenius norm

$$\begin{aligned} \bigg \Vert A\bigg \Vert ^2 = Tr\bigg \{ A^TA\bigg \} \end{aligned}$$
(6)
$$\begin{aligned} Tr\bigg \{ AB\bigg \} = Tr\bigg \{ BA\bigg \} \end{aligned}$$
(7)

along with the property that

$$\begin{aligned} U^T= U^{-1} \end{aligned}$$
(8)

for any unitary matrix U. In order to derive the relationship in (5) we exploit the fact that the trace of matrix products is invariant with respect to cycling the matrices being multiplied, in accordance with (7), as well as the fact that \(UU^T = I\), the identity matrix. Thus, using the relationship in Eq. (6) we can redefine a formula equivalent to the numerator of (4) as follows -

$$\begin{aligned} \bigg \Vert \varSigma _i^{1/2}U_i^TU_j\varSigma _j^{1/2}\bigg \Vert ^2&= Tr\bigg \{ \big (\varSigma _i^{1/2}U_i^TU_j\varSigma _j^{1/2}\big )^T\varSigma _i^{1/2}U_i^TU_j\varSigma _j^{1/2}\bigg \} \\&= Tr\bigg \{ \big (U_i\varSigma _iU_i^T\big )\big (U_j\varSigma _jU_j^T\big )\bigg \} \\&= Tr\bigg \{ R_iR_j\bigg \} \end{aligned}$$

Similarly, since \(\varSigma \) is a diagonal matrix, the factors in the denominator can be expressed as

$$\begin{aligned} \bigg \Vert diag\bigg \{ \varSigma _i\bigg \}\bigg \Vert ^2&= \bigg \Vert \varSigma _i\bigg \Vert ^2 \\&= Tr \bigg \{ \varSigma _i^T \varSigma _i\bigg \} \\&= Tr \bigg \{ U_i\varSigma _iU_i^T U_i\varSigma _i U_i^T\bigg \} \\&= Tr \bigg \{ R_i R_i\bigg \} \\ \end{aligned}$$

and so a similarity metric equivalent to the generalized cosine defined in (4) can be formed according to (5) which brings the computational complexity of our metric from \(\mathcal {O}(n^3)\) to \(\mathcal {O}(n^2)\). The similarity defined in (5) is then used to construct a distance matrix, D, according to

$$\begin{aligned} d_{ij} = 1 - s_{ij} \end{aligned}$$
(9)

where (ij) refers to classes i and j, respectively. The class similarities \(s_{ij}=cos^2(R_i, R_j)\) are used to build a similarity matrix, S, which is used to compare different network representations. Thus, D defines the distance matrix which can be used to generate a hierarchy of network classes via agglomerative clustering, as in Sect. 4.1, or the similarity matrices can be used to compare entire representations by taking the generalized cosine between the similarity matrices calculated from different models, as in Sect. 4.2.

3.3 Agglomerative Clustering

Agglomerative clustering is a hierarchical clustering method, which merges points one at a time in a greedy fashion until there is only one supercluster remaining. In this case we are concerned with a sort of mean representation of the clusters formed by similar classes, so the merging rule we chose was to take the averaging method which takes the distance between two clusters u and v according to [9]

$$\begin{aligned} d(u,v) = \frac{1}{|u|*|v|} \sum _{ij} d(u[i], v[j]) \end{aligned}$$
(10)

where |u| and |v| represent the cardinalities of clusters u and v. Linkage calculations were done using the scikit-learn library [10]. Neural network training and data collection was done using tensorflow 2.0 with data processing performed in Python 3.6.

4 Experimental Results and Discussion

Fig. 1.
figure 1

Method Overview - This flow diagram represents the analysis methods used in Sects. 4.1 and 4.2. In both experiments, network activations from hidden layers are sorted by class membership to calculate correlation matrices. Class similarities are then calculated using the generalized cosine between the correlation matrices and converted to distances by subtracting from one. In Sect. 4.1, agglomerative clustering is performed on the distance matrix, resulting in a class hierarchy implied by the hidden representation at a given layer. In Sect. 4.2, different network representations are compared by performing the generalized cosine metric on the similarity matrices themselves.

4.1 Comparing Network Models with the WordNet Ontology

In order to compare some trained hierarchies with a more human-minded view of the world, we performed NAA on the VGG16 [11], Resnet50 [12], and InceptionV3 [13] architectures pretrained on the ImageNet [4] benchmark dataset, the class structure of which is conveniently based on the WordNet [5] ontology. By treating the resulting clusters as a sort of classification problem, we can get an idea of the similarity with human-made ontologies. That is, for each supercluster in the WordNet subtree, we find the closest matching cluster in the agglomerative hierarchy by searching for the cluster with the closest F1 score. The results can be seen in Fig. 2. Comparing the trees is done in this way due to limitations of standard graph distance measures, which generally assume a well-defined superclass structure. In order to show that these correspondences are not merely a matter of random chance, the results were compared with the resulting clustering when the distance matrix was randomly permuted. This randomized baseline ensures the distribution of distance values is identical to the experimental distribution, while showing that the correspondences between the results and WordNet are not a result of random chance.

Fig. 2.
figure 2

Results from performing agglomerative clustering on the fully connected layers of the VGG16 architecture trained on the ImageNet dataset; 2(a) normalized histogram with the best fit distribution of resulting similarities from both fully connected layers of the VGG16 architecture; 2(b) normalized histogram with the best fit distribution resulting from clustering the permuted distance matrix

Table 1. Top-10 F1 score results from searching NAA hierarchy of the final fully connected layer in VGG16, Resnet50, and InceptionV3 for best match to each superclass in the WordNet ground-truth hierarchy; note the significant overlap in groupings between VGG16 and Resnet50

Figure 2(a) shows the distribution of F1 scores resulting from the fully connected layers of VGG16 and Fig. 2(b) shows the distribution resulting from the permuted distance matrix. Only classes from the WordNet ontology with at least 15 true class members were included in the analysis in order to eliminate bias introduced by including groups with low member counts.

The clusters with the top-10 and bottom-10 F1 scores for the final hidden layers of all three architectures can be seen in Tables 1 and 2, respectively. The significant overlap in both top and bottom scoring clusters in the VGG16 and Resnet50 architectures may indicate an invariance even across architectures, although the significant difference between the InceptionV3 results indicates this would not be true across all architectures.

Table 2. Bottom-10 F1 score results, as in Table 1

4.2 MNIST Training Experiments

In order to evaluate how the class relationships evolve as a model trains, generic CNNs were trained to classify the MNIST dataset [3]. The number of convolutional layers were varied from 1 to 2, with channel depth varying from 2 to 8. The number of fully connected layers was set at 2, each layer having between 2 and 16 nodes. ReLu activations were used for all hidden layers. Distances relative to digits 6 for two example networks can be seen in Fig. 3, as well as the minimum distance across all digits which appears to be a good indicator of validation accuracy. Accuracy and loss plots are included in Figs. 3(c) and (d) for comparison. The heat maps indicate training epoch, with the darkest line corresponding to epoch 100.

As can be seen by comparing the plots in Fig. 3, the relationships in the under-parameterized network are relatively unstable, varying significantly across training epochs compared with the relational stability exhibited in the sufficiently parameterized networks in Fig. 3(b). In general, sufficiently parameterized networks showed stable class relationships and increased minimum distances while underparameterized network relationships oscillated back and forth through training and often reflected a lack of effective delineation between classes.

Fig. 3.
figure 3

NAA results on the final hidden layer across training for CNNs with 1–2 convolutional layers and two fully connected layers with varying degrees of parameterization; 3(a) and 3(b) class distances relative to digit 6 plotted across training for 2 sample networks, parameterization increasing from left to right(network parameterization shown above plot); 3(c) and 3(d) minimum distances across all digits for the respective networks for the same layer, with accuracy and loss curves plotted in red and blue for reference (Color figure online)

Fig. 4.
figure 4

Evaluating invariance across training experiments - 4(a) average cross-network similarity of final hidden layers; all networks were composed of 2 convolutional layers and 2 fully connected layers, with increasing parameterizations. The legend specifies the total convolutional filters plus fully connected nodes prior to the classification layer; 4(b) average validation accuracy across the 10 training experiments for each architecture; comparing 4(a) and 4(b) shows that as validation increases, so does invariance across training experiments.

Comparing Representations Across Networks. One useful component of other analysis methods that has not been addressed is the ability to compare across networks. As mentioned previously, our method is not invariant to rotations in the feature subspace and so using the relationship defined in (2) in order to compare across networks would require a step to make the R matrices of the same dimensionality as well as likely being a poor method to evaluate relationships since it does not allow for flexible neural correspondences. However, we can compare representations in different layers by calculating the generalized cosine between the similarity matrices themselves, each being a symmetric matrix in \(\mathbb {R}^n\), where n is the dimensionality of the classification layer. We therefore define the similarity between two network representations i and j at layers \(l_i\) and \(l_j\) analogous to (5) using the cosine between \(S_{l_i}\) and \(S_{l_j}\), where \(S_{l_i}\) and \(S_{l_j}\) represent the class similarity matrices at layer \(l_i\) in network representation i and \(S_{l_j}\) defined similarly for network j.

Figure 4 shows the results from comparing network representations across training experiments. The figure shows experiments where the number of parameters in networks consisting of one to two convolutional layers and two fully connected layers were gradually increased in order to evaluate cross-training invariance as the level of parameterization increases. Each configuration was used to train 10 different models with the same architecture, which were then compared as described above. Figure 4(a) shows the average similarity of the final fully connected layers at different points in training, while Fig. 4(b) shows the average validation accuracy at each epoch. Comparing the validation accuracy and average network similarity shows that the relationships appear to be invariant to training initialization in sufficiently parameterized networks.

5 Summary

In this work, we introduced Neurodynamical Agglomerative Analysis, a hierarchical analysis pipeline that takes as input a subset of network activations sorted by class and layer, and outputs an agglomerative hierarchy of output classes based on the generalized cosine of cross-correlation matrices as a similarity metric.

Using the ImageNet benchmark dataset, we showed the extent to which this similarity corresponds to a human-made view of the world in some architectures, likely corresponding to the extent to which class delineation relates to textural similarity. Further, by comparing the resulting hierarchies across different well-known architectures, we showed that some relationships in NAA may even be invariant across some architectures based on similarities between results for VGG16 and Resnet50 architectures, as well as that this invariance would not be global since there were significant differences apparent in the InceptionV3 architecture.

Using MNIST training experiments, we also showed how some class relationships in NAA evolve over training in networks to varying degrees of parameterization. Preliminary analysis shows underparameterized networks exhibit more erratic relationships between classes while overparameterized networks tend toward more or less steady-state relationships over time. This steady-state relationship appears to also be relatively invariant to training initializations, showing that this type of analysis may be a step in the direction of uncovering properties of the network that are invariant across training experiments. Future work with these insights will include further exploration into how invariant these relationships are, as well as how the relationships in the correlation space relate to class-delineated error analysis. Additionally, in order to add another dimension of experimental variation, methods of embedding arbitrary hierarchies into the correlation space will be explored.