Abstract
We propose a new analysis technique for neural networks, Neurodynamical Agglomerative Analysis (NAA), an analysis pipeline designed to compare class representations within a given neural network model. The proposed pipeline results in a hierarchy of class relationships implied by the network representation, i.e. a semantic hierarchy analogous to a human-made ontological view of the relevant classes. We use networks pretrained on the ImageNet benchmark dataset to infer semantic hierarchies and show the similarity to human-made semantic hierarchies by comparing them with the WordNet ontology. Further, we show using MNIST training experiments that class relationships extracted using NAA appear to be invariant to random weight initializations, tending toward equivalent class relationships across network initializations in sufficiently parameterized networks.
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - GRK 2340.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The problem of how to analyze neural network models has been an increasingly important focus as the efficacy of deep networks continues to be demonstrated in many applications. This is a unique aspect of solving problems using structured feedback systems such as machine learning models, where the designer’s task is to design a system that will exhibit certain properties upon being restructured by feedback according to update rules (e.g. stochastic gradient descent), as opposed to designing a solution via a robust understanding of relevant first principles.
Analyzing a very high-dimensional model evokes the classic problem of dimensionality reduction, or determining how to simplify a model with potentially millions of variables into something meaningful to a human being. The choice of what aspects of the model to focus on and how to bring meaningful information to the forefront is non-trivial in this case. One would ascertain very different information about the network from a technique that focuses on analyzing which regions of the input are most significant to the output, such as Relevance Propagation [1], than by using an approach focused on evaluating model dimensionality and latent similarity, such as SVCCA [2].
In this paper we introduce Neurodynamical Agglomerative Analysis (NAA), an analysis model that takes as input a collection of network activations sorted according to input classes, and outputs a hierarchy of network clusters that gives insight into the class relationships in the hidden representation. The goal of this analysis pipeline is to allow designers to convert a network model of seemingly random activations and transform them into a hierarchical graphical model that can give insight into how the network models classes with respect to one another, and which classes are most well-differentiated by the model. This way of analyzing neural networks opens the door to new ways of considering issues such as class imbalance and error analysis in the context of training. Further, since up until now most supervised deep neural network models are not developed with any consideration for higher-order class relationships, a computationally efficient and intuitive way to analyze the implied class relationships naturally begs the question of how one would approach enforcing arbitrary relationships into machine learning methods, thus introducing a key aspect of top-down reasoning into bottom-up models.
The proposed model considers the cross-correlation matrix of network activations for a given class as a lower dimensional encoding of the neural relationships that have been developed at a particular layer. We then take the generalized cosine of the correlation matrices as a similarity metric between the class representations and use the matrix defined by these class similarities in order to compare network representations using both agglomerative clustering and the generalized cosine between the similarity matrices themselves. Training experiments with the MNIST [3] dataset show that class relationships exhibit invariance to random initialization, tending toward a particular state as training progresses. Underparameterized networks exhibit dynamic relationships with higher class similarities that never reach a steady state. Additionally, NAA was used to derive class hierarchies based on the ImageNet [4] dataset which were compared to the WordNet [5] ground-truth ontology in order to consider how this notion of similarity relates to a human-made one.
2 Related Work
As the interest in neural networks has increased over the last decade, analysis techniques for deep networks have become an increasing focus in the field. In [1], Bach et al. develop Relevance Propagation, a method to visualize the contributions single pixels make to output predictions by calculating the extent to which an individual pixel impacted the model in choosing or not choosing a particular output class. This analysis method visualizes the saliency of input pixels via a heat map that enables a human to view which regions of an image contributed most to the output.
In [6], Li et al. explored to what extent invariant features are learned from different random initializations by computing the within-net and between-net correlation of network activations. They then inferred corresponding feature maps by sorting the feature maps according to their between-net correlation values, as well as looking for higher level correspondences among features via spectral clustering on within-net correlation matrices. Their model collects a matrix composed of vectorized activations across the entire dataset from each layer and treats each layer of each input instance as a separate feature in the analysis model, thus it effectively compares similar and dissimilar features at the level of individual feature maps of individual input instances. This can be a prohibitively high level of granularity in the case of deep neural networks, as insight still must be gained via the inspection of individual features. In contrast, NAA generates a correlation matrix of activations across a particular class and compares averaged class representations, as well as performing agglomerative clustering on output classes in a bottom-up fashion. In essence, we trade feature-level granularity for hierarchy-level granularity with respect to the class structure.
Another recent work using a similar approach is SVCCA [2], which forms matrices of network activations across a dataset and then performs canonical correlation analysis [7] on filtered approximations of the activation matrices via singular value decomposition. Raghu et al. show relationships between SVCCA and network parameterization that are used to develop networks that use the parameter space more efficiently.
Our method differs from SVCCA in a similar way as it does to [1] and [6]. SVCCA compares representations across architectures for a given dataset by evaluating the correlation values of maximally-correlated subspaces of features across network representations. In contrast, we compare the network to itself using the generalized cosine of correlation matrices across classes at a given layer to evaluate how well-oriented the correlation matrices are with respect to one another, and therefore how similar the class representations are. Unlike SVCCA, our method is only invariant to global rotations in the feature subspace. Intuitively, this means that our method does not allow for flexible neural correspondences in its notion of representational similarity.
The idea to use the generalized cosine of the cross-correlation matrices was inspired by [8], in which Jaeger similarly uses the generalized cosine between cross-correlation matrices as a similarity metric for Excited Network Dynamics in Echo State Networks. Jaeger shows that, in some cases, evaluating the similarity of the conceptor matrices he defines produces a better correspondence to a human notion of similarity than do the raw matrices.
3 Methods
3.1 Dimensionality Reduction for Analysis
Dimensionality reduction is often brought up in the context of computational feasibility, but many scientific methods can be viewed in terms of dimensionality reduction. For example, representing a collection of data as a probability distribution could be seen as representing a vast collection of data points by a few parameters.
The dimensionality reduction approach used by NAA is to take the cross-correlation matrix of network activations, it being considered in this sense a low-dimensional representation for the network “dynamics”. Given a set of neural activation vectors specific to class c in layer l, \(Z_{c,l} = \{\mathbf {z_1}, \ldots , \mathbf {z_m}\}\), where \(\mathbf {z_k}\) represents the activations on the \(k^{th}\) datapoint, and is of dimensionality n, defined by the amount of output neurons at a given layer. The cross-correlation matrix, R is defined according to
where, in this case, \(z_i\) and \(z_j\) represent the \(i^{th}\) and \(j^{th}\) (scalar) activations in a given vector, respectively, and \(E[z_iz_j]\) is the expectation over the product of scalar neural activations \(z_i\) and \(z_j\). The cross-correlation matrix is a way of assessing the “neurodynamics” of the network, used here in an analogical sense, in that, although there is no time variable in the convolutional networks in consideration per se, we are using the way neurons would vary together in a dynamical sense as being a low-dimensional representation of the network’s class representations. The analysis pipeline can also be generally applied to any data composed of vectors of network activations. The cross-correlation matrix can be estimated for a dataset according to
3.2 Generalized Cosine for Assessing Neurodynamical Similarity
Definition. Singular Value Decomposition (SVD) is a common tool that forms the basis of dimensionality reduction techniques such as Principle Component Analysis as well as being used for applications such as solving homogeneous linear equations. The Singular Value Decomposition of a real, symmetric matrix R is defined as
where U is a unitary matrix composed of the eigenvectors of R, and \(\varSigma \) is a diagonal matrix composed of the eigenvalues of R. This decomposition defines a hyperellipsoid, with axes oriented according to the unitary basis defined by U and radii corresponding to the diagonal entries of \(\varSigma \). A generalized cosine [8] between two such hyperellipsoids is defined according to
where \(diag\{\varSigma \}\) is the vectorized diagonal entries of \(\varSigma \) and \(\Vert \varvec{\cdot }\Vert \) represents the Frobenius norm, or the \(L^2\) vector norm. The relationship in (4) defines a similarity metric between two matrices and in this case is used for comparing two class representations of a given layer of a neural network.
Computational Optimization and Implementation. Calculating the generalized cosine defined in (4) requires performing singular value decomposition on the matrix, which would require \(\mathcal {O}(n^3)\) computation time, where n is the dimensionality of the square matrix, R, or dimensionality of the layer of interest in our case. However, an equivalent relationship can be derived that avoids the SVD step by using the trace of matrix products, resulting in the equivalent formula
which brings the computational cost of calculating the similarity metric down to \(\mathcal {O}(n^2)\), since for the trace of matrix products only the diagonal elements are required.
Proof
The numerator of (4) can be simplified utilizing the following properties of the trace product and Frobenius norm
along with the property that
for any unitary matrix U. In order to derive the relationship in (5) we exploit the fact that the trace of matrix products is invariant with respect to cycling the matrices being multiplied, in accordance with (7), as well as the fact that \(UU^T = I\), the identity matrix. Thus, using the relationship in Eq. (6) we can redefine a formula equivalent to the numerator of (4) as follows -
Similarly, since \(\varSigma \) is a diagonal matrix, the factors in the denominator can be expressed as
and so a similarity metric equivalent to the generalized cosine defined in (4) can be formed according to (5) which brings the computational complexity of our metric from \(\mathcal {O}(n^3)\) to \(\mathcal {O}(n^2)\). The similarity defined in (5) is then used to construct a distance matrix, D, according to
where (i, j) refers to classes i and j, respectively. The class similarities \(s_{ij}=cos^2(R_i, R_j)\) are used to build a similarity matrix, S, which is used to compare different network representations. Thus, D defines the distance matrix which can be used to generate a hierarchy of network classes via agglomerative clustering, as in Sect. 4.1, or the similarity matrices can be used to compare entire representations by taking the generalized cosine between the similarity matrices calculated from different models, as in Sect. 4.2.
3.3 Agglomerative Clustering
Agglomerative clustering is a hierarchical clustering method, which merges points one at a time in a greedy fashion until there is only one supercluster remaining. In this case we are concerned with a sort of mean representation of the clusters formed by similar classes, so the merging rule we chose was to take the averaging method which takes the distance between two clusters u and v according to [9]
where |u| and |v| represent the cardinalities of clusters u and v. Linkage calculations were done using the scikit-learn library [10]. Neural network training and data collection was done using tensorflow 2.0 with data processing performed in Python 3.6.
4 Experimental Results and Discussion
4.1 Comparing Network Models with the WordNet Ontology
In order to compare some trained hierarchies with a more human-minded view of the world, we performed NAA on the VGG16 [11], Resnet50 [12], and InceptionV3 [13] architectures pretrained on the ImageNet [4] benchmark dataset, the class structure of which is conveniently based on the WordNet [5] ontology. By treating the resulting clusters as a sort of classification problem, we can get an idea of the similarity with human-made ontologies. That is, for each supercluster in the WordNet subtree, we find the closest matching cluster in the agglomerative hierarchy by searching for the cluster with the closest F1 score. The results can be seen in Fig. 2. Comparing the trees is done in this way due to limitations of standard graph distance measures, which generally assume a well-defined superclass structure. In order to show that these correspondences are not merely a matter of random chance, the results were compared with the resulting clustering when the distance matrix was randomly permuted. This randomized baseline ensures the distribution of distance values is identical to the experimental distribution, while showing that the correspondences between the results and WordNet are not a result of random chance.
Figure 2(a) shows the distribution of F1 scores resulting from the fully connected layers of VGG16 and Fig. 2(b) shows the distribution resulting from the permuted distance matrix. Only classes from the WordNet ontology with at least 15 true class members were included in the analysis in order to eliminate bias introduced by including groups with low member counts.
The clusters with the top-10 and bottom-10 F1 scores for the final hidden layers of all three architectures can be seen in Tables 1 and 2, respectively. The significant overlap in both top and bottom scoring clusters in the VGG16 and Resnet50 architectures may indicate an invariance even across architectures, although the significant difference between the InceptionV3 results indicates this would not be true across all architectures.
4.2 MNIST Training Experiments
In order to evaluate how the class relationships evolve as a model trains, generic CNNs were trained to classify the MNIST dataset [3]. The number of convolutional layers were varied from 1 to 2, with channel depth varying from 2 to 8. The number of fully connected layers was set at 2, each layer having between 2 and 16 nodes. ReLu activations were used for all hidden layers. Distances relative to digits 6 for two example networks can be seen in Fig. 3, as well as the minimum distance across all digits which appears to be a good indicator of validation accuracy. Accuracy and loss plots are included in Figs. 3(c) and (d) for comparison. The heat maps indicate training epoch, with the darkest line corresponding to epoch 100.
As can be seen by comparing the plots in Fig. 3, the relationships in the under-parameterized network are relatively unstable, varying significantly across training epochs compared with the relational stability exhibited in the sufficiently parameterized networks in Fig. 3(b). In general, sufficiently parameterized networks showed stable class relationships and increased minimum distances while underparameterized network relationships oscillated back and forth through training and often reflected a lack of effective delineation between classes.
Comparing Representations Across Networks. One useful component of other analysis methods that has not been addressed is the ability to compare across networks. As mentioned previously, our method is not invariant to rotations in the feature subspace and so using the relationship defined in (2) in order to compare across networks would require a step to make the R matrices of the same dimensionality as well as likely being a poor method to evaluate relationships since it does not allow for flexible neural correspondences. However, we can compare representations in different layers by calculating the generalized cosine between the similarity matrices themselves, each being a symmetric matrix in \(\mathbb {R}^n\), where n is the dimensionality of the classification layer. We therefore define the similarity between two network representations i and j at layers \(l_i\) and \(l_j\) analogous to (5) using the cosine between \(S_{l_i}\) and \(S_{l_j}\), where \(S_{l_i}\) and \(S_{l_j}\) represent the class similarity matrices at layer \(l_i\) in network representation i and \(S_{l_j}\) defined similarly for network j.
Figure 4 shows the results from comparing network representations across training experiments. The figure shows experiments where the number of parameters in networks consisting of one to two convolutional layers and two fully connected layers were gradually increased in order to evaluate cross-training invariance as the level of parameterization increases. Each configuration was used to train 10 different models with the same architecture, which were then compared as described above. Figure 4(a) shows the average similarity of the final fully connected layers at different points in training, while Fig. 4(b) shows the average validation accuracy at each epoch. Comparing the validation accuracy and average network similarity shows that the relationships appear to be invariant to training initialization in sufficiently parameterized networks.
5 Summary
In this work, we introduced Neurodynamical Agglomerative Analysis, a hierarchical analysis pipeline that takes as input a subset of network activations sorted by class and layer, and outputs an agglomerative hierarchy of output classes based on the generalized cosine of cross-correlation matrices as a similarity metric.
Using the ImageNet benchmark dataset, we showed the extent to which this similarity corresponds to a human-made view of the world in some architectures, likely corresponding to the extent to which class delineation relates to textural similarity. Further, by comparing the resulting hierarchies across different well-known architectures, we showed that some relationships in NAA may even be invariant across some architectures based on similarities between results for VGG16 and Resnet50 architectures, as well as that this invariance would not be global since there were significant differences apparent in the InceptionV3 architecture.
Using MNIST training experiments, we also showed how some class relationships in NAA evolve over training in networks to varying degrees of parameterization. Preliminary analysis shows underparameterized networks exhibit more erratic relationships between classes while overparameterized networks tend toward more or less steady-state relationships over time. This steady-state relationship appears to also be relatively invariant to training initializations, showing that this type of analysis may be a step in the direction of uncovering properties of the network that are invariant across training experiments. Future work with these insights will include further exploration into how invariant these relationships are, as well as how the relationships in the correlation space relate to class-delineated error analysis. Additionally, in order to add another dimension of experimental variation, methods of embedding arbitrary hierarchies into the correlation space will be explored.
References
Bach, S., et al.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10(7), e0130140 (2015)
Raghu, M., et al.: SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In: Advances in Neural Information Processing Systems (2017)
Deng, L.: The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag. 29(6), 141–142 (2012)
Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2009)
Fellbaum, C.: WordNet. In: Poli, R., Healy, M., Kameas, A. (eds.) Theory and Applications of Ontology: Computer Applications, pp. 231–243. Springer, Dordrecht (2010). https://doi.org/10.1007/978-90-481-8847-5_10
Li, Y., et al.: Convergent learning: do different neural networks learn the same representations?. In: FE@ NIPS (2015)
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Jaeger, H.: Controlling recurrent neural networks by conceptors. arXiv preprint arXiv:1403.3369 (2014)
Müllner, D.: Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378 (2011)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Marino, M., Schröter, G., Heidemann, G., Hertzberg, J. (2020). Hierarchical Modeling with Neurodynamical Agglomerative Analysis. In: Farkaš, I., Masulli, P., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2020. ICANN 2020. Lecture Notes in Computer Science(), vol 12396. Springer, Cham. https://doi.org/10.1007/978-3-030-61609-0_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-61609-0_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61608-3
Online ISBN: 978-3-030-61609-0
eBook Packages: Computer ScienceComputer Science (R0)