Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

FormalPara Network Architecture Type Abbreviations
CDBN:

Convolutional Deep Belief Networks

CNN:

Convolutional Neural Networks

DBN:

Deep Belief Networks

DCNN:

Deep Convolution Neural Network

DNNs:

Deep Neural Networks

MCDNN:

Multicolumn Deep Neural Networks

FormalPara Dataset Name Abbreviations
CASIA-HWDB:

Institute of Automation of Chinese Academy of Sciences-Hand Writing Databases

DTD:

Describable Textures Dataset

FLIC:

Frames Labeled In Cinema

FMD:

Flickr Material Database

GTSRB:

German Traffic Sign Recognition Benchmark

ISLVRC:

ImageNet Large Scale Visual Recognition Challenge

LFW:

Labeled Faces in the Wild

LSP:

Leeds Sports Pose

MNIST:

Mixed National Institute of Standards and Technology

VOC:

Visual Object Classes

WAF:

We Are Family

YTF:

YouTube Faces

FormalPara Other Abbreviations
CVPR:

Computer Vision and Pattern Recognition

ICCV:

International Conference on Computer Vision

IEEE:

Institute of Electrical and Electronics Engineers

NIPS:

Neural Information Processing Systems

t-SNE:

Stochastic Neighbor Embedding

1 Introduction

Artificial Neural Networks for learning mathematical functions have been introduced in 1943 [48]. Despite being theoretically able to approximate any function [7], their popularity decreased in the 1970s because their computationally expensive training was not feasible with available computing resources [49]. With the increase in computing power in recent years, neural networks again became subject of research as Deep Neural Networks (DNNs). DNNs, artificial neural networks with multiple layers combining supervised and unsupervised training, have since been shown to outperform the state-of-the-art in multiple areas, such as visual object recognition, genomics and speech recognition [36]. Despite their empirically superior performance, DNN models have one disadvantage: their trained models are not easily understandable, because information is encoded in a distributed manner.

However, understanding and trust have been identified as desirable property of data mining models [65]. In most scenarios, experts can assess model performance on data sets, including gold standard data sets, but have little insights on how and why a specific model works [81]. The missing understandability is one of the reasons why less powerful, but easy to communicate classification models such as decision trees are in some applications preferred to very powerful classification models, like Support Vector Machines and Artificial Neural Networks [33]. Visualization has been shown to support understandability for various data mining models, e.g. for Naive Bayes [1] and Decision Forests [66].

In this chapter, we review literature on visualization of DNNs in the computer vision domain. Although DNNs have many application areas, including automatic translation and text generation, computer vision tasks are the earliest applications [35]. Computer vision applications also provide the most visualization possibilities due to their easy-to-visualize input data, i.e., images. In the review, we identify questions authors ask about neural networks that should be answered by a visualization (visualization goal) and which visualization methods they apply therefore. We also characterize the application domain by the computer vision task the network is trained for, the type of network architecture and the data sets used for training and visualization. Note that we only consider visualizations which are automatically generated. We do not cover manually generated illustrations (like the network architecture illustration in [35]). Concretely, our research questions are:

  1. RQ-1

    Which insights can be gained about DNN models by means of visualization?

  2. RQ-2

    Which visualization methods are appropriate for which kind of insights?

To collect the literature we pursued the following steps: since deep architectures became prominent only a few years ago, we restricted our search starting from the year 2010. We searched the main conferences, journals and workshops in the area of computer vision, machine learning and visualization, such as: IEEE International Conference on Computer Vision (ICCV), IEEE Conferences on Computer Vision and Pattern Recognition (CVPR), IEEE Visualization Conference (VIS), Advances in Neural Information Processing Systems (NIPS). Additionally, we used keyword-based search in academic search engines, using the following phrases (and combinations): “deep neural networks”, “dnn”, “visualization”, “visual analysis”, “visual representation”, “feature visualization”.

This chapter is organized as follows: the next section introduces the classification scheme and describes the categories we applied to the collected papers. Section 3 reviews the literature according to the introduced categories. We discuss the findings with respect to the introduced research questions in Sect. 4, and conclude the work in Sect. 5.

2 Classification Scheme

In this chapter we present the classification scheme used to structure the literature: we first introduce a general view, and then provide detailed descriptions of the categories and their values. An overview of the classification scheme is shown in Fig. 1.

Fig. 1
figure 1

Classification Scheme for Visualizations of Deep Neural Networks. The dotted border subsumes the categories characterizing the application area

First, we need to identify the purpose the visualization was developed for. We call this category visualization goal. Possible values are for instance general understanding and model quality assessment. Then, we identified the visualization methods used to achieve the above mentioned goals. Such methods can potentially cover the whole visualization space [51], but literature review shows that only a very small subset has been used so far in the context of DNNs, including heat maps and visualizations of confusion matrices. Additionally, we introduced three categories to describe the application domain. These categories are the computer vision task, the architecture type of the network and the data sets the neural network was trained on, which is also used for the visualization.

Note, that the categorization is not distinct. This means that one paper can be assigned multiple values in one category. For instance, a paper can use multiple visualization methods (CNNVis uses a combination of node-link diagrams, matrix displays and heatmaps [44]) on multiple data sets.

Related to the proposed classification scheme is the taxonomy of Grün et al. for visualizing learned features in convolutional neural networks [25]. The authors categorize the visualization methods into input modification, de-convolutional, and input reconstruction methods. In input modification methods, the output of the network and intermediate layers is measured while the input is modified. De-Convolutional methods adapt a reverse strategy to calculate the influence of a neuron’s activation from lower layers. This strategy demonstrates which pixels are responsible for the activation of neurons in each layer of the network. Input reconstruction methods try to assess the importance of features by reconstructing input images. These input images can either be real or artificial images, that either maximize or lead to an output invariance of a unit of interest. This categorization is restricted to feature visualizations and therefore is narrower as the proposed scheme. For instance, it does not cover the general application domain, and is restricted to specific types of visualizations, because it categorizes the calculation methods used for Pixel Displays and heatmaps.

2.1 Visualization Goals

This category describes the various goals of the authors visualizing DNNs. We identified the following four main goals:

  • General Understanding: This category encompasses questions about general behavior of the neural network, either during training, on the evaluation data set or on unseen images. Authors want to find out what different network layers are learning or have learned, on a rather general level.

  • Architecture Assessment: Work in this category tries to identify how the network architecture influences performance in detail. Compared to the first category the analyses are on a more fine-grained level, e.g. assessing which layers of the architecture represent which features (e.g., color, texture), and which feature combinations are the basis for the final decision.

  • Model Quality Assessment: In this category authors have focused their research goal in determining how the number of layers and role played by each layer can affect the visualization process.

  • User Feedback Integration: This category comprises work in which visualization is the means to integrate user feedback into the machine learning model. Examples for such feedback integration are user-based selection of training data [58] or the interactive refinement of hypotheses [21].

2.2 Visualization Methods

Only a few visualization methods [51] have been applied to DNNs. We briefly describe them in the following:

  • Histogram: A histogram is a very basic visualization showing the distribution of univariate data as a bar chart.

  • Pixel Displays: The basic idea is that each pixel represents a data point. In the context of DNN, the (color) value for each pixel is based on network activation, reconstructions, or similar and yield 2-dimensional rectangular images. In most cases the pixels next to each other in the display space are also next to each other in the semantic space (e.g., nearby pixels of the original image). This nearness criterion is defined on the difference from Dense Pixel Displays [32]. We further distinguish whether the displayed values originate from a single image, from a set of images (i.e., a batch), or only from a part of the image.

  • Heat Maps: Heat maps are a special case of Pixel Displays, where the value for each pixel represents an accumulated quantity of some kind and is encoded using a specific coloring scheme [72]. Heat maps are often transparently overlaid over the original data.

  • Similarity Layout: In similarity-based layouts the relative positions of data objects in the low-dimensional display space is based on their pair-wise similarity. Similar objects should be placed nearby in the visualization space, dissimilar objects farther apart. In the context of images as objects, suitable similarity measures between images have to be defined [52].

  • Confusion Matrix Visualization: This technique combines the idea of heatmaps and matrix displays. The classifier confusion matrix (showing the relation between true and predicted classes) is colored according to the value in each cell. The diagonal of the matrix indicates correct classification and all the values other than the diagonal are errors that need to be inspected. Confusion matrix visualizations have been applied to clustering and classification problems in other domains [69].

  • Node-Link Diagrams are visualizations of (un-)directed graphs [10], in which nodes represents objects and links represent relations between objects.

2.3 Computer Vision Tasks

In the surveyed papers different computer vision tasks were solved by DNNs. These are the following:

  • Classification: The task is to categorize image pixels into one or more classes.

  • Tracking: Object tracking is the task of locating moving objects over time.

  • Recognition: Object recognition is the task of identifying objects in an input image by determining their position and label.

  • Detection: Given an object and an input image the task in object detection is to localize this object in the image, if it exists.

  • Representation Learning: This task refers to learning features suitable for object recognition, tracking etc. Examples of such features are points, lines, edges, textures, or geometric shapes.

2.4 Network Architectures

We identified six different types of network architectures in the context of visualization. These types are not mutually exclusive, since all types belong to DNNs, but some architectures are more specific, either w.r.t. the types of layers, the type of connections between the layers or the learning algorithm used.

  • DNN: Deep Neural Networks are the general type of feed-forward networks with multiple hidden layers.

  • CNN: Convolutional Neural Networks are a type of feed-forward network specifically designed to mimic the human visual cortex [22]. The architecture consists of multiple layers of smaller neuron collections processing portions of the input image (convolutional layers) generating low-level feature maps. Due to their specific architecture CNNs have much fewer connections and parameters compared to standard DNNs, and thus are easier to train.

  • DCNN: The Deep Convolution Neural Network is a CNN with a special eight-layer architecture [35]. The first five layers are convolutional layers and the last three layers are fully connected.

  • DBN: Deep Belief Networks can be seen as a composition of Restricted Boltzmann Machines (RBMs) and are characterized by a specific training algorithm [27]. The top two layers of the network have undirected connections, whereas the lower layers have directed connection with the upper layers.

  • CDBN: Convolutional Deep Belief Networks are similar to DBNs, containing Convolutional RBMs stacked on one another [38]. Training is performed similarly to DBNs using a greedy layer-wise learning procedure i.e. the weights of trained layers are fixed and considered as input for the next layer.

  • MCDNN: The Multicolumn Deep Neural Networks is basically a combination of several DNN stacked in column form [6]. The input is processed by all DNNs and their output aggregated to the final output of the DNN.

In the next section we will apply the presented classification scheme (cf. Fig. 1) to the selected papers and provide some statistics on the goals, methods, and application domains. Additionally, we categorize the papers according to the taxonomy of Grün [25] (input modification methods, de-convolutional methods and input reconstruction) if this taxonomy is applicable.

3 Visualizations of Deep Neural Networks

Table 1 provides an overview of all papers included in this survey and their categorization. The table is sorted first by publication year and then by author name. In the following, the collected papers are investigated in detail, whereas the subsections correspond to the categories derived in the previous section.

Table 1 Overview of all reviewed papers

3.1 Visualization Goals

Table 2 provides an overview of the papers in this category. The most prominent goal is architecture assessment (16 papers). Model quality assessment was covered in 8 and general understanding in 7 papers respectively, while only 3 authors approach interactive integration of user feedback.

Table 2 Overview of visualization goals

Authors who have contributed work on visualizing DNNs with the goal general understanding have focused on gaining basic knowledge of how the network performs its task. They aimed to understand what each network layer is doing in general. Most of the work in this category conclude that lower layers of the networks contains representations of simple features like edges and lines, whereas deeper layers tend to be more class-specific and learn complex image features [41, 47, 61]. Some authors developed tools to get a better understanding of learning capabilities of convolutional networksFootnote 1 [2, 79]. They demonstrated that such tools can provide a means to visualize the activations produced in response to user inputs and showed how the network behaves on unseen data.

Approaches providing deeper insights into the architecture were placed into the category architecture assessment. Authors focused their research on determining how these networks capture representations of texture, color and other features that discriminate an image from another, quite similar image [56]. Other authors tried to assess how these deep architectures arrive at certain decisions [42] and how the input image data affects the decision making capability of these networks under different conditions. These conditions include image scale, object translation, and cluttered background scenes. Further, authors investigated which features are learned, and whether the neurons are able to learn more than one feature in order to arrive at a decision [53]. Also, the contribution of image parts for activation of specific neurons was investigated [85] in order to understand for instance, what part of a dog’s face needs to be visible for the network to detect it as a dog. Authors also investigated what types of features are transferred from lower to higher layers [78, 79], and have shown for instance, that scene centric and object centric features are represented differently in the network [84].

Eight papers contributed work on model quality assessment. Authors have focused their research on how the individual layers can be effectively visualized, as well as the effect on the network’s performance. The contribution of each layer at different levels greatly influence their role played in computer vision tasks. One such work determined how the convolutional layers at various levels of the network show varied properties in tracking purposes [71]. Dosovitskiy and Bronx have shown that higher convolutional layers retain details of object location, color, and contour information of the image [12]. Visualization is used as a means to improve tools for finding good interpretations of features learned in higher levels [16]. Kriszhesvsky et al. focused on performance of individual layers and how performance degrades when certain layers in the network are removed [35].

Some authors researched user feedback integration. In the interactive node-link visualization in [26] the user can provide his/her own training data using a drawing area. This method is strongly tied to the used network and training data (MNIST hand written digit). In the Ml-O-Scope system users can interactively analyze convolutional neural networks [2]. Users are presented with a visualization of the current model performance, i.e. the a-posteriori probability distribution for input images and Pixel Displays of activations within selected network layers. They are also provided with a user interface for interactive adaption of model hyper-parameters. A visual analytics approach to DNN training has been proposed recently [44]. The authors present 3 case studies in which DNN experts evaluated a network, assessed errors, and found directions for improvement (e.g. adding new layers).

3.2 Visualization Methods

In this section we describe the different visualization methods applied to DNNs. An overview of the methods is provided in Table 3. We also categorize the papers according to Grün’s taxonomy [25] in Table 4. In the following we describe the papers for each visualization method separately.

Table 3 Overview of visualization methods
Table 4 Overview of categorization by Grün [25]

3.2.1 Pixel Displays

Most of the reviewed work has utilized pixel based activations as a means to visualize different features and layers of deep neural networks. The basic idea behind such visualization is that each pixel represents a data point. The color of the pixel corresponds to an activation value, the maximum gradient w.r.t. to a given class, or a reconstructed image. The different computational approaches for calculating maximum activations, sensitivity values, or reconstructed images are not within the scope of this chapter. We refer to the survey paper for feature visualizations in DNNs [25] and provide a categorization of papers into Grün’s taxonomy in Table 4.

Mahendran and Vedaldi [46, 47] have visualized the information contained in the image by using a process of inversion using an optimized gradient descent function. Visualizations are used to show the representations at each layer of the network (cf. Fig. 2). All the convolutional layers maintain photographically realistic representations of the image. The first few layers are specific to the input images and form a direct invertible code base. The fully connected layers represent data with less geometry and instance specific information. Activation signals can thus be invert back to images containing parts similar, but not identical to the original images. Cao et al. [3] have used Pixel Displays on complex, cluttered, single images to visualize their results of CNNs with feedback. Nguyen et al. [53] developed an algorithm to demonstrate that single neurons can represent multiple facets. Their visualizations show the type of image features that activate specific neurons. A regularization method is also presented to determine the interpretability of the images to maximize activation. The results suggest that synthesizing visualizations from activated neurons better represent input images in terms of the overall structure and color. Simonyan et al. [61] visualized data for deep convolutional networks. The first visualization is a numerically generated image to maximize a classification score. As second visualization, saliency maps for given pairs of images and classes indicate the influence of pixels from the input image on the respective class score, via back-propagation.

Fig. 2
figure 2

Pixel based display. Activations of first convolutional layer generated with the DeepVis toolboxfrom [79] https://github.com/yosinski/deep-visualization-toolbox/

3.2.2 Heat Maps

In most cases, heat maps were used for visualizing the extent of feature activations of specific network layers for various computer vision tasks (e.g. classification [81], tracking [71], detection [83]). Heat maps have also been used to visualize the final network output, e.g. the classifier probability [63, 81]. The heat map visualizations are used to study the contributions of different network layers (e.g. [71]), compare different methods (e.g. [50]), or investigate the DNNs inner features and results on different input images [83]. Zintgraf et al. [85] used heat maps to visualize image regions in favor of, as well as image regions against, a specific class in one image. Authors use different color codings for their heat maps: blue-red-yellow color schemes [71, 81, 83], white-red schemes [50], blue-white-red schemes [85], and also a simple grayscale highlighting interesting regions in white [63].

3.2.3 Confusion Matrix and Histogram

Two authors have shown the confusion matrix to illustrate the performance of the DNN w.r.t., a classification task (see Fig. 3). Bruckner et al. [2] additionally encoded the value in each cell using color (darker color represents higher values). Thus, in this visualization, dark off-diagonal spots correspond to large errors. In [6] the encoding used is different: each cell value is additionally encoded by the size of a square. Cells containing large squares represent large values; a large off-diagonal square corresponds to a large error between two classes. Similarly, in one paper histograms have been used to visualize the decision uncertainty of a classifier, indicating using color whether the highest-probable class is the correct one [35].

Fig. 3
figure 3

Confusion Matrix example. Showing classification results for the COIL-20 data set. Screenshots reproduced with software from [59]

3.2.4 Similarity Based Layout

In the context of DNNs, similarity based layouts so far have been applied only by Donahue et al. [11], who specifically used t-distributed stochastic neighbor embedding (t-SNE) [67] of feature representations. The authors projected feature representations of different networks layers into the 2-dimensional space and found a visible clustering for the higher layers in the network, but none for features of the lower network layer. This finding corresponds to the general knowledge of the community that higher levels learn semantic or high-level features. Further, based on the projection the authors could conclude that some feature representation is a good choice for generalization to other (unseen) classes and how traditional features compare to feature representations learned by deep architectures. Figure 4 provided an example of the latter.

Fig. 4
figure 4

Similarity based layout of the MNIST data set using raw features. Screenshot was taken with a JavaScript implementation of t-SNE [67] https://scienceai.github.io/tsne-js/

3.2.5 Node-Link Diagrams

Two authors have approach DNN visualization with node-link diagrams (see examples in Fig. 5). In his interactive visualization approach, Adam Harley represented layers in the neural networks as nodes using Pixel Displays, and activation levels as edges [26]. Due to the denseness of connections in DNNs only active edges are visible. Users can draw input images for the network and interactively explore how the DNN is trained. In CNNVis [44] nodes represent neuron clusters and are visualized in different ways (e.g., activations) showing derived features for the clusters.

Fig. 5
figure 5

Node-link diagrams of DNNs. Top: Example from [26] taken with the online application at http://scs.ryerson.ca/~aharley/vis/conv/. Bottom: screenshot of the CNNVis system [44] taken with the online application at http://shixialiu.com/publications/cnnvis/demo/

3.3 Network Architecture and Computer Vision Task

Table 5 provides a summary of the architecture types. The majority of papers applied visualizations to CNN architectures (18 papers), while 8 papers dealt with the more general case of DNNs. Only 8 papers have investigated more special architectures, like DCNN (4 papers), DBNs (2 papers), CDBN (1 paper) and MCDNNs (1 paper).

Table 5 Overview of network architecture types

Table 6 summarizes the computer vision tasks for which the DNNs have been trained. Most networks were trained for classification (14 papers), some for representation learning and recognition (9 and 6 papers, respectively). Tracking and Detection were pursued the least often.

Table 6 Overview of computer vision tasks

3.4 Data Sets

Table 7 provides an overview of the data sets used in the reviewed papers. In the field of classification and detection, the ImageNet dataset represent the most frequently used dataset, used around 21 times. Other popular datasets used in tasks involving detection and recognition such as Caltech101, Caltech256 etc. have been used 2–3 times (e.g. in [11, 56, 81, 84]).

Table 7 Overview of data sets sorted after their usage. Column “#” refers to the number of papers in this survey using this data set

While ImageNet and its subsets (e.g. ISLVRC) are large datasets with around 10,000,000 images each, there are smaller datasets such as the ETHZ stickmen and VOC2010 which are generally used for fine-grained classification and learning. VOC2010, consisting of about 21,738 images, has been used twice, while more specialized data sets, such as Buffy Stickmen for representation learning, have been used only once in the reviewed papers [41]. There are datasets used in recognition with fewer classes such as CIFAR10, consisting of 60,000 colour images, with about 10 classes; and MNIST used for recognition of handwritten digits.

4 Discussion

In this section we discuss the implications of the findings from the previous section with respect to the research questions. We start the discussion by evaluating the results for the stated research questions.

RQ-1 (Which insights can be gained about DNN models by means of visualization) has been discussed along with the single papers in the previous section in detail. We showed by examples which visualizations have previously been shown to lead to which insights. For instance, visualizations are used to learn which features are represented in which layer of a network or which part of the image a certain node reacts to. Additionally, visualizing synthetic input images which maximize activation allows to better understand how a network as a whole works. To strengthen our point here, we additionally provide some quotes from authors:

Heat maps::

“The visualisation method shows which pixels of a specific input image are evidence for or against a node in the network.” [85]

Similarity layout::

“[…] first layers learn ‘low-level’ features, whereas the latter layers learn semantic or ‘high-level’ features. […] GIST or LLC fail to capture the semantic difference […]” [11]

Pixel Displays::

“[…] representations on later convolutional layers tend to be somewhat local, where channels correspond to specific, natural parts (e.g. wheels, faces) instead of being dimensions in a completely distributed code. That said, not all features correspond to natural parts […]” [79]

The premise to use visualization is thus valid, as the publications agree that visualizations help to understand the functionality and behavior of DNNs in computer vision. This is especially true when investigating specific parts of the DNN.

To answer RQ-2 (Which visualization methods are appropriate for which kind of insights?) we evaluated which visualizations were applied in the context of which visualization goals. A summary is shown in Fig. 6. It can be seen that not all methods were used in combination with all goals, which is not surprising. For instance, no publication used a similarity layout for assessing the architecture. This provides hints on possibilities for further visualization experiments.

Fig. 6
figure 6

Relation of visualization goals and applied methods in the surveyed papers following our taxonomy. Size of the circles corresponds to the (square root of the) number of papers in the respective categories. For details on papers see Table 1

Pixel Displays were prevalent for architecture assessment and general understanding. This is plausible since DNNs for computer vision work on the images themselves. Thus, Pixel Displays preserve the spatial-context of the input data, making the interpretation of the visualization straight-forward. This visualization, however, method has its own disadvantages and might not be the ideal choice in all cases. The visualization design space is extremely limited, i.e. constrained to a simple color mapping. Especially for more complex research questions, extending this space might be worthwhile, as the other visualization examples in this review show.

The fact that a method has not been used w.r.t. a certain goal does not necessarily mean that it would not be appropriate. It merely means that authors so far achieved their goal with a different kind of visualization. The results based on our taxonomy, cf. Fig. 6 and Table 1, hint at corresponding white spots. For example, node-link diagrams are well suited to visualize dependencies and relations. Such information could be extracted for architecture assessment as well, depicting which input images and activation levels correlate highly to activations within individual layers of the network. Such a visualization will neither be trivial to create nor to use, since this first three part correlation requires suitable hyper-graph visualization metaphor, but the information basis is promising. Similar example ideas can be constructed for the other white spots in Fig. 6 and beyond.

5 Summary and Conclusion

In this chapter we surveyed visualizations of DNNs in the computer vision domain. Our leading questions were: “Which insights can be gained about DNN models by means of visualization?” and “Which visualization methods are appropriate for which kind of insights?” A taxonomy containing the categories visualization method, visualization goal, network architecture type, computer vision task, and data set was developed to structure the domain. We found that Pixel Displays were most prominent among the methods, closely followed by heat maps. Both is not surprising, given that images (or image sequences) are the prevalent input data in computer vision. Most of the developed visualizations and/or tools are expert tools, designed for the usage of DNN/computer vision experts. We found no interactive visualization allowing to integrate user feedback directly into the model. The closest approach is the semi-automatic CNNVis tool [44]. An interesting next step would be to investigate which of the methods have been used in other application areas of DNNs, such as speech recognition, where Pixel Displays are not the most straight-forward visualization. It would be also interesting to see which visualization knowledge and techniques could be successfully transferred between these application areas.