Introduction

The autonomous identification of objects in space imagery is an enabling technology for a variety of space exploration missions. Objects in images can range from point sources that illuminate only a few pixels (which we refer to as unresolved objects) to large extended bodies that span many pixels (which we refer to as resolved objects). Currently, there are established techniques that are used to identify objects at both ends of this spectrum. In some situations, however, the object that we observe is of intermediate size (which we refer to as partially resolved), and identification in this regime is especially challenging.

Deep learning using neural networks is one way of addressing the classification problem. Recently, deep learning has gained popularity within the computer vision community [63]. As neural networks have improved, the computational cost has diminished while the accuracy has improved [65]. With applications such as object recognition [36], object tracking [70], and many others [6, 62], it is clear that there is a potential for deep learning in the space domain.

While a wide variety of neural network architectures exist, this study considers neural networks designed with both convolutional and fully-connected layers. This neural network architecture is trained using a large database of images of known space objects and then tested using a separate set of test images. Section “Rendering of Training and Test Images” discusses the rendering of the database of images. Section “Neural Network Design” gives a basic background of neural networks and discuss the architecture of the neural network we have developed. Section “Network Training” examines the training of our network. Lastly, Section “Results” discusses the results of the crisp images, blurred images, and novelty detection.

Background

Identification of Unresolved and Fully-Resolved Objects

Objects within space imagery may appear at varying resolutions, ranging from unresolved to fully-resolved. Unresolved objects can be well approximated as a point source and only illuminate a few pixels. A fully-resolved object appears large enough in the image that individual features on the object may be discerned.

The prevailing technique for the characterization and classification of unresolved objects is lightcurve inversion. The use of lightcurves to synthesize information of object shapes has been discussed as far back as 1906 [57] and gained popularity in recent decades with techniques developed by Kaasalainen and Torppa [29] and Kaasalainen et al. [30]. Lightcurve inversion is a mathematical technique in which light intensity over time is used to determine the rotational period and provide insight into the shape model of an object. Past work has shown success in model development and classification for both asteroids [29, 30, 45, 69] and artificial objects [15, 17, 27, 41, 42, 46].

The higher spatial resolution of fully resolved objects permits the use of classical machine vision techniques for object recognition, including: histogram of oriented gradients (HOG) [9, 72], scale invariant feature transform (SIFT) clustering [5, 43, 49], speeded up robust features (SURF) [3, 32, 59], features from accelerated segment test (FAST) [34, 55], oriented FAST and rotated BRIEF (ORB) [50, 71], and others. Many analysts, however, have recently moved away from hand-crafted features in favor of deep learning techniques for object identification [1, 47].

Between the two extremes of object resolution discussed above is the category of partially-resolved objects. These objects are no longer point sources of light, nor do they have the spatial resolution of fully-resolved objects. Partially-resolved objects have a discernible overall shape and generally span 5–30 pixels in their longest direction for a well focused camera. Camera defocus may cause larger objects (in terms of pixel extent) to still appear “partially resolved”, as this term is used to describe situations where the objects overall shape is apparent but individual surface features are not. Thus far, there has been little work involving the identification of partially resolved objects [53]. This work provides initial steps for one solution to this problem.

Classification with Neural Networks

Neural networks are a powerful tool for pattern recognition. One of the first successful instances of neural networks comes from Widrow and Hoff’s “Adaline” adaptive linear, pattern classification machine [73]. This network was developed to recognize a pattern of binaries and output a resulting binary.

The first convolutional neural network was developed by Fukushima in 1980 [16]. The neocognitron was inspired by the work of Widrow and Hoff and introduced two types of layers: convolutional layers and downsampling layers. Shortly after this, the method of back-propagation was deployed to the problem of machine learning [56]. This method iteratively changes the weights and biases within the neural network in order to minimize the output error, and is still used by most deep learning algorithms.

One of the first computer vision challenges met by neural networks was to classify images in the Modified National Institute of Standards and Technology (MNIST) database [39]. The database includes thousands of handwritten letters and numbers. Early neural networks used to solve this problem were relatively “shallow” by today’s standards such as LeNet [38], a network with two convolutional and two fully-connected layers. LeNet was highly successful at this classification task when compared to the conventional computer vision methods being used at the time.

As the images requiring classification became progressively complex, so did the convolutional neural network architectures. Leading to deeper neural networks. One of the first “deep” neural networks, AlexNet [36], was trained to classify objects within the ImageNet database [8]. The ImageNet database contains thousands of labeled images of thousands of different objects. AlexNet’s architecture includes five convolutional layers and three fully-connected layers. To reduce over-fitting, where a neural network is not able to generalize outside of its training data, the network utilized random dropout [22]. The network performance was also imporved by using activation function Rectified Linear Units (ReLU) [7], which removes negative activations.

After AlexNet, the depth of convolutional neural networks increased further with notable examples being developed by industrial research laboratories, such as Microsoft’s ResNet [21] and Google’s DeepDream [64]. These networks include millions of parameters and continue to grow with newer iterations. A majority of the current literature for object recognition using deep learning makes use of a variation of convolutional neural networks [31, 40, 60]. Our network architecture utilizes both convolutional and fully-connected layers, which will be discussed in greater detail in Section “Neural Network Design”.

Of special relevance here, convolutional neural networks have been used by astronomers to classify a variety of space objects in images [2, 10, 26, 33, 51].

Other Machine Learning Classification Methods

Although neural networks are one of the most widely used machine learning methods for classification, there exists other classification methods. Random decision forests were developed in [23] and use decision trees for classification. The method has been used for both terrestrial [4, 61] and space [48, 54] image classification.

The k-nearest neighbors algorithm, developed in 1951 [13], is a non-parametric classification algorithm. Similar to random forest, k-nearest neighbors has also been used for image classification [14, 66, 74].

Rendering of Training and Test Images

Neural networks require a large and diverse set of images for both training and testing. We accomplish this by rendering synthetic images of 14 objects, seven spacecraft and seven asteroids, using 3D triangular mesh models obtained from the NASA 3D Resources repositoryFootnote 1 and the NASA Planetary Data SystemFootnote 2 (PDS). The seven spacecraft include Cassini, Far Ultraviolet Spectroscopic Explorer (FUSE), Galileo, Hubble Space Telescope, International Space Station (ISS), Maven, and Voyager. The seven asteroids include 101955 Bennu, 6489 Golevka, (8567) 1996 HW1, 25143 Itokawa, 216 Kleopatra, 4486 Mithra, and 4 Vesta. These 14 objects are known to the network, which attempts to classify any observed object using one of these 14 labels. Since we may often encounter novel objects that are unknown to the network and for which no training has been performed, we render an additional ten objects from other repositories. These additional ten objects are used to evaluate how our network behaves when challenged with an object outside its training set, and, if we can, reliably identify such a scenario.

All rendering is performed using the open-source Blender software package.Footnote 3 The simulation environment is configured to take images of the objects from varying relative attitudes, distances, and phase angles. Test and training images are 8-bit monochrome with dimensions of 30 × 30 pixels. The objects are placed such that they span 5 to 30 pixels in their longest direction within these images. In real imagery, where the full image is substantially larger, we would perform classification on a 30 × 30 window centered on the observed object. Figure 1 demonstrates this process.

Fig. 1
figure 1

Example of converting from a full image to 30×30 window centered on the observed object

The images for each object were rendered using Blender’s application programming interface (API). For each object we rendered 60,000 training images and a separate set of 5,000 test images. Both the training and test images have a uniform distribution in range, relative attitude, and phase angle. The range spans the distance where the object appears to be 5 to 30 pixels in width and the phase angle is from 0 to 138 deg. Once a range is chosen for an image, we randomly sample a location on a sphere with an origin at the centroid of the object and a radius being the chosen range. We then adjust the camera’s attitude such that the object is always at the center of the 30 × 30 patch. This allows us to render each object from a uniform distribution of relative attitudes. The illumination direction is then calculated using the known phase angle and the camera quaternion.

The object ID, relative attitude, and range are all unknown to the network when challenged with a test (or operational) image. In contrast, a good estimate of the phase angle, g, is generally known since the Sun is presumed to be much farther away than the distance between the camera and observed object. Therefore, taking advantage of this knowledge, the images for each object are separated into five overlapping bins based on the known phase angle. This results in 12,000 training images and 1,000 test images per bin, per object. Figure 2 provides a visual representation of the phase angle ranges used by the five training bins. Although the training data is uniformly sampled over phase angle, the trained networks may not be fully developed at their boundaries. The operating bins reduce the phase angle range by 8 degrees ensuring that test images conform to the constraints of the trained networks. Table 1 give the exact values for the phase angle ranges of the training and operating bins. Examples of the rendered images are provided in Fig. 3. In previous work, it is shown that convolutional neural networks perform poorly when trained on the entire phase angle range as compared to when they are trained on a portion of the phase angle range [12]. For this reason each of the five overlapping phase angle bins will train a separate convolutional neural network of the same architecture.

Fig. 2
figure 2

A visualization of the five overlapping ranges of phase angles (g) used for training and operating bins. As the angle between the sun and the observer increases the visibility of the object degrades

Table 1 Phase angle ranges for both training and operating bins
Fig. 3
figure 3

Images are rendered using a 3D model at varying attitudes, ranges, and phase angles. These images are then sorted into 5 overlapping bins based on phase angle

Additionally, to better understand the effect of defocus on performance, we also develop sets of training and test images that are blurred to varying amounts. These sets of training and test images are versions of the crisp test images already rendered with defocus being simulated by convolution of each image with a Gaussian kernel having the appropriate standard deviation. The Gaussian kernel was selected since this is known to be a good approximation of blur from defocus for well-built cameras [28, 52]. Thus blur is simulated according to,

$$ I_{blur} = G\circledast I $$
(1)

where \(\circledast \) is the 2D convolutional operator, I is the original image, Iblur is the blurred image, and G is the Gaussian kernel.

Neural Network Design

Background of Neural Networks

Although there are a multitude of applications for neural networks ranging from function approximation [25] to natural language processing [6], this work focuses exclusively on classification and pattern recognition in digital images. Originally, the neural networks used for these problems were made of only fully-connected layers. The typical architecture for these fully-connected neural networks included an input layer consisting of a flattened array of the image. Each pixel value of the flattened array connects to every neuron in the next layer. The layers of neurons between the input layer and the output layer are called the hidden layers of the network. Each of the hidden layers are fully connected to both the previous layer and the next layer. The neurons of the output layer are used to determine how the network has classified the image. This is illustrated in Fig. 4.

Fig. 4
figure 4

A basic visualization of a fully-connected network showing the three major components of the network

The output of a neuron in a fully-connected network is calculated as

$$ x^{\left( n+1\right)}=\omega^{\left( n+1\right)}x^{\left( n\right)}+b^{\left( n+1\right)} $$
(2)

where x is a vector of the neuron activations, ω is the matrix of weights associated with the connection of neurons between two subsequent layers, and b is the bias vector. The superscript defines the layer.

Convolutional neural networks were developed to improve upon the successes of the fully-connected network for computer vision applications [20]. Unlike the fully-connected neural network, a convolutional neural network allows for shared weights among pixels and neurons rather than a single weight being assigned to every pixel and neuron in the network. Similar to the 2D convolutions used extensively in classical computer vision, a convolutional layer takes a filter and convolves it with an image to create a feature map [44]. The output pixels comprising the feature map are calculated as

$$ x_{l}^{\left( n+1\right)} = \sum\limits_{k}\omega_{k,l}\circledast x_{k}^{\left( n\right)}+b^{\left( n+1\right)}_{l} $$
(3)

where the superscript defines the layer, the l is the feature map for the current layer, and k is the input channel. Since convolutional layers may output multiple feature maps, there are an equal number of kernels, w, for the current feature map to the number of feature maps in the previous convolutional layer. These kernels make up a filter and the number of filters equal the number feature maps desired for the current convolutional layer. Each filter has a bias, b, associated with it. The filters move across the image building up the feature maps.

Both the fully-connected and convolutional neural network include parameters that can be changed to improve the performance of the networks. These parameters include their weights, either from connections between fully-connected layers or filters in a convolutional layer, and each neuron or feature map’s associated bias. The way these parameters are optimized in training will be discussed in Section “Network Training”.

A more detailed discussion of fully-connected and convolutional neural networks can be found in most texts on deep learning, e.g. [19].

Convolutional Neural Network Architecture

The neural network architecture developed for this work was implemented using the Pytorch machine learning libraryFootnote 4 and includes convolutional layers, max pool layers, and fully-connected layers. Following a convolutional layer, there can be a max pooling layer which is used to reduce the dimension of feature maps by reducing the elements in a filter to the max element. The final feature map array is then flattened and connected to a fully-connected neural network.

The new network we designed for the present task includes five convolutional layers with each layer using a 3 × 3 kernel. Input images and features maps are zero padded, such that each convolutional layer produces feature maps of the same size as its input. The network also includes two max pool layers and three fully-connected layers. It should be noted that ReLU is used as the activation function for each convolutional and fully-connected layer. This convolutional neural network architecture was selected after testing multiple architectures and finding the approach presented here to produce the best results. A visualization of our convolutional neural network architecture can be seen in Fig. 5. Five separate networks of this architecture are then trained on the five phase angle bins.

Fig. 5
figure 5

Our convolutional neural network architecture includes five convolutional layers separated by a max pool layer after every two convolutional layers and three fully-connected hidden layers at the end of the network

Network Training

Training of a neural network is divided into two pieces: learning and generalization. Learning is the process of adaptively understanding the fundamental aspects of the training data in order to correctly classify it. Generalization is being able to correctly classify data outside of the training data [75]. To train a neural network, a training set of images is used along with a loss function and an optimization method to adjust the parameters in the network.

A network begins its learning process by splitting the training data into batches. Our network uses cross-entropy loss as its loss function. The loss function, L, for each batch of training images is calculated asFootnote 5

$$ L = -\ln\left\{\frac{\exp\left[x^{\left( k\right)}\left( class\right)\right]}{{\sum}_{i}\exp\left[x^{\left( k\right)}\left( i\right)\right]}\right\}=-x^{\left( k\right)}\left( class\right)+\ln\left\{\sum\limits_{i}\exp\left[x^{\left( k\right)}\left( i\right)\right]\right\} $$
(4)

where the superscripts define the neuron layer, which in this case is the output layer k. Thus, \(x^{\left (k\right )}\left (i\right )\) is the activation of the ith neuron in the output layer. The activation, \(x\left (class\right )\), is the activation of the classifier that correctly classifies the object in the image. The calculated loss is then utilized by the optimizer.

Using Eq. 4 for a batch of images, the optimizer modifies the parameters of the network through backward propagation with a Stochastic Gradient Descent (SGD) optimizer. We found a momentum of 0 and a static learning rate of 0.0001 to produce the best results for this network. The SGD optimizer is typically used for neural network training and is used to optimize the weights and bias parameters [24].

This process of calculating loss for a batch of images and then optimizing the weights and bias parameters is repeated until there are no more batches left. The images are then rearranged and split up into new batches. This is called an epoch. We trained our network with 250 epochs and batch sizes of 10.

Each of our networks, both crisp and blurred, experience a minor amount of overfitting. Overfitting can be observed by monitoring the training and validation loss at each epoch while the network is training. The network is said to be “overfit” if the loss of the validation set (a small set of training images not used for training) remains higher than the loss of the training set. In general, overfit networks are less able to generalize. Both Figs. 6 and 7 show that there is only a minor amount of overfitting, with the network trained on crisp images experiencing less overfitting than the network trained on blurred images.

Fig. 6
figure 6

Training and validation cross entropy loss of a convolutional neural network trained on crisp images. There is a minor amount of overfitting, but in general the validation loss matches the training loss

Fig. 7
figure 7

Training and validation cross entropy loss of a convolutional neural network trained on Gaussian blurred images with a standard deviation of two pixels. Comparing to Fig. 6 we can see there is more overfitting, but still relatively minor

Results

Crisp Image Results

The convolutional neural networks were trained on overlapping ranges of phase angles as shown in Table 1. Each bin included 12,000 images per object for training and 1,000 images per object for testing. Each image used in this assessment contains no blurring or distortion. Further, the object in each image is placed at a uniformly distributed attitude and at a uniformly distributed range (making it appear 5–30 pixels in length) — with both the relative attitude and range being unknown to the neural network. A convolutional neural network was trained and assessed for each bin, with success rates shown in Table 2.

Table 2 Success rate (%) of networks trained and tested on only crisp images. Success rates of each object degrade at higher bin numbers

As can be seen in Table 2, the success rates stay above 98% in bin 1 and above 86% across all five bins. The success rates tend to decrease at higher phase angles, but this is expected since object visibility decreases at higher phase angles. Thus, the success rate of each object degrades as we move from bin 1 to bin 5.

The confusion matrix for the bin 1 convolutional neural network can be found in Table 3. The data show that a majority of our classifications for the bin 1 convolutional neural network are correct, with the largest confusion coming from our network incorrectly classifying Bennu as Vesta in about 2% of the test cases.

Table 3 Confusion matrix for the bin 1 convolutional neural network. The confusion matrix provides information on the correct classifications (diagonal of the matrix) and the incorrect classifications (off diagonals)

Examples of images with objects classified by the convolutional neural networks can be found in Fig. 8.

Fig. 8
figure 8

Examples of images and the classification results produced by our convolutional neural network from Fig. 5. Correct classifications are outlined in green. Incorrect classifications are outlined in red, along with details of the incorrect assignment

Blurred Image Results

The results of Section “Crisp Image Results” summarizes the success rates of the network when trained with crisp images. In actuality, images captured during a spaceflight mission may be blurred due to defocus, jitter, or other effects. The network must be assessed on its ability to generalize not only with varying ranges and phase angles, but also varying blur. Two networks are assessed, one is a neural network trained with only crisp images and the other is a neural network trained the images blurred to varying amounts. Examples of the images used to test the networks can be seen in Fig. 9.

Fig. 9
figure 9

Examples of test images that are blurred using a Gaussian kernel. Each crisp test image for all objects are blurred to varying degrees

Figure 10 shows how the bin 1 convolutional neural network responds to increased blur while being trained on both crisp images (left) and images that are blurred between a standard deviation of 0 to 2 pixels (right). In all cases, classification performance decreases with increased blur, which makes sense since information is being lost (the blur is effectively a 2D low-pass filter). What is striking, however, is how quickly the performance of the network trained on crisp images deteriorates for blurs above 0.5 pixels. This suggest the network generalizes poorly with unexpected image blur or defocus. When trained with images of appropriate blur, however, the network shows reasonable performance over the entire range of blurs investigated.

Fig. 10
figure 10

Object classification performance for two convolutional neural networks tested with increasingly blurred images. One was trained with only crisp images for bin 1 (left) and the other was trained with a range of blurred images for bin 1(right)

We then trained and tested five of our convolutional neural networks on bin 1 images with fixed degrees of Gaussian blur. The standard deviation of Gaussian blur ranges from zero pixels to two pixels. Each network is trained and tested on the same number of images specified in Section “Crisp Image Results”, but these networks were trained with 150 epochs. As we can see from Table 4, the success rates stay above 90% and there is no significant degradation to the success rates as the training and test images are blurred to an increasing extent. A confusion matrix is provided for the fifth convolutional neural network trained and tested on bin 1 images with a fixed blur width of 2 pixels in Table 5.

Table 4 Success rate (%) of networks trained and tested on bin 1 images with fixed blur widths
Table 5 Confusion matrix for the bin 1 convolutional neural network both trained and tested on images with a fixed gaussian blur width of 2 pixels

Novelty Detection Results

A significant problem with conventional neural network architectures (including our own) is that any object within an image will be classified by the neural network even if the network was never trained with that particular object. The network will give its “best guess” and classify the image incorrectly with the label from the training set that produced the best match. It is unreasonable to train a classifier for all objects in the known universe, therefore a method must be developed for our neural networks to detect and classify novel objects in imagery. This is where the distinction between generalization and novelty detection becomes important. A network’s ability to generalize comes from its ability to correctly classify a known object in an image it was not trained on and under different operating conditions, while novelty detection is the ability to classify an object outside of the group of known objects as unknown.

In order to assess the network’s ability to detect novel objects, 2,000 images of ten random objects were rendered in the bin 1 phase angle range.

Here, we explore the use of classifier activations as a means of novelty detection similar to the method explored by LeCun et al. [37]. This approach classifies objects as unknown by using the activation of the largest classifier and the activation gap between the largest and second largest classifier. Specifically, using the set of test images, we counted the number of cases observed for each combination of top classifier score and and classifier gap as a percentage of top classifier score. Example results are shown as a heat map in Fig. 11 for a few of the known objects and for all of the unknown objects. The aim is to determine if an inequality constraint exists (red line in Fig. 11) where the scenarios falling above the constraint are known objects and scenarios falling below the constraint are unknown objects. It is immediately evident, however, that substantial overlap of the classifier activation behavior between the known objects and the unknown objects preclude the effective use of such an approach.

Fig. 11
figure 11

Heat maps showing the number of cases observed for each combination of top classifier score and classifier gap as a percentage of top classifier score for four known objects and all of the unknown objects combined. The red line is the inequality constraint with everything below it being classified as unknown. The top two plots show known objects that are not exceedingly affected by this method of classifying unknown objects and the middle two plots show two known objects that are affected by this method

Regardless of these drawbacks, this method was implemented for our bin 1 convolutional neural network and the results can be seen in Table 6. Using this method most of the known object success rates stayed above 90% with one object dropping to 86%. This method correctly labeled only 30.7% of the unknown objects as unknown (with the rest of the cases incorrectly labeling the unknown objects as one of the catalog objects). Moving the inequality constraint to capture more images in the unknown object heat map may increase the success rate of the classifying unknown objects, but it also reduces the success rates for the known objects. The top classifier value and gap of an image will be different for every network. Therefore, each trained network would need to be assessed and a new constraint would need to be developed. This method’s results and the inability for convolutional neural networks to provide a confidence in its classifications proves that alternative methods must be explored.

Table 6 Success rate (%) of network after unknown object method is implemented. This novelty detection method slightly drops the success rate of the known objects while only successfully classifying 30.7% of the unknown objects as unknown

There are a variety of emerging techniques for novelty detection in deep learning that will be explored in future work, such as developing an autoencoder [11, 68] or generative adversarial network (GAN) [58]. Both methods require an entirely new architecture and method of training since both are considered unsupervised learning. Both the autoencoder and the GAN are used as generative models with the autoencoder detecting novelties by how poorly the network recreates the novel data. For the GAN, the adversarial segment of the network is able to detect if an object in the image is unknown to the network. These methods prove to be more successful than simply evaluating classifier activations, but a separate model must be trained to discern between novel or known objects.

Conclusion

Using a convolutional neural network architecture, we developed a technique for the identification of partially resolved space objects in images at varying distances, relative attitudes, phase angles, and defocus/blur. Three-dimensional triangular mesh models of both asteroids and spacecraft were used to develop a synthetic data sets to both train and evaluate our network architecture. The results involving crisp image training and testing showed that using multiple convolutional neural networks trained on overlapping phase angle bins provided high success rates in classifying objects within an image. The results involving blurred images showed that reasonable performance is maintained, so long as the network is trained with a mix of crisp and blurred images. A method for novelty detection was implemented with results showing a need for further exploration of the topic.