1 Introduction

Galaxies are gravitationally bound celestial object composed of gas, dust, and billions of stars. Shape and general visual appearance of galaxies gives astronomers much information about their composition and their evolution. The morphology of galaxies is a significant role in the large scale to understanding observed phenomena and study the universe. This is the first step towards a greater understanding of the origin and formation process of galaxies. Many authors conclude that the galaxies morphological analysis is a problem of interest in astronomy because it is often considered as a convenient way to distinguish between galaxies that have different physical properties [1,2,3,4].

Recently, increasing size of telescopes and the CCD camera have allowed extremely large datasets of images to be produced. In modern sky surveys containing millions of galaxies, there are too much data to feasibly analyze manually, one of the most prominent examples of this is the Sloan Digital Sky Survey. This has reinforced the need for robust methods that can perform morphological analysis of large galaxy image databases. Artificial neural networks (ANNs) have recently been utilized in astronomy for a wide range of problems, e.g., from adaptive optics to galaxy classification. Recently, the advancements in computational tools and algorithms have started to allow automatic analysis of galaxy morphology according to the different Hubble types. These algorithms are a widely studied problem in the field of astronomy and include model-driven methods such as [5, 6] and galaxy morphologies [7, 8]. GALFIT [9], GIM2D and refinance their in, CAS [10], Gini [11], Ganalyzer [12], SpArcFiRe [13] and [14] and references therein. Automated classification of galaxies not only using CCD images, but also using spectra, [15, 16]. Not only classifiers can differentiate between broad galaxy morphological types of elliptical and spiral galaxies ([17]; and refinances their in [18]) but also can differentiate between four basic objects [19] and references their in. Clearly, automated classification algorithms will prove invaluable for the analysis of such datasets, but these algorithms are yet to be applied on such scales.

The classification of galaxies based on images and spectra has been a long-term goal in astronomy. Artificial neural networks were first applied to astronomical data sets in order to classify stellar spectra [5]. Classification of galaxies based on spectra in supervised [16] and unsupervised [15]. A common problem facing many of the astronomical sciences today is an increasing amount of data, this will leads to make the classification is a hard problem. Therefore, the dimension reduction methods is needed before apply classification method. Since galaxies images are represented by its light intensity and it is measured by a nonnegative value.

In recently years, the nonnegative matrix factorization (NMF) has become a popular dimension reduction method. The NMF refers to the problem of approximating a nonnegative matrix by a product of two nonnegative matrices.

The main goal of this paper is to developed an algorithm for automated classification of galaxies images based the NMF method. The proposed algorithm is compared with human classifications and other algorithms. The NMF extract a list of features for galaxy images based on some nonnegative constraints are imposed. Therefore, the image can be reconstructed from this list of features.

The paper is organized as follows: Galaxy classification was described in Section 2. In section 3, the Nonnegative matrix factorization is introduced. The classification method and algorithm are presented in Section 4. We briefly describe Data and Results in section 5. The conclusion is presented in last section.

2 Galaxy classification

Classification of galaxies was established only more than 90 years ago, i.e., during the 1920s. Before that, galaxies were listed in catalogs of nebulae objects that appeared fuzzy in a telescope and were therefore not stars. However, Sir. Edwin Hubble, the American astronomer who showed conclusively for the first time that one of the nebulae (the Andromeda “nebula” M31) is a galaxy in its own right. Hubble was also the first who tried to set out a scheme for classifying the galaxies. His classification scheme which is known as Hubble’s tuning fork diagram as shown in Fig. 1, Hubble [20], including some later additions and modifications, are still in use up to today. Hubble recognized three main types of galaxy: ellipticals, lenticulars and spirals, with a fourth class, that would not fit into any of the other categories called the irregulars. He proposed the following classifications: elliptical: E0, E3, E5, E7; spiral: S0, Sa, Sb, Sc; barred spiral: SBa, SBb, SBc; irregular: Im, IBm. This scheme is commonly referred to as the “Hubble Tuning Fork.” This classification updated by de Vaucouleurs to obtain the Revised Hubble System (RHS) in 1959. Other schemes were proposed by Morgan in 1958 and Van Den Berghin 1960. A global classification known as the revised morphological types was provided by NASA.

Fig. 1
figure 1

Hubble’s tuning fork diagram of galaxies classification

Classifying these galaxies into categories is of great value to astronomers. By studying the structures of galaxies in the same category, we understanding the structures processes. In the past, large catalogs of classified galaxies have had many practical applications. Astronomers have used such catalogs to test theories about the universe. The availability of these large datasets has introduced the need for tools that can automatically analyze astronomical images.

There is currently much interest in applying classification algorithms for this problem. The challenge is to design an algorithm which will reproduce classification to the same degree a human expert can do it. Automated classification algorithms will improve the analysis of data sets, but these algorithms are not being applied on all fields of astronomy. The goal of such algorithms is to learn the characteristics of each of the categories and build a model on this data. This model can then be used to predict the category of futures objects without known classification information. Such an automated procedure usually involves two steps: (i) feature extraction from the digitized image, e.g., the galaxy profile, the extent of spiral arms, the color of the galaxy, or an efficient compression of the image pixels into a smaller number of coefficients (e.g., Fourier or principal component analysis) and (ii) a classification procedure, in which a computer learns from a training set for which a human expert provided his or her classification.

Several machine learning algorithms have been applied to photometric data extracted from galaxy images and applied to direct galaxies images in order to automatic classification [21,20,23].

There are many algorithms that classify galaxy images based on ANN such as [8] use artificial neural networks techniques for galaxies classification. The Sloan Digital Sky Survey has already used for morphological classification using automated machine-learning techniques [16]. Also the artificial neural networks were applied to astronomical datasets in order to classify stellar spectra [5, 6] and [24, 25]. This has been one of the most successful attempts at automatic classification.

The matrix factorization is an example of a prevalently used scene-based classification method. There are many types of this factorization such as NMF [26] and principle component analysis (PCA), [27] and independent component analysis (ICA) [28].

However, galaxy images are represented by a set of nonnegative numbers, e.g., numbers of occurrences or light intensities. Because of the nonnegativity is a crucial feature that one needs to maintain during the analysis of objects. Hence, the PCA and ICA are not suitable for images since the results of them contain negative data and some features which nonnegative objects by their nature never or hardly possess, such as zero sum and orthogonality. To solve this, we must add the nonnegativity constraint and this constraint is important in human perceptions. So, the NMF method has been proposed to construct these features not only a good reconstruction of images but also the nonnegativity of the features. Therefore, each feature can be considered again as an image. Together with the participation of each feature in an image, one can establish the composition of every image in a very comprehensible way.

3 The nonnegative matrix factorization

The NMF method has been proposed as a novel subspace method in order to obtain a parts based representation of objects by imposing nonnegative constraints. Also, it is represents data as a linear combination of basis images by using nonnegativity constraints.

The problem addressed by NMF is as follows: Given a nonnegative matrix V ∈ Rn × m (data matrix, consisting of m vectors of dimension n), it is possible to find nonnegative matrix factors W ∈ Rn × r and H ∈ Rr × m (r ≪ min(n, m)) in order to approximate the original matrix:

$$ V\approx W H $$
(1)

A standard NMF aims to minimize the squared Euclidean distance between V and WH:

$$ \underset{\mathrm{H},\mathrm{W}}{ \min } f\left( W, H\right)=\frac{1}{2}{\left\Vert \mathrm{V}- WH\right\Vert}_{\mathrm{F}}^2\mathrm{subject}\ \mathrm{t}\mathrm{o}\forall i, j,\ {W}_{ij},{H}_{ij}\ge 0 $$
(2)

where .  F is Frobenius norm, the interpretation of W and H is different based on the application. For example, in blind source separation (BSS) [29] and [30]), when V represents the mixed signals, W is a mixing matrix and H is a matrix of original sources. In classification W is a basis matrix and H is coefficient matrix as in [26]. The multiplicative algorithms proposed by Lee and Seung highly popularized to perform decomposition, which alternatively updates W and H as follows:

$$ \begin{array}{ll}{W}_{k+1}={W}_k\frac{V{\left({H}_k\right)}^T}{W_k{H}_k{\left({H}_k\right)}^T},\hfill & \mathrm{s}.\mathrm{t}\kern0.5em {W}_{ij}\ge 0\hfill \end{array} $$
(3)
$$ \begin{array}{ll}{H}_{k+1}={H}_k\frac{{\left({W}_{k+1}\right)}^T V}{{\left({W}_{k+1}\right)}^T{W}_{k+1}{H}_k},\hfill & \mathrm{s}.\mathrm{t}\kern0.5em {H}_{ij}\ge 0\hfill \end{array} $$
(4)

where (.)T represents the transpose of matrix, the multiplicative algorithm is commonly used and simple to implement. However, the multiplicative update algorithm lacks the convergence guarantee for large-scale problem [31, 32].

Many algorithms have been proposed to solve these problems, such as, in [31, 32] the projected gradient update for NMF is introduced, in which the Armijo rule (as in Eq. 7) is used to estimate the learning parameters. In the projected gradient NMF, the two matrices W and H are updated as:

$$ {W}_{k+1}= \max \left(0,{W}_k-{\alpha}_k{\nabla}_{\mathrm{W}} f\left({W}_k, H\right)\right) $$
(5)
$$ {H}_{k+1}= \max \left(0,{H}_k-{\alpha}_k{\nabla}_H f\left( W,{H}_k\right)\right) $$
(6)

where α k is the fixed step size that selected by an Armijo rule along the projection direction (direction of update variable H or W) to determine the step lengths in the iterative updates. For the computation of H such a value of α k is defined as:

$$ {\alpha}_k={\beta}^{t_k} $$
(7)

where the parameters β ∈ (0, 1) and t k is the first nonnegative integer t that satisfies

$$ f\left( W,{H}_{k+1}\right)- f\left( W,{H}_k\right)\le \upsigma\ \nabla f\left( W,{H}_k\right)\left({H}_{k+1}-{H}_k\right) $$
(8)

where σ ∈ (0, 1) decide about a convergence speed, here we put σ = 0.01 and β = 0.1 following [31, 32] as the initial value, and these parameters are updated using (6) and (7).

4 The proposed algorithm

The overall process of our algorithm based on projected gradient NMF (PG-NMF algorithm) is given in Fig. 2 and we called it Galaxy-PGNMF Classifier. In which the input to this algorithm is training set and test set (where our algorithm not needs to dimension reduction as preprocessing step since (r < <min (n, m)). Also, the number of galaxy types (groups) is input (i.e., if it is determined from prior knowledge about catalog). Otherwise, (if it is unknown) we determine the best number of galaxy types by using the consensus matrix, where it reflects the probability that images i and j belong to the same type. It is defined as the average of connectivity matrix over runs, where the consensus matrix is defined as:

Fig. 2
figure 2

Nonnegative matrix factorization galaxy classification scheme

$$ \overline{C}=\frac{{\displaystyle {\sum}_{k=1}^{NR}}{C}_k}{NR} $$
(9)

where NR is the number of run and C k is the connectivity matrix of size m × m and its element is defined as:

$$ {c}_{ij}=\left\{\begin{array}{cc}\hfill 1\hfill & \hfill \mathrm{if}\kern0.5em \mathrm{i}\kern0.5em \mathrm{and}\kern0.5em \mathrm{j}\kern0.75em \mathrm{belong}\kern0.75em \mathrm{to}\kern0.75em \mathrm{the}\kern0.75em \mathrm{same}\kern0.75em \mathrm{type}\hfill \\ {}\hfill 0\hfill & \hfill \mathrm{otherwise}\hfill \end{array}\right. $$
(10)

By using the off-diagonal of \( \overline{C} \) the best number of type is determined.

Then, our algorithm consists of two main steps: The first is the training step in which each training sample is normalized using l 2—norm to make all images have the same scale. Then, the PG-NMF algorithm is used to decompose these sets into basis matrix W and coefficient matrix H. The second main step is the test or prediction step, where the test set S is normalized. The coefficient matrix A of the S is computed based on basis matrix W, where the basis matrix W contains all information about different galaxy types (i.e., spiral, elliptical, lenticulars, and irregulars) and these basis make the process of separating the images that belong to the same type is easy (like blind source separation).

To predict the class of an unknown sample S i , we used a MAX rule that selects the maximum coefficient in a i (the coefficient vector of A), and then assign the class label of the corresponding training sample to this new sample. The pescdocode of our algorithm is given in algorithm 1.

figure a
figure b

5 Results and discussion

The experiments were implemented in Matlab and run in windows environment with 64 bit support. Performance of the algorithms was evaluated by using different measures of performance such as NPV, sensitivity, accuracy, Specificity, and F-measure. These measures are defined as follows:

  1. (1)

    The classification accuracy is defined as:

    $$ \mathrm{Accuracy}=\frac{TP+ TN}{TP+ FP+ FN+ TN}\times 100 $$
    (11)

    where TPTNFP and FP are represented the true positive, true negative, false positive and false negative, respectively.

  2. (2)

    Sensitivity measures (also called recall):

    $$ \mathrm{Sensitivity}=\frac{TP}{TP+ FN}\times 100 $$
    (12)
  3. (3)

    Specificity:

    $$ \mathrm{Specificity}=\frac{TN}{FP+ TN}\times 100 $$
    (13)
  4. (4)

    The F-measure or F-score is defined as:

    $$ F-\mathrm{measure}=\frac{2\left(\mathrm{precision}+\mathrm{recall}\right)}{\mathrm{precision}\times \mathrm{recall}} $$
    (14)

5.1 Data description

The proposed algorithm is applied to classify an images of the EFIGI catalog [33], a database contains sample all different Hubble types galaxies. The catalog merges data from standard surveys and catalogs (Principal Galaxy Catalogue, Sloan Digital Sky Survey, Value-Added Galaxy Catalogue, HyperLeda, and the NASA Extragalactic Database). The final EFIGI database is a large sub-sample of the local universe which it samples densely. It must be noted that the peculiar, interacting, duple and merging galaxies images are not included in our images. The images that contain an unknown object are not included also. The main feature of galaxies types used in the paper can be explained in Table 1.

Table 1 Features of ellipticals, lenticulars and spirals, and the irregulars galaxies

5.2 Discussion

Since supervised learning is used, the classifier can be biased by the intuition of the person who prepares the standard training data. Therefore, training data for an image classifier that can be used for practical galaxy morphology classification should be selected and reviewed carefully.

5.2.1 Small data

In this section, we illustrate the performance of the proposed algorithm by selecting small number of galaxy imgaes. In which we selected 30 images from each classes ellipticals, lenticulars and spirals, and 20 images from fourth class the irregulars.

The results are show in Tables 2 and 3, where Table 2 illustrates the accuracy and Table 3 shows the count matrix). From Table 2, we can conclude that the performance of the proposed algoritm for ellipticals, lenticulars and spirals, and the irregulars galaxies is 90, 90, 93, and 97%, respectively. The mean accuracy over all galaxies type is 92.7%. Also, in this table, the measure TP represents the true positive for each type of galaxies.

Table 2 The performance of proposed algorithm for small dataset
Table 3 Confusion matrix of the classification of spirals, ellipticals, lenticulars, and irregulars for small dataset

5.3 Large dataset

In this section, we select a large number of galaxy images from catalog as ellipticals, spirals, lenticulars, and irregulars galaxies, 225, 388, 99, and 34 respectively. This dataset is divided into training and testing set which represent nearly 75% (∼558 images) and 25% (∼188 images) from dataset respectively. The results of the proposed algorithm is given in Tables 4 and 5, where the accuracy for spirals, ellipticals, lenticulars, and the irregulars are 92, 94, 88, and 92% respectively. Also, the average over all types is 91.9%, the F-measure is 72% which indicates that high performance of the proposed algorithm.

Table 4 The performance of proposed algorithm for large dataset
Table 5 Confusion matrix of the classification of spirals, ellipticals, lenticulars, and irregulars for small dataset

In our sample, there are some galaxy images are poorly classified over different morphological classes. A standard problem in automatic classification is the projection of the galaxies on the sky. Not only these poorly classified is results from overlapping or foreground objects and the sky projection but also the most poorly classified galaxies have noise. Our algorithm can work well when SNR is low, in general, all observed images have noise so it would be reasonable to state, e.g., that 90% of the galaxies with a SNR of more than 60 at their effective radius (i.e., half light radius) could be classified correctly. There are some images are not well classification with different morphological type as shown in Fig. 3, with high SNR.

Fig. 3
figure 3

Images which are not well classified

5.3.1 Comparison with other methods

We compare the proposed algorithm with other related works in Table 6. From this table we can conclude that, average of accuracy for most of the different algorithms nearly 91%. However, the number of galaxies type range from two to three types; therefore, the accuracy of the Galaxy-PGNMF algorithm is in the same range of accuracy (or increase with small difference). Also, the Galaxy-PGNMF algorithm classify four types of galaxy images.

Table 6 The accuracy of related works

6 Conclusions

In this paper, we described an algorithm that can automatically classify images of spiral, elliptical, lenticular and Irregulars galaxies based on Nonnegative Matrix Factorization (NMF) algorithm. The data set used for the described experiments consists of galaxy images manually classified by Baillard and also classified visual by the authors (in this paper four morphological classes are used). In fact, our experiments show that the accuracy for small and large datasets is 92.7 and 91.9% respectively. There are many factors that effect on the accuracy of the proposed algorithm such as, a noise in the image, bias, dust scattering, overlapping or foreground objects, sky projections, and low surface brightness. The correct classification within four Hubble types is different from early type galaxies to late type galaxies as followe, for small (large) example 90% (94%) of ellipticals classifications and 90% (88%) of lenticular classifications 93% (92%) of spiral type classifications, and 97% (92%) for fourth class, the irregulars classifications. These results are compatible with those obtained by the human eye and as learned. We found that early types generally have a high optimum values where the late type low optimum values.

Finally, we conclude from the results that the Galaxy-PGNMF algorithm is a powerful alternative for galaxy morphology classification in CCD images.

Future work to improve classification requires more training data and integrating photometric features measured in different morphological types, repeating the experiments with a larger set of galaxies.