Keywords

1 Introduction

The data collected today by the sensors increases rapidly and especially the hyperspectral data, which allow to give more physical information on the observed area.

The hyperspectral image is an image that represents the same scene following the hundreds of contiguous spectral bands in various wavelength ranges. The data of a hyperspectral image are organized in the form of a cube of three dimensions: Two dimensions denoted x and y represent the spatial dimensions and a spectral dimension denoted z (see Fig. 1) [1].

Fig. 1.
figure 1

Acquisition and decomposition of hyperspectral image

It will be noted that there are multispectral images composed of a dozen bands, while the hyperspectral image exceeds a hundred bands, which implies a significant requirement in terms of data processing and storage.

Unlike the classic color image, the hyperspectral image gives more physical information about each observed object of the scene. Thus the technique of hyperspectral imaging is used in several fields, for example: geology, agriculture, town planning, forestry, in the military field.

To prepare hyperspectral image for visualization or further analysis such as classification, it is necessary to reduce the dimensions of the image to dimensions that can be analyzed by humans. Several dimensions reduction techniques exist. We find iterative versions and also parallel ones [2].

In this paper, we will propose a distributed and parallel version of the PCA dimension reduction algorithm that will be tested on the Spark platform using the MapReduce paradigm.

The rest of this paper is organized as follows: In Sect. 2, we will make an overview of the distributed parallel platforms most known in the field of BIG DATA processing. Thereafter, in Sect. 3, we will see the classic PCA dimension reduction technique and our proposed parallel distributed PCA. The tests of the proposed algorithm are in Sect. 4. Finally, we finish this paper with a conclusion and the future work.

2 Parallel and Distributed Platforms

In order to deal with BIG DATA such as our case hyperspectral images, we will use parallelized and distributed calculations in order to obtain results in a reasonable time.

Platforms that perform parallel distributed processing are multiplying in recent years. The two most recognized tools are Apache Hadoop and Apache Spark.

2.1 Apache Hadoop

Hadoop is among the most widely used, distributed platforms in the BIG DATA domain for storing data with his file manager named HDFS, and processing data with MapReduce on thousands of nodes [3]. (see Fig. 2).

Fig. 2.
figure 2

HDFS abstraction of physical storage

2.2 Apache Spark

Apache Spark is an open source distributed platform for faster and sophisticated processing of BIG DATA, developed by AMPLab, from Berkeley University in 2009 and became an open source Apache project in 2010 [4].

Compared to other BIG DATA platforms, Apache Spark has many advantages:

  • At storage levels: Spark allows to store the data in memory in the form of Resilient Distributed Data Set (RDD)

  • At processing level: Spark extend Hadoop’s Map-Reduce programming that works on disk to process RDDs in memory, allowing it to run programs in memory, up to 100 times faster than Hadoop MapReduce and in disk 10 Times faster

  • In addition to the operations that exist in Hadoop (MapReduce), Apache Spark offers the possibility to work with SQL queries, streaming, graph processing, Learning machine… (see Fig. 3)

    Fig. 3.
    figure 3

    Spark framework libraries

3 Dimensionality Reduction

Now to understand the information hidden in the hyperspectral image cube from human, or extract a useful part of the image, we often resort to visualization.

However, narrow human perception cannot visualize more than 3 hyperspectral bands. So before starting the visualization of our hyperspectral image, we must start by reducing the spectral bands to 3 without losing the quality of the information.

In the last few years, several techniques of dimensionality reduction have been made to reduce the hyperspectral data to a space of lower dimension, important examples include: ISOMAP, LLE, Laplacian eigenmap embedding, Hessian eigenmap embedding, conformal maps, diffusion maps and Principal Components Analysis (PCA) [5].

In the following, we will use PCA, the most popular technique in several domains: reduction of dimensionality, image processing, visualization of data and discovery of hidden models in the data [6].

3.1 Classic PCA Algorithm

Principal Component Analysis is a technique of reducing dimensions of a matrix of quantitative data. This method allows the dominant profiles to be extracted from the matrix [7]. The description of the classical PCA algorithm is as follows:

We assume that our hyperspectral image is a matrix of size (m = LxC, N) where L is the number of lines in the image, C is the number of columns and N is the number of bands with \( {\text{m}} \gg {\text{N}} \).

$$ {\text{X}}\,\left[ {\begin{array}{*{20}c} {{\text{X}}11} & \cdots & {{\text{X}}1{\text{N}}} \\ \vdots & \ddots & \vdots \\ {{\text{Xm}}1} & \cdots & {\text{XmN}} \\ \end{array} } \right] $$

Each line of the matrix X represents the pixel vector. For example the first pixel is represented by the vector: [X11, X12,…, X1N], with X1j is the value of the pixel 1 taken by the spectrum of number j.

Each column of the matrix X represents the values of all the pixels of the image taken by a spectrum. For example Xi1 = [X11, X21,…, Xm1] represents the data of the image taken by the spectrum 1.

To apply the PCA algorithm to the hyperspectral image X, the following steps are followed:

  • Step 1: Calculate the reduced centered matrix of X denoted: XRC

    $$ {\text{XRC}}_{\text{ij}} = \frac{{{\text{Xij}} - \overline{\overline{{{\text{X}}_{\text{J}} }}} }}{\sigma j}\,\,\,{\text{for}}\,{\text{each}}\,{\text{i}} = 1 \ldots {\text{m}}\,{\text{and}}\,{\text{for}}\,{\text{each}}\,{\text{j}} = 1 \ldots {\text{N}} $$
    (1)
    $$ \text{With}\,\overline{{{\text{X}}_{\text{J}} }} = \frac{1}{\text{m}}\sum\nolimits_{{{\text{i}} = 1}}^{\text{m}} {\text{Xij}} \,\,\,\text{And}\,\,\,\upsigma{\text{j}}^{2} = \frac{1}{\text{m}}\sum\nolimits_{i = 1}^{m} {\left( {Xij - \overline{{X_{J} }} } \right)}^{2} $$

    In the formula 1, \( \overline{{X_{J} }} \) denoted the average of column j and \( \upsigma{\text{j}} \) denoted the Standard deviation of column j.

  • Step 2: Calculate the correlation matrix of size (N, N) denoted: Xcorr.

    $$ {\text{Xcorr}} = \frac{1}{\text{m}}\left( {{\text{XRC}}^{\text{T}} . {\text{XRC}}} \right) $$
    (2)

    In the formula 2, \( {\text{XRC}}^{\text{T}} . {\text{XRC}} \) denoted the matrix product between the transpose of the matrix \( {\text{XRC}} \) and the matrix \( {\text{XRC}} \)

  • Step 3: Calculate the eigenvalues and eigenvector of the Xcorr matrix denoted: [\( \lambda ,V \)]

  • Step 4: Sort the eigenvector in descending order of the eigenvalues and take the first k columns of V \( ( {\text{k}} < N) \)

  • Step 5: Project the matrix X on the vector V: U = X. V

  • Step 6: use the new matrix U of size (m, k) for the visualization of the hyperspectral image

3.2 Distributed and Parallel PCA Algorithm

Related works:

There are currently two popular libraries that provide a parallel distributed implementation for the PCA algorithm: MLlib [8] on spark and the Mahout based on MapReduce [9]. In the Mllib library of Spark, we find an implementation for the parallel distributed PCA, but this implementation is done with the two languages: Sclala and Java. No implementation is made for the Python language.

In [6], Tarek et al. have shown that these two libraries do not allow a perfect analysis of a large mass of data and proposed a new PCA implementation, called sPCA. This proposed algorithm has a better scaling and accuracy than these competitors.

In [2], Zebin et al. proposed a new distributed parallel implementation for the PCA algorithm. The implementation is done using the Spark platform and the results obtained are compared with a serial implementation on Matlab and a parallel implementation on Hadoop. The comparison shows the efficiency of the proposed implementation in terms of precision and computation time.

In the following, we will propose a new implementaion for the parallel distributed PCA algorithm based on the Apache Spark platform using the Python programming language and which uses the distributed Mllib matrices.

Proposed implementation:

Since the hyperspectral image is a representation of the same scene with several spectral bands, we can decompose the hyperspectral image into several images, each image for a given spectrum (see Fig. 4).

Fig. 4.
figure 4

Representation of the hyperspectral image by several images

The classic PCA algorithm requires intensive computation because of large hyperspectral image. We will present in this part a method of parallel distributed implementation of the algorithm using the Spark platform.

First, we began by transforming the X matrix (see Fig. 5a) used to represent the hyperspectral image in the classic PCA to a vector of images denoted M, where each column of X is represented by an image in M (see Fig. 5b).

Fig. 5.
figure 5

Representation of hyperspectral image for classic PCA and distributed PCA

Now, each image t of M denoted Mt is a matrix in our implementation (Represents an RDD in parallel distributed spark programming):

To make a parallel distributed implementation of PCA we will use the map reduce paradigm of spark. The proposed algorithm proceeds as follows

  • Step 1: Calculate the reduced centered matrix of M:

As has been seen before, the matrix M contains several images and each image Mt is represented by a matrix (an RDD in Spark notation) of size (L, C) where L is the number of lines in image and C is the number of columns. Therefore, to calculate the reduced centered matrix of M denoted MCR, a parallel distributed computation is carried out of each image Mt (See graphical description of the algorithm in Fig. 6).

Fig. 6.
figure 6

Calculating the average of Mt and σt of MCt with Spark

  • Calculate the reduced matrix of M denoted MC:

    $$ {\text{MC}}_{\text{t}} {\text{ij}} = {\text{M}}_{\text{t}} {\text{ij }} - \overline{{{\text{M}}_{\text{t}} }} \,{\text{for}}\,{\text{each}}\,{\text{i}} = 1 \ldots {\text{L}}\,{\text{and}}\,{\text{for}}\,{\text{each}}\,{\text{j}} = 1 \ldots {\text{C}} $$
    (3)
    $$ \text{with}\,\overline{{{\text{M}}_{\text{t}} }} = \sum\nolimits_{{{\text{i}} = 1}}^{\text{L}} {\sum\nolimits_{{{\text{j}} = 1}}^{\text{C}} {(\frac{1}{\text{LxC}}{\text{x M}}_{\text{t}} {\text{ij}})} } $$

    In the formula 3, \( \overline{{{\text{M}}_{\text{t}} }} \) denoted the average of image Mt

  • Calculate the reduced centered matrix of M denoted MCR:

    $$ {\text{MCR}}_{\text{t}} {\text{ij}} = \frac{\text{MCtij}}{{\upsigma_{\text{t}} }}{\text{for}}\,{\text{each}}\,{\text{i}} = 1 \ldots \text{L}\,{\text{and}}\,{\text{for}}\,{\text{each}}\,{\text{j}} = 1 \ldots \text{C} $$
    (4)
    $$ \begin{aligned} {\text{with}}\,\upsigma_{\text{t}}^{2} = \frac{1}{\text{LxC}}\sum\nolimits_{{{\text{i}} = 1}}^{\text{L}} {\sum\nolimits_{{{\text{j}} = 1}}^{\text{C}} {({\text{M}}_{\text{t}} {\text{ij}} - \overline{{{\text{M}}_{\text{t}} }} )^{2} } } \hfill \\ {\text{or}}\,\,\,\upsigma_{\text{t}}^{2} = \frac{1}{\text{LxC}}\sum\nolimits_{{{\text{i}} = 1}}^{\text{L}} {\sum\nolimits_{{{\text{j}} = 1}}^{\text{C}} {({\text{MC}}_{\text{t}} {\text{ij}})^{2} } } \hfill \\ \end{aligned} $$

In the formula 4, \( \upsigma_{\text{t}} \) denoted the standard deviation of image t

  • Step 2: Calculate the MCR correlation matrix of size (N, N) denoted: Mcorr

According to step 1, the MCR is an images vector of size N. Each image represents a reduced centered matrix.

The next step is to calculate the correlation matrix of size (N, N), by making the matrix product, between the vector MCRT and the vector MCR, using a distributed parallel computation MapReduce of Spark (See graphical description of the algorithm in Fig. 7).

$$ {\text{Mcorr}} = \frac{1}{\text{LxC}}\left( {{\text{MCR}}^{\text{T}} .{\text{MCR}}} \right) $$
(5)
$$ {\text{Mcorr}}_{\text{t,k}} = \frac{1}{\text{LxC}}({\text{MCR}}_{\text{t}} . {\text{MCR}}_{\text{k}} ) $$
(6)
$$ \begin{aligned} {\text{for}}\,{\text{each}}\,{\text{t}} = 1 \ldots {\text{N}}\,{\text{and}}\,{\text{for}}\,{\text{each}}\,{\text{k}} = 1 \ldots \text{N} \hfill \\ {\text{with}}\,{\text{MCR}}_{\text{t}} .{\text{MCR}}_{\text{k}} = \sum\nolimits_{{{\text{i}} = 1}}^{\text{L}} {\sum\nolimits_{{{\text{j}} = 1}}^{\text{C}} {{\text{MCR}}_{\text{t}} {\text{ij}}.{\text{MCR}}_{\text{k}} {\text{ij}}} } \hfill \\ \end{aligned} $$
Fig. 7.
figure 7

Calculation of the correlation matrix with Spark

To calculate the value of each Mcorrt,k (Formula 6), the image MCRt is multiplied by the image MCRk pixel by pixel. Then we calculate the mean of result (See graphical description of the algorithm in Fig. 8).

Fig. 8.
figure 8

Multiplication of two images

  • Step 3: Calculate the eigenvalues and eigenvector of the Mcorr matrix: [\( \lambda ,V \)]

  • Step 4: Sort the eigenvector in descending order of the eigenvalues and take the first k columns of V (k \( < N) \)

  • Step 5: Project the matrix X on the vector V: U = X.V

  • Step 6: use the new matrix U of size (m, k) for the visualization of the hyperspectral image

4 Experimental and Computational Details

To test the validity of the proposed algorithm on hyperspectral images using the Apache Spark platform, we chose a set of open hyperspectral images of different sizes (see Table 1) and we tested our serial PCA algorithm, serial PCA of the Sklearn library of Python and parallel distributed PCA proposed on these datasets.

Table 1. Datasets

The hyperspectral image used in our experiments for the classic PCA algorithm or the proposed PCA distributed algorithm is the Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) Moffett Field image with 224 spectral bands in the 2.5 nm to 400 nm, which was acquired on August 20, 1992 [10].

The three algorithms are implemented in Python 3 and are executed on several configurations (see Table 2) and the results of the experiments of PCA are given in Table 3.

Table 2. Configuration parameters
Table 3. The three most significant eigenvalues of PCA

The visualization of the hyperspectral image after the application of the classic PCA algorithm or the proposed PCA distributed algorithm is given in Fig. 9.

Fig. 9.
figure 9

Dataset Visualization,before and after application of classic PCA (our srial PCA or Sklearn serial PCA) and the proposed Distributed PCA

5 Conclusions

In this work, we proposed a parallel and distributed algorithm for dimensionality reduction called PCA. The algorithm is developed in Python 3 and tested on hyperspectral images using the Spark platform. The results coincide with the results of classic PCA and the visualization of the images after the application of our reduction algorithm confirms the validity of our algorithm.

In the next work we try to validate our algorithm based on the execution time of each method.