Abstract
In recent years, the field of storage and data processing has known a radical evolution, because of the large mass of data generated every minute. As a result, traditional tools and algorithms have become incapable of following this exponential evolution and yielding results within a reasonable time. Among the solutions that can be adopted to solve this problem, is the use of distributed data storage and parallel processing. In our work we used the distributed platform Spark, and a massive data set called hyperspectral image. Indeed, a hyperspectral image processing, such as visualization and feature extraction, has to deal with the large dimensionality of the image. Several dimensions reduction techniques exist in the literature. In this paper, we proposed a distributed and parallel version of Principal Component Analysis (PCA).
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The data collected today by the sensors increases rapidly and especially the hyperspectral data, which allow to give more physical information on the observed area.
The hyperspectral image is an image that represents the same scene following the hundreds of contiguous spectral bands in various wavelength ranges. The data of a hyperspectral image are organized in the form of a cube of three dimensions: Two dimensions denoted x and y represent the spatial dimensions and a spectral dimension denoted z (see Fig. 1) [1].
It will be noted that there are multispectral images composed of a dozen bands, while the hyperspectral image exceeds a hundred bands, which implies a significant requirement in terms of data processing and storage.
Unlike the classic color image, the hyperspectral image gives more physical information about each observed object of the scene. Thus the technique of hyperspectral imaging is used in several fields, for example: geology, agriculture, town planning, forestry, in the military field.
To prepare hyperspectral image for visualization or further analysis such as classification, it is necessary to reduce the dimensions of the image to dimensions that can be analyzed by humans. Several dimensions reduction techniques exist. We find iterative versions and also parallel ones [2].
In this paper, we will propose a distributed and parallel version of the PCA dimension reduction algorithm that will be tested on the Spark platform using the MapReduce paradigm.
The rest of this paper is organized as follows: In Sect. 2, we will make an overview of the distributed parallel platforms most known in the field of BIG DATA processing. Thereafter, in Sect. 3, we will see the classic PCA dimension reduction technique and our proposed parallel distributed PCA. The tests of the proposed algorithm are in Sect. 4. Finally, we finish this paper with a conclusion and the future work.
2 Parallel and Distributed Platforms
In order to deal with BIG DATA such as our case hyperspectral images, we will use parallelized and distributed calculations in order to obtain results in a reasonable time.
Platforms that perform parallel distributed processing are multiplying in recent years. The two most recognized tools are Apache Hadoop and Apache Spark.
2.1 Apache Hadoop
Hadoop is among the most widely used, distributed platforms in the BIG DATA domain for storing data with his file manager named HDFS, and processing data with MapReduce on thousands of nodes [3]. (see Fig. 2).
2.2 Apache Spark
Apache Spark is an open source distributed platform for faster and sophisticated processing of BIG DATA, developed by AMPLab, from Berkeley University in 2009 and became an open source Apache project in 2010 [4].
Compared to other BIG DATA platforms, Apache Spark has many advantages:
-
At storage levels: Spark allows to store the data in memory in the form of Resilient Distributed Data Set (RDD)
-
At processing level: Spark extend Hadoop’s Map-Reduce programming that works on disk to process RDDs in memory, allowing it to run programs in memory, up to 100 times faster than Hadoop MapReduce and in disk 10 Times faster
-
In addition to the operations that exist in Hadoop (MapReduce), Apache Spark offers the possibility to work with SQL queries, streaming, graph processing, Learning machine… (see Fig. 3)
3 Dimensionality Reduction
Now to understand the information hidden in the hyperspectral image cube from human, or extract a useful part of the image, we often resort to visualization.
However, narrow human perception cannot visualize more than 3 hyperspectral bands. So before starting the visualization of our hyperspectral image, we must start by reducing the spectral bands to 3 without losing the quality of the information.
In the last few years, several techniques of dimensionality reduction have been made to reduce the hyperspectral data to a space of lower dimension, important examples include: ISOMAP, LLE, Laplacian eigenmap embedding, Hessian eigenmap embedding, conformal maps, diffusion maps and Principal Components Analysis (PCA) [5].
In the following, we will use PCA, the most popular technique in several domains: reduction of dimensionality, image processing, visualization of data and discovery of hidden models in the data [6].
3.1 Classic PCA Algorithm
Principal Component Analysis is a technique of reducing dimensions of a matrix of quantitative data. This method allows the dominant profiles to be extracted from the matrix [7]. The description of the classical PCA algorithm is as follows:
We assume that our hyperspectral image is a matrix of size (m = LxC, N) where L is the number of lines in the image, C is the number of columns and N is the number of bands with \( {\text{m}} \gg {\text{N}} \).
Each line of the matrix X represents the pixel vector. For example the first pixel is represented by the vector: [X11, X12,…, X1N], with X1j is the value of the pixel 1 taken by the spectrum of number j.
Each column of the matrix X represents the values of all the pixels of the image taken by a spectrum. For example Xi1 = [X11, X21,…, Xm1] represents the data of the image taken by the spectrum 1.
To apply the PCA algorithm to the hyperspectral image X, the following steps are followed:
-
Step 1: Calculate the reduced centered matrix of X denoted: XRC
$$ {\text{XRC}}_{\text{ij}} = \frac{{{\text{Xij}} - \overline{\overline{{{\text{X}}_{\text{J}} }}} }}{\sigma j}\,\,\,{\text{for}}\,{\text{each}}\,{\text{i}} = 1 \ldots {\text{m}}\,{\text{and}}\,{\text{for}}\,{\text{each}}\,{\text{j}} = 1 \ldots {\text{N}} $$(1)$$ \text{With}\,\overline{{{\text{X}}_{\text{J}} }} = \frac{1}{\text{m}}\sum\nolimits_{{{\text{i}} = 1}}^{\text{m}} {\text{Xij}} \,\,\,\text{And}\,\,\,\upsigma{\text{j}}^{2} = \frac{1}{\text{m}}\sum\nolimits_{i = 1}^{m} {\left( {Xij - \overline{{X_{J} }} } \right)}^{2} $$In the formula 1, \( \overline{{X_{J} }} \) denoted the average of column j and \( \upsigma{\text{j}} \) denoted the Standard deviation of column j.
-
Step 2: Calculate the correlation matrix of size (N, N) denoted: Xcorr.
$$ {\text{Xcorr}} = \frac{1}{\text{m}}\left( {{\text{XRC}}^{\text{T}} . {\text{XRC}}} \right) $$(2)In the formula 2, \( {\text{XRC}}^{\text{T}} . {\text{XRC}} \) denoted the matrix product between the transpose of the matrix \( {\text{XRC}} \) and the matrix \( {\text{XRC}} \)
-
Step 3: Calculate the eigenvalues and eigenvector of the Xcorr matrix denoted: [\( \lambda ,V \)]
-
Step 4: Sort the eigenvector in descending order of the eigenvalues and take the first k columns of V \( ( {\text{k}} < N) \)
-
Step 5: Project the matrix X on the vector V: U = X. V
-
Step 6: use the new matrix U of size (m, k) for the visualization of the hyperspectral image
3.2 Distributed and Parallel PCA Algorithm
Related works:
There are currently two popular libraries that provide a parallel distributed implementation for the PCA algorithm: MLlib [8] on spark and the Mahout based on MapReduce [9]. In the Mllib library of Spark, we find an implementation for the parallel distributed PCA, but this implementation is done with the two languages: Sclala and Java. No implementation is made for the Python language.
In [6], Tarek et al. have shown that these two libraries do not allow a perfect analysis of a large mass of data and proposed a new PCA implementation, called sPCA. This proposed algorithm has a better scaling and accuracy than these competitors.
In [2], Zebin et al. proposed a new distributed parallel implementation for the PCA algorithm. The implementation is done using the Spark platform and the results obtained are compared with a serial implementation on Matlab and a parallel implementation on Hadoop. The comparison shows the efficiency of the proposed implementation in terms of precision and computation time.
In the following, we will propose a new implementaion for the parallel distributed PCA algorithm based on the Apache Spark platform using the Python programming language and which uses the distributed Mllib matrices.
Proposed implementation:
Since the hyperspectral image is a representation of the same scene with several spectral bands, we can decompose the hyperspectral image into several images, each image for a given spectrum (see Fig. 4).
The classic PCA algorithm requires intensive computation because of large hyperspectral image. We will present in this part a method of parallel distributed implementation of the algorithm using the Spark platform.
First, we began by transforming the X matrix (see Fig. 5a) used to represent the hyperspectral image in the classic PCA to a vector of images denoted M, where each column of X is represented by an image in M (see Fig. 5b).
Now, each image t of M denoted Mt is a matrix in our implementation (Represents an RDD in parallel distributed spark programming):
To make a parallel distributed implementation of PCA we will use the map reduce paradigm of spark. The proposed algorithm proceeds as follows
-
Step 1: Calculate the reduced centered matrix of M:
As has been seen before, the matrix M contains several images and each image Mt is represented by a matrix (an RDD in Spark notation) of size (L, C) where L is the number of lines in image and C is the number of columns. Therefore, to calculate the reduced centered matrix of M denoted MCR, a parallel distributed computation is carried out of each image Mt (See graphical description of the algorithm in Fig. 6).
-
Calculate the reduced matrix of M denoted MC:
$$ {\text{MC}}_{\text{t}} {\text{ij}} = {\text{M}}_{\text{t}} {\text{ij }} - \overline{{{\text{M}}_{\text{t}} }} \,{\text{for}}\,{\text{each}}\,{\text{i}} = 1 \ldots {\text{L}}\,{\text{and}}\,{\text{for}}\,{\text{each}}\,{\text{j}} = 1 \ldots {\text{C}} $$(3)$$ \text{with}\,\overline{{{\text{M}}_{\text{t}} }} = \sum\nolimits_{{{\text{i}} = 1}}^{\text{L}} {\sum\nolimits_{{{\text{j}} = 1}}^{\text{C}} {(\frac{1}{\text{LxC}}{\text{x M}}_{\text{t}} {\text{ij}})} } $$In the formula 3, \( \overline{{{\text{M}}_{\text{t}} }} \) denoted the average of image Mt
-
Calculate the reduced centered matrix of M denoted MCR:
$$ {\text{MCR}}_{\text{t}} {\text{ij}} = \frac{\text{MCtij}}{{\upsigma_{\text{t}} }}{\text{for}}\,{\text{each}}\,{\text{i}} = 1 \ldots \text{L}\,{\text{and}}\,{\text{for}}\,{\text{each}}\,{\text{j}} = 1 \ldots \text{C} $$(4)$$ \begin{aligned} {\text{with}}\,\upsigma_{\text{t}}^{2} = \frac{1}{\text{LxC}}\sum\nolimits_{{{\text{i}} = 1}}^{\text{L}} {\sum\nolimits_{{{\text{j}} = 1}}^{\text{C}} {({\text{M}}_{\text{t}} {\text{ij}} - \overline{{{\text{M}}_{\text{t}} }} )^{2} } } \hfill \\ {\text{or}}\,\,\,\upsigma_{\text{t}}^{2} = \frac{1}{\text{LxC}}\sum\nolimits_{{{\text{i}} = 1}}^{\text{L}} {\sum\nolimits_{{{\text{j}} = 1}}^{\text{C}} {({\text{MC}}_{\text{t}} {\text{ij}})^{2} } } \hfill \\ \end{aligned} $$
In the formula 4, \( \upsigma_{\text{t}} \) denoted the standard deviation of image t
-
Step 2: Calculate the MCR correlation matrix of size (N, N) denoted: Mcorr
According to step 1, the MCR is an images vector of size N. Each image represents a reduced centered matrix.
The next step is to calculate the correlation matrix of size (N, N), by making the matrix product, between the vector MCRT and the vector MCR, using a distributed parallel computation MapReduce of Spark (See graphical description of the algorithm in Fig. 7).
To calculate the value of each Mcorrt,k (Formula 6), the image MCRt is multiplied by the image MCRk pixel by pixel. Then we calculate the mean of result (See graphical description of the algorithm in Fig. 8).
-
Step 3: Calculate the eigenvalues and eigenvector of the Mcorr matrix: [\( \lambda ,V \)]
-
Step 4: Sort the eigenvector in descending order of the eigenvalues and take the first k columns of V (k \( < N) \)
-
Step 5: Project the matrix X on the vector V: U = X.V
-
Step 6: use the new matrix U of size (m, k) for the visualization of the hyperspectral image
4 Experimental and Computational Details
To test the validity of the proposed algorithm on hyperspectral images using the Apache Spark platform, we chose a set of open hyperspectral images of different sizes (see Table 1) and we tested our serial PCA algorithm, serial PCA of the Sklearn library of Python and parallel distributed PCA proposed on these datasets.
The hyperspectral image used in our experiments for the classic PCA algorithm or the proposed PCA distributed algorithm is the Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) Moffett Field image with 224 spectral bands in the 2.5 nm to 400 nm, which was acquired on August 20, 1992 [10].
The three algorithms are implemented in Python 3 and are executed on several configurations (see Table 2) and the results of the experiments of PCA are given in Table 3.
The visualization of the hyperspectral image after the application of the classic PCA algorithm or the proposed PCA distributed algorithm is given in Fig. 9.
5 Conclusions
In this work, we proposed a parallel and distributed algorithm for dimensionality reduction called PCA. The algorithm is developed in Python 3 and tested on hyperspectral images using the Spark platform. The results coincide with the results of classic PCA and the visualization of the images after the application of our reduction algorithm confirms the validity of our algorithm.
In the next work we try to validate our algorithm based on the execution time of each method.
References
Mercier, L.: Système d’analyse et de visualisation d’images hyperspectrales appliqué aux sciences planétaires (2011)
Zebin, W., et al.: Parallel and distributed dimensionality reduction of hyperspectral data on cloud computing architectures. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 9(6), 2270–2278 (2016)
Apache Software Foundation. Official apache hadoop. http://hadoop.apache.org/. Accessed 10 July 2017
Apache Spark - Lightning-Fast Cluster Computing. http://spark.apache.org/. Accessed 10 July 2017
Van Der Maaten, L., Postma, E., Van den Herik, J.: Dimensionality reduction: a comparative. J. Mach. Learn. Res. 10, 66–71 (2009)
Elgamal, T., et al.: sPCA: scalable principal component analysis for big data on distributed platforms. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM (2015)
Shlens, J.: A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100 (2014)
MLlib machine learning library. https://spark.apache.org/mllib/. Accessed 10 July 2017
Mahout machine learning library. http://mahout.apache.org/. 10 July 2017
AVIRIS - Airborne Visible/Infrared Imaging Spectrometer - Data. http://aviris.jpl.nasa.gov/data/image_cube.html. 10 July 2017
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Zbakh, A., Alaoui Mdaghri, Z., El Yadari, M., Benyoussef, A., El Kenz, A. (2018). Proposition of a Parallel and Distributed Algorithm for the Dimensionality Reduction with Apache Spark. In: Ben Ahmed, M., Boudhir, A. (eds) Innovations in Smart Cities and Applications. SCAMS 2017. Lecture Notes in Networks and Systems, vol 37. Springer, Cham. https://doi.org/10.1007/978-3-319-74500-8_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-74500-8_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74499-5
Online ISBN: 978-3-319-74500-8
eBook Packages: EngineeringEngineering (R0)