Keywords

1 Introduction

Statistical methods are today used in almost all areas of human activity and are part of the basic knowledge of the researcher, the engineer, the manager, the economist, etc. They are existing in two types known as: Factorial methods and Classification methods. In this work, we are interested in studying Factorial methods. This latter are among the descriptive or unsupervised methods of Datamining which consist in the projection on a space of lower dimension in order to give a clear viewing of all the links between variables while guaranteeing the minimization of loss of the informations. Those Factorial methods are classified into two groups:

  1. 1.

    Principal Component Analysis or PCA

  2. 2.

    Factor Analysis or FA

By contrast on linear data, there are also other many dimensionality reduction techniques used for nonlinear dimensionally reduction structure like the Self Organizing Maps (SOM) used to visualize corrective actions of failure modes and effects analysis (FMEA) [1], Kernel PCA which could be used for de-noising image or either novelty detection and many others applications [2, 3]. In the context of Independent Component Analysis (ICA), supposing the assumption of source independence, is the main factor to apply this method instead of PCA which assumes that the sources are uncorrelated. ICA is the well known method often used as a solution of blind source separation (BSS) problem [4, 5]. In this study, the goal of using PCA or FA is to reduce the dimension space to separate signals from its mixtures (observations) and then, these techniques will be applied for linear dataset as a whitening step before the separation process. To clarify the idea of using PCA or FA: knowing the difference between the data types is a major step, which means that, if the one have a table of numerical or ordinary variables, the one should apply the principal component analysis, but if the table contains the qualitative or nominal variables, the factor analysis should be used instead. In the following, a detailed study of those two previous methods is provided.

1.1 Applications of Signal Separation Algorithms in Telecommunications Systems Based on OFDM

Orthogonal frequency-division multiplexing (OFDM) is a method of encoding digital data on multiple carrier frequencies. OFDM has developed into a popular scheme for wideband digital communication, used in applications such as digital television and audio broadcasting, DSL internet access, wireless networks, power line networks, and 4G mobile communications.

In statistical wireless signal processing, extraction of unobserved signals from observed mixtures can be achieved using Blind Source Separation (BSS) algorithms. OFDM can be considered as a good established predominant air interface communication technique. It is used for encoding digital data on multiple carrier frequencies.

Due to the high data rate transmission and the ability to against frequency selective fading, OFDM is usually applied in the current broadband wireless telecommunication system.

In the mobile communication environment we have to deal with multipath transmission channels due to the reflections of wavefronts. In order to apply existing source separation algorithms for mobile communication signals, some modifications of the classical narrow band data model have to be done. In this paper a modification of the data model using PCA or FA techniques is presented. After the classification of the data, the OFDM-technique could be used for such many telecommunication systems such as:

  • Digital audio broadcasting (DAB) (1995).

  • Digital video broadcasting (DVB) (1997).

  • High-definition television (HDTV) terrestrial broadcasting.

  • Wireless LAN and PAN like: IEEE 802.11a and IEEE 802.11g.

  • Optical communications.

  • Now, OFDM technique has been adopted as the new European DAB standard, and HDTV standard.

  • OFDM/UWB (802.15.3a) (2004).

  • IEEE 802.16 broadband wireless access system (2004).

  • IEEE 802.20 mobile broadband wireless access (MBWA).

  • 4G mobile communication (2005).

Nowadays, OFDM is representing the key technology for beyond 3G, 4G and 5G communications, promising robust, high capacity, and high speed wireless broadband multimedia networks. A source separation algorithms named PCA and FA will be considered in this paper for data transmission through random multipath channels like mobile communication channels. Simulation results will show the separation and classification efficiency.

2 Methods

2.1 Principal Component Analysis

Definition. The PCA (Hotelling [6]) is a part of the multidimensional descriptive techniques which consists in passing from a table of complex and large data containing all the information of a certain phenomenon studied, to visual representations (graphs) and optimal as much as possible of the data. This passage aims to reduce this number of data while projecting these cloud points on a principal or factorial axis, a plane or a hyperplane without using any particular hypothesis or model, which allowed the user to interpret these results [7]. This reduction of the number of variables will allow to form a linear combination that each one of it is related to a principal component [8, 9]. It operates through a mathematical process that transforms a number of variables that are likely to be correlated to a number of uncorrelated variables called Principal Component, because of their character to absorb as much information as possible or variance in the starting variables. So Principal Component Analysis is really a good name because it does what it says; the PCA finds the Principal Component Analysis of the data.

Problematic 1. The measurement table is presented as follows: the columns contain variables of type numerical values, and the rows represent the observations (individuals) on which these variables are observed, in the form of a matrix of type (pq).

$$\begin{aligned} X= \begin{pmatrix} x_{11} &{} x_{12}&{} \cdots &{} x_{1j} &{} x_{1q} \\ x_{21} &{} x_{22} &{} \cdots &{} x_{2j}&{} x_{2q}\\ \vdots \\ x_{i1}&{} x_{i2}&{} \cdots &{}x_{ij} &{} x_{iq}\\ \vdots \\ x_{p1} &{} x_{p2}&{} \cdots &{} x_{pj} &{} x_{pq}\\ \end{pmatrix} = \begin{pmatrix} X_{1}\\ X_{2}\\ \vdots \\ X_{q}\\ X_{p}\\ \end{pmatrix} \end{aligned}$$
(1)

As one can observe, this matrix is a linear combination of the rows and columns of the initial table as follows:

$$\begin{aligned} \begin{matrix} Y_{1}=e_{11}X_{1}+e_{12}X_{2}+\cdots +e_{1p}X_{p}\\ Y_{2}=e_{21}X_{1}+e_{22}X_{2}+\cdots +e_{2p}X_{p}\\ \vdots \\ Y_{p}=e_{p1}X_{1}+e_{p2}X_{2}+\cdots +e_{pp}X_{p}\\ \end{matrix} \end{aligned}$$
(2)

These new components are linear combinations of variables and must be uncorrelated.

About the coefficients \(e_{ij}\) they are collected into the vector:

$$\begin{aligned} e_{i}= \begin{pmatrix} e_{i1}\\ e_{i2}\\ \vdots \\ e_{ip}\\ \end{pmatrix} \end{aligned}$$
(3)
  • Goals of applying PCA:

    • The most important thing is to reduce the dimensions of the data set.

    • Have an idea about the structure of the data set and also point out the similarities or oppositions of behaviour between individuals.

    • Graph the point cloud in the plane or space, respecting:

      \(*\) The distances between individuals.

      \(*\) The structure of correlations between variables.

  • Variance-Covariance matrix

A variance-covariance matrix is a square and symmetric matrix that contains the variance and covariance associated with several variables. The diagonal elements of the matrix contain the variances of the variables, while the off-diagonal elements contain the covariance between all possible pairs of variables.

This matrix is used to evaluate the variance between different variables, that covariance measures the linear link that may exist between a couple of statistical variables or a couple of quantitative random variables, so that one can calculate the covariance of each couple of variables and then indicate them in a symmetric matrix:

$$\begin{aligned} cov(x,y)=\frac{1}{n}(\sum _{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})) \end{aligned}$$
(4)

2.2 Factor Analysis

Definition. In general, the Factorial Analysis is also a data reduction tool, the term factor analysis was first introduced by (Thurstone [10]) and used for modulate the data set and to detect the relationships between qualitative and nominal variables to classify them [11] and determine the covariance of variables reconstructed with less latent variables called factors independent one another, to well describe an observed phenomenon involved in many fields such as intelligence, science, psychology, health, ecology, sociology and others. It is similar to Principal Component Analysis in the term of reducing the data. In Factor Analysis there is two types of variables: the latent variables (factors) and the observed variables. Note that The PCA is a particular type of FA.

More especially, there are many types of Factorial Analysis, the famous one is called Factorial Analysis of correspondence.

The Factorial Analysis of Correspondence (Benzekri [12]) is used for the processing of information contained in a so-called contingency (dependency) table of qualitative, quantitative and positive variables of different kinds, and it is used mainly for nominal variables. This table can thus be represented by a cloud of points with probabilities [13]. This correspondence analysis is descriptive when there are two way tables or multi tables having correspondence between rows and columns. The final result produced is similar to the Factorial Analysis method exploring the categories of variables contained in the specific table.

  • So what is “correspondence”?

When the variables are quantitative, a correlation study have to be done (PCA). However, when there are qualitative or nominal variables, the one must make a study of the correspondences (FCA).

Problematic 2. The notation of the Factorial Analysis model is like the regression model and each data-subject is a linear function of the unobserved factors \(f_{1},f_{2},\ldots ,f_{m}\) which determine the variation of the data set. In general, the matrix notation of the FA model is like:

$$\begin{aligned} X=\mu +Lf+\epsilon \end{aligned}$$
(5)

We have the data X with the expression in Eq. (1), \(\mu \) is the \(X_{i}\) variables mean vector denoted:

$$\begin{aligned} \mu = \begin{pmatrix} \mu _{1}\\ \mu _{2}\\ \vdots \\ X_{p}\\ \end{pmatrix} \end{aligned}$$
(6)

f represents the factors collected in the vector of common factors:

$$\begin{aligned} f= \begin{pmatrix} f{1}\\ f{2}\\ \vdots \\ f{m}\\ \end{pmatrix} \end{aligned}$$
(7)

With \(m\prec \prec p\) And the matrix of factor loadings is represented like:

$$\begin{aligned} \quad L= \begin{pmatrix} l_{11} &{} l_{12}&{} \cdots &{} l_{1m} \\ l_{21} &{} l_{22} &{} \cdots &{} l_{2m}\\ \vdots \\ l_{p1}&{} x_{p2}&{} \cdots &{}x_{pm} \\ \end{pmatrix} \end{aligned}$$
(8)

And finally the measurement error:

$$\begin{aligned} \epsilon = \begin{pmatrix} \epsilon _{1}\\ \vdots \\ \epsilon _{p}\\ \end{pmatrix} \end{aligned}$$
(9)

To know more about the model assumptions for the mean, variance and correlation, see [14] (Table 1).

Now let’s consider the example of a collect information table applied for the Factorial Analysis of Correspondence. The following table contains variables of two sets I and J (the entries):

Table 1. Contingency table

\(*\) Example:

The technique of the FCA is mainly used for large data tables all expressed in the same unit. For the qualitative case, the preceding table is presented in the form of a table of the ones and the zeros (depending on whether or not the individual i has the parameter j).

And we have: \(p_{ij}=\frac{x_{ij}}{\sum _{i=1}^{n}\sum _{j=1}^{m}x_{ij}}\) which replace \(x_{ij}\) in the previous table.

  • Goals of Factorial Analysis:

    • First, we use Factorial Analysis in the purpose of measuring the unobserved (latent) and error-free variables.

    • Reduce the number of variables.

    • Determine and prioritize all the dependencies between the rows and the columns of the table on one hand, and on the other hand, it serves to appear some abstract synthetic non correlated variables (reduction of dimensionality). For this purpose, the projection of transformed cloud must be on a space of smaller dimension.

We have the observations which are classified in the contingency table in the boxes following two sets presented in rows and columns. In contrast to PCA, the representative cloud of individuals can not be visualized using a Cartesian coordinate system since the population in this case is defined by nominal criteria. On the other hand, the analysis of the correspondences will make it possible to visualize links between the variables on one or two factorial planes using the metric.

3 PCA and FA Differences

  • The Factorial Analysis offers the uniqueness (unlike the PCA) of providing a space of representation common to variables and individuals. In addition, the FA can process nominal data, which is not possible for the PCA.

  • In Principal Components Analysis we assume that all variability in an item should be used in the analysis, while in Factor Analysis we only use the variability in an item that it has in common with the other items.

  • The Factorial Analysis studies the link between two qualitative and quantitative variables, However, The PCA analyses only the quantitative variables.

  • A double PCA on lines and columns leads to obtain the Factorial Correspondence Analysis.

  • In the PCA, the used distance for the computation is the Euclidian method but for the FCA it is the Khi-deux test [15].

  • As the number of variables used to study such a phenomenon is huge, application result of PCA and FA become more and more similar. This observation has been proved by many researchers in this field, thus (Snook and Gorsuch [16]) have found out that variable table with at least 40 variables result in minor differences.

Fig. 1.
figure 1

Simple comparison of PCA and FA

The following figure shows the principal distinguish between PCA and FA in a very easier and shortest way: Noticing that, the arrows point the measured variables to the principal component and it is the inverse for the FA. The variability in the measured variables in the Principal Component Analysis lead to the variance for the Principal Component, by contrast, in the Factorial Analysis the latent factors are the mean raison of variance and correlation between the measured variables (Marcoulides and Hershberger [17]) (Fig. 1).

4 Results and Interpretations

4.1 Analyzing General Data

In this section, we will show some result of statistical analyses and the projection data obtained by the two techniques studied in this paper; the Principal Component Analysis and the Factorial Analysis presented in the type of Factorial Correspondence Analysis (FCA). First we begin with example 1 where the dataset have been collected in a table presenting a series of completely fictitious data concerning the stays of several patients in a hospital center. We are looking to analyze these data by using PCA.

Note that this data is chosen for the pedagogic purpose of study and not for comprehensive or limited analyses.

In the first example we present the results of PCA technique. Our original first table contains 10 (ordinary and nominal) variables presented below:

Table 2. Descriptive statistics

The Table 2 shows the descriptive statistics of each variable, here, we have replaced all the missing variables by their means.

In this example, the study of the data has been done for 20 samples, and thus, the purpose of analysing the structure of this data is to perform a meaningful interpretation of the results after applying the PCA technique. The matrix presented in Table 3 regroups the set of the variation that exists between the variables i.e., for example, there is a strong correlation that is equal to 0.22 between the variable age and the variable hospitalized, which could be interpreted in the way of that the aged people are more hospitalized than the young people. This correlation determines all the variables that will decompose the main components, all the variables that are correlated will be grouped into factors.

Table 3. Correlation matrix

The total variance explained in Table 4 gives us an idea of the degree of information presented by each component or factor, so that 10 variables have been replaced by 6 components, but the first component represents only itself, 29% of the total information of the set of all variables, then the second represents 23% of the total information and that the third one itself represents 18%, so then if we regroup the three components that will give us 71% of the set of variables so, as a result, we are no longer able to work on the set of all variables.

Fig. 2.
figure 2

The eigen value graph

Table 4. Total variance

The Table 4 presents the correlation values between variables and the Principal Component grouped in the Component matrix after rotation.

The graph of the Fig. 2 shows us the eigenvalues of each calculated component. As one can observe the tree first component choosen by PCA have the high values from the six total component computed. These tree components have the high variance PC1, PC2 and PC3 calculated from the Table 4 with the eigen values 29.26, 23.72 and 18.41 respectively.

The rotation type used in this case is Varimax with Kaiser normalization.

After 5 iterations the rotation matrix converged, and we obtained this component matrix shown in the Table 5 presenting the tree axes that influence on each variable, so that the Hour of entry, the disability test and the cholesterol level are having the high correlation with the first axe, the ID has a correlation of 0.875 with the second axe and finally the variables age and hospitalized have the high correlation with the third axe (0.520 and 0.936) respectively.

Table 5. The component matrix after rotation
Fig. 3.
figure 3

Projection of two first component according to ID variable

Now one can plot the projection points of two first components according to ID variables. Here in the Fig. 3, each small circle represents the projection of the data following the two dimentions represented by the first two components having the highest eigen values (variance) of the data.

Now we present the result of the second example analyzed with FCA. This example offers data on the composition of products sold in fast food outlets in the United States. Here, we have 117 types of hamburger products with 16 variables. So, the set of all types of hamburgers sold constitutes the population we seek to study by FCA.

At the beginning, we can show the relation between the calories and proteins in the following graph.

That type of analyses is called bivariate analysis.

Fig. 4.
figure 4

Graph of relation between two variables (cholesterol and proteins)

This graph presented in Fig. 4 shows a positive linear relationship between the two variables: The more cholesterol in a hamburger, the more protein there is. The correlation coefficient of Pearson is 0.966, which shows a strong but not perfect relationship. Now we will weigh the observations of 117 types of marks of hamburgers by the numerical identifier ID and then project the data on two dimensions.

The Fig. 5 shows the statistical link between the three variables: marks of hamburgers, the fast food offering this hamburger and the total calories recorded into 4 categories presented in Table 6.

Fig. 5.
figure 5

Graph of projection points in two dimensions

Table 6. Correspondance table

As we can observe from Table 6, the Marge active is the fast food chain that proposed more hamburgers with high total of calories (>820), Burger King comes after with 1036 fast food chain, Jack in the box with 739, McDonalds with no hamburger which exceed 820 calories and finally Wendy’s is the last fast food chain ranking with 281 hamburgers. We notice also, that Marge active is the only fast food chain which proposed high number of hamburgers with less total of calories (<40). This table shows the importance correspondence between the fast food chain and the number of total calories.

4.2 Analyzing Audio Data

In the telecommunication systems as OFDM technique, audio data is widely used to analyse the recordings of different types of signals and since those audio signals belongs to some audio classes, such as speech, noise and music, it can be useful for several applications, like audiovisual indexing, retrieval system and automatic classification of multimedia contents. In our case, we suppose that we have only speech signals represented in three mixtures of two male speakers. The condition of the experiment is mentioned below:

The recording of those four male speakers is represented in a stereo WAV audio file where all the microphone elements are spaced in a linear arrangement. The spacing of each stereo microphone pair is about 2.15 cm. The reverberation time is about 150 ms [18].

The channels are synchronized within each file, but no two channels in different files are synchronized to each other.

The source sets do not share the same time offsets, sampling frequency mismatches and the direction of the sources. The sampling frequency mismatches are smaller than 100 ppm (=0.01%). In this section, we will present the result of the PCA technique in order to classify audio data to prepare it for separation method to use it after for mobile applications using OFDM technique.

The result of the experiment is shown in the Fig. 6 below:

Fig. 6.
figure 6

Principal component analysis of audio data

Fig. 7.
figure 7

Final data

This figure demontrates that the PCA reduces the number of dataset on 6 components according to 6 different colors representing all the original data. In this experiment, we choosed 50 samples of each column of the mixtures among 64000 samples for every mixture alone, which is of course a huge number that will complicates the operation of the computation and need much more memory in the computer, so in order to simplify this operation we reduced this number of samples just to compute quickly and see the perspicuous results. The final data is seen in the Fig. 7.

The covarience matrix is:

$$\begin{aligned} CovMat= \begin{pmatrix} 6.572699270125735e-07 &{} 1.396525492870834e-09 \\ 1.396525492870834e-09 &{} 4.054649297556355e-09 \end{pmatrix} \end{aligned}$$
(10)

5 Conclusion

To sum up, we have demonstrate that the techniques of PCA and FA are both tools of reduction and Processing data, in which PCA aims to group together a large number of variables in a limited number of components in order to facilitate the analysis of the data and to detect the set of relations of independence between the various variables in the major objective of obtaining the most relevant summary of the initial data. Otherwise, FA method, is also a dimension reduction technique used especially for measuring the impact of un-observed variables called factors on a large number of observed variables. Those data are defined by qualitative variables and notably of the nominal variables.

Finally, we conclude that the choice of the method of analysis depends fundamentally on the type of the data, and consequently, the principal component analysis (PCA) is used to process the quantitative variables meanwhile, the analysis factorial correspondence is used for qualitative and nominal variables. This differences will be so helpful to decide which method is most appropriate for a given variables. Choosing improperly might lead to have a bad interpretation results or incorrect understanding of the data.