Keywords

1 Introduction

Federated learning (FL) [15] has been receiving increasing attention in the literature, which enables collaborative machine learning in a communication-efficient and privacy-preserving way. FL provides a general-purpose collaborative learning framework, which consists of a coordinating central server and multiple participating clients with their local datasets, e.g., organizations (cross-silo setting) or devices (cross-device setting). More recent FL methods also propose fully-decentralized learning, where clients directly communicate with their neighbors. It eliminates the need of the coordinating server which can sometimes expose security and privacy risks to the collaborative training [26]. During the FL training, the raw local datasets of all clients are kept locally and restricted to be exchanged or transferred over network for privacy-preserving purpose. Instead, it suffices to communicate only intermediate training variables (e.g., local gradients or local model parameters). Due to its generality and superiority in the privacy-preserving aspect, FL paves its way to a growing number of application areas. Nevertheless, most existing FL methods focus almost exclusively on the supervised learning problems under the horizontally distributed local dataset setting. As far as we know, unsupervised FL learning under the settings of vertically distributed datasets has yet to be well explored.

In fact, the vertically partitioned local dataset setting is common in many practical applications [16]. Under this setting, different clients hold a local partition of features, instead of samples as in the horizontally partitioned setting. It naturally arises when the features/attributes describing the same group of entities are collected and stored among multiple clients. For example, the financial features of a person can split among multiple financial companies s/he has dealt with, e.g., banks, credit card companies, stock market. Similar to the horizontal FL setting, it is important to collaboratively train the model based on all clients’ data partition to maximize the global model performance. For example, when a bank refers to a machine learning (ML) model for deciding whether to grant a loan application of a customer, it is ideal for the model training to take account into all the financial records of the customer, not only the transactions history held by this bank. It is apparent from the example that raw local datasets should not be exchanged because the vertically partitioned datasets can contain sensitive information, i.e., the financial status of the customer. Furthermore, unsupervised learning is practically appealing because it need not the labels for model training. The labels can be expensive and time-consuming to mark, and even require domain expertise [13, 19]. The label can also contain sensitive information, which will incur privacy leakage if not handled properly, especially under the vertically partitioned setting where the labels need to be shared among all clients [5, 8].

Under the circumstances, this paper will concentrate on principal component analysis (PCA), which is one of the most fundamental and useful unsupervised learning task with multiple clients holding the vertically partitioned datasets under the FL framework. PCA plays a pivotal role in high-dimensional data analysis by extracting principal components, which are uncorrelated and ordered, with the leading ones retaining most of the data variations [18]. Very recently, federated PCA is studied on the sample-wise distributed (also known as horizontally distributed) data [9, 10]. However, to the best of our knowledge, there is no federated PCA tailored to the feature-wise distributed (also known as vertically distributed) data. The two data distribution settings, though seemed subtle, has fundamental difference as summarized in the review [23]. To this end, we propose a vertically partitioned federated PCA method, abbreviated VFedPCA, which features computational efficiency, communication efficiency and privacy-preserving. In VFedPCA, clients keep their datasets local and only requires the communication of model parameters. Such model exchange suffices to occur periodically to reduce communication overhead. Within each communication round, clients run the computationally efficient local power iteration [1, 11, 17], and the warm-start power iteration method can further improve performance. Furthermore, we consider the relationship between each independent subset and the full set based on the weight ratio of eigenvalue. Subsequently, we supplement the weight scaling method to further improve the accuracy and performance of the algorithm. In addition, we consider both of two settings: (1) a centralized server coordinates the learning, and (2) the fully decentralized setting which eliminates the need of the server. In summary, the main contributions of this paper are below:

  1. 1.

    We first attempt to study PCA under the vertically-partitioned FL setting.

  2. 2.

    We propose VFedPCA, featuring computational efficiency, communication efficiency, and privacy-preserving. VFedPCA comes with two variants: one requires a central server for coordination and the other performs fully decentralized learning where a client communicates directly with its neighbors.

  3. 3.

    Extensive experiments on real datasets justify the favorable features of VFedPCA under the vertically-partitioned FL setting, while maintaining accuracy similar to unseparated counterpart dataset.

2 Notations and Background

2.1 Notation

We use boldfaced lower-case letters, e.g., \( \mathbf {x}\), to denote vectors and boldfaced upper-case letters, e.g., \( \mathbf {X}\), to denote matrix. The superscript is associated with the number of iterations, e.g., \( \mathbf {X}^{t}\) denotes the decision variable at iteration t, while the subscript is associated with indices of different parties, e.g., the data matrix \(\mathbf {X}_{i}\in \mathfrak {R}^ {n \times {p}_i}\) denotes the i-th party that has \(\mathbf {p}_{i}\) variables (i.e., features) and \(\mathbf {n}\) samples (i.e., users). We use \(\mathbf {X}^{*}\) to denote a dual space and \(\mathbf {\Vert x}\Vert \) denotes the standard Euclidean norm for the vector \(\Vert \mathbf {x}\Vert \). For image dataset, the data matrix \(\mathbf {X}_{i} \in \mathfrak {R}^{n\times m\times m} \) is the i-th party’s input dimension, where m denotes the pixel size. The summary of the used symbols in this paper is shown in Table 1.

Table 1. Summary of frequently used notation

2.2 PCA and Power Iteration

In this subsection, we describe the basic PCA setup and present the centralized power iteration. Let \(\mathbf {X}\in \mathfrak {R}^{\textit{m}\times \textit{n}}\) be the data matrix. The goal of PCA is to find the top eigenvectors of the symmetric positive semidefinite (PSD) matrix \(\mathbf {A} = \frac{1}{n} \mathbf {X}\mathbf {X}^{\top }\in \mathfrak {R}^{\textit{m}\times \textit{m}}\), which is the covariance matrix. We assume that the target matrix A has eigenvalues \(1\ge \boldsymbol{\alpha }_{1}\ge \boldsymbol{\alpha }_{2}\cdots \ge \boldsymbol{\alpha }_{m}\ge 0\) with corresponding normalized eigenvectors \(\mathbf {a}_{1},\mathbf {a}_{2},\cdots ,\mathbf {a}_{m}\), then \(\mathbf {A}\mathbf {a}=\boldsymbol{\alpha }\mathbf {a}\). In general, \(\mathbf {A}^{k}\mathbf {a}=\boldsymbol{\alpha }^{k}\mathbf {a}\) for all \(k=1,...,m\), which is the basis of the power iteration method. Let \(\boldsymbol{\alpha }^{(0)}\) be an approximation to an eigenvector of \(\mathbf {A}\), and \(\mathbf {a}^{(0)}\) as an initial vector. The power method [21] estimates the top eigenvector by repeatedly applying the update step below:

$$\begin{aligned} \boldsymbol{\omega }=\mathbf {A}\mathbf {a}^{(k-1)}, \mathbf {a}^{(k)}=\frac{\boldsymbol{\omega }}{\Vert \boldsymbol{\omega }\Vert }, \end{aligned}$$
(1)

where at each iteration, \(\mathbf {a}^{(k)}\) will get closer to the top principal eigenvector \(\mathbf {a}_{1}\). When \(\mathbf {k}\rightarrow \infty \), \(\mathbf {a}^{(k)}\) will converge to \(\mathbf {a}_{1}\).

Recently, [17] has proposed a distributed SVD algorithm based on the local power iteration, which incorporates the periodic communication for communication-efficiency. The LocalPower method can save communication by \(\mathbf {\textit{l}}\) times without much impact on the accuracy. To guarantee the convergence, they partition the data matrix \(\boldsymbol{X}\) randomly. They assume the local correlation matrix \(\mathbf {A}_{i} = \frac{1}{{s}_{i}} \mathbf {X}_{i}\mathbf {X}_{i}^{\top }\) is a good approximation to the global correlation matrix \(\mathbf {A} = \frac{1}{n} \mathbf {X}\mathbf {X}^{\top }\), which is equivalent to \(\Vert \mathbf {A}-\mathbf {A}_{i}\Vert \le \rho \Vert \mathbf {A}\Vert \), where \(\mathbf {A}_{i}\in \mathfrak {R}^{\textit{s}_{i}\times \textit{m}}\) is the i-th partition, \(\rho \) bounds the difference between \(\mathbf {A}_{i}\) and \(\mathbf {A}\), which is assumed as small as \(\epsilon \) . However, our setting allows all parties to independently calculate the principal component vector based on the basic PCA method, and derive the federated principal component vector in the global center according to the weights of the parties, which can reduce the complexity of communication calculations and ensure the privacy of participants.

2.3 Federated Learning

FL is data-private collaborative learning, where multiple clients jointly train a global ML model with the help of a central coordination server. The goal of FL is achieved through the aggregation of intermediate learning variables, while the raw data of each customer is stored locally without being exchanged or transferred. According to different data partition structures, the distribution can be generally categorized into two regimes, namely horizontally and vertically partitioned data [7]. Currently, there is relatively less attention paid to the vertically partitioned data. Most of the existing cross-silo FL work is based on the supervised datasets, including trees [4], linear and logistic regression [20, 25], and neural networks [15]. These supervised FL works rely on the labels, which are expensive to acquire and require domain expertise. For example, diabetes patients may wish to contribute to the FL with their everyday health monitor data like glucose level and blood press. Since patients lack advanced medical expertise, their local data are often unlabeled without a medical expert’s evaluation. As far as we know, the current work on the unlabelled data and vertical learning has yet to be explored. The only existing related papers focus on the semi-supervised FL [12] and weakly-supervised FL [14], respectively.

3 Federated-PCA on Privacy-Preserving Vertical-Partitioned Data

In this section, we first formulate the problem of federated PCA on the vertically partitioned dataset. Then, we propose the VFedPCA algorithm to collaboratively extract principal components with the participation of all clients, while ensuring privacy by avoiding sharing of raw local datasets. Our method devises the local power iteration for efficient local computation, along with periodic communication for better communication efficiency.

In particular, we consider two types of communication topologies. The first is the server-clients topology, which follows the most existing FL methods by introducing a semi-trusted server to coordinate the training. The second is the fully-decentralized topology, where the clients communicate in a peer-to-peer manner with their neighbors. It eliminates the need of the server, which itself can be malicious and sometimes considered unpractical provided that such a semi-trusted server exists.

3.1 Problem Formulation

Let \(\mathbf {X}\in \mathfrak {R}^{\textit{m}\times \textit{n}}\) be the data matrix, which have \(\textit{m}\) features and \(\textit{n}\) samples, we partition \(\mathbf {X}\) into \(\textit{p}\) clients as \(\mathbf {X}=[\mathbf {X}_{1},\mathbf {X}_{2},\cdots \mathbf {X}_{p}]^{\top }\), where \(\mathbf {X}_{i}\in \mathfrak {R}^{\textit{f}_{i}\times \textit{n}}\) contains \(\textit{f}_{i}\) features of \(\mathbf {X}\) and \(\sum _{i=1}^p f_i = m\). Let \(\mathbf {S} = \frac{1}{m} \mathbf {X}^{\top }\mathbf {X}\in \mathfrak {R}^{\textit{n}\times \textit{n}} \). Let \(\mathbf {Z}=\mathbf {X}\mathbf {U}^{\top } \in \mathfrak {R}^{\textit{m}}\), which can be considered as the coordinate of the projection along the direction given by \(\mathbf {U}\). Note that \(\mathbf {Var}(\mathbf {Z})=\mathbf {U}^{\top }\mathbf {S}\mathbf {U}\), where \(\mathbf {S}\) is the covariance matrix. Our purpose is to find the leading k-dimensional subspace such that the projection of the data onto the subspace has the largest variance. The problem could be formulated as follows:

$$\begin{aligned} \underset{\mathbf {U} \in \mathbb {R}^{m \times k}, \mathbf {U}^{\top } \mathbf {U}=\mathbf {I}}{\max }\Vert \mathbf {U}^{\top }\mathbf {S}\mathbf {U}\Vert = \boldsymbol{\alpha }_{G}\mathbf {U}^{\top }_G\mathbf {U}_{G}=\boldsymbol{\alpha }_{G}, \end{aligned}$$
(2)

where \(\boldsymbol{\alpha }_{G}\) are the leading eigenvalues of \(\mathbf {S}\) and \(\mathbf {U}_{G}\) are the eigenvectors corresponding to \(\boldsymbol{\alpha }_{G}\). Our aim is to minimize the distance between the global eigenvector \(\mathbf {U}_G\) that is computed as if all features were centralized together, and each client’s \(\mathbf {U}_{i}\), as follows

$$\begin{aligned} \underset{\mathbf {U}_G,\mathbf {U}_{i}\in \mathfrak {R}^{\textit{m}\times \textit{k}}}{min} dist(\mathbf {U}_G, \mathbf {U}_{i}),\text { for all } i=1,...,p. \end{aligned}$$
(3)

We define the squared \(\boldsymbol{\chi }\)-distance [3] as the distance between \(\mathbf {U}_G\) and \(\mathbf {U}_i\) by:

$$\begin{aligned} dist(\mathbf {U}_G, \mathbf {U}_{i}) :=\sum _{j=1}^{m} \sum _{h=1}^k\frac{\left( (\mathbf {U}_{G})_{j,h}-\mathbf {(U_\textit{i}})_{j,h}\right) ^{2}}{(\mathbf {U}_{G})_{j,h}+\mathbf {(U_\textit{i}})_{j,h}}, \end{aligned}$$
(4)

where \(\mathbf {U}_G\) is the leading eigenvectors of the global covariance matrix \(\mathbf {S}\), and \(\mathbf {U}_{i}\) is the leading eigenvectors of each client’s (say i-th) local covariance matrix.

3.2 Local Power Method

Let us consider the data partition among \(\textit{p}\) clients. In each client, we use power iteration algorithm (also known as the power method) to produce the greatest (in absolute value) eigenvalue of \(\mathbf {A}_{i}=\frac{1}{{f}_{i}}\mathbf {x}_{i}^{\top }\mathbf {x}_{i}\), and a nonzero vector \(\mathbf {a}_{i}\), which is a corresponding eigenvector of \(\boldsymbol{\alpha }_{i}\), i.e. \(\mathbf {A}_{i}\mathbf {a}_{i} =\boldsymbol{\alpha }_{i}\mathbf {a}_{i}\). For \( \textit{l} = 1,2 \cdots \textit{L} \), each client will compute locally until convergence by

$$\begin{aligned} \mathbf {a}_{i}^{l+1} = \frac{\mathbf {A}_{i}\mathbf {a}_{i}^{l}}{\Vert \mathbf {A}_{i}\mathbf {a}_{i}^{l}\Vert }, \end{aligned}$$
(5)

where the vector \(\mathbf {a}_{i}^{l}\) is multiplied by the matrix \(\mathbf {A}_{i}\) and normalized at every iteration. Throughout the paper, we use \(l\in [L]\) to denote local iterations and reserve \(t\in [T]\) to denote global rounds.

If \(\mathbf {a}_{i}^{l}\) is an eigenvector of \(\mathbf {A}_{i}\), its corresponding eigenvalue is given by

$$\begin{aligned} \boldsymbol{\alpha }_{i}^{l} = \frac{\mathbf {A}_{i}(\mathbf {a}_{i}^{l})^{\top }\mathbf {a}_{i}^{l}}{(\mathbf {a}_{i}^{l})^{\top }\mathbf {a}_{i}^{l}}, \end{aligned}$$
(6)

where \(\mathbf {a}_{i}^{l}\) and \(\boldsymbol{\alpha }_{i}^{l}\) represent the largest eigenvector and eigenvalue of the i-th client, respectively.

3.3 Federated Communication

Each client uploads the eigenvector \(\mathbf {a}_{i}^{t}\) and the eigenvalue \(\boldsymbol{\alpha }_{i}^{t}\) to the central sever. First, the server computes the weight \(\boldsymbol{\omega }_{i}\) of each client. Then, the server merges all clients’ results and computes the federated eigenvector \(\mathbf {u}^{t}\)

$$\begin{aligned} \boldsymbol{\omega }_{i}^{t} = \frac{\boldsymbol{\alpha }_{i}^{t}}{\sum _{i=1}^{p}\boldsymbol{\alpha }_{i}^{t}}, \mathbf {u}^{t} = \boldsymbol{\omega }_{1}^{t}\mathbf {a}_{1}^{t} + \boldsymbol{\omega }_{2}^{t}\mathbf {a}_{2}^{t} + \cdots \boldsymbol{\omega }_{p}^{t}\mathbf {a}_{p}^{t}, \end{aligned}$$
(7)

where \(\boldsymbol{\omega }_{i}^{t}\) is the weight of the i-th client and \(\mathbf {u}^{t}\) is the shared projection feature vector by merging all clients’ \(\mathbf {a}_{i}^{t}\)’s.

3.4 Sever-Clients Architecture

Fig. 1.
figure 1

The framework of VFedPCA with central server.

figure a

All clients share parameters with the help of the central coordination server. In our setting, this third-client coordinator can be trusted, which means that it will honestly conduct the designated functionality and will not attempt to breach the raw data of any clients. After adjusting the new parameters, this federated-parameter is returned to all clients. Then, each client calculates locally based on this new federated-parameter. Moreover, this “communication-and-local-computation” cycle is iterative. The framework of this model is shown in Fig. 1. The algorithm is summarized in Algorithm 1. The key steps of this method are as follows:

Local Training: All clients perform local power method independently. Each client first calculates the covariance matrix \( \mathbf {A} \). Subsequently, the largest eigenvector \(\mathbf {a}\) and eigenvalue \(\boldsymbol{\alpha }\) are calculated (see steps 4–10 of Algorithm 1);

Model Integration: The central server calculates the weight \( \boldsymbol{\omega }_{i} \) occupied by each client in the federated model by receiving the eigenvalue \(\boldsymbol{\alpha }_{i}\) of each client, and then combines the received eigenvector \(\mathbf {a}_{i}, i \in \{1, \cdots p\}\), to derive a federated feature vector \(\mathbf {u}^{t}\), where the communication is cyclical (see steps 12–16 of Algorithm 1);

Parameters Broadcasting and Updating: The central coordination server broadcasts the aggregated parameters \(\mathbf {u}^{t}\) to the p clients. Each client updates local eigenvector with the new returned federated feature vector \(\mathbf {u}^{t}\) (see step 17 of Algorithm 1). The advantage of this model is relatively more efficient, although it relies on the help of the server which can be potentially malicious.

3.5 Local Power Iteration with Warm Start

The general power iteration algorithm starts with an initialization vector \(\mathbf {a}^{(0)}\), which may be an approximation to the dominant eigenvector or a random unit vector. Here, we initialize the algorithm from a “warm-start” vector \(\mathbf {a}^{(0)}\), which is based on the federated vector from the previous communication round. It is natural to reach convergence faster after consecutive iterations, which will reduce the actual running time and improve the performance of the local power iteration algorithm in Sect. 3.2. The algorithm is summarized in Algorithm 2. The main step is as follows:

Local Training: Each node uses the federated feature vector \(\mathbf {u}\) as the initialize value to perform local power method, then calculate the k + 1 to 2k eigenvector \(\mathbf {a}\) and eigenvalue \(\boldsymbol{\alpha }\) and send to the server (see steps 2–8 of Algorithm 2).

figure b

This method considers the use of an adaptive loop idea which allows communication between clients to reduce the error to a certain extent, especially for the case where the error between the general federated result \(\mathbf {u}\) and the global result \(\mathbf {u}_G\) is large.

3.6 Weight Scaling Method

The federated result may be sometimes not beneficial to all clients. In general, the client with the larger eigenvalue has a greater influence on the federated result, which hints a natural intuition that following the direction of the clients with larger eigenvalues tends to reach the global consensus faster and costs less communication rounds. Inspired by this intuition, we introduce a weight scaling factor to further improve the federation communication. The improved weight scaled federated average is formulated by

$$\begin{aligned} \begin{aligned} \mathbf {u}^{t}&= (1+\eta _{1}^t)\boldsymbol{\omega }_{1}^{t}\mathbf {a}_{1}^{t} + \cdots + (1+\eta _{\lceil p/2\rceil }^t)\boldsymbol{\omega }_{1}^{t}\mathbf {a}_{\lceil p/2\rceil }^{t} \\&+ (1-\eta _{\lceil p/2\rceil +1}^t)\boldsymbol{\omega }_{1}^{t}\mathbf {a}_{\lceil p/2\rceil +1}^{t} \cdots (1-\eta _{p}^t)\boldsymbol{\omega }_{p}^{t}\mathbf {a}_{p}^{t}, \end{aligned} \end{aligned}$$
(8)

where \(\sum _{i=1}^{\lceil p/2\rceil } \eta _{i}^t = \sum ^{p}_{i= \lceil p/2\rceil +1} \eta _{i}^t\). In brief, we gradually increase the impact of the first half of the clients with larger eigenvalues, while further decrease the impact of the clients of the second half with smaller eigenvalues. The algorithm is summarized in Algorithm 3. The main step is as follows:

Model Integration: The central server calculates the weight \( \boldsymbol{\omega }_{i} \) occupied by each client in the federated model based on the received eigenvalue \(\boldsymbol{\alpha }_{i}\) of each client, and then adds \(\eta \) parameter to further adjust the weight scale of each client. Then, it combines the received eigenvector \(\mathbf {a}_{i}, i \in \{1, \cdots p\}\), to derive a federated feature vector \(\mathbf {u}^{t}\) (i.e., steps 2–8 of Algorithm 3). This method is especially suitable for situations where some clients have a larger weight in the federated process.

figure c

3.7 Fully Decentralized Architecture

In this variant, we eliminate the need of the central server, where all clients only share principal component parameters with each other, and compute the results locally. The framework of this model is shown in Fig. 2. The algorithm is summarized in Algorithm 4. The main step of this method is as follows:

Parameters Communication and Updating: Communication among clients is connected and share intermediate results of the parameters \(\boldsymbol{\alpha }_{i}\) and \(\mathbf {a}_{i}\). Each client obtains the extracted final data based on calculating the weights \( \boldsymbol{\omega }_{i} \)’s occupied by all clients and updates locally (see steps 3–5 of Algorithm 4).

This model circumvents the need for third-client agencies to help participants further reduce some external risks, especially for two-client scenarios. However, the principal components of all clients need to be calculated independently, and the computing efficiency will be greatly affected.

3.8 Complexity Analysis

In each client, let \(\mathbf {\textit{L}}\) be the total number of local iterations, which only depends on the eigen-gap \(\varDelta \) after full passes over the data between the top two eigenvalues of \(\mathbf {A}_{i}\). In Algorithm 1, let \(\mathbf {\textit{T}}\) be the total number of federated aggregations executed by the server.

Theorem 1

(Local Iteration Complexity). After \(O(\frac{1}{\varDelta }\log {\frac{n}{\epsilon }})\) steps, the local power method can achieve \(\epsilon \)-accuracy.

Fig. 2.
figure 2

The framework of fully decentralized (peer-to-peer) VFPCA learning

figure d

Remark 1

We show the proof in the appendixFootnote 1. According to Theorem 1, the normalized iterate \(\frac{\mathbf {A}_{i}\mathbf {a}_{i}^{l}}{\Vert \mathbf {A}_{i}\mathbf {a}_{i}^{l}\Vert }\) is an \(\epsilon \)-accurate estimate of the top principal component. Hence, the total local power iterations complexity is \(O(\frac{1}{\varDelta }\log {\frac{n}{\epsilon }})\).

4 Experimental Results

This section will empirically evaluate the proposed method. In the experiments, we utilized four groups of real-world datasets: 1) structured datasets from different domains [6]; 2) medical image dataset [24]; 3) Face image dataset [2]; 4) Gait image dataset [22]. The feature and sample size information of all datasets are summarized in Table 2. In addition, illustrative images of the image dataset are shown in Figs. 3, 4 and 5, respectively. In all experiments, we measure the communication cost by calculating the bits of all the variables transmitted throughout the algorithm execution. As a preliminary study, we set T = 1, i.e., one-shot communication between the clients and the server, in our experiments, unless otherwise stated.

Table 2. Summary of datasets
Fig. 3.
figure 3

The DeepLesion dataset.

Fig. 4.
figure 4

The YaleFace dataset

Fig. 5.
figure 5

The CASIAGait database.

4.1 Experiment on Structured Dataset

Semi-synthetic Datasets. We use 8 structured datasets from different domains, which are the real data that are publicly available at the UCI Machine Learning Repository [6]. We compare the communication cost and estimation error by the different settings of \(\textit{p}\) and \(\textit{l}\) based on the feature-wise setting. We change the number of partitions \(\textit{p}\) = 3,5,10, \(\textit{p}\) = 10,50,100, and the local iterations \(\textit{l}\) = 5,10,20, \(\textit{l}\) = 50,100,200, and we vary the number of communication period to \(\textit{t}\) = 5, \(\textit{t}\) = 10. The number of each client’s features is split and configured according to the different settings of \(\textit{p}\).

Fig. 6.
figure 6

The distance error (left vertical axis) and the running time (right vertical axis) of Algorithm 1 with respect to the number of parties on structured datasets: (a) College, (b) GlaucomaM, (c) PimaIndiansDiabetes, (d) Musk, (e) Vehicle, (f) Sonar, (g) Swarm, and (h) TCGA, where \(\textit{l}\) = 10, \(\textit{p}\) = 3, 5, 10 for (a)–(f), and \(\textit{l}\) = 100, \(\textit{p}\) = 10, 50, 100 for (g)–(h).

Communication Cost and Estimation Error In this part, we simulate the communication between the clients and the server on a single machine, where the CPU time are wall-clock time.

  • The effect of local feature size. In Fig. 6, we fix \(\textit{l}\) = 10 and \(\textit{l}\) = 100 but change the number of involved clients \(\textit{p}\) = 3, 5, 10 and \(\textit{p}\) = 10, 50, 100. Then, we plot the distance error and the running time cost between the global eigenvector and the federated eigenvector with the different partitions \(\textit{p}\). In all experiments, compared with the un-communicated situation, the distance error after the communication has been significantly reduced and convergence has been achieved. It can be observed that the smaller \(\textit{p}\) will lead to less communication cost.

  • The effect of local iterations. In Fig. 7, we fix \(\textit{p}\) = 5 and \(\textit{p}\) = 100, but change the number of local power iterations to \(\textit{l}\) = 5, 10, 20 and \(\textit{l}\) = 50, 100, 200. Then, we plot the distance error and the running time cost between the global eigenvector and the federated eigenvector with the different local iterations \(\textit{l}\). All experimental results show that, after the communication, the error eventually decreases and the algorithm converges. It can also be seen that the result will remain stable when \(\textit{l}\) is up to 20.

  • The effect of warm-start power iterations. In Fig. 8, we fix \(\textit{p}\) = 5 and local power iterations \(\textit{l}\) = 100. Then, we plot the distance error between the global eigenvector and the federated eigenvector after communications \(\textit{t}\) = 5. Experimental results show the trend that the error further decreases and converges.

  • The effect of \(\eta \). In Fig. 8, we fix \(\textit{p}\) = 5 and local power iterations \(\textit{l}\) = 100. Then, we plot the distance error between the global eigenvector and the federated eigenvector after iterations \(\textit{t}\) = 5, 10. Experimental results show the trend that the error further decreases and converges compared with the one without using the adjustment factor \(\eta \).

Fig. 7.
figure 7

The distance error (left vertical axis) and the running time (right vertical axis) of Algorithm 1 with respect to the number of local iterations on structured datasets: (a) College, (b) GlaucomaM, (c) PimaIndiansDiabetes, (d) Musk, (e) Vehicle, (f) Sonar, and (g) Swarm (h) TCGA; For (a)–(f), \(\textit{p}\) = 5, \(\textit{l}\) = 5, 10, 20; for (g)–(h), \(\textit{p}\) = 50, 100, \(\textit{l}\) = 50, 100, 200.

Fig. 8.
figure 8

The distance error of Algorithms 2 and 3 with respect to the number of local iterations on structured datasets: (a) College, (b) GlaucomaM, where \(\textit{p}\) = 5, \(\textit{l}\) = 100.

Fig. 9.
figure 9

The final results of comparative experiment on image datasets: YaleFace (center-light and surprised), CasiaGait (sequence 1 and 10) and DeepLesion, where (a) PCA on the un-split data, (b) VFedPCA on the split data, (c) PCA on the isolated data. Image segmentation (k = 10) results: (d) PCA on the un-split data, (e) VFedPCA on the split data, (f) PCA on the isolated data.

4.2 Case Studies

Medical Image Dataset. The DeepLesion dataset [24] is a CT slices collection from 4427 unique patients, which contains a variety of lesions (e.g., lung nodules, liver lesions). We selected 100 samples from the dataset and each image is normalized to an 512 \(\times \) 512 gray image. We set \(\textit{p}\) = 8 clients, re-split the features, and assign them equally to each client. We used the commonly clustering method: k-means, with the number \(\textit{k}\) = 20 of clusters.

Yale Face Dataset. We use the Yale Face Dataset [2], which contains 165 grayscale images of 15 subjects. Each subject configures 11 different facial expressions and \(\textit{n}\) = 15 samples for each facial expression, where each face image is normalized to a 100\(\times \)100 gray image and we set \(\textit{p}\) = 10 clients. We use the k-means with \(\textit{k}\) = 10.

Gait Estimation. The CASIA is a gait database [22] for gait recognition, including 20 persons. We have image sequences, 4 sequences for each of the three directions and \(\textit{n}\) = 20 samples for each direction, where each image is resized to the 256 \(\times \) 256 scale. We set \(\textit{p}\) = 16 clients. We used the k-means with \(\textit{k}\) = 10.

Comparative Results. We first perform PCA on image datasets. Figure 9 (a)(b)(c) show that the final images after federating is almost the same as the final images after un-split image data. Then, to further verify that the final image results extract useful features, we also perform the clustering experiment on the image dataset. Figure 9 (d)(e)(f) show the clustering result (also known as image segmentation) after using the federated PCA method, which also shows improvement over the isolated PCA (i.e., independently run PCA on splitted local datasets). It will help each client to further perform local training tasks.

5 Concluding Remarks

In this paper, we have proposed VFedPCA algorithm, which can obtain a collaborative model that improves over the local models learned separately by each client. Through extensive comparative studies on various tasks, we have verified that the collaborative model achieves comparative accuracy with the centralized model as if the dataset were un-splitted. In addition, we propose two strategies, i.e., the local power iteration warm start method and the weight scaling method, to further improve the performance and accuracy of the model. In terms of the federated communication, the full decentralized model has lower communication cost and more flexible application value.

From the practical point of view, it is not uncommon that there exists a non-linear relationship between various data, especially for the image data, most of which possess a non-linear relationship. Under the circumstances, the PCA method which is essentially linear is inapplicable to find an appropriate representative direction. In the future, we will therefore study the vertical federated kernel PCA learning on datasets with the different types of non-linearity.