Vertical Federated Principal Component Analysis on Feature-Wise Distributed Data

Cheung, Yiu-ming; Lou, Jian; Yu, Feng

doi:10.1007/978-3-030-90888-1_14

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13080))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1578 Accesses
3 Citations

Abstract

Despite the wide attention to federated learning (FL) in the literature, the existing studies mostly focus on supervised federated learning under the horizontally partitioned local dataset setting. This paper will study the unsupervised FL under the vertically partitioned dataset setting. Accordingly, we propose the vertically dataset partitioned federated principal component analysis (VFedPCA) method, which reduces the dimensionality across the joint datasets over all the clients and extracts the principal component feature information for downstream data analysis. VFedPCA features efficient local computation, communication efficiency, and privacy-preserving. Further, we study two communication topologies. The first is a server-client topology where a semi-trusted server coordinates the federated training, while the second is the fully-decentralized topology which eliminates the requirement of the server by allowing clients to communicate with their neighbors. Extensive experiments conducted on real-world datasets justify the efficacy of VFedPCA under vertical partitioned FL setting.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Federated Data Integration for Heterogeneous Partitions Based on Differential Privacy

Feature Map Transfer: Vertical Federated Learning for CNN Models

Enhancing Federated Learning Robustness Through Clustering Non-IID Features

Keywords

1 Introduction

Federated learning (FL) [15] has been receiving increasing attention in the literature, which enables collaborative machine learning in a communication-efficient and privacy-preserving way. FL provides a general-purpose collaborative learning framework, which consists of a coordinating central server and multiple participating clients with their local datasets, e.g., organizations (cross-silo setting) or devices (cross-device setting). More recent FL methods also propose fully-decentralized learning, where clients directly communicate with their neighbors. It eliminates the need of the coordinating server which can sometimes expose security and privacy risks to the collaborative training [26]. During the FL training, the raw local datasets of all clients are kept locally and restricted to be exchanged or transferred over network for privacy-preserving purpose. Instead, it suffices to communicate only intermediate training variables (e.g., local gradients or local model parameters). Due to its generality and superiority in the privacy-preserving aspect, FL paves its way to a growing number of application areas. Nevertheless, most existing FL methods focus almost exclusively on the supervised learning problems under the horizontally distributed local dataset setting. As far as we know, unsupervised FL learning under the settings of vertically distributed datasets has yet to be well explored.

In fact, the vertically partitioned local dataset setting is common in many practical applications [16]. Under this setting, different clients hold a local partition of features, instead of samples as in the horizontally partitioned setting. It naturally arises when the features/attributes describing the same group of entities are collected and stored among multiple clients. For example, the financial features of a person can split among multiple financial companies s/he has dealt with, e.g., banks, credit card companies, stock market. Similar to the horizontal FL setting, it is important to collaboratively train the model based on all clients’ data partition to maximize the global model performance. For example, when a bank refers to a machine learning (ML) model for deciding whether to grant a loan application of a customer, it is ideal for the model training to take account into all the financial records of the customer, not only the transactions history held by this bank. It is apparent from the example that raw local datasets should not be exchanged because the vertically partitioned datasets can contain sensitive information, i.e., the financial status of the customer. Furthermore, unsupervised learning is practically appealing because it need not the labels for model training. The labels can be expensive and time-consuming to mark, and even require domain expertise [13, 19]. The label can also contain sensitive information, which will incur privacy leakage if not handled properly, especially under the vertically partitioned setting where the labels need to be shared among all clients [5, 8].

Under the circumstances, this paper will concentrate on principal component analysis (PCA), which is one of the most fundamental and useful unsupervised learning task with multiple clients holding the vertically partitioned datasets under the FL framework. PCA plays a pivotal role in high-dimensional data analysis by extracting principal components, which are uncorrelated and ordered, with the leading ones retaining most of the data variations [18]. Very recently, federated PCA is studied on the sample-wise distributed (also known as horizontally distributed) data [9, 10]. However, to the best of our knowledge, there is no federated PCA tailored to the feature-wise distributed (also known as vertically distributed) data. The two data distribution settings, though seemed subtle, has fundamental difference as summarized in the review [23]. To this end, we propose a vertically partitioned federated PCA method, abbreviated VFedPCA, which features computational efficiency, communication efficiency and privacy-preserving. In VFedPCA, clients keep their datasets local and only requires the communication of model parameters. Such model exchange suffices to occur periodically to reduce communication overhead. Within each communication round, clients run the computationally efficient local power iteration [1, 11, 17], and the warm-start power iteration method can further improve performance. Furthermore, we consider the relationship between each independent subset and the full set based on the weight ratio of eigenvalue. Subsequently, we supplement the weight scaling method to further improve the accuracy and performance of the algorithm. In addition, we consider both of two settings: (1) a centralized server coordinates the learning, and (2) the fully decentralized setting which eliminates the need of the server. In summary, the main contributions of this paper are below:

1.
We first attempt to study PCA under the vertically-partitioned FL setting.
2.
We propose VFedPCA, featuring computational efficiency, communication efficiency, and privacy-preserving. VFedPCA comes with two variants: one requires a central server for coordination and the other performs fully decentralized learning where a client communicates directly with its neighbors.
3.
Extensive experiments on real datasets justify the favorable features of VFedPCA under the vertically-partitioned FL setting, while maintaining accuracy similar to unseparated counterpart dataset.

2 Notations and Background

2.1 Notation

We use boldfaced lower-case letters, e.g., $ \mathbf {x}$, to denote vectors and boldfaced upper-case letters, e.g., $ \mathbf {X}$, to denote matrix. The superscript is associated with the number of iterations, e.g., $ \mathbf {X}^{t}$ denotes the decision variable at iteration t, while the subscript is associated with indices of different parties, e.g., the data matrix $\mathbf {X}_{i}\in \mathfrak {R}^ {n \times {p}_i}$ denotes the i-th party that has $\mathbf {p}_{i}$ variables (i.e., features) and $\mathbf {n}$ samples (i.e., users). We use $\mathbf {X}^{*}$ to denote a dual space and $\mathbf {\Vert x}\Vert $ denotes the standard Euclidean norm for the vector $\Vert \mathbf {x}\Vert $. For image dataset, the data matrix $\mathbf {X}_{i} \in \mathfrak {R}^{n\times m\times m} $ is the i-th party’s input dimension, where m denotes the pixel size. The summary of the used symbols in this paper is shown in Table 1.

Table 1. Summary of frequently used notation

Full size table

2.2 PCA and Power Iteration

In this subsection, we describe the basic PCA setup and present the centralized power iteration. Let $\mathbf {X}\in \mathfrak {R}^{\textit{m}\times \textit{n}}$ be the data matrix. The goal of PCA is to find the top eigenvectors of the symmetric positive semidefinite (PSD) matrix $\mathbf {A} = \frac{1}{n} \mathbf {X}\mathbf {X}^{\top }\in \mathfrak {R}^{\textit{m}\times \textit{m}}$, which is the covariance matrix. We assume that the target matrix A has eigenvalues $1\ge \boldsymbol{\alpha }_{1}\ge \boldsymbol{\alpha }_{2}\cdots \ge \boldsymbol{\alpha }_{m}\ge 0$ with corresponding normalized eigenvectors $\mathbf {a}_{1},\mathbf {a}_{2},\cdots ,\mathbf {a}_{m}$, then $\mathbf {A}\mathbf {a}=\boldsymbol{\alpha }\mathbf {a}$. In general, $\mathbf {A}^{k}\mathbf {a}=\boldsymbol{\alpha }^{k}\mathbf {a}$ for all $k=1,...,m$, which is the basis of the power iteration method. Let $\boldsymbol{\alpha }^{(0)}$ be an approximation to an eigenvector of $\mathbf {A}$, and $\mathbf {a}^{(0)}$ as an initial vector. The power method [21] estimates the top eigenvector by repeatedly applying the update step below:

$$\begin{aligned} \boldsymbol{\omega }=\mathbf {A}\mathbf {a}^{(k-1)}, \mathbf {a}^{(k)}=\frac{\boldsymbol{\omega }}{\Vert \boldsymbol{\omega }\Vert }, \end{aligned}$$

(1)

where at each iteration, $\mathbf {a}^{(k)}$ will get closer to the top principal eigenvector $\mathbf {a}_{1}$. When $\mathbf {k}\rightarrow \infty $, $\mathbf {a}^{(k)}$ will converge to $\mathbf {a}_{1}$.

Recently, [17] has proposed a distributed SVD algorithm based on the local power iteration, which incorporates the periodic communication for communication-efficiency. The LocalPower method can save communication by $\mathbf {\textit{l}}$ times without much impact on the accuracy. To guarantee the convergence, they partition the data matrix $\boldsymbol{X}$ randomly. They assume the local correlation matrix $\mathbf {A}_{i} = \frac{1}{{s}_{i}} \mathbf {X}_{i}\mathbf {X}_{i}^{\top }$ is a good approximation to the global correlation matrix $\mathbf {A} = \frac{1}{n} \mathbf {X}\mathbf {X}^{\top }$, which is equivalent to $\Vert \mathbf {A}-\mathbf {A}_{i}\Vert \le \rho \Vert \mathbf {A}\Vert $, where $\mathbf {A}_{i}\in \mathfrak {R}^{\textit{s}_{i}\times \textit{m}}$ is the i-th partition, $\rho $ bounds the difference between $\mathbf {A}_{i}$ and $\mathbf {A}$, which is assumed as small as $\epsilon $ . However, our setting allows all parties to independently calculate the principal component vector based on the basic PCA method, and derive the federated principal component vector in the global center according to the weights of the parties, which can reduce the complexity of communication calculations and ensure the privacy of participants.

2.3 Federated Learning

FL is data-private collaborative learning, where multiple clients jointly train a global ML model with the help of a central coordination server. The goal of FL is achieved through the aggregation of intermediate learning variables, while the raw data of each customer is stored locally without being exchanged or transferred. According to different data partition structures, the distribution can be generally categorized into two regimes, namely horizontally and vertically partitioned data [7]. Currently, there is relatively less attention paid to the vertically partitioned data. Most of the existing cross-silo FL work is based on the supervised datasets, including trees [4], linear and logistic regression [20, 25], and neural networks [15]. These supervised FL works rely on the labels, which are expensive to acquire and require domain expertise. For example, diabetes patients may wish to contribute to the FL with their everyday health monitor data like glucose level and blood press. Since patients lack advanced medical expertise, their local data are often unlabeled without a medical expert’s evaluation. As far as we know, the current work on the unlabelled data and vertical learning has yet to be explored. The only existing related papers focus on the semi-supervised FL [12] and weakly-supervised FL [14], respectively.

3 Federated-PCA on Privacy-Preserving Vertical-Partitioned Data

In this section, we first formulate the problem of federated PCA on the vertically partitioned dataset. Then, we propose the VFedPCA algorithm to collaboratively extract principal components with the participation of all clients, while ensuring privacy by avoiding sharing of raw local datasets. Our method devises the local power iteration for efficient local computation, along with periodic communication for better communication efficiency.

In particular, we consider two types of communication topologies. The first is the server-clients topology, which follows the most existing FL methods by introducing a semi-trusted server to coordinate the training. The second is the fully-decentralized topology, where the clients communicate in a peer-to-peer manner with their neighbors. It eliminates the need of the server, which itself can be malicious and sometimes considered unpractical provided that such a semi-trusted server exists.

3.1 Problem Formulation

Let $\mathbf {X}\in \mathfrak {R}^{\textit{m}\times \textit{n}}$ be the data matrix, which have $\textit{m}$ features and $\textit{n}$ samples, we partition $\mathbf {X}$ into $\textit{p}$ clients as $\mathbf {X}=[\mathbf {X}_{1},\mathbf {X}_{2},\cdots \mathbf {X}_{p}]^{\top }$, where $\mathbf {X}_{i}\in \mathfrak {R}^{\textit{f}_{i}\times \textit{n}}$ contains $\textit{f}_{i}$ features of $\mathbf {X}$ and $\sum _{i=1}^p f_i = m$. Let $\mathbf {S} = \frac{1}{m} \mathbf {X}^{\top }\mathbf {X}\in \mathfrak {R}^{\textit{n}\times \textit{n}} $. Let $\mathbf {Z}=\mathbf {X}\mathbf {U}^{\top } \in \mathfrak {R}^{\textit{m}}$, which can be considered as the coordinate of the projection along the direction given by $\mathbf {U}$. Note that $\mathbf {Var}(\mathbf {Z})=\mathbf {U}^{\top }\mathbf {S}\mathbf {U}$, where $\mathbf {S}$ is the covariance matrix. Our purpose is to find the leading k-dimensional subspace such that the projection of the data onto the subspace has the largest variance. The problem could be formulated as follows:

$$\begin{aligned} \underset{\mathbf {U} \in \mathbb {R}^{m \times k}, \mathbf {U}^{\top } \mathbf {U}=\mathbf {I}}{\max }\Vert \mathbf {U}^{\top }\mathbf {S}\mathbf {U}\Vert = \boldsymbol{\alpha }_{G}\mathbf {U}^{\top }_G\mathbf {U}_{G}=\boldsymbol{\alpha }_{G}, \end{aligned}$$

(2)

where $\boldsymbol{\alpha }_{G}$ are the leading eigenvalues of $\mathbf {S}$ and $\mathbf {U}_{G}$ are the eigenvectors corresponding to $\boldsymbol{\alpha }_{G}$. Our aim is to minimize the distance between the global eigenvector $\mathbf {U}_G$ that is computed as if all features were centralized together, and each client’s $\mathbf {U}_{i}$, as follows

$$\begin{aligned} \underset{\mathbf {U}_G,\mathbf {U}_{i}\in \mathfrak {R}^{\textit{m}\times \textit{k}}}{min} dist(\mathbf {U}_G, \mathbf {U}_{i}),\text { for all } i=1,...,p. \end{aligned}$$

(3)

We define the squared $\boldsymbol{\chi }$-distance [3] as the distance between $\mathbf {U}_G$ and $\mathbf {U}_i$ by:

$$\begin{aligned} dist(\mathbf {U}_G, \mathbf {U}_{i}) :=\sum _{j=1}^{m} \sum _{h=1}^k\frac{\left( (\mathbf {U}_{G})_{j,h}-\mathbf {(U_\textit{i}})_{j,h}\right) ^{2}}{(\mathbf {U}_{G})_{j,h}+\mathbf {(U_\textit{i}})_{j,h}}, \end{aligned}$$

(4)

where $\mathbf {U}_G$ is the leading eigenvectors of the global covariance matrix $\mathbf {S}$, and $\mathbf {U}_{i}$ is the leading eigenvectors of each client’s (say i-th) local covariance matrix.

3.2 Local Power Method

Let us consider the data partition among $\textit{p}$ clients. In each client, we use power iteration algorithm (also known as the power method) to produce the greatest (in absolute value) eigenvalue of $\mathbf {A}_{i}=\frac{1}{{f}_{i}}\mathbf {x}_{i}^{\top }\mathbf {x}_{i}$, and a nonzero vector $\mathbf {a}_{i}$, which is a corresponding eigenvector of $\boldsymbol{\alpha }_{i}$, i.e. $\mathbf {A}_{i}\mathbf {a}_{i} =\boldsymbol{\alpha }_{i}\mathbf {a}_{i}$. For $ \textit{l} = 1,2 \cdots \textit{L} $, each client will compute locally until convergence by

$$\begin{aligned} \mathbf {a}_{i}^{l+1} = \frac{\mathbf {A}_{i}\mathbf {a}_{i}^{l}}{\Vert \mathbf {A}_{i}\mathbf {a}_{i}^{l}\Vert }, \end{aligned}$$

(5)

where the vector $\mathbf {a}_{i}^{l}$ is multiplied by the matrix $\mathbf {A}_{i}$ and normalized at every iteration. Throughout the paper, we use $l\in [L]$ to denote local iterations and reserve $t\in [T]$ to denote global rounds.

If $\mathbf {a}_{i}^{l}$ is an eigenvector of $\mathbf {A}_{i}$, its corresponding eigenvalue is given by

$$\begin{aligned} \boldsymbol{\alpha }_{i}^{l} = \frac{\mathbf {A}_{i}(\mathbf {a}_{i}^{l})^{\top }\mathbf {a}_{i}^{l}}{(\mathbf {a}_{i}^{l})^{\top }\mathbf {a}_{i}^{l}}, \end{aligned}$$

(6)

where $\mathbf {a}_{i}^{l}$ and $\boldsymbol{\alpha }_{i}^{l}$ represent the largest eigenvector and eigenvalue of the i-th client, respectively.

3.3 Federated Communication

Each client uploads the eigenvector $\mathbf {a}_{i}^{t}$ and the eigenvalue $\boldsymbol{\alpha }_{i}^{t}$ to the central sever. First, the server computes the weight $\boldsymbol{\omega }_{i}$ of each client. Then, the server merges all clients’ results and computes the federated eigenvector $\mathbf {u}^{t}$

$$\begin{aligned} \boldsymbol{\omega }_{i}^{t} = \frac{\boldsymbol{\alpha }_{i}^{t}}{\sum _{i=1}^{p}\boldsymbol{\alpha }_{i}^{t}}, \mathbf {u}^{t} = \boldsymbol{\omega }_{1}^{t}\mathbf {a}_{1}^{t} + \boldsymbol{\omega }_{2}^{t}\mathbf {a}_{2}^{t} + \cdots \boldsymbol{\omega }_{p}^{t}\mathbf {a}_{p}^{t}, \end{aligned}$$

(7)

where $\boldsymbol{\omega }_{i}^{t}$ is the weight of the i-th client and $\mathbf {u}^{t}$ is the shared projection feature vector by merging all clients’ $\mathbf {a}_{i}^{t}$’s.

3.4 Sever-Clients Architecture

All clients share parameters with the help of the central coordination server. In our setting, this third-client coordinator can be trusted, which means that it will honestly conduct the designated functionality and will not attempt to breach the raw data of any clients. After adjusting the new parameters, this federated-parameter is returned to all clients. Then, each client calculates locally based on this new federated-parameter. Moreover, this “communication-and-local-computation” cycle is iterative. The framework of this model is shown in Fig. 1. The algorithm is summarized in Algorithm 1. The key steps of this method are as follows:

Local Training: All clients perform local power method independently. Each client first calculates the covariance matrix $ \mathbf {A} $. Subsequently, the largest eigenvector $\mathbf {a}$ and eigenvalue $\boldsymbol{\alpha }$ are calculated (see steps 4–10 of Algorithm 1);

Model Integration: The central server calculates the weight $ \boldsymbol{\omega }_{i} $ occupied by each client in the federated model by receiving the eigenvalue $\boldsymbol{\alpha }_{i}$ of each client, and then combines the received eigenvector $\mathbf {a}_{i}, i \in \{1, \cdots p\}$, to derive a federated feature vector $\mathbf {u}^{t}$, where the communication is cyclical (see steps 12–16 of Algorithm 1);

Parameters Broadcasting and Updating: The central coordination server broadcasts the aggregated parameters $\mathbf {u}^{t}$ to the p clients. Each client updates local eigenvector with the new returned federated feature vector $\mathbf {u}^{t}$ (see step 17 of Algorithm 1). The advantage of this model is relatively more efficient, although it relies on the help of the server which can be potentially malicious.

3.5 Local Power Iteration with Warm Start

The general power iteration algorithm starts with an initialization vector $\mathbf {a}^{(0)}$, which may be an approximation to the dominant eigenvector or a random unit vector. Here, we initialize the algorithm from a “warm-start” vector $\mathbf {a}^{(0)}$, which is based on the federated vector from the previous communication round. It is natural to reach convergence faster after consecutive iterations, which will reduce the actual running time and improve the performance of the local power iteration algorithm in Sect. 3.2. The algorithm is summarized in Algorithm 2. The main step is as follows:

Local Training: Each node uses the federated feature vector $\mathbf {u}$ as the initialize value to perform local power method, then calculate the k + 1 to 2k eigenvector $\mathbf {a}$ and eigenvalue $\boldsymbol{\alpha }$ and send to the server (see steps 2–8 of Algorithm 2).

This method considers the use of an adaptive loop idea which allows communication between clients to reduce the error to a certain extent, especially for the case where the error between the general federated result $\mathbf {u}$ and the global result $\mathbf {u}_G$ is large.

3.6 Weight Scaling Method

The federated result may be sometimes not beneficial to all clients. In general, the client with the larger eigenvalue has a greater influence on the federated result, which hints a natural intuition that following the direction of the clients with larger eigenvalues tends to reach the global consensus faster and costs less communication rounds. Inspired by this intuition, we introduce a weight scaling factor to further improve the federation communication. The improved weight scaled federated average is formulated by

$$\begin{aligned} \begin{aligned} \mathbf {u}^{t}&= (1+\eta _{1}^t)\boldsymbol{\omega }_{1}^{t}\mathbf {a}_{1}^{t} + \cdots + (1+\eta _{\lceil p/2\rceil }^t)\boldsymbol{\omega }_{1}^{t}\mathbf {a}_{\lceil p/2\rceil }^{t} \\&+ (1-\eta _{\lceil p/2\rceil +1}^t)\boldsymbol{\omega }_{1}^{t}\mathbf {a}_{\lceil p/2\rceil +1}^{t} \cdots (1-\eta _{p}^t)\boldsymbol{\omega }_{p}^{t}\mathbf {a}_{p}^{t}, \end{aligned} \end{aligned}$$

(8)

where $\sum _{i=1}^{\lceil p/2\rceil } \eta _{i}^t = \sum ^{p}_{i= \lceil p/2\rceil +1} \eta _{i}^t$. In brief, we gradually increase the impact of the first half of the clients with larger eigenvalues, while further decrease the impact of the clients of the second half with smaller eigenvalues. The algorithm is summarized in Algorithm 3. The main step is as follows:

Model Integration: The central server calculates the weight $ \boldsymbol{\omega }_{i} $ occupied by each client in the federated model based on the received eigenvalue $\boldsymbol{\alpha }_{i}$ of each client, and then adds $\eta $ parameter to further adjust the weight scale of each client. Then, it combines the received eigenvector $\mathbf {a}_{i}, i \in \{1, \cdots p\}$, to derive a federated feature vector $\mathbf {u}^{t}$ (i.e., steps 2–8 of Algorithm 3). This method is especially suitable for situations where some clients have a larger weight in the federated process.

3.7 Fully Decentralized Architecture

In this variant, we eliminate the need of the central server, where all clients only share principal component parameters with each other, and compute the results locally. The framework of this model is shown in Fig. 2. The algorithm is summarized in Algorithm 4. The main step of this method is as follows:

Parameters Communication and Updating: Communication among clients is connected and share intermediate results of the parameters $\boldsymbol{\alpha }_{i}$ and $\mathbf {a}_{i}$. Each client obtains the extracted final data based on calculating the weights $ \boldsymbol{\omega }_{i} $’s occupied by all clients and updates locally (see steps 3–5 of Algorithm 4).

This model circumvents the need for third-client agencies to help participants further reduce some external risks, especially for two-client scenarios. However, the principal components of all clients need to be calculated independently, and the computing efficiency will be greatly affected.

3.8 Complexity Analysis

In each client, let $\mathbf {\textit{L}}$ be the total number of local iterations, which only depends on the eigen-gap $\varDelta $ after full passes over the data between the top two eigenvalues of $\mathbf {A}_{i}$. In Algorithm 1, let $\mathbf {\textit{T}}$ be the total number of federated aggregations executed by the server.

Theorem 1

(Local Iteration Complexity). After $O(\frac{1}{\varDelta }\log {\frac{n}{\epsilon }})$ steps, the local power method can achieve $\epsilon $-accuracy.

Remark 1

We show the proof in the appendix^{Footnote 1}. According to Theorem 1, the normalized iterate $\frac{\mathbf {A}_{i}\mathbf {a}_{i}^{l}}{\Vert \mathbf {A}_{i}\mathbf {a}_{i}^{l}\Vert }$ is an $\epsilon $-accurate estimate of the top principal component. Hence, the total local power iterations complexity is $O(\frac{1}{\varDelta }\log {\frac{n}{\epsilon }})$.

4 Experimental Results

This section will empirically evaluate the proposed method. In the experiments, we utilized four groups of real-world datasets: 1) structured datasets from different domains [6]; 2) medical image dataset [24]; 3) Face image dataset [2]; 4) Gait image dataset [22]. The feature and sample size information of all datasets are summarized in Table 2. In addition, illustrative images of the image dataset are shown in Figs. 3, 4 and 5, respectively. In all experiments, we measure the communication cost by calculating the bits of all the variables transmitted throughout the algorithm execution. As a preliminary study, we set T = 1, i.e., one-shot communication between the clients and the server, in our experiments, unless otherwise stated.

Table 2. Summary of datasets

Full size table

4.1 Experiment on Structured Dataset

Semi-synthetic Datasets. We use 8 structured datasets from different domains, which are the real data that are publicly available at the UCI Machine Learning Repository [6]. We compare the communication cost and estimation error by the different settings of $\textit{p}$ and $\textit{l}$ based on the feature-wise setting. We change the number of partitions $\textit{p}$ = 3,5,10, $\textit{p}$ = 10,50,100, and the local iterations $\textit{l}$ = 5,10,20, $\textit{l}$ = 50,100,200, and we vary the number of communication period to $\textit{t}$ = 5, $\textit{t}$ = 10. The number of each client’s features is split and configured according to the different settings of $\textit{p}$.

Communication Cost and Estimation Error In this part, we simulate the communication between the clients and the server on a single machine, where the CPU time are wall-clock time.

The effect of local feature size. In Fig. 6, we fix $\textit{l}$ = 10 and $\textit{l}$ = 100 but change the number of involved clients $\textit{p}$ = 3, 5, 10 and $\textit{p}$ = 10, 50, 100. Then, we plot the distance error and the running time cost between the global eigenvector and the federated eigenvector with the different partitions $\textit{p}$. In all experiments, compared with the un-communicated situation, the distance error after the communication has been significantly reduced and convergence has been achieved. It can be observed that the smaller $\textit{p}$ will lead to less communication cost.
The effect of local iterations. In Fig. 7, we fix $\textit{p}$ = 5 and $\textit{p}$ = 100, but change the number of local power iterations to $\textit{l}$ = 5, 10, 20 and $\textit{l}$ = 50, 100, 200. Then, we plot the distance error and the running time cost between the global eigenvector and the federated eigenvector with the different local iterations $\textit{l}$. All experimental results show that, after the communication, the error eventually decreases and the algorithm converges. It can also be seen that the result will remain stable when $\textit{l}$ is up to 20.
The effect of warm-start power iterations. In Fig. 8, we fix $\textit{p}$ = 5 and local power iterations $\textit{l}$ = 100. Then, we plot the distance error between the global eigenvector and the federated eigenvector after communications $\textit{t}$ = 5. Experimental results show the trend that the error further decreases and converges.
The effect of $\eta $. In Fig. 8, we fix $\textit{p}$ = 5 and local power iterations $\textit{l}$ = 100. Then, we plot the distance error between the global eigenvector and the federated eigenvector after iterations $\textit{t}$ = 5, 10. Experimental results show the trend that the error further decreases and converges compared with the one without using the adjustment factor $\eta $.

4.2 Case Studies

Medical Image Dataset. The DeepLesion dataset [24] is a CT slices collection from 4427 unique patients, which contains a variety of lesions (e.g., lung nodules, liver lesions). We selected 100 samples from the dataset and each image is normalized to an 512 $\times $ 512 gray image. We set $\textit{p}$ = 8 clients, re-split the features, and assign them equally to each client. We used the commonly clustering method: k-means, with the number $\textit{k}$ = 20 of clusters.

Yale Face Dataset. We use the Yale Face Dataset [2], which contains 165 grayscale images of 15 subjects. Each subject configures 11 different facial expressions and $\textit{n}$ = 15 samples for each facial expression, where each face image is normalized to a 100$\times $100 gray image and we set $\textit{p}$ = 10 clients. We use the k-means with $\textit{k}$ = 10.

Gait Estimation. The CASIA is a gait database [22] for gait recognition, including 20 persons. We have image sequences, 4 sequences for each of the three directions and $\textit{n}$ = 20 samples for each direction, where each image is resized to the 256 $\times $ 256 scale. We set $\textit{p}$ = 16 clients. We used the k-means with $\textit{k}$ = 10.

Comparative Results. We first perform PCA on image datasets. Figure 9 (a)(b)(c) show that the final images after federating is almost the same as the final images after un-split image data. Then, to further verify that the final image results extract useful features, we also perform the clustering experiment on the image dataset. Figure 9 (d)(e)(f) show the clustering result (also known as image segmentation) after using the federated PCA method, which also shows improvement over the isolated PCA (i.e., independently run PCA on splitted local datasets). It will help each client to further perform local training tasks.

5 Concluding Remarks

In this paper, we have proposed VFedPCA algorithm, which can obtain a collaborative model that improves over the local models learned separately by each client. Through extensive comparative studies on various tasks, we have verified that the collaborative model achieves comparative accuracy with the centralized model as if the dataset were un-splitted. In addition, we propose two strategies, i.e., the local power iteration warm start method and the weight scaling method, to further improve the performance and accuracy of the model. In terms of the federated communication, the full decentralized model has lower communication cost and more flexible application value.

From the practical point of view, it is not uncommon that there exists a non-linear relationship between various data, especially for the image data, most of which possess a non-linear relationship. Under the circumstances, the PCA method which is essentially linear is inapplicable to find an appropriate representative direction. In the future, we will therefore study the vertical federated kernel PCA learning on datasets with the different types of non-linearity.

Notes

1.
Link to Appendix: https://www.comp.hkbu.edu.hk/~ymc/papers/conference/wise21/appendix.pdf.

References

Balcan, M.-F., Du, S.S., Wang, Y., Yu, A.W.: An improved gap-dependency analysis of the noisy power method. In: Conference on Learning Theory, pp. 284–309. PMLR (2016)
Google Scholar
Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 711–720 (1997)
Google Scholar
Cha, S.: Comprehensive survey on distance/similarity measures between probability density functions (2007)
Google Scholar
Cheng, K., Fan, T., Jin, Y., Liu, Y., Chen, T., Yang, Q.: SecureBoost: a lossless federated learning framework. arXiv:1901.08755 (2019)
Du, W., Han, Y.S., Chen, S.: Privacy-preserving multivariate statistical analysis: linear regression and classification. In: SIAM International Conference on Data Mining. SIAM (2004)
Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017)
Google Scholar
Fan, J., Wang, D., Wang, K., Zhu, Z.: Distributed estimation of principal eigenspaces. Ann. Stat. 47(6), 3009 (2019)
Article MathSciNet Google Scholar
Gascón, A., et al.: Secure linear regression on vertically partitioned datasets. IACR Cryptol. ePrint Arch. 2016:892 (2016)
Google Scholar
Grammenos, A., Smith, R.M., Crowcroft, J., Mascolo, C.: Federated principal component analysis. In: NeurIPS (2020)
Google Scholar
Guo, X., Li, X., Chang, X., Wang, S., Zhang, Z.: Privacy-preserving distributed SVD via federated power. arXiv:2103.00704 (2021)
Hardt, M., Price, E.: The noisy power method: a meta algorithm with applications. In: NeurIPS (2014)
Google Scholar
Jeong, W., Yoon, J., Yang, E., Hwang, S.J.: Federated semi-supervised learning with inter-client consistency. arXiv:2006.12097 (2020)
Ji, Z., Zou, X., Huang, T., Wu, S.: Unsupervised few-shot learning via self-supervised training. arXiv:1912.12178 (2019)
Jin, Y., Wei, X., Liu, Y., Yang, Q.: Towards utilizing unlabeled data in federated learning: a survey and prospective. arXiv:Learning (2020)
Kairouz, P., McMahan, H.B., Avent, B., et al.: Advances and open problems in federated learning. arXiv:1912.04977 (2019)
Kargupta, H., Huang, W., Sivakumar, K., Johnson, E.: Distributed clustering using collective principal component analysis. Knowl. Inf. Syst. 3(4), 422–448 (2001)
Article Google Scholar
Li, X., Wang, S., Chen, K., Zhang, Z.: Communication-efficient distributed SVD via local power iterations. arXiv:2002.08014 (2020)
Mishra, S.P., et al.: Multivariate statistical data analysis-principal component analysis (PCA). Int. J. Liv. Res. 7(5), 60–78 (2017)
Google Scholar
Nian, K., Zhang, H., Tayal, A., Coleman, T., Li, Y.: Auto insurance fraud detection using unsupervised spectral ranking for anomaly. J. Finance Data Sci. 2(1), 58–75 (2016)
Article Google Scholar
Nock, R., et al.: Entity resolution and federated learning get a federated resolution. arXiv:1803.04035 (2018)
Saad, Y.: Numerical methods for large eigenvalue problems: revised edition. SIAM (2011)
Google Scholar
Wang, L., Tan, T., Ning, H., Weiming, H.: Silhouette analysis-based gait recognition for human identification. IEEE Trans. Pattern Anal. Mach. Intell. 25(12), 1505–1518 (2003)
Article Google Scholar
Wu, S.X., Wai, H.-T., Li, L., Scaglione, A.: A review of distributed algorithms for principal component analysis. Proc. IEEE 106(8), 1321–1340 (2018)
Google Scholar
Yan, K., Wang, X., Lu, L., Summers, R.M.: Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imaging 5(3), 036501 (2018)
Google Scholar
Yang, Q., Liu, Y., Chen, T., Tong, Y.: Silhouette analysis-based gait recognition for human identification. ACM Trans. Intell. Syst. Technol. (TIST) 10(2), 1–19 (2019)
Article Google Scholar
Zantedeschi, V., Bellet, A., Tommasi, M.: Fully decentralized joint learning of personalized models and collaboration graphs. In: International Conference on Artificial Intelligence and Statistics, pp. 864–874 (2020)
Google Scholar

Download references

Acknowledgment

This work was supported by HKBU AI-Info Communication Study (AIS) Scheme with Project No. AIS 21-22/03.

Author information

Authors and Affiliations

Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong SAR, China
Yiu-ming Cheung & Feng Yu
Guangzhou Institute of Technology, Xidian University, Guangzhou, China
Jian Lou

Authors

Yiu-ming Cheung
View author publications
You can also search for this author in PubMed Google Scholar
Jian Lou
View author publications
You can also search for this author in PubMed Google Scholar
Feng Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiu-ming Cheung .

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
Wenjie Zhang
Peking University, Beijing, China
Lei Zou
Zayed University, Dubai, United Arab Emirates
Zakaria Maamar
Swinburne University of Technology, Melbourne, VIC, Australia
Lu Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheung, Ym., Lou, J., Yu, F. (2021). Vertical Federated Principal Component Analysis on Feature-Wise Distributed Data. In: Zhang, W., Zou, L., Maamar, Z., Chen, L. (eds) Web Information Systems Engineering – WISE 2021. WISE 2021. Lecture Notes in Computer Science(), vol 13080. Springer, Cham. https://doi.org/10.1007/978-3-030-90888-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-90888-1_14
Published: 01 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90887-4
Online ISBN: 978-3-030-90888-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics