1 Introduction

Data availability is essential for machine learning, however, privacy concerns often prevent the direct sharing of data among different parties. Federated learning (FL) addresses this issue by facilitating collaborative model training without sharing private data. This approach allows multiple parties to leverage their data while adhering to privacy protection measures and government regulations, such as the General Data Protection Regulation (GDPR) (Commission, 2016).

Federated Learning (FL) algorithms have evolved into two mainstream subtypes, including Horizontal Federated Learning (HFL) (McMahan et al., 2017; Li et al., 2020, 2021; Karimireddy et al., 2020; Mishchenko et al., 2022; Shi et al., 2021; Casado et al., 2023; Badar et al., 2023; Ahmad et al., 2023; Li et al., 2022; Sabater et al., 2022) and Vertical Federated Learning (VFL) (Li et al., 2020; Vepakomma et al., 2018; Chen et al., 2020; Yang et al., 2019; Hu et al., 2019; Wei et al., 2022; Gu et al., 2021). HFL involves clients holding a subset of data points with a full feature set (horizontally distributed), while VFL involves clients holding all data points but with a non-intersecting subset of features (vertically distributed).

We focus on VFL, which is applicable to practical learning scenarios in various industries, such as hospitals, banks, and insurance companies. For example, a government agency (server) collaborates with multiple banks (clients) to develop a model for estimating customers’ credit scores (Wei et al., 2022), where each bank holds a distinct set of customer features. In VFL, the client trains a feature extraction model that maps its local data sample to embeddings. The server then collects the embeddings from all clients and uses them as input for the server model to make a prediction.

To build a practical VFL, it is essential to meet the following fundamental requirements: model applicability (Castiglia et al., 2022; Makhija et al., 2022; Zhang et al., 2021), privacy security (Zhou et al., 2020; Hardy et al., 2017; Fang et al., 2021), computational efficiency (Chen et al., 2020; Hu et al., 2019; Zhang et al., 2021), and communication efficiency (Zhang et al., 2021; Castiglia et al., 2022; Wang et al., 2022).

In a typical VFL framework optimized with FOO (Chen et al., 2020; Vepakomma et al., 2018),nas illustrated in Fig. 1a, both the server and clients utilize FOO to optimize the model, which is fast. However, sharing the gradient with the client poses a serious risk of privacy leakage (Fu et al., 2022; Fredrikson et al., 2015; He et al., 2016; Zhao et al., 2020), and the framework is only applicable to differentiable models.

Fig. 1
figure 1

The intuition of our VFL framework

A recent study (Zhang et al., 2021) found that applying ZOO on VFL, as depicted in Fig. 1b, offers several advantages in building practical VFL. Firstly, it enhances model applicability by eliminating the requirement for an explicit gradient to update the model. Secondly, it improves privacy security by transmitting black-box information (losses) to the client instead of internal information (gradients). Besides, the client retains the perturbation direction, preventing third parties from obtaining the gradient. As a result, both the server and client can maintain the confidentiality of gradient information during training. However, relying solely on ZOO for model optimization can lead to slow convergence, especially when dealing with large models.

Both frameworks mentioned above do not meet the requirement of practical VFL. Although FOO converges rapidly and dependably, the privacy risk associated with transmitting the gradient is a significant drawback. On the other hand, ZOO provides high model applicability and privacy security but suffers from a slow convergence problem.

Then, it comes to the question: How to improve the convergence speed while preserving the advantages of ZOO to make a practical VFL?

In this paper, we provide a solution to this problem by proposing a cascaded hybrid optimization method in the asynchronous VFL which maximizes the benefits of both optimization methods.

As depicted in Fig. 1c, we utilized distinct optimization methods for the upstream (server) and downstream (client) of the global model in a cascaded manner. This approach ensures privacy preservation, as the downstream models update with ZOO, which guarantees that no gradient is transmitted through the network. Additionally, the upstream model is updated with FOO locally, which converges fast and does not compromise privacy.

Our contributions can be summarized as follows:

  • We propose a practical asynchronous VFL framework that cascades two different optimization methods (FOO &ZOO), where the advantages of both optimization methods are maximized. Our VFL framework satisfies the fundamental requirements of model applicability, privacy security, computational efficiency, and communication efficiency to a significant degree.

  • We theoretically prove that the convergence of our VFL framework is faster than the ZOO-based VFL by demonstrating that the convergence is solely limited by the size of the client’s parameters. Additionally, our VFL framework can feasibly train a large parameterized model with the majority part on the server.

  • We conduct extensive experiments on the Multi-Layer Perception (MLP), Convolutional Neural Network (CNN), and Large Language Model (LLM) to demonstrate the privacy and applicability of our framework in the latest deep learning tasks.

Justification of the Application Scenario: In our VFL setting, the server uses a larger model compared with the clients. We provide our justification for this application scenario below.

In VFL, the server is typically the initiator and primary beneficiary of the model training process. The client, on the other hand, acts as a follower and only provides the embedding of their local features without disclosing the raw data (Wei et al., 2022). Besides, the server usually possesses more computational resources than the clients, making it more suitable for training large models. Therefore, using a larger model on the server side can lead to better data predictions and reduce the computational burden for all participants in the VFL, making it a more preferable and economical option.

2 Related work

There are several basic metrics to consider when developing a VFL framework:

Model Applicability dictates the VFL framework can fit heterogeneous models. The heterogeneity of the model mainly determines whether the model is differentiable. For example, most of the VFL approaches explicitly apply gradient (Vepakomma et al., 2018; Chen et al., 2020), which forces each party to use a differentiable model. However, this approach may not always be practical, especially when the participants have non-differentiable model architectures. In such cases, when the gradient is not available, the main solution is to apply proximal-term (Castiglia et al., 2022) or to use ZOO (Zhang et al., 2021).

Privacy is a critical consideration for any VFL algorithm. In VFL, there are two types of private data: the features held by the clients and the labels held by the server. Depending on the target of the attack, privacy inference attacks in VFL can be classified as feature inference attacks (Luo et al., 2021; Jin et al., 2021; Zhu et al., 2019; Fredrikson et al., 2015; Weng et al., 2020) or label inference attacks (Fu et al., 2022; Sun et al., 2022; Zhu et al., 2019; Zhao et al., 2020; Jin et al., 2021).

The mainstream privacy protection scheme is applying privacy computing on VFL. For example, Liu et al. (2020) and Hardy et al. (2017) have applied homomorphic encryption (HE) on the transmission data, where the participant in the VFL framework sends the ciphertext instead of plain text through the network. Other works have used differential privacy (DP) (Shokri & Shmatikov, 2015; Ranbaduge & Ding, 2022; Wei et al., 2020; Sabater et al., 2022) or secure multiparty computation (SMC) (Fang et al., 2021). Although these privacy computing methods have a provable security level, they have several disadvantages. For example, HE restricts the choice of model structure, DP reduces the performance of the global model, and HE and SMC have high communication or computation costs for participants, which limits their application.

Computational Efficiency dictates that the computation resource in VFL is efficiently used. The computational efficiency of synchronous VFL can be low due to the idle time for participants. In synchronous VFL, the server coordinates with all clients by sending a request to all clients for each batch of training data. The server must wait for all clients’ responses to fulfill one global update step before sending the next request to all clients (Liu et al., 2019; Vepakomma et al., 2018; Castiglia et al., 2022; Fang et al., 2021). As a result, all participants must wait for the slowest one, leading to low computational efficiency in synchronous VFL.

Asynchronous VFL (Chen et al., 2020; Hu et al., 2019; Zhang et al., 2021) was proposed to reduce idle time for each participant and improve the computation efficiency. In asynchronous VFL, the client continuously sends its model output to the server without coordination from the server. When the server receives the output from the client, it replies with the necessary information (e.g., partial derivative) to assist the model update of the client. This scheme eliminates most of the idle time for the clients and improves computation efficiency. Our research focuses on asynchronous VFL.

Communication Efficiency is about reducing the communication cost between the parties of VFL. Research has focused on reducing communication rounds (Liu et al., 2019; Wang et al., 2022) or per-round communication overhead (Castiglia et al., 2022). Liu et al. (2019) propose multiple local updates on VFL participants to reduce communication rounds. However, multiple local updates consume more computational resources on clients, which is not favorable in VFL. Wang et al. (2022) apply a better optimization method to speed up convergence and reduce communication rounds. Castiglia et al. (2022) apply compression to the embeddings of client outputs to support efficient communication and multiple local updates, reducing per-round communication overhead and communication rounds.

3 Method

This section introduces the modeling of the VFL problem and proposes our framework that cascades different optimization methods. With a cascaded hybrid optimization method, the advantage of both ZOO and FOO is maximized in one VFL framework.

3.1 Problem definition

We consider a general form of VFL problem (Chen et al., 2020; Hu et al., 2019; Liu et al., 2019; Zhang et al., 2021), which involves a single server and M clients.

Each participant in the VFL possesses n samples within their respective databases. Specifically, each client holds a distinct set of features for each sample, denoted as \(x_{i,m}\), while the server holds the corresponding labels for the i-th sample,Footnote 1 denoted as \(y_i\).

Clients communicate with the server through the network. To preserve the privacy of the local data. Raw data \(x_{i, m}\) and \(y_i\) should not be transmitted through the network. The client holds a local model \( F_m( w_m;x_{i, m}) \) parameterized by \(w_m \in {\mathbb {R}}^{d_m}\) with sample \(x_{i, m}\) as input and send the output \(c_{i, m}\) of the model to the server through the network. The server holds a model \(F_0(w_0; c_{i, 1}, \ldots , c_{i, q})\) which is parameterized by \(w_0 \in {\mathbb {R}}^{d_0} \) and take \(c_{i, m}\) from all clients as inputs. The loss function is denoted as \({\mathcal {L}}( \hat{y_i}, y_i)\).

Ideally, all parties in the VFL framework collaborate to solve a finite-sum problem in the composition form:

$$\begin{aligned} f(w_0, \textbf{w}) =&\frac{1}{n} \sum _{i=1}^{n} \underbrace{\left[ {\mathcal {L}}(F_0(w_0, c_{i, 1}, \ldots , c_{i, M}), y_i) + \lambda \sum _{m=0}^M g(w_m) \right] }_{f_i(w_0, \textbf{w}) } \nonumber \\&\text {with} \quad c_{i, m} = F_m ( w_m ;x_{i, m}) \quad \forall m \in [M] \end{aligned}$$
(1)

where g is the regularization function for the party m, [M] \(=\) \(\{1, 2, \cdots , M\}\) denote the set of all clients’ indices, \(\textbf{w}\) \(=\) \(\{ w_1, w_2, \cdots , w_M \}\) denotes the parameters from all clients, \(f_i(w_0, \textbf{w})\) denotes the loss function for the i-th sample.

3.2 Cascaded hybrid optimization (ZOO & FOO)

To leverage the advantage of ZOO and FOO in one VFL, we apply a cascaded hybrid optimization method, where the upstream (server) and the downstream (client) of the global model apply different optimization methods simultaneously. Specifically, the clients are updated with ZOO and the communication between the server and the client does not contain internal information, which protects privacy. The server is updated with FOO locally, which speeds up the convergence of the VFL without degrading the privacy security.

3.2.1 Client update with ZOO to ensure privacy security

The models of the clients are trained with the ZOO. The two-point stochastic gradient estimator (Liu et al., 2020; Nesterov & Spokoiny, 2017) w.r.t. the client m’s parameter \(w_m\) is defined as:

$$\begin{aligned}&\hat{\nabla }_{w_m} f_i \left( w_0, \textbf{w} \right) = \frac{\phi (d_m) }{\mu _m} \left[ f_i\left( w_m + \mu _m u_{i, m} \right) - f_i\left( w_m \right) \right] u_{i, m} \end{aligned}$$
(2)

where \(u_{i, m} \sim p\) is a random direction vector drawn from distribution p. Typically, p is standard normal distribution \({\mathcal {N}}(\pmb {0}, \textbf{I})\), or uniform distribution \({\mathcal {U}}({\mathcal {S}}(\textbf{0}, 1) )\) over a unit sphere at \(\textbf{0}\), with the radius of 1. \(\mu \) is the smoothing parameter. \(f_i\left( w_m + \mu _m u_{i, m} \right) \) is the simplified form of \(f_i(w_0, w_1, w_2, \cdots , w_m + \mu _m u_{i, m}, \cdots , w_q )\), i.e. the loss of the i-th sample with the model parameter of client m changed to \(w_m + \mu _m u_{i, m}\). \(\phi (d_m)\) is a dimension-dependent factor that relates to the choice of p. To be more specific, if p is \({\mathcal {N}}(\pmb {0}, \textbf{I})\) then \(\phi (d_m) = 1\) and if p is \({\mathcal {U}}( {\mathcal {S}}(\textbf{0}, 1))\) then \(\phi (d_m) = d_m\).

The clients are unable to compute the gradient of the loss function locally due to the fact that the label of the data is stored on the server. As illustrated in Fig. 2, the clients query the server for the necessary computation material. The active client then computes the model output with or without the perturbation \(\mu _m u_{i,m}\) on its parameter and sends them to the server. Specifically, the client’s outputs are:

$$\begin{aligned} c_{i, m}&= F_m (w_m; x_{i,m}) \\ \hat{c}_{i, m}&= F_m(w_m + \mu _m u_{i, m} ; x_{i,m}) \end{aligned}$$

Receiving the query from the client, the server replies to the client m with the corresponding loss values \(h_{i, m}\) and \(\hat{h}_{i, m}\):

$$\begin{aligned} h_{i, m}&= {\mathcal {L}}(F_0(w_0, c_{i, 1}, \ldots , c_{i, m}, \ldots , c_{i, M}), y_i) \\ \hat{h}_{i, m}&= {\mathcal {L}}(F_0(w_0, c_{i, 1}, \ldots , \hat{c}_{i, m}, \ldots , c_{i, M}), y_i) \end{aligned}$$

When the client receives \(h_{i, m}\) and \(\hat{h}_{i, m}\) from the server, it is able to calculate the two-point gradient estimator via:

$$\begin{aligned} \hat{\nabla }_{w_m} f_i \left( w_0, \textbf{w} \right) = \frac{\phi (d_m) }{\mu _m} \left[ \hat{h}_{i, m} - h_{i, m} \right] u_{i, m} \end{aligned}$$
(3)

Finally, the client m updates its parameter by gradient descent with the stochastic gradient estimator:

$$\begin{aligned} w_m^{t+1}= w_m^t - \eta _m \hat{\nabla }_{w_m} f_i \left( w_0^t, \textbf{w}^t \right) \end{aligned}$$

There are two parts of private data in the VFL framework that require protection: the features held by the clients and the labels held by the server. Our framework protects the privacy of the data by concealing the internal information of the participants. A comprehensive analysis of the privacy protection of our framework is presented in Sect. 5.

Fig. 2
figure 2

One round of our VFL framework

3.2.2 Server update with FOO to speed up the convergence

The primary issue with ZOO in the context of machine learning is that the variance of the gradient estimation increases as the parameter dimension grows larger, leading to slow convergence of ZOO, particularly for large models. To address this issue, we implemented the FOO on the server to speed up the convergence. It is important to note that the server update is performed locally and does not affect communication with the client or the client’s update steps. As a result, the privacy protection of the framework is not compromised while simultaneously accelerating convergence.

The server’s model is trained with the first-order gradient. Whenever the server receives a message from the client, it performs one gradient descent step on its local model. Since the server can access the output embeddings \([c_{i, m}]_{m=1}^M\) from all clients and the label \(y_i\), plus that the server naturally has full access to its own model \(F_0\), the server can explicitly calculate the gradient via backpropagation. Specifically, the local gradient of the server is:

$$\begin{aligned} \nabla _{w_0}f_{i} (w_0, \textbf{w}) = \frac{\partial \left[ {\mathcal {L}}(F_0(w_0, c_{i, 1}, \ldots , c_{i, M}), y_i) + \lambda g(w_0) \right] }{\partial w_0} \end{aligned}$$

And the server’s parameter is updated via gradient descent:

$$\begin{aligned} w_0^{t+1}= w_0^t - \eta _0 \nabla _{w_0}f_{i} (w_0^t, \textbf{w}^t) \end{aligned}$$
(4)

3.3 Asynchronous updates

The global model is trained without coordination among each party. We assume that all messages will be successfully transmitted, and no participants will withdraw during training. A schematic graph is shown in Fig. 2. At each round, only one client is activated and communicates with the server. After the communication, the activated client and the server update their model. The clients’ update order can be modeled with a sequence of length T. In the t-th iteration, the client \(m_t\) is activated and picks the i-th sample for the update.

To model the delay of the clients, if the client \(m_t\) is activated at the t-th iteration, the client updates its parameter \(w_{m_t}\) and its delay for the i-th sample on the global model is reset. For all other clients \(m \ne {m_t}\), the delay count is incremented by 1. Formally, the delay for the client m and sample i is updated using the following equation:

$$\begin{aligned} \tau ^{t+1}_{i, m} = {\left\{ \begin{array}{ll} 1, &{} \quad m = m_t, i = i_t\\ \tau ^{t}_{i, m} + 1, &{}\quad \text {otherwise} \end{array}\right. } \end{aligned}$$

Taking the client delay \(\tau ^{t}_{i, m}\) into consideration, we can represent the set of parameters for the delayed clients as:

$$\begin{aligned} {\tilde{\textbf{w}}}^t = {{\textbf{w}}}^{t-\tau ^{t}_{i}} = [w_1^{t- \tau ^t_{i, 1} }, \ldots , w_M^{t-\tau ^t_{i, M}} ] \end{aligned}$$

3.4 Algorithm

By combining the ZOO on the client and FOO on the server, we designed an asynchronous VFL framework. The algorithm is presented in Algorithm 1, and the procedure of one update round is summarized in Fig. 2. The procedure of each training round can be summarized as follows: first, the client randomly selects one sample i, computes \(c_{i, m}\) and \(\hat{c}_{i, m}\), and sends them to the server. Upon receiving the query from client m, the server calculates the corresponding losses \(h_{i, m}\) and \(\hat{h}_{i, m}\) and sends them back to the client. The server updates its parameter using gradient descent (Eq. 4) immediately after sending the losses to the client. Finally, upon receiving \(h_{i, m}\) and \(\hat{h}_{i, m}\) from the server, the client updates its parameter using the stochastic gradient estimator given by Eq. 3.

Algorithm 1
figure a

Asyn. VFL with Cascaded Hybrid Optimization

4 Convergence analysis

4.1 Theoretical challenges and advantages

The theoretical difficulty of our work comes from the cascaded hybrid optimization in the VFL, where different optimization methods are simultaneously applied to the upstream and downstream parts of the VFL. To the best of our knowledge, all related works in VFL only considered a single type of optimization method in the entire VFL during one iteration, whose analytic result can be more easily derived via the same analytic steps on the entire framework. However, our work required different analytic procedures to be applied to different parts of the model to solve the problem, which posed a significant challenge. Specifically, the analytic procedure for ZOO and FOO is vastly different, making it difficult to analyze these two different optimizations cascaded in a single model.

The theoretical advantage of our framework compared to the ZOO-based VFL (Zhang et al., 2021) is that the convergence rate of our framework is no longer limited by the server’s parameter size, as stated in Remark 3. The complete proof of the convergence analysis is provided in “Appendix 1”.

4.2 Assumptions

Assumptions 14 are the basic assumptions for solving the non-convex optimization problem with stochastic gradient descent (Ghadimi & Lan, 2013; Liu et al., 2019; Zhang et al., 2021). Assumption 1 tells that the global minima \(f^*\) is not \(-\infty \) (Ghadimi & Lan, 2013; Liu et al., 2018; Zhang et al., 2021). Assumption 2 is used for modeling the smoothness of the loss function \(f(\cdot )\), with which we can link the difference of the gradients with the difference of the input in the definition domain. Assumption 3 is a common assumption for stochastic gradient descent telling that the expectation of the estimation of the stochastic gradient of the sample i does not have a systematic error or bias (Ghadimi & Lan, 2013). Assumption 4 tells that the variance of the gradient estimation is bounded (Liu et al., 2018).

Assumption 1

(Feasible optimal solution) Function f is bounded below that is, there exist \(f^*\) such that,

$$\begin{aligned} f^*:=\inf _{[w_0,{{\textbf{w}}}]\in {\mathbb {R}}^d} f(w_0,{{\textbf{w}}}) > -\infty . \end{aligned}$$

Assumption 2

(Lipschitz gradient) \(\nabla f_i\) is L-Lipschitz continuous w.r.t. all the parameter, i.e., there exists a constant L for \( \forall \ [w_0,{\textbf{w}}], [w_0', {\textbf{w}}']\) such that

$$\begin{aligned}&\left\| \nabla _{[w_0,{\textbf{w}}]} f_i (w_0, \textbf{w}) - \nabla _{[w_0,{\textbf{w}}]} f_i (w'_0, \textbf{w}') \right\| \le L \left\| [w_0,{\textbf{w}}]-[w_0',{\textbf{w}}'] \right\| \end{aligned}$$

specifically there exists an \(L_m>0\) for all parties \(m=0,\cdots ,M\) such that \(\nabla _{w_m}f_i\) is \(L_m\)-Lipschitz continuous:

$$\begin{aligned}&\left\| \nabla _{w_m}f_i (w_0,\textbf{w}) - \nabla _{w_m}f_i (w_0',\textbf{w}') \right\| \le L_m \left\| [w_0,{\textbf{w}}]-[w_0',{\textbf{w}}'] \right\| \end{aligned}$$

Assumption 3

(Unbiased gradient) For \(m \in 0, 1, \cdots M \) for every data sample i, the stochastic partial derivatives for all participants are unbiased, i.e.

$$\begin{aligned} \mathbb {E}_{i} \nabla _{w_m}f_i(w_0, \textbf{w}) = \nabla _{w_m}f_{}\left( w_0, \textbf{w} \right) \end{aligned}$$

Assumption 4

(Bounded variance) For \( m = 0, 1, \cdots , M \), there exist constants \(\sigma _m \le \infty \) such that the variance of the stochastic partial derivatives are bounded:

$$\begin{aligned} \mathbb {E}_{i} \left\| \nabla _{w_m}f_i(w_0, \textbf{w}) - \nabla _{w_m}f(w_0, \textbf{w}) \right\| ^2 \le \sigma _m^2 \end{aligned}$$

Assumption 5 is a common assumption for analyzing VFL when bounding some terms for the entire model when the rest parts have been bounded (Castiglia et al., 2022; Gu et al., 2021; Zhang et al., 2021). We only apply this assumption in the parts of convergence analysis that do not affect the analytic result.

Assumption 5

(Bounded block-coordinate gradient) The gradient w.r.t. all the client is bounded, i.e. there exist positive constants \(\textbf{G}_m\) for the client \( m = 1, \cdots , M \) the following inequalities hold:

$$\begin{aligned} \left\| \nabla _{w_m}{h_m}({w_m}; x_{m, i}) \right\| \le {\textbf{G}}_m \end{aligned}$$

Assumption 67 are fundamental assumptions for analyzing the asynchronous VFL (Zhang et al., 2021; Chen et al., 2020; Gu et al., 2021).

Assumption 6 states that the activation of each client in asynchronous VFL is independent, without which the convergence result cannot be further simplified. Assumption 7 states that the delay on the clients is bounded, without which the convergence cannot be achieved.

Assumption 6

(Independent client) The activated client \(m_t\) for the global iteration t is independent of \(m_0\), \(\cdots ,\) \(m_{t-1}\) and satisfies \({\mathbb {P}}(m_t=m):=p_m\)

Assumption 7

(Uniformly bounded delay) For each client m, and each sample i, the delay at each global iteration t is bounded by a constant \(\tau \). i.e. \(\tau _{m, i}^t \le \tau \)

4.3 Theorems

Theorem 1

Under Assumptions 17, to solve the Problem 1 with Algorithm 1 the following inequality holds.

$$\begin{aligned}&\frac{1}{T } {\sum _{t=0}^{T-1}} \mathbb {E}_{} \left\| \nabla f_{}\left( w_0^{t}, \textbf{w}^{t} \right) \right\| ^2 \overset{ }{\le }\ \frac{4p_*\mathbb {E}_{}\left( f^0 - f^* \right) }{T \eta } + \eta \left( 4 p_*L_*\sigma _*^2 + 8 p_*L_*d_*\sigma _*^2 + p_*L_*^3 \mu _*^2 d_*^2 \right) \\&\quad + \eta ^2\left( 18 p_*\tau ^2 L_*^2 d_*{\textbf{G}}_*^2 + 5 p_*\tau ^2 L_*^2 \mu _*^2 L_*^2 d_*^2 \right) + \mu _*^2\left( p_*L_*^3 d_*^2 \right) \end{aligned}$$

where \(L_*= \max _m \left\{ L, L_0, L_m \right\} \), \( d_*= \max _m \left\{ d_{m} \right\} \), \(\eta _0 = \eta _m = \eta \le \frac{1}{4 L_*d_*}\), \(\frac{1}{p_*} = \min _m p_m \), \(\mu _*= \max _m \left\{ \mu _m \right\} \), \({\textbf{G}}_*= \max _m \left\{ {\textbf{G}}_m \right\} \), and T is the number of iterations.

Remark 1

Theorem 1 tells that the major factors that affect the convergence are the learning rate \(\eta \), the smoothing coefficient \(\mu \) for the ZOO, and the biggest parameter size \(d_*\) among the clients.

Corollary 1

If we choose \(\eta = \frac{1}{\sqrt{T}}\), \( \mu = \frac{1}{\sqrt{T}}\), we can derive

$$\begin{aligned}&\frac{1}{T} {\sum _{t=0}^{T-1}} \mathbb {E}_{} \left\| \nabla f_{}\left( w_0^{t}, \textbf{w}^{t} \right) \right\| ^2 \overset{ }{\le }\ \frac{1}{\sqrt{T}} \left[ 4p_*\mathbb {E}_{}\left( f^0 - f^* \right) + 4 p_*L_*\sigma _*^2 + 8 p_*L_*d_*\sigma _*^2 \right] \\&\quad + \frac{1}{T}\left( 18 p_*\tau ^2 L_*^2 d_*{\textbf{G}}_*^2 + 5 p_*\tau ^2 \mu _*^2 L_*^4 d_*^2 + p_*L_*^3 d_*^2 \right) \\&\quad + \frac{1}{T^\frac{3}{2}} \left( p_*L_*^3 d_*^2 \right) \end{aligned}$$

where the parameters are the same as that in Theorem 1.

Remark 2

Corollary 1 demonstrates the convergence of our cascaded hybrid optimization framework and shows that it converges in \({\mathcal {O}}\left( \frac{d_*}{\sqrt{T}} \right) \), where \(d_* = \underset{m}{\max }\ \left\{ d_{m} \right\} \) represents the largest model size among the clients, and T denotes the number of iterations.

Remark 3

Comparing our convergence analysis result and ZOO-VFL (Zhang et al., 2021), our result does not include the parameter size of the server (\(d_0\)) in the constant terms, which demonstrates that the convergence of the global model is not limited by the size of the server’s parameter. Therefore, in our framework, the server can apply a larger model without impacting the convergence of the global model.

5 Security analysis

5.1 Threat model

We discuss the privacy protection of our framework under the “honest-but-curious” and “honest-but-colluded” models.

5.1.1 Honest-but-curious

The “honest-but-curious” threat model refers to a scenario in which a participant is honest and adheres to the protocol, but is curious about the data of other parties. This party may attempt to gain more knowledge about the data of other parties through communication between participants. Specifically, in VFL, clients seek to infer the label from the server, while the server aims to derive the feature from the client.

5.1.2 Honest-but-colluded

The “honest-but-colluded” threat model involves multiple participants colluding to gain more knowledge about the private data from other participants. Specifically, in VFL, clients may work together to infer the label from the server, or the server may collude with some clients to infer the feature from the remaining clients.

5.2 Theorem

Theorem 2

Our framework can defend against existing privacy inference attacks on VFL under the “honest-but-curious” and “honest-but-colluded” scenarios.

Proof

Defend Against Label Inference Attack: Our framework protects the label on the server by concealing its internal information from clients. Specifically, the server responds to the client with the losses of the model, which are limited to a single value for each batch, without revealing the domain of the target task. Moreover, the server keeps the internal details of its model and the domain information associated with the labels confidential from clients. This approach guarantees that the server acts as a black box to clients, allowing them to collaborate with the server without having access to any task-specific information.

In the context of the “honest-but-curious” model, one client in the VFL system attempts to infer the label from the server.

The “direct label inference” attack from Fu et al. (2022) is based on the gradient information provided by the server and relies on strong assumptions about both the attacker and the victim. Specifically, the attack assumes that the server simply sums the output from all clients and that the attacker has explicit knowledge of this fact. By exploiting this information, the label can be directly inferred from the sign of the element in the gradient provided by the server. However, this attack is not feasible for our framework, as we do not transmit gradients to the client and the server model is agnostic, rather than a simple summation.

The “model completion attack” from Fu et al. (2022) and the “forward embeddings leakage” from Sun et al. (2022) utilize the client’s local model and feature to predict the label on the server. For these attacks to be successful, the local model and local feature must be well-represented on the target task. Besides, a certain label for the sample cannot be guaranteed with those attacks. Additionally, these attacks assume that the client has knowledge of the target task, which can be avoided by using our proposed framework.

Deep leakage from gradient and its variant (Zhu et al., 2019; Zhao et al., 2020; Jin et al., 2021) utilize the gradient provided by the server as the optimization objective to reconstruct the true labels of the sample. However, these attacks assume the attacker has access to the server’s model, which is not applicable to our current framework.

Under the “honest-but-colluded” model, some clients collude to infer the label from the server, the attacker can access more information in this scenario.

If all clients colluded, the “direct label inference attack”, from Fu et al. (2022) still assumes that the client knows that the server uses a simple summation model, which is not applicable to our framework. The “model completion attack” from Fu et al. and the “forward embeddings” attack from Sun et al. (2022) can have better representation on the global task if some client colluded. However, the clients still cannot access the task information from the server, which is not applicable to our model. In the “honest-but-colluded” model, the “deep leakage from the gradient” (Zhu et al., 2019), still requires the gradient information from the server and assumes a simple summation model on the server, which can be avoided with our framework.

Defend Against Feature Inference Attack: Our framework protects the client’s features by concealing their internal information from other participants. Clients send the model’s output for each batch to the server without revealing the feature’s domain. Additionally, the server is unable to access the client’s model information. As a result, adversaries view the client as a black box, only able to receive outputs from it. This makes it difficult to infer the feature from the client.

In the “honest-but-curious” model, the server attempts to infer the feature from the clients.

The “deep leakage from gradient” (Zhu et al., 2019) leverages the gradient as the optimization target to infer the feature from the client. However, this method assumes that the server, as the attacker, can access the client’s model, which is not possible through the protocol in our framework.

The model inversions attack (Fredrikson et al., 2015) uses the model’s output to recover the input of a machine-learning model, which has the potential to be used for feature inference attacks in VFL. However, this attack requires the attacker to have the ability to adaptively query the target model, which the server does not possess this capability in our framework.

The “honest-but-colluded” model allows the server to collude with certain clients to infer features from the remaining clients. Luo et al. (2021) consider a feature inference scenario with two participants, where one participant takes the role of server and client and attempts to infer the feature from the remaining client. They assume that the client uses a logistic regression model, which allows them to reverse the model with the output. However, this method is not applicable to our framework because the client model is agnostic to the attacker. Weng et al. (2020) consider a similar VFL with an extra HE scheme, and they assume that the coordinator with the private key also colludes, enabling the attacker to decrypt the communication. However, this approach is not applicable to our framework as they also assume a specific model on the client. \(\square \)

6 Experiments

In this section, we did extensive experiments to demonstrate the security of our framework, the convergence of our framework and the feasibility of applying our framework to deep learning tasks.

6.1 Experiment setups

6.1.1 Datasets

We vertically partitioned the dataset among M clients, with each client holding an equal amount of features. The server held the labels. Both clients and the server knew the sample IDs, enabling them to coordinate training on each sample. For the base experiment, we used the MNIST dataset (LeCun et al., 2010), the features of the image were flattened and equally distributed among the clients. For the image classification task, we used the CIFAR-10 dataset (Krizhevsky, 2009), with each client holding half of each image. For the natural language processing (NLP) task, we used the IMDb dataset (McAuley & Leskovec, 2013) where the client held the review text data.

6.1.2 Models

We used a Multi-Layer Perceptron (MLP) for the base experiment to demonstrate the convergence rate of our framework. Although simple, it showed the advantage of our framework.

The base model for clients was a single-layer Fully Connected Layer (FCL) with an input size equal to the feature size of the client’s data and an output size of 128 by default. The activation function was ReLU.

The base model for the server was a two-layer FCL whose input was the concatenation of all the clients’ outputs [\(c_{i, 1}, \ldots , c_{i, M}\)]. Since the client updated asynchronously, the server held a table of [\(c_{i, 1}, \ldots , c_{i, M}\)]. When the server received an update from client m, it would update the corresponding \(c_{i, m}\) in the table and use the table as input of the model. The embedding size of the first layer was 128 by default and the output size of the second layer was the number of classes.

For the image classification task, we applied a split ResNet-18 model (He et al., 2016) on the VFL framework. There were two clients and one server. Each client held half of each image while the server held the labels. The clients preprocessed the images and passed them through the first convolutional layer of ResNet-18. The model on the server comprised the remaining parts of the ResNet-18 model.

For the NLP task, we applied a split distilBERT (Devlin et al., 2018) model on the VFL framework. The network consisted of one client and one server, the client holding the embedding layer of the transformer and the server holding the remaining parts of the model.

6.1.3 The frameworks for comparison

We conducted a comparative analysis of our asynchronous VFL framework with four baseline methods: VAFL (Chen et al., 2020), ZOO-VFL (Zhang et al., 2021), Split-Learning (Vepakomma et al., 2018), and Syn-ZOO-VFL.Footnote 2 All baselines employ a single optimization method across the entire VFL, and we applied the same base models to all frameworks. While ZOO-VFL and Syn-ZOO-VFL share the same message transmission content as our framework, VAFL and Split-Learning transmit partial derivatives through the network, which poses a privacy risk. It is worth noting that our framework offers the same level of privacy security as ZOO-VFL and Syn-ZOO-VFL, whereas VAFL is privacy risky. Therefore, we consider the experiment on VAFL and Split-Learning as an upper bound for convergence rate comparison among these frameworks, but it is not practical due to the privacy risk.

6.1.4 Training procedures

We employed different learning rates for the server and clients in our experiments, as their update times differ. The optimal learning rate \(\eta \) was selected from the range [0.020, 0.015, 0.010, 0.005, 0.001] for all frameworks. We chose this range because \(\eta =0.001\) was too small, resulting in slow convergence, while \(\eta =0.020\) was too large for ZOO to achieve satisfactory test accuracy. We set \(\mu \) to 0.001 for all experiments, which was the optimal parameter selected from the range [0.1, 0.01, 0.001, 0.0001, 0.00001] through preliminary experiments. To make a fair comparison, we applied the vanilla SGD strategy to all VFL frameworks. The number of training epochs was 100 by default to ensure model convergence.

For training the split ResNet-18 on distributed CIFAR-10, we trained the model for 40 epochs. To determine the optimal learning rate \(\eta \) for the framework, we searched \(\eta \) within the range [0.03, 0.01 0.003, 0.001] for the framework. We selected the one with the highest test accuracy. For the ZOO-VFL and Syn-ZOO-VFL, we searched for the optimal learning rate in an exponential manner, i.e., [\( \cdots , 3\times 10, 10, 3, 1, 0.3, 0.1, \cdots \)]. The upper limit for the search was where the loss kept increasing, and the lower limit was where the model training accuracy did not increase for every epoch. We selected the learning rate that allowed the model to train the fastest.

For the NLP task, we finetuned the pre-trained distil-BERT model. Since the model is pre-trained, we set the number of training epochs to 10. The hyperparameter tuning scheme was the same as that used for the CIFAR-10 task. All of the test accuracy presented in this paper (including the Appendix) is derived from five independent runs.

6.2 A demonstration on defending against label inference attack

In this experiment, we aimed to demonstrate the security levels of ZOO-based VFL (ZOO-VFL, Syn-ZOO-VFL, and ours) and FOO-based VFL (Split-Learning and VAFL) against a direct label inference attack from Fu et al. (2022). The attack is only effective for the “model without split” VFLs where the server simply sums up the output from all clients. The threat model involves a curious client aiming to infer labels from the victim server. The client can design the query for the server to acquire partial derivative w.r.t. the global model’s output layer, i.e., \(\frac{\partial {\mathcal {L}}(y; y_i) }{\partial y^c}\), where y represents the probability output for all classes, \(y^c\) is the probability for the c-th class predicted by the model, and there are C classes in total. The label can be directly inferred with the sign of \(\frac{\partial {\mathcal {L}}(y; y_i) }{\partial y^c}\), i.e., if the sign of it is negative, then the label for sample i is c; otherwise, the sign is positive. Note that this attack scenario where the server model simply sums the output of the clients is very strong (the server is too vulnerable). However, it has effectively demonstrated the vulnerability of transmitting gradients in VFL.

To simulate a curious client who wanted to infer the label from the server, we designed a dummy client that directly generated a random vector \(c_{i, m} \in {\mathcal {R}}^C\), with elements sampled from \( {\mathcal {N}}(0, 1)\). The client then randomly selected a \(u\in {\mathcal {R}}^C\) to compute \( \hat{c}_{i, m} = c_{i, m} + u \). The server then responded with the corresponding losses \(\hat{h}_{i, m}\) and \(h_{i, m}\), and the curious client estimated \(\frac{\partial {\mathcal {L}}(y; y_i) }{\partial y^c}\) using gradient estimation, i.e. \( \hat{\nabla }_{y}{\mathcal {L}}(y; y_i) = \frac{\phi (d)}{\mu }(\hat{h}_{i, m} - h_{i, m}) u \). In addition to the curious client, eavesdroppers also sought to infer labels from the server. However, when clients are benign, eavesdroppers cannot obtain the client’s u value. Therefore, in the experiment, they randomly generated a u to estimate the gradient.

We conducted the label inference attack using the MNIST dataset, using a batch size of 64. The attack success rate was calculated by dividing the number of correctly predicted samples by the total number of samples. The VFL framework was run for a single epoch, during which the attacker predicted the label of all samples based on the information they obtained. The VFL framework consisted of two clients and one server, where the server model summed up the output from the clients and replied with the losses value w.r.t. the client’s output. In the trial involving the curious client, there was one curious client and one benign client. In the trial involving the eavesdropper, both clients were benign.

The results are present in Table 1, where each experiment consists of 5 independent trials. The table indicates that the use of FOO in VFL poses a serious privacy vulnerability, as both curious clients and eavesdroppers can infer certain labels. On the other hand, when ZOO is applied to VFL, the malicious client who dedicated designed the query only gains a slight advantage with one query. Additionally, eavesdroppers were unable to infer the label from the messages due to the lack of gradient information on the server.

Table 1 Demonstration with Direct Label Inference Attack

6.3 A demonstration on defending against feature inference attack

In this experiment, we demonstrate the capability of our framework in defending against the feature inference attacks based on “deep leakage from gradient” (DLG) (Zhu et al., 2019). Besides, we highlight the vulnerability of gradient-based VFL in the context of such attacks.

We designed an experiment where the VFL involved two clients, each equipped with a Convolutional Neural Network (CNN). In the CNN architecture, the first two layers are convolutional layers, employing the Sigmoid activation function. The final layer is a fully connected layer. The server aggregates the logits generated by each client through a summation process. Each client possessed half of each image from CIFAR-10 as their private dataset.

Without loss of generality, we assume that client 1 is the victim, and the server is the curious party. We assume that at some stage of the training, the attacker obtained a snapshot of the model parameters from Client 1 and the corresponding gradient w.r.t. the sample i. The gradient information obtained by the attack is \(\nabla _{w_1}f_i(w_0, \textbf{w}) = \frac{\partial f_i(w_0, \textbf{w}) }{ \partial w_1}\) under the FOO case (VAFL and Split-learning), or \( \hat{\nabla }_{w_1} f_i (w_0, \textbf{w}) = \frac{\phi (d_1) }{\mu _1} \left( \hat{h}_{i, 1} - h_{i, 1} \right) u_{i, 1} \) under the ZOO case (ZOO-VFL and ours). Having obtained the model parameter and gradient information, the attacker aims to reconstruct the private data \(x_{i, 1}\) maintained by Client 1.

We randomly selected an image from the CIFAR-10 dataset, specifically choosing the image at index 28, which belongs to the class “deer”. Figure 3 shows the original private data from the two clients, where client 1 (victim) held the left half of the picture. Figure 4a depicts the DLG attack on the First Order Optimization (FOO)-based model, while Fig. 4b showcases the DLG attack on the FOO-based model with Gaussian Noise \( {\mathcal {N}}(0, 0.03)\) added to each dimension of the gradient. Lastly, Fig. 4c illustrates the DLG attack on the ZOO-based model.

Fig. 3
figure 3

The target data, with the victim client holding the left half

Fig. 4
figure 4

DLG attack on the VFL framework

Our observations indicate that DLG successfully infiltrated the VFL model when an accurate gradient and the model snapshot were acquired. However, the DLG attack proved ineffective against our framework trained with ZOO. This outcome is likely attributed to the randomness introduced by the ZOO, which hinders the attacker from obtaining accurate gradient information for the attack.

6.4 The convergence for different numbers of clients

In this experiment, we compared the convergence curve between our framework and others, with varying numbers of clients. With the base model, we set the number of clients to {4, 6, 8} and plotted the epoch-training accuracy curve in Fig. 5. As illustrated in the figure, our framework exhibited a more stable convergence rate than ZOO-VFL. The curve for ZOO-VFL displayed significant vibration between the fifth and tenth epoch, primarily due to client delay. This phenomenon was less obvious in our framework. Table 2 shows the test accuracy achieved after the training procedure. Our framework demonstrated a slight test accuracy loss compared to VAFL, which was a trade-off for improving the privacy and security of the framework. In contrast, our framework achieved a much higher test accuracy than ZOO-VFL, indicating that ZOO-VFL does not possess good convergence characteristics.

Table 2 Test accuracy (%) for the convergence of different number of clients experiments
Fig. 5
figure 5

Learning curve for different numbers of clients

6.4.1 More robust hyperparameter tuning

When searching for the optimal learning rate, we observed that the selection of the learning rate for ZOO-VFL was more sensitive compared to VFL-Cascaded. This sensitivity is an undesirable characteristic for hyperparameter tuning, especially in federated learning, which introduces more hyperparameters than centralized training (Kairouz et al., 2019).

Assuming that we have obtained the optimal learning rate for ZOO-VFL, it is worth noting that even a slight increase in the learning rate can lead to a significant reduction in test accuracy. Conversely, a minor decrease in the learning rate can also slow the convergence and decrease test accuracy. In contrast, our framework demonstrates greater resilience in learning rate selection, resulting in a more stable performance with less deviation in hyperparameters.

To demonstrate the resilience of our framework, we reported the test accuracy at a different learning rate for comparing the ZOO-VFL and VFL-Cascaded. We selected the server learning rate from [0.020, 0.015, 0.010, 0.005, 0.001], and trained the model for 200 epochs to make sure the model converges. The test accuracy is presented in Fig. 6. Our findings indicate that the deviation from the optimal learning rate had a more significant impact on ZOO-VFL than VFL-Cascaded.

Fig. 6
figure 6

Robustness of the hyperparameter

In VFL, a more robust hyperparameter is favorable as it requires less tuning and computational resources. This is particularly important as communication between the server and clients in VFL is costly.

6.5 The convergence for different server model sizes

6.5.1 Base model

In this experiment, we conducted a comparison of the convergence rates between our framework and other frameworks, using a variety of server model sizes. The frameworks were applied to four clients and one server, and we tested it on different widths of the server model, specifically the embedding size of the first layer. We varied the embedding size of the first layer of the server from the default value of 128 to 256 and 512, resulting in server model parameter counts of 66954, 133898, and 267786, respectively.

The training curve is presented in Fig. 7a. As shown in the figure, for all different sizes of models, our framework has a more stable convergence than ZOO-VFL, where the vibration between the fifth and tenth epoch is less obvious. Table 3 presents the test accuracy achieved after the training procedure. For all model sizes, our model has a significantly higher test accuracy than ZOO-VFL. However, when compared to VAFL, our framework incurs a trade-off of approximately \(1\%\) in test accuracy for privacy security.

Fig. 7
figure 7

Learning curve for different server model size

Table 3 The test accuracy (%) for the different model size experiments

To demonstrate the superiority of our framework in training larger models, we conducted tests on deep learning tasks, including image classification and text classification (NLP).

6.5.2 Image classification

The training curve for the image classification task on CIFAR-10 using the split ResNet-18 model is presented in Fig. 7b. As depicted in the figure, our framework maintains a reasonable convergence rate and is robust for the best two learning rates, where the best curve almost overlaps the training curve for VAFL. The training accuracy for ZOO-VFL gradually increases from 0.10 to 0.22 during the training process, indicating the slow convergence problem of ZOO-VFL with the large model. Table 3 shows the test accuracy. By applying our framework, we can achieve a reasonable test accuracy in 40 training epochs using a modified split ResNet18 model.

6.5.3 Natural language processing

We also demonstrated that a more complex transformer-based model for NLP can be trained with our VFL framework. The training curve is depicted in Fig. 7c. The dataset comprises of two classes, therefore, the training accuracy commences at around 50%.

The difference in convergence speed becomes more noticeable when using a large model. In our framework, the training accuracy reached 94% in the second epoch, which took approximately 45 min. In contrast, ZOO-VFL’s training accuracy only rose from 50% to 70% in 10 epochs, requiring around 6 h of training time, and the model’s performance remained close to random guessing. Besides, the learning rate was more robust for VFL-Cascaded, with most of the parameters we tuned proving to be effective. In contrast, ZOO-VFL’s second-best learning rate exhibited much slower convergence, and the third-best learning rate failed to converge altogether. The test accuracy of our model is presented in Table 3. Since training for around 6 h is contrary to the basic idea of fine-tuning, we test the model after 2 epochs of training. The results demonstrate that our framework is capable of training an extremely large deep-learning model.

7 Limitations and discussions

In our framework, we utilized ZOO and FOO strategically to address the demanding aspects of the VFL framework. Specifically, we employed ZOO on the client to maximize model applicability and privacy protection, and FOO on the server to accelerate convergence. We carefully balanced the advantages and disadvantages of ZOO and FOO in different parts of the VFL model to ensure that our framework meets all requirements for practical VFL. A detailed comparison of the frameworks is presented in Table 4 (“S” for the server and “C” for the client, “F” for the entire framework). It is important to note that the inherent limitations of ZOO and FOO were not eliminated. That is, ZOO’s slow convergence makes it unsuitable for dealing with large models on the client side, while the server can only handle differentiable models.

Table 4 Comparison with typical VFL frameworks

However, our framework is more suitable for real-world application scenarios for several reasons. Firstly, in VFL, the server is the initiator and sole beneficiary of the framework, with all clients acting as collaborators. As such, it is more cost-effective for the server to train a larger model to achieve better prediction results, as only the server obtains the prediction. Secondly, the server typically has more computational resources than the clients, making it computationally efficient for the server to train a larger model. Thirdly, as the server is the initiator and has the ability to select its model, the model applicability of the server is not as critical in VFL. Conversely, for clients, their models are unknown to the initiator of the VFL, making the model-agnostic characteristic important. Therefore, our framework is more suitable for real-world applications than other frameworks that use a unified optimization method.

8 Conclusions

We proposed a novel VFL framework where different optimization methods were applied to the upstream (server) and the downstream (client) of the VFL cascaded. This approach maximized the benefits of both optimization methods. The clients are optimized with ZOO to protect privacy, while the server is optimized with FOO to accelerate convergence without compromising the framework’s privacy. Theoretical results demonstrated that our framework with cascaded hybrid optimization converges faster than the ZOO-based VFL, and that applying a large model on the server does not hinder convergence. Extensive experiments demonstrated that our framework achieves better convergence characteristics compared with the ZOO-based VFL while maintaining the same level of privacy security.