Keywords

1 Introduction

With the rapid development of intelligent transportation system, new computing methods have been widely deployed on Internet of Vehicles. In particular, the integration of IoV and artificial intelligence has led to a trend of sharing data among vehicles and infrastructure. The shared data usually includes trajectories, surrounding information and operation information, etc. To improve the driving experience and service quality, vehicles could utilize the shared data. For example, vehicles could generate a region traffic flow model according to the shared data. However, in the process of sharing data, the data privacy of the vehicles will be damaged, which may lead to serious consequences.

In the privacy-sensitive scenario of the IoV, to avoid privacy disclosure caused by sharing data, the federated learning(FL) [1, 2] mechanism is applied to the IoV. The existing achievements [3,4,5] have well implemented the scheme that the vehicle privacy data is not sent directly in the public channel. However, as previous work [6] had shown that, the private information of vehicle can also be leaked from the gradient. Considering such security vulnerabilities, if a aggregation server and other entities obtain sufficient gradients, the privacy data(e.g. vehicle position information and trajectories information) of participants will be seriously threatened. Another concern is that the gradient may come from malicious participants. On the one hand, the malicious vehicles upload incorrect gradients to lead to a decline in the accuracy of the model, and even make the final global model unavailable. On the other hand, the AS may forge the aggregation model parameters, if the participant cannot recognize the modified global model, the entire FL process will be destroyed, even leading to a serious threat to traffic safety.

In this paper, we address the data privacy leakage issue by integrating federated learning into IoV, on this basis, we use masks to encrypt the model gradient and remove the incorrect gradients without knowing the gradient. Finally we verify the correctness of the global model. The contributions of the paper can be summarized as follows.

  • The proposed scheme divides the vehicles into multiple groups containing an appropriate number of vehicles, and vehicles in the same group use the negotiated mask to encrypt the model gradient. Moreover, secret share is adopt to recover the mask of leaved vehicles.

  • The proposed scheme distinguishs the correctness of gradients by using the Pearson correlation coefficient when receiving the uploaded gradients from vehicle. Then the Lagrange interpolation is adopted to verify the global aggregation result at the downlink from AS to RSU.

  • The convolution neural network (CNN) with the MNIST dataset is uesed to evaluate the performance of the proposed scheme. The experimental results demonstrate that the proposed scheme shows a higher accuracy with the acceptable overhead for the FL participants.

The remainder of the paper is organized as follows. Section 2 presents related work and Sect. 3 presents background knowledge include the system model, cryptography primitives and federated learning. Section 4 introduces the mechanism details. Analysis including correctness, privacy and performance evaluation are presented in Sect. 5. Section 6 concludes the paper.

2 Related Work

In recent years, given the rising popularity of IoV [7,8,9,10], the data privacy of vehicles has increasingly become the focus of attention. Several studies have been proposed to solve related issues.

Traditional privacy protection schemes mainly combined cryptography to encrypt or hide the collected data and send it to the center. Hui Li et al. [11] designed an Architecture for identity and location privacy protection in VANET base on k-anonymity and dynamic threshold encryption. Han et al. [12] designed a vehicle privacy-preserving algorithm based on a local differential privacy to minimize the possibility of exposing the regional semantic privacy of the k-location set. Ma et al. [13] performed homomorphic encryption on the sensitive part of the data and keep the ciphertext at the blockchain to preserve IoV data privacy. Zhang et al. [14] encrypted traffic flow data by BGN homomorphic encryption to protect the travel direction when arriving at T-junctions or crossroads.

Although these approaches do solve the issue of data privacy to a certain extent, still have two issues: (1) based on differential privacy, the noise will affect the availability of data. Based on homomorphic encryption, the computational overhead is not suitable in IoV. (2) a large amount of collected data needs to be sent, which will lead to the potential threat of leaking sensitive data and high network bandwidth usage.

FL, as a distributed artificial intelligence approach, allows participant trains local models on local privacy database and then the center aggregates the local model to construct a global model. Compared with traditional privacy protection schemes, FL could enhance communication efficiency and privacy preservation [15]. Lu et al. [16] designed a secure data sharing scheme based on asynchronous FL and blockchain, which could improve efficiency. A hierarchical FL algorithm with a multi-leader and multi-player game for knowledge sharing is proposed in [17]. Wu et al. [18] proposed a Traffic-Aware FL framework to enhance motion control of vehicles. Although the above FL-based schemes avoid directly uploading a large amount of privacy data, the uploaded gradient is not protected, and the privacy data is still likely to be exposed [6].

To prevent the leaking of privacy from the gradient, some studies are proposed. Phong et al. [19] bridged deep learning and homomorphic encryption to ensure that the server can not get user privacy and the accuracy is kept intact. Liu et al. [20] presented a privacy-enhanced FL (PEFL) framework by using homomorphic encryption, in process of FL, In the whole process, the gradient is only processed in the form of ciphertext. Although homomorphic encryption is useful for privacy-preserving, it is not suitable for IoV as the time cost. The scheme of adding mask to gradient was proposed in [21], two types of masks would be added in gradient, the participant only reply one mask recover request from the center, which Effectively protects privacy gradient. Further, a more efficient scheme [22] based on [21] is proposed, which only needs logarithmic overhead. To verify global aggregation result in FL, Fu et al. [23] designed a verifiable FL scheme by using Lagrange interpolation. Guo et al. [24] proposed a verifiable aggregation scheme for FL by using Linear homomorphic hash and Cryptographic commitment. To achieve the privacy-preserving, we propose a privacy-enhanced federated learning scheme based on gradient encryption by the mask.

Fig. 1.
figure 1

Proposed Federated Learning Framework in IoV

3 Preliminaries

3.1 System Model

As shown in Fig. 1, the proposed scheme consists of some vehicles, some Roadside Units(RSUs) and one Aggregation Server(AS).

The vehicle executes federated learning to generate the local gradient on local privacy data set. During the whole process, the vehicle transmits the gradient instead of privacy data. RSU is a kind of wireless infrastructure, as relay node, RSU is responsible for organizing vehicles to execute the mask agreement and checking the correctness of received model parameters. AS is responsible for constructing a global model. In the proposed scheme, AS may return forged global model parameters to other participants.

3.2 Cryptography Block

  • Hard problem Let \(\mathbb {G}\) denotes a cyclic group, \(g\in \mathbb {G}\) denotes a generator of group \(\mathbb {G}\), and q is the prime order of group \(\mathbb {G}\). Then the computational hard problems named Discrete Logarithm Problem (DLP), Decisional Diffie Hellman Problem (DDHP),Computational Diffie-Hellman Problem(CDHP) can be described as follows.

    1. (1)

       DLP: Given one tuple {P, Q}(\(P,Q \in \mathbb {G}\)), where \(Q=P^x\), \(x \in \mathbb {Z}_{q}^{*}\), the advantage for any probabilistic polynomial time (PPT) adversary to calculate x is negligible.

    2. (2)

       CDHP: Given one tuple \(\{g,g^x,g^y \in \mathbb {G}\}\), where \(x, y \in \mathbb {Z}_{q}^{*}\), the advantage for any PPT adversary to calculate \(g^{xy} \in \mathbb {G}\) is negligible.

    3. (3)

       DDHP: Given two tuples \(\{g,g^x,g^y,g^{xy} \in \mathbb {G}\}\) and \(\{g,g^x,g^y,g^z \in \mathbb {G}\}\), where \(x,y,z \in \mathbb {Z}_q^*\) , any PPT adversary is decisional hard to distinguish the two tuples.

  • Secret share Shamir secret share is a threshold secret sharing scheme. The threshold secret sharing scheme first constructs a \(t-1\) degree polynomial and takes secret \(k\) as a constant term of the polynomial

$$\begin{aligned} \begin{aligned} f_k(x) = k + \sum _{j=1}^{t-1} a_j x^j, a_j \in GF(q) \end{aligned} \end{aligned}$$
(1)

where \(q\) is a big prime number and \(GF(*)\) is a finite field. Thus, according to formula 1, \(f_k(0)\) is the secret \(k\). Selecting n elements \(\{x_i \in GF(q), 1\le i \le n \}\) and feeding \(x_i\) into \(f_k(x)\) to get \(f_k(x_i)\), arbitrary \(t\) \(\{(x_i, f(x_i))\}\) could recover the \(t-1\) degree polynomial \(f_k(x)\) as follows

$$\begin{aligned} \begin{aligned} f_k(x) = \sum _{i=1}^{t} f_k(x_i) \prod _{j=1,j \ne i} ^{t} \frac{x-x_i}{x_i - x_j} \end{aligned} \end{aligned}$$
(2)

Therefore, secret \(k\) is secure if malicious participants can not obtain \(t\) or more sub-secrets \(\{x_i, f_k(x_i)\}\).

3.3 Federated Learning

We leverage federated learning to protect the privacy data of vehicle. Assume there are N vehicles in proposed scheme, Vehicle \(v_i\)(\(0 \le i \le N - 1\)) participates in the FL and cooperatively trains a model \(\mathcal {M}\) on private data set \(D_i = \{(x_j, y_j), 0 \le j \le d_i - 1\}\), where \(d_i\) is the size of \(D_i\) A loss function quantifies the difference between estimated values and real values of samples in \(D_i\), defined as follows:

$$\begin{aligned} E_i(\mathcal {M}) = \frac{1}{d_i} \sum _{j=0}^{d - 1}L(\mathcal {M}, x_j, y_j) \end{aligned}$$
(3)

where \(L(\mathcal {M}, x_j, y_j)\) is the loss function on data sample \((x_j, y_j)\), and \(y_j\) is the label of \(x_j\). The global loss function \(E(\mathcal {M})\) could be calculated.

$$\begin{aligned} E(\mathcal {M}) = \frac{1}{N}\sum _{i=0}^{N - 1}E_i(\mathcal {M}) \end{aligned}$$
(4)

In FL, each vehicle trains the model on the training set by a back propagation algorithm, and gets the private gradient \(\omega _i = \frac{\partial E_i}{\partial \mathcal {M}}\). Vehicles and RSUs synchronously upload gradients to the AS to aggregate. Then AS returns the result to RSUs, vehicles download the result.

4 The Proposed Scheme

We assume that most vehicles are honest, but a small number of malicious gradients generated by malicious vehicles will still affect the aggregation results of FL. In the proposed scheme, RSUs could check the correctness of gradients uploaded by local region vehicles and eliminate malicious gradients. AS may forge global aggregation result, to verify the correctness of the global aggregation result, RSUs would perform the Lagrange interpolation function on the local region gradients. RSUs are regarded as honest participants, but during the whole process, all RSUs also can not learn vehicles’ privacy data and gradients. The CA is used to generate public parameters for the registered vehicles and RSUs. The vehicle generates a private-public key pair which are used for negotiating masks respectively, the RSU generates a private-public key pair which is used for negotiating a common integer sequence as the Lagrange interpolation function points.

4.1 Initialization

In this phase, CA initializes all necessary system parameters and publishs the parameters to participants. RSU calculates common sequence as the necessity of verifying global parameter.

figure a

The CA generates a cyclic group \(\mathbb {G}\) with prime order \(q\), chooses randomly a group generator \(g\) as a public parameter which is used for calculating vehicles’ agreement key and RSUs’ public key. The CA also generates \(m + 1\) positive integer \(P = \{p, p_1, p_2, \cdots , p_{m}\}\), where \(gcd(p_i, p_j) = 1, (i \ne j), 1 \le i,j \le m\) and \(p\) is large enough. Vehicles and RSUs receive public parameters \(PP = \{\mathbb {G}, q, g, P\} \) to generate a private-public key, for example, vehicle \(v_i\) chooses \(SK_i \in \mathbb {Z}_q^*\) and calculates \(PK_{z, i} = g^{SK_{i}}\), RSU gets the private-public pair with the same operations with vehicle.

RSUs should calculate three common sequences \(SeqA\), \(SeqB\) and \(SeqS\) as Algorithm 1 which are confidential to other entities to achieve the verifiability of the aggregated result from AS. Let \(R = \{RSU_0, RSU_1, \cdots , RSU_{K-1}\}\),

figure b

4.2 Gradient Encryption

  • Mask agreement. In the proposed scheme, for any \(RSU_k\), \(0 \le k \le K - 1\), we divide the vehicles in the RSU area into multiple groups, each group contains \(h = \frac{N_k}{n_k}\) vehicles, where \(N_k\) is the number of vehicles in \(RSU_k\) and \(n_k\) is the number of the groups in \(RSU_k\). In a group, every vehicle negotiates mask with the neighbor vehicles. To describe the mask agreement process, we denote the vehicles that have joined the mask agreement process as ordered sequence \(\{v_0, v_1, \cdots , v_{h-1}\}\) in every group. In followed phases, we introduce the agreement process.

Vehicle \(v_i\)(\(0 \le i \le h - 1\)) in group \(g_z\)(\(0 \le z \le n_k -1 \)) sends \(PK_{z, i}\) to \(RSU_k\) when enters the communication range of \(RSU_k\). \(RSU_k\) transmits all other public keys to a vehicle \(v_i\) to execute the secret share in the group. Once a group is generated, then calculates the first mask code(FMC) \(m_{k,z,i}^{(R)}\) between \(RSU_k\) and vehicle \(v_i\) as follows:

$$\begin{aligned} \begin{aligned} m_{k,z,i}^{(R)} = Hash(PK_{z, i}^{SK_k}) \end{aligned} \end{aligned}$$
(5)

where \(Hash(*)\) is the function that map \(*\) to a integer sequence with the same size as \(|\omega |\).

Then\(RSU_k\) sends \(\{PK_k || PK_{z,<i+1>} || c_{i,<i+1>} || PK_{z,<i-1>} || c_{i,<i-1>} || \{PK_{u}\}\}\) to \(\{v_i,|i = 0,1,\cdots ,h-1\}\), where \(<\cdot > = \cdot \pmod {h}\), \(c_{i,<\cdot>} = -c_{<\cdot >, i}\), \(\{PK_{u}\}\) is a set include other participants’ information exclude \(PK_{z,<i+1>}\) and \(PK_{z,<i-1>}\).

After receiving message, vehicle \(v_i\) calculates FMC \(m_{k,z,i}^{(R)} = Hash(PK_{k}^{SK_{z,i}})\), and calculates \(k_1 = PK_{z,<i+1>}^{SK_{z,i}}\) and \(k_{2} = PK_{z,<i-1>}^{SK_{z,i}}\). To calculate second mask code(SMC), vehicle \(v_i\) executes \(m_{i,<i+1>}^{(V)} = Hash(k_1)\) and \(m_{i,<i-1>}^{(V)} = Hash(k_2)\), SMC could be calculated as follows.

$$\begin{aligned} \begin{aligned} m_{i}^{(V)} = c_{i,<i+1>} \cdot m_{i,<i+1>}^{(V)} + c_{i,<i-1>} \cdot m_{i,<i-1>}^{(V)} \end{aligned} \end{aligned}$$
(6)

Notice that both \(m_{i}^{(V)}\) and \( m_{k,z,i}^{(R)}\) would be added to the gradient to ensure the confidentiality of the gradient.

Vehicle \(v_{i}\) executes secret share to share \(k_{1}\) and \(k_{2}\) with the vehicle \(v_j\), \(j = 0, 1, \cdots , h - 1\). In the proposed scheme, the Shamir algorithm is used for secret share. Vehicle \(v_i\) constructs two polynomials \(f_{k_1}(x)\) and \(f_{k_2}(x)\) as formula 1 and generates sub-secret \( s_j = \{f_{k_1}(x_j)||f_{k_2}(x_j)||x_j\}\) for vehicle \(v_j\). Then \(v_{i}\) encrypts sub-secrets

$$\begin{aligned} \begin{aligned} s_{i, j} = Enc(PK_{j}, s_j) \end{aligned} \end{aligned}$$

where \(Enc(\cdot ,*)\) is the encryption function such as RSA, \(\cdot \) and \(*\) are the public key and the plaintext. Vehicle \(v_i\) sends \(s_{i, j}\) to \(RSU_k\) for forwarding to vehicle \(v_j\).

  • Local training. The local training is implemented with distributed gradient descent. In the training process, we iteratively improve the accuracy of model \(\mathcal {M}\) by minimizing the global loss function 4.

For every vehicle in IoV, the goal of training the model is to find the gradient to update \(\mathcal {M}\) for minimizing the value of the loss function 3.

Vehicle \(v_i\)(\(0 \le i \le h - 1\)) in \(RSU_k\)(\(0 \le k \le K - 1\)) calculates \(\omega _i = \frac{\partial E_i}{\partial \mathcal {M}}\), where \(\omega _i\) is a vector with dimension \(d=|\omega _i|\). Then vehicle \(v_i\) masks \(\omega _i\) as follows:

$$\begin{aligned} \begin{aligned} \omega _{k,z,i}^{(M)} = \omega _i + m_{i}^{(V)} + m_{k,z,i}^{(R)} \end{aligned} \end{aligned}$$
(7)

4.3 Region Verification

The difference between the malicious gradient and correct gradient is perceptible. Actually, the low similarity with the correct gradients means that the gradient is malicious with high probability. So the Pearson correlation coefficient between gradients could be calculated to make a distinction between the correct gradients and the malicious gradients. Pearson correlation coefficient is defined as follows:

$$\begin{aligned} \rho _{XY} = \frac{Cov(X,Y)}{\sigma (X)\sigma (Y) } \end{aligned}$$
(8)

where \(Cov(X,Y)\) is covariance between random variables \(X\) and \(Y\), \(\sigma ()\) is the standard deviation.

  • Verification \(RSU_k\)(\(0 \le k \le K -1\)) calculates the gradients sum of group \(g_z\)(\(0 \le z \le n_k - 1\)) if all vehicles upload masked gradients. And we set

$$\begin{aligned} \begin{aligned} \omega _{k, z} = \sum _{i=0}^{h-1} \omega _{k,z,i}^{(M)} - m_{k,z,i}^{(R)} \end{aligned} \end{aligned}$$
(9)

If a vehicle (suppose \(v_i\)) leaves \(RSU_k\)’s communication range, because of the lack of \(\omega _{k, i}^{(M)}\) , the sum of the gradients could not be recovered. To eliminate the mask, \(RSU_k\) sends a recover request to the rest vehicles in the group to get \(v_i\)’s sub-secrets. The rest vehicle \(v_j\)(\(0 \le j \le h -1, j \ne i \)) sends \(dc_{i, j} = Dec(SK_{j}, s_{i, j}) \) to \(RSU_k\), where \(Dec(\cdot , *)\) is decryption algorithm corresponding to encryption algorithm \(Enc(\cdot , *)\). After receiving enough sub-secrets, \(RSU_k\) extracts \(\{x_j, f_{k_1}(x_j), f_{k_2}(x_j)\}\) from \(dc_{i, j}\) and executes formula 2 to get \(k_1\) and \(k_2\), further \(RSU_k\) calculates \(m_{i,i+1}^{(V)} = Hash(k_1)\) and \(m_{i,i-1}^{(V)} = Hash(k_2)\) to get the SMC of \(v_i\) as formula 6. Then \(RSU_k\) adds the recovered SMC to uploaded gradients to eliminate masks as follows.

$$\begin{aligned} \begin{aligned} \omega _{k, z, i}&= \sum _{c = 0, c \ne i}^{h-1} (\omega _{k,z,c}^{(M)} - m_{k,z,c}^{(R)}) + m_{i}^{(V)} \end{aligned} \end{aligned}$$
(10)

Note that, the value of \(\omega _{k, z, i}\) is the gradients sum \(\omega _{k, z}\) of group \(g_z\), the RSU can not attain the specific gradients when recovering the SMC.

To calculate the Pearson correlation coefficient, the gradients coordinate-wise medians \(\bar{(\omega _{k})}\)(\(0 \le k \le K - 1\)) should be calculated as the benchmark.

$$\begin{aligned} \bar{\omega _{k}} = \frac{1}{n_k}\sum _{r=0}^{n_k - 1} \omega _{k, r} \end{aligned}$$

\(RSU_k\) randomly selects \(X = \omega _{k, x}\) and \(Y = \omega _{k, y}\)(\(0 \le x, y \le n_g - 1\)), and calculates \(\rho _{XY}\) according to \(\bar{\omega _{k}}\) and formula 8. \(RSU_k\) discards \(\omega _{k, x}\) if \(\rho _{XY} \le l\), where \(l\) is the limiting value of correlation coefficient, The number of rest gradients sum is \(n_{k, re}\). Then \(RSU_k\) broadcasts \(n_{k,re}\) to \(R\). Before uploading gradients to AS, \(RSU_k\) will process the gradients sum with Lagrange interpolation.

  • Interpolation \(RSU_k\) would convert the sum of gradients \(\omega _{k} = \sum _{r = 0}^{n_{k, re} - 1} \omega _{k, r}\) from float number to finite field as follows:

$$\begin{aligned} \omega _{k}=\left\{ \begin{aligned}&\left\lfloor \lambda \cdot \omega _{k} \right\rceil ,&\quad \left\lfloor \lambda \cdot \omega _{k} \right\rceil \ge 0\\&p + \left\lfloor \lambda \cdot \omega _{k} \right\rceil ,&\quad \left\lfloor \lambda \cdot \omega _{k} \right\rceil < 0\\ \end{aligned} \right. \end{aligned}$$

where \(\left\lfloor * \right\rceil \) is the rounding method , \(\lambda \) is a large integer (i.e. \(10^6\)) used to control accuracy. And in this phase, the data uploaded by \(RSU_k\) to AS is not gradients, but the Lagrange function results. First,\(RSU_k\) randomly selects \(m - 1\) arrays \(U_k = \{\mathbf {u_{k,i}},|i = 1, 2, \cdots , m - 1\}\) that satisfies that the sum of \(U_k\) is equql to \(\omega _{k}\), note that, each element in \(U_k\) is an array with the dimension \(d\). \(RSU_k\) has parameters \(SeqS = [s_0, s_1, \cdots , s_{d-1}]\) and \(SeqA = [a_0, a_1, \cdots , a_{m-1}]\), according to (\(U_k, SeqA, SeqS\)), we have points \(P_j = \{(a_i, u_{k,i,j}), | i \in \{0,1, \cdots , m-2\}\} \cup \{(a_{m-1}, s_j)\}\), where \(u_{k,i,j}\) is the j-th element in \(u_{k, i}\). Therefore Lagrange interpolation function could be executed on \(P_j\) to generate \(m-1\) degree polynomial \(F_{k,j}(x)\) as formula 2. \(RSU_k\) sends \(Pack_{k,j} = CRT[F_{k,j}(b_0), \cdots ,F_{k,j}(b_{m-1})]\) on \(SeqB = [b_0, b_1, \cdots , b_{m-1}]\), where \(CRT\) is the Chinese remainder theorem as [23].

4.4 Aggregation and Update

After receiving \(\{(Pack_{k,0}, \cdots ,Pack_{k,d-1}), |k = 0, 1, \cdots , K-1\}\), AS executes aggregation as follows:

$$\begin{aligned} \omega ^{(G)} = (\sum _{k = 0}^{K-1} Pack_{k,0}, \sum _{k = 0}^{K-1} Pack_{k,1}, \cdots , \sum _{k = 0}^{K-1} Pack_{k,d-1}) \end{aligned}$$

Because AS does not know the x-coordinate \(SeqA\) and \(SeqB\) corresponding to packaged function values \(Pack_{k,*}\), AS couldn’t forge an aggregation result that can be verified successfully by RSU. Then AS distributes \(\omega ^{(G)}\) to each RSU. After receiving \(\omega ^{(G)}\) from AS and \(n_{re} = \sum _{i = 0}^{K-1} n_{i, re}\) from other RSUs, to verify \(\omega ^{(G)}\), For any \(j = 0,1,\cdots , d-1\), \(RSU_k\) should unpack \(\omega ^{(G)}\) as follows:

$$\begin{aligned} \begin{matrix} F_j(b_i) \equiv \sum _{k = 0}^{K-1} Pack_{k, j} \pmod {p_i} \end{matrix} \end{aligned}$$

As mentioned before, there are points \(\{(b_0, F_j(b_0)), \cdots , (b_{m-1}, F_j(b_{m-1}))\}\), \(RSU_k\) applies the Lagrange interpolation to calculate corresponded function expression \(F_j(x)\). Then \(RSU_k\) calculates \(F_j(a_{m-1})\), the aggregation result is correct if \(K \cdot s_j = F_j(a_{m-1})\) holds. Next, \(RSU_k\) calculates aggregation gradient as follows:

$$\begin{aligned} \omega ^{(G')} = (C(\sum _{c = 0}^{m - 2}F_1(a_c)), C(\sum _{c = 0}^{m - 2}F_2(a_c) ), \cdots , C(\sum _{c = 0}^{m - 2}F_d(a_c))) \end{aligned}$$

where

$$\begin{aligned} C(x)=\left\{ \begin{aligned}&x ,&x \in [0, \frac{p - 1}{2}) \\&x - p ,&x \in [\frac{p - 1}{2}, p)\\ \end{aligned} \right. \end{aligned}$$
(11)

Then \(RSU_k\) sends global model parameters \(\omega ^{(G')}\) and \(n_{re}\) to each vehicle in communication region. The vehicle updates the local model as follows:

$$\begin{aligned} \mathcal {M} = \mathcal {M} - \alpha \frac{\omega ^{(G')}}{n_{re}} \end{aligned}$$
(12)

where \(\alpha \) is the learning rate. Then vehicles iteratively perform local training until global model \(\mathcal {M}\) is available. The proposed FL algorithm is illustrated in Algorithm 2.

5 Analysis

5.1 Correctness and Privacy

The masks FMC and SMC are added into the gradient, after receiving the masked gradient, RSU performs formula 9 to get the sum of gradients. For any group \(r\) in \(RSU_k\) the correctness of the formula is as follows.

$$\begin{aligned} \begin{aligned} \omega _{k,r}&= \sum _{i=0}^{h-1} \omega _{k,r,i}^{(M)} - m_{k,r,i}^{(R)} \\&= \sum _{i=0}^{h-1} \omega _i + c_{i,i +1} \cdot m_{i,i+1}^{(V)} + c_{i,i -1} \cdot m_{i,i-1}^{(V)} \\&= \sum _{i=0}^{h-1} \omega _i \end{aligned} \end{aligned}$$

If the vehicle \(v_i\)(\(0 \le i \le h-1\)) leaves, RSU performs formula 10, and the correctness of the formula is as follows.

$$\begin{aligned} \begin{aligned} \omega _{k,r}&= \sum _{j=0,j \ne i}^{h-1} (\omega _{k,r,j}^{(M)} - m_{k, r,j}^{(R)}) + m_{i}^{(V)} \\&= \sum _{j=0}^{h -1} \omega _{k,r,j}^{(M)} - \sum _{j=0}^{h-1} m_{k,r,j}^{(R)} - \omega _{k,r,i}^{(M)} + m_{k,z,i}^{(R)} + m_{i}^{(V)} \\&= \sum _{j = 0, j \ne i}^{h-1} \omega _j \end{aligned} \end{aligned}$$

In the proposed scheme, to achieve the privacy-preserving of vehicle’s data, we adopt FL to keep raw privacy data locally, Furthermore, the mask is used to protect gradients.

We combine \(h\) vehicles \(\{v_0, v_1, \cdots , v_{h-1}\}\) to form a group. If the vehicle only masks uploaded gradients \(\omega _i\) with SMC \(m_{i}^{(V)}\), the vehicle’s privacy gradient may leak. Because the vehicle may be curious about neighbor vehicle’s privacy gradient, especially if the two side neighbors of a vehicle conspire, the SMC may be eliminated. To address the problem of gradient leakage caused by neighbor vehicles collusion, We add FMC \(m_{k,z,i}^{(R)}\) and SMC \(m_i^{(V)}\) to the gradient. The FMC is only known between RSU and vehicle, so malicious neighbor vehicles can not get \(v_i\)’s privacy gradient without knowing the FMC. In the meantime, we also avoid the RSU from knowing the specific gradient of the vehicle by adding the SMC, so the RSU only knows the sum of \(h\) vehicles’ gradients. As mentioned above, the privacy gradient is only known by itself. And according to the computational or decisional hard problems mentioned before, it is difficult to calculate the FMC and SMC without knowing corresponding private key. hence, during the whole FL process, the privacy data and gradient of the vehicle will not be leaked.

5.2 Performance

In this section, we give the performance analysis of our proposed scheme. Our simulation experiment is conducted on Intel(R) Core(TM) i7-10875H,2.30GHz and 16 GB memory.

  • Performance of RSUs setup and agreement RSUs should perform Lagrange interpolation with some points, RSUs’ common sequence is regarded as the x-coordinate only known by itself. In the experiment, we use the JPBC library in Java JDK8 to execute the Algorithm 1.

Fig. 2.
figure 2

The sequence generation overhead of RSU

Fig. 3.
figure 3

The mask agreement overhead of vehicle

Fig. 4.
figure 4

the mask agreement overhead with different number of gradient parameter

We measure the computing overhead of our proposed algorithm under different number of RSUs. The computation overhead is shown in Fig. 2, with the number of RSU increasing, the overhead increases linearly. The frequency of RSU updating the common sequence can maintain at a low value, in the condition, the cost of calculating common sequences is acceptable. Next we measure the cost of agreement phase, we set \(h = 4, 5 , \cdots , 14\) and fix gradient length\( 10^6\). The computational overhead is shown in Fig. 3. With the number of gradients increasing, the computational overhead increases approximately linearly. A vehicle shares secrets \(K_1\), \(K_2\) with the vehicles within the group, the share operation cost is low as the number of vehicles in a group is small. And the most time-consuming operation at this phase is to calculate the masks as the length of the gradient is generally the time of \(10^5\). Meanwhile, We set the number of CNN model parameters from \(10^5\) to \(12*10^5\) and set \(h = 9\). The computational overhead of different length of gradient as Fig. 4. The computational overhead increases approximately linearly with the increase of gradient length.

Fig. 5.
figure 5

The accuracy with various numbers of vehicles

Fig. 6.
figure 6

The loss with various numbers of vehicels

Fig. 7.
figure 7

The accuracy in various numbers of groups with low correlation coefficient

Fig. 8.
figure 8

The loss in various numbers of groups with low correlation coefficient

Fig. 9.
figure 9

The performance with malicious gradient between scheme [23] and the proposed scheme

  • Performance of FL We use the MNIST dataset to evaluate the proposed scheme. We divided the dataset into \(50, 80, 100\) parts and assigned them to \(50, 80 ,100\) vehicles, and the number of vehicles in a group is \(h = 5\), that is, \(10, 16, 20\) groups would join in FL. Each vehicle executes the local train with a splited dataset. The Convolutional Neural Network (CNN) is used in the training process. The result of accuracy and loss are shown in Figs. 5 and 6. We set a various number of vehicles and 0 low correlation group, 100 vehicles that joined FL could provide the highest accuracy and the lowest loss, 50 and 80 vehicles could achieve almost the same accuracy and loss. The proposed scheme could achieve a satisfactory result with no malicious vehicles.

To measure the result with the malicious vehicles joined, we execute FL with 100 data providers(20 groups) with different proportions of low correlation groups. The result of accuracy and loss are shown in Figs. 7 and 8. The accuracy has a reduction and the loss is not as good as Fig. 6. And as shonw in Fig. 9, compared with [23], we have better performance in the presence of malicious vehicles.

6 Conclusion

In this paper, we propose the privacy-preserving FL scheme. The proposed scheme addresses the vehicle data privacy and gradient privacy, and the malicious participants’ gradient could be removed at RSU by calculating the Pearson correlation coefficient. Meanwhile, we use Lagrange interpolation to verify the correctness of the returned result from AS. To reduce the overhead, We form a small number of vehicles as a group to negotiate the mask. Numerical results confirm the effectiveness of our proposed scheme in terms of accuracy. In future work, we plan to reduce data waste in the process of eliminating malicious gradients.