1 Introduction

The recent development of centralized federated learning has drawn dramatic attention to some powerful computing platforms, e.g., cloud-based and edge-based systems. Federated learning applies in some fields, such as the Internet of Things (IoT), Natural Language Processing (NLP), and Image Processing. A central server accesses massive information from clients with excessive communication overhead in a cloud-based system. In an edge-based system, a center server pushes computation resources to the edge servers, which allows clients to jointly train deep models within the communication range (Wang et al., 2019; Liu et al., 2020; Li et al., 2022). In centralized federated learning, each client runs stochastic gradient descent (SGD) locally and a central server aggregates parameter updates from clients for the next round until model convergence. Figure 1a shows the diagram of a round of federated averaging algorithm (FedAvg).

FedAvg (McMahan et al., 2017) is a gradient-based and well-established centralized federated learning algorithm, that allows clients to collaboratively train a model without raw data. When client data is independent and identical distribution (IID), local gradients are unbiased estimates of full gradients, which performs well under standard assumptions (Li et al., 2020; Kairouz & McMahan, 2021). However, this method relies heavily on data quantities and data distribution. When the client data collected from different sources is Non-IID, averaging different local models generates biased gradients and deviates from the true results. To alleviate this impact, the studies (Zhu et al., 2021; Wang et al., 2021) propose some solutions, such as sharing partial private data as public data (Zhao et al., 2018), fine-tuning (Li et al., 2021; Karimireddy et al., 2020) and distillation-based approaches (Jeong et al., 2018; Zhang et al., 2021; Feng et al., 2021). The method of fine-tuning adjusts the weight divergence of local and global models by adding the regulation term to improve model performance, but it’s limited by model structures. However, the mismatch of gradient dimensions leads to degrading performance during the aggregation stage when different clients design different network structures according to computing power (Li & Wang, 2019). Model complexity affects the learning ability of the model and is affected by model size and data distribution (Mohri et al., 2018; Hu et al., 2021). Distillation-based approaches can compress the model’s size and improve the performance of small models, which starts with a large and pre-trained teacher model and trains a smaller student model, which isn’t limited by model structures, but the performance of the student model can’t outperform the teacher model. Deep mutual learning (DML) (Zhang et al., 2018) isn’t a one-way transfer method between static teacher and student models and is integrated with federated learning (Li et al., 2022; Shen et al., 2020), which ensembles of student models and learns collaboratively, and all model parameters are updated throughout the training process. In Fig. 1b, the client forks the initial global model as a meme model for local mutual training, and uploads a meme model to the cloud server.

In this paper, we propose a Heterogeneous Hierarchical Federated Mutual Learning (HFML) method in the edge-based framework. We introduce deep mutual learning to mine knowledge from local data and use partial aggregation to guide local updates per client when local and edge models are heterogeneous. Our contributions are listed as follows:

  • We propose an edge-based federated learning framework and design a model assignment mechanism that allows the client frequently performs the local update and transfers local knowledge with the edge model by deep mutual learning through the edge layer to achieve fast convergence.

  • We develop an easy-to-implement heterogeneous hierarchical federated mutual learning method named HFML in an edge-based system. We leverage partial model aggregation to reduce the number of local iterations and training time while maintaining stable accuracy when client data is Non-IID data.

  • We conduct multiple experiments in Non-IID settings for image classification. The results show that HFML outperforms FedAvg, FedProx, and FML methods on metrics such as accuracy, training time, and model complexity.

In this paper, Sect. 2 reviews the relevant studies on Non-IID data and model heterogeneity. Section 3 describes related preliminaries. We propose the HFML scheme in Sect. 4. Section 5 describes the experiments and analyzes the results. In Sect. 6, we summarize the content of this paper.

Fig. 1
figure 1

Three centralized federated learning frameworks. a FedAvg b FML c HFML

2 Related work

2.1 Non-IID data

Data partition strategies are used to simulate real-world data distributions, including Dirichlet distribution, skewed feature distribution, skewed label distribution, and quantity imbalance (Caldas et al., 2018; Li et al., 2021; Hsieh et al., 2020; Hsu et al., 2020). Note that Non-IID means that local distribution hardly represents global distribution. Local and global models can be regarded as containers of knowledge and rely heavily on massive data and data distribution. In ref (Li et al., 2019), the convergence of the FedAvg algorithm for strongly convex problems in the Non-IID setting is proved. Fine-tuning reduces weight divergence discrepancy between local and global models with the same structure (Munir et al., 2022; Li et al., 2018; Karimireddy et al., 2020). Aggregating multiple local models trained on Non-IID data is affected by inconsistent updates, which reduces model accuracy and convergence speed.

2.2 Model heterogeneity

Model heterogeneity reflects differences in data presentation and learning ability. Knowledge distillation integrated with federated learning is used to compress models (Hinton et al., 2015; Anil et al., 2018; Seo et al., 2020; Chan et al., 2021; Jiang et al., 2020; Afonin et al., 2022; Yu et al., 2022), which transfers knowledge from large models to small models and is suited to the low-memory device. FedGKD (Pan & Sun, 2021) fuses global historical information to guide local models and weakens the over-fitting of local models. FedUFO (Zhang et al., 2021) addresses optimization inconsistency and feature divergence issues by modifying two consensus losses and extracting group data information from global and local models. FedHeNN (Makhija et al., 2022) allows agnostic architecture across peer clients and guides the simultaneous training on each client.

2.3 Periodic aggregation

Periodic model aggregations reduce the communication cost in an edge-based system. Increasing parallel computing on clients can reduce communication times in a centralized federated learning framework (Konecny et al., 2016; Rothchild et al., 2020; Matsuda et al., 2022). Lin et al. (2018) proves that 99% of gradient exchange is redundant in the communication process, and the exchange of a large number of parameters increases the unnecessary communication cost and extends the aggregation period. Tier-based federated learning is segmented according to privacy levels and model performance to accelerate convergence under data heterogeneity (Wu et al., 2021; Chai et al., 2020, 2021; Mhaisen et al., 2022; Luo et al., 2020). HierFAVG (Liu et al., 2020) allows multiple edge servers to perform partial model aggregation on an edge-based system. For periodic aggregation optimization, FedBCD (Liu et al., 2019) presents that each client performs different local updates before uploading parameters to adjust the update direction. Ref (Lee et al., 2022) proposes a partial model averaging method to solve the problem of slow convergence due to model discrepancy across the clients.

3 Preliminaries

In this section, we describe the federated averaging algorithm and model complexity. Then, we introduce the deep mutual learning method and related federated learning schemes.

3.1 Federated averaging algorithm

Suppose the network includes K clients. Given dataset \(S = \{x_i,y_i\}^n_{i=1}\) includes n samples of M classes, the k-th client holds the \(n_k\) samples over data distributions \(D_k(x_i, y_i)\), \(n =\sum _{k=1}^{K}n_k\), \(p_k = \frac{n_k}{n}\). The global objective function f(w) formulates as follow:

$$\begin{aligned} min_w f(w) = \sum _{k=1}^{K}p_k F_k(w) \end{aligned}$$
(1)
$$\begin{aligned} F_k(w^k) = \frac{1}{n_k} \sum _{i=1}^{K}f_i(w^k) \end{aligned}$$
(2)

where \(F_k(\cdot )\) is local objective function of k-th client, \(f_i(w^k)=l(x_i,y_i; w^k)\) represents loss function on samples \(n_k\) made with weight parameters of local model \(w^k\).

$$\begin{aligned} w^k \leftarrow w^k - \eta \nabla F_k(w^k) \end{aligned}$$
(3)
$$\begin{aligned} w^{global} \leftarrow \sum _{k=1}^K p_k w^k \end{aligned}$$
(4)

where \(w^{global}\) is the weight of the global model. Note that when \(D_k\) is IID data, \(\mid F_{SUM}-F_{FED}\mid \le \delta \) and \(E_{D_K}[F_k(w)] =f(w)\) holds, where federated learning performance approximates centralized computing, \(\delta \) is a non-negative real number (Yang et al., 2019). Clients update local parameters by SGD method and aggregate local models at a server, performing the above operations to update the global model until converges during the whole process.

3.2 Deep mutual learning

Deep mutual learning (Zhang et al., 2018) can be viewed as bidirectional knowledge transfer between student networks and is suitable for training on heterogeneous models. At each iteration, we compute the predictions of the two models and update both models’ parameters according to the predictions of the other.

Suppose there are two models \(\theta _1\) and \(\theta _2\). For multi-class image classification task, the probability of class m for sample \(x_i\) by model \(\theta _1\) is computed as

$$\begin{aligned} p^m_1(x_i)= \frac{exp(z^m_1)}{\sum ^M_{m=1}exp(z^m_1)} \end{aligned}$$
(5)

where the logit \(z_1^m\) is the output of the softmax layer in model \(\theta _1\). \(p_1(\cdot )\) represents the prediction of the model \(\theta _1\), named soft targets. Deep mutual learning includes two losses: a conventional supervised learning cross-entropy loss \(L_{CE}\) between the hard labels and the soft targets and a mimicry loss \(D_{KL}(\cdot )\), named Kullback Leibler (KL) Divergence which quantifies the match of the soft predictions.

$$\begin{aligned} L_{CE} = -\sum ^N_{i=1}\sum ^M_{m=1}I(y_i)log(p^m_i(x_i)) \end{aligned}$$
(6)
$$\begin{aligned} D_{KL}(p_2\Vert p_1) = \sum ^K_{i=1} \sum ^M_{m=1} p^m_2(x_i) log\frac{p^m_2(x_i)}{p^m_1(x_i)} \end{aligned}$$
(7)

where I is an indicator function. \( I(y_i)=\left\{ \begin{aligned} 0&\quad y_i=m \\ 1&\quad y_i \ne m \\ \end{aligned} \right. \), loss functions of model \(\theta _1\) and \(\theta _2\) can be computed as.

$$\begin{aligned} L_{\theta _1} = L_{C_1}+D_{KL}(p_2\Vert p_1) \end{aligned}$$
(8)
$$\begin{aligned} L_{\theta _2} = L_{C_2}+D_{KL}(p_1\Vert p_2) \end{aligned}$$
(9)

Deep mutual learning is integrated with federated learning from invisible data to learn knowledge. In FML (Shen et al., 2020), the meme model as a medium between the global models and the local models is used to solve the problem of data, objective, and model heterogeneity (DOM) in Fig. 1b. Student models can train mutually instead of learning from the pre-trained teacher model. The loss functions \(L(\cdot )\) of \(C_{local}\) and \(C_{meme}\) describe as follows:

$$\begin{aligned} \begin{aligned} L_{local} = \alpha L_{C_{local}} + (1-\alpha ) D_{KL}( p_{meme}\Vert p_{local}) \\ L_{meme} = \beta L_{C_{meme}} + (1-\beta ) D_{KL}( p_{local}\Vert p_{meme}) \end{aligned} \end{aligned}$$
(10)

where \(\alpha \) and \(\beta \) are the hyper-parameters which use to control the proportion of knowledge transfer from data or model. When \(\beta =1\), the federated mutual learning algorithm would degrade into a typical federated averaging algorithm.

4 Methodology

In this section, we describe the Heterogerous Hierarchical Federated Mutual Learning (HFML) method in an edge-based federated learning system which includes cloud servers, edge servers, and clients. Fig. 1c shows the diagram of a round of HFML.

4.1 Formulation

Edge servers are denoted by \(S_{edge} = \{m_i, i=1, \cdots , M\}\) and clients are denoted by \(S_{client}=\{c_{i_j}, {i_j}=1, \cdots , N\}\). The edge models are marked as \(\{C_i, i= 1, \cdots , N\}\). The client models connected to edge server \(m_i\) are marked as \(\{C_{i_j}, i_j = 1, \cdots , K\}\), \(c_{i_j}\) represents j-th client connected to the i-th edge server and each edge server connects same number of clients (Table 1).

Table 1 Notations

The rounds are marked as D. The global communication round sets T between cloud and edge servers, and the partial communication round sets t between clients and edge servers, the local epoch sets E. \(p_{C_{i_j}}\) is computed as Eq. 5 and \(P_I\) represents periodic predictions aggregation.

$$\begin{aligned} p_I=\left\{ \begin{aligned}&\frac{1}{K}\sum ^K_{i_j=1}p_{C_{i_j}} \quad&D \mod E = 0 \\&p_{C_{i_j}} \quad&D \mod E \ne 0 \end{aligned} \right. \end{aligned}$$

We rewrite the edge model loss function \(L_{C_i}\) and local model loss function \(L_{C_{i_j}}\) as follows:

$$\begin{aligned} \begin{aligned} L_{C_{i_j}}&= L_{C_{i_j}} + D_{KL}(p_{C_{i_j}} \Vert p_I) \\ L_{C_i}&= \alpha L_{C_{i}} + \beta D_{KL}(p_I \Vert p_{C_{i_j}}) \end{aligned} \end{aligned}$$
(11)

where \(L_{C_{i_j}}\) and \(L_{C_i}\) are computed by Eq. 7. The hyper-parameters are used to adjust the strength of learning ability from local data, defaulting to 0.5 for all experiments. The edge and local models conduct DML and update model parameters.

$$\begin{aligned} w^{m+1}_{edge} \leftarrow w^m_{edge} - \eta \nabla L_{C_i} (w^m_{edge}, w^k_{local}) \end{aligned}$$
(12)
$$\begin{aligned} w^{k+1}_{local} \leftarrow w^k_{local} - \eta \nabla L_{C_{i_j}}(w^m_{edge}, w^k_{local}) \end{aligned}$$
(13)

Partial periodic aggregation at the edge layer and global model aggregation in the cloud are as follows.

$$\begin{aligned} w^{m_i}_{edge} \leftarrow \sum _{c_{i_j}=1}^K w^{c_{i_j}}_{local} \end{aligned}$$
(14)
$$\begin{aligned} w_{global} \leftarrow \sum _{m_i=1}^M w^{m_i}_{edge} \end{aligned}$$
(15)

In Fig. 1c, the edge servers download the global models (homogeneous or heterogeneous) from the cloud as edge models, and the client downloads the edge model and trains mutually and uploads the edge model to the edge layer. Finally, the cloud aggregates local models into a global model. HFML is compatible with heterogeneous models on Non-IID data, where the edge layer acts as a knowledge transfer hub between connected clients in Algorithm 1

figure a

5 Experiments

5.1 Models and datasets

We conduct extensive experiments on CIFAR-10 (Krizhevsky et al., 2014) and CIFAR-100 (Krizhevsky et al., 2009) datasets, which are widely used in image classification task. CIFAR-10 consists of 50000 training images and 10000 test images in 10 classes, with 5000 and 1000 images per class. CIFAR-100 has the same total number of images as CIFAR-10, but it has 100 classes. All images of CIFAR-10/100 are 3-channel 32x32 RGB images. We simulate two settings by sorting and assigning label classes.

Global and local models include combinations of convolution neural network (CNN1, CNN2, and Multi-Layer Perceptron (MLP)). CNN1 is a convolution neural network with two 3x3 convolution layers (the first with 6 channels, the second with 16 channels, each followed with 2x2 max pooling and ReLu activation) and two fully connected layers, and a convolution neural network CNN2 with three 3x3 convolution layers (each with 128 channels followed with 2x2 max pooling and ReLu activation) and one fully connected layer. MLP is a special convolution neural network with three fully connected layers that contain the nonlinear activation function ReLU. CNN1/CNN2 represents that the global model is CNN1 and the local models are CNN2; CNN2/CNN1 represents the global model is CNN2 and local models are CNN1; CNN1 represents the global model and local models are CNN1, and CNN2 represents that the global model and local models are CNN2, MLP/CNN2 represents that global model is MLP and local models are CNN2; CNN2/MLP represents that global model is CNN2 and local models are MLP.

5.2 Experiment settings

For a fair comparison, our experiment performs image classification tasks on CIFAR-10/100 datasets with Non-IID settings of skewed label partitions. We compare proposed HFML with FedAvg (McMahan et al., 2017), FML (Shen et al., 2020), and FedProx (Li et al., 2020) schemes under the same conditions. The metrics include accuracy, training time, and model size. We consider an edge-based federated learning system and assume each edge server connects the same number of clients. The total communication round sets T between cloud and edge servers, and partial communication round sets t between clients and edge servers, and local epochs E. Two settings mean that each client has overlap label classes, such as label classes and sample size are similar {3:3:4} for CIFAR-10, and label classes and sample size are similar {30:30:40} for CIFAR-100, marked as setting 1; label classes and sample size are large difference {6:2:2} for CIFAR-10, and label classes and sample size are large difference {60:20:20} for CIFAR-100, marked as setting 2. The parameters include momentum = 0.9, weight_decay = \(5 \times 10^{-4}\), learning rate \(\eta = 10^{-3}\) and batch size \(B= 128\).

5.3 Results

To compare the performance of the proposed method and the baseline methods, we run multiple experiments using the homogeneous and heterogeneous models in all settings and evaluate the training time, accuracy, and model complexity. In this paper, the accuracy rate recorded in all figures is the best value for each round in the training stage, once it exceeds the existing best value, and it is recorded and updated, otherwise, it remains unchanged.

5.3.1 Comparison of training time

We use deep mutual learning to improve performance during the training phase and partial period aggregations to approximate global aggregations during the inference phase. Table 2 shows that HFML spends less training time than FedAvg, FedProx, and FML methods on settings 1 under homogeneous and heterogeneous models with the same conditions. While the training time between different algorithms under homogeneous models on the same dataset varies widely, HFML can reduce training time up to 30%. The training time of the same method on datasets of different sizes is affected, such as CIFAR-10 and CIFAR-100. The difference in the training time is only a few minutes under different models using the same algorithm, such as MLP, CNN1, and CNN2.

Table 2 Training time of four approaches in setting 1
Table 3 Top-1 accuracy (%) of the global model on CIFAR-10/100 using homogeneous and heterogeneous models in setting 1 and 2 for four approaches

5.3.2 Accuracy comparison

As shown in Table 3, the accuracy of CNN2 is higher than CNN1 and MLP for four approaches under homogeneous models in all settings. When global model and local models are heterogeneous on CIFAR-10/100, such as CNN1/CNN2, CNN2/CNN1, and CNN2/MLP, the accuracy of HFML has 2.9% improvement than FML on CIFAR-10/100 in Figs. 2 and 3, and improves 2.04% accuracy than other approaches when global model and local models are MLP in setting 1. The model performance is related to the degree of label skew and the model structure. When the global and local models are homogeneous, the accuracy of the deep global model (CNN2) is higher than that of the shallow model (CNN1, MLP).

FedAvg and FedProx train failure between local and global models due to mismatch of gradient dimension. FedAvg depends on initializing the global model and updates model weights according to sample size at clients. FedProx modifies local and global updated weight by adding a regularization term. FML aggregates meme models at the cloud server and controls the proportion of the rate of data and model by hyper-parameters. HFML adjusts knowledge fusion by partial periodic aggregation edge models to approximate global aggregation in an edge-based system.

Fig. 2
figure 2

The best accuracy on CIFAR-10 under heterogeneous models. a The global model is CNN1 and local models are CNN2 in setting 1. b The global model is CNN2 and local models are CNN1 in setting 1. c The global model is CNN1 and local models are CNN2 in setting 2. d The global model is CNN2 and local models are CNN1 in setting 2

Fig. 3
figure 3

The best accuracy on CIFAR-100 under heterogeneous models. a The global model is CNN1 and local models are CNN2 in setting 1. b The global model is CNN2 and local models are CNN1 in setting 1. c The global model is CNN1 and local models are CNN2 in setting 2. d The global model is CNN2 and local models are CNN1 in setting 2

Fig. 4
figure 4

The accuracy of four methods on CIFAR-10 under heterogeneous models in setting 1. Node 0 represents as a global model and Node 1-6 respectively represent as local models

Figure 4 describes the accuracies of a global model and six local models on CIFAR-10 when they are heterogeneous models (CNN1/CNN2, CNN2/CNN1, MLP/CNN2, and CNN2/MLP). HFML improves the accuracy of both the global model and the local models. The local performance is affected by the local data distribution and label classes, while the global model is affected by the distribution gap between clients.

Fig. 5
figure 5

The best accuracy of global model on CIFAR-10/100 under homogeneous models in setting 1. a The global model and local models are CNN1 on CIFAR-10. b The global model and local models are CNN2 on CIFAR-10. c The global model and local models are CNN1 on CIFAR-100. d The global model and local models are CNN2 on CIFAR-100

Fig. 6
figure 6

The best accuracy of the global model on CIFAR-10/100 under homogeneous models in setting 2. a The global model and local models are CNN1 on CIFAR-10. b The global model and local models are CNN2 on CIFAR-10. c The global model and local models are CNN1 on CIFAR-100. d The global model and local models are CNN2 on CIFAR-100

Figures 5 and 6 show that the best accuracy of the global model for homogeneous models in setting 1 and setting 2 of four algorithms using CNN1 and CNN2, it can be seen that global classification accuracy of the proposed method is higher than that of the baseline methods. Figure 7a and b show that HFML achieves the best results on all cases for homogeneous models (MLP) of four algorithms.

Fig. 7
figure 7

The best accuracy of the global model on CIFAR-10. a The global model and local models are MLP in setting 1. b The global model and local models are MLP in setting 2. c The global model is MLP and local models are CNN2 in setting 1. d The global model is MLP and local models are CNN2 in setting 2

Fig. 8
figure 8

The best accuracy of HFML with hyper-parameters on CIFAR-10 using homogeneous and heterogeneous models. a The accuracy of homogeneous models in setting 1. b The accuracy of heterogeneous models in setting 2. c The accuracy of partial aggregations in setting 1. d The accuracy of partial aggregations in setting 1

Figure 7c and d show that HFML improves the accuracy when global and local models are heterogeneous. In Fig. 8a and b, the results show that the accuracy is affected by hyper-parameters but slightly. The edge layer acts as a knowledge transfer hub between clients to guide local updates. We consider local epochs and partial periodic aggregation relatedness in Fig. 8c and d. The results show that the accuracy decreases as the rounds of edge aggregations.

Table 4 Model complexity on Non-IID settings

5.3.3 Model complexity comparison

We run the code in the same hardware and software environments. In Table 4, we found that model parameters are related to model structures and dataset size, and different algorithms in the same setting train under the same models to have the same numbers of model parameters. For example, the global model parameters of CNN2/MLP are the same as CNN2, only the local model parameters are different. The number of parameters of CNN2 is 10 times that of CNN1. When the global model size is larger than the local models, such as CNN2/MLP or CNN2/CNN1, the accuracy of the global model approximates CNN2 and consumes lower resources, respectively 2.11M, 11.22M, and 17.55M, which reduces gradients size during aggregation stage. FedAvg and FedProx train collaboratively a global model under homogeneous models. FML and HFML focus on training on Non-IID data under homogeneous and heterogeneous models.

5.4 Discussion

In this paper, we discuss three aspects: data distribution, model structures, and training time. We compare performance of four methods from the perspective of data heterogeneity under heterogeneous and homogeneous models.

5.4.1 Impact of non-IID data

HFML focus on training on Non-IID data under homogeneous and heterogeneous models. Fine-tuning mainly occurs in the training phase and depends on model structures, and deep mutual learning can transfer the knowledge of the last layer. Performance is affected by model structures and data distribution during local multiple iterations and global aggregations. The significant update deviation causes the global model to deviate from the true optimization results.

5.4.2 Impact of heterogeneous models

Models are regarded as containers for storing knowledge from different data, and model complexity is affected by model structures, model size, data distribution, and dataset size. Increasing the number of hidden units or parameters, which leads to generalization errors. When different algorithms train the model on the same condition, such as the model complexity can be measured using LANN (Hu et al., 2020), e.g., \(CNN2> MLP > CNN1\), the accuracy of the deep model is better than a shallow model, but training time is longer. Table 4 shows that model complexity depends on global model structure and parameters, such as CNN2 has the same model parameters as CNN2_CNN1. Deep mutual learning transfers bidirectionally knowledge and is suitable for heterogeneous models. When the global model size is smaller than the local model, the accuracy and storage of the global model are better than that of the homogeneous model CNN1, but it loses accuracy more than CNN2 and can’t trade off training time and accuracy.

5.4.3 Impact of edge layer

We set the same number of clients connected to the edge server in an edge-based system. The hyper-parameters \(\alpha \) and \(\beta \) are regarded as the knowledge transfer rate of data and models.

6 Conclusion

In this work, we propose a method named heterogeneous hierarchical federated mutual learning (HFML) in an edge-based system, which solves the problems of the inconsistency of gradient updates during the training stage and mismatch of gradient dimensions during the aggregation stage in Non-IID settings. We use deep mutual learning to transfer and jointly mine invisible knowledge from local models and edge models and achieve model updates. We leverage partial aggregations to achieve fast convergence and reduce training time. In terms of accuracy and training time, HFML outperforms than FedAvg, FedProx, and FML schemes.