FedPeWS | Nurbek Tastan

FedPeWS: Personalized Warmup via Subnetworks for Enhanced Heterogeneous Federated Learning

Nurbek Tastan¹ Samuel Horvath¹ Martin Takac¹ Karthik Nandakumar^1,2

^{¹Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)}
^{²Michigan State University (MSU)}

Abstract

Statistical data heterogeneity is a significant barrier to convergence in federated learning (FL). While prior work has advanced heterogeneous FL through better optimization objectives, these methods fall short when there is extreme data heterogeneity among collaborating participants. We hypothesize that convergence under extreme data heterogeneity is primarily hindered due to the aggregation of conflicting updates from the participants in the initial collaboration rounds. To overcome this problem, we propose a warmup phase where each participant learns a personalized mask and updates only a subnetwork of the full model. This personalized warmup allows the participants to focus initially on learning specific subnetworks tailored to the heterogeneity of their data. After the warmup phase, the participants revert to standard federated optimization, where all parameters are communicated. We empirically demonstrate that the proposed personalized warmup via subnetworks (FedPeWS) approach improves accuracy and convergence speed over standard federated optimization methods.

Conceptual illustration of training personalized subnetworks in federated learning.

FedPeWS - Main Algorithm

How FedPeWS works?

Illustration of the proposed FedPeWS algorithm for two participants, which aggregates partial subnetworks ($x_i^t \odot m_i^t$) during the warmup phase to obtain a shared global model $x_g^t$. Here, $x_i^t$ and $m_i^t$ denote the local model and personalized mask of the $i^{\text{th}}$ participant in the $t^{\text{th}}$ round.

FedPeWS-Fixed. Fixed Mask Generation

Illustration of manual mask setting in the FedPeWS-Fixed method. The left figure illustrates the complete network with all neurons active and full connections. The middle figure represents subnetwork 1, utilizing only the left portion of the full network, where $m_1$ corresponds to this left side. Conversely, the right figure indicates the part of the network used for subnetwork 2. This setting is employed in all experiments involving $N=2$ participants.

Results: Improved communication efficiency and accuracy

The required number of collaboration rounds to reach target accuracy $\upsilon$ $\%$ and the final accuracy after $T$ rounds. The results are averaged over 3 seeds. $\times$ indicates that the algorithm cannot reach target accuracy $\upsilon$ within $T$ rounds and NA means that it reaches $\upsilon$ only in one random seed.
Dataset / Batch size		Synthetic-32K, 32			Synthetic-3.2K, 8
Parameters $ \{ \eta_g / \lambda / \tau \} $		$ \{1.0 / 5.0 / 0.125\} $	$ \{0.5/2.0/0.2\} $	$ \{0.25/1.0/0.1875\} $	$ \{0.1/2.0/0.1\} $
Target accuracy $ \upsilon (\%) $		99	90	75	99
No. of rounds to reach target accuracy	FedAvg	148 ± 3.79	199 ± NA	×	371 ± NA
No. of rounds to reach target accuracy	FedAvg+PeWS	115 ± 7.21	182 ± 6.81	286 ± 7.93	301 ± 10.59
Final accuracy after $ T $ collaboration rounds	FedAvg	99.94 ± 0.05	91.40 ± 7.25	67.64 ± 0.90	97.33 ± 3.89
Final accuracy after $ T $ collaboration rounds	FedAvg+PeWS	99.96 ± 0.01	99.49 ± 0.60	83.50 ± 3.52	99.66 ± 0.19

Results on Synthetic-{32, 3.2}K datasets with batch sizes {32, 8}, global learning rates $\eta_g \in \{1.0, 0.5, 0.25, 0.1\}$ and communication rounds $T \in \{200, 250, 400, 500\}$. FedPeWS consistently converges faster and outperforms FedAvg.

Visualization of validation accuracy and loss on the Synthetic-32K dataset with $N=4$.

Results: Sensitivity analysis

(a) CIFAR-MNIST dataset

(b) {Path-OCT-Tissue}MNIST dataset

Results for experiments with $T=300$ on (a) CIFAR-MNIST and (b) {Path-OCT-Tissue}MNIST datasets. (a) Participant 1 uses MNIST; Participant 2 uses CIFAR-10; ablation study for $\lambda$ and $\tau$. (b) $N=3$ participants use {PathMNIST, OCTMNIST, TissueMNIST}; ablation study for $\lambda$ and $\tau$. FedPeWS-Fixed results appear in the last row; $\tau=0.0$ denotes FedAvg.

Results: Comparison to SOTA algorithms

Comparison to the SOTA algorithms
Dataset	CIFAR-MNIST	{P-O-T}MNIST
FedAvg	71.78 ± 0.66	52.83 ± 1.26
FedProx	72.27 ± 0.88	51.28 ± 1.03
SCAFFOLD	71.83 ± 0.24	53.05 ± 0.60
FedNova	71.63 ± 0.98	53.05 ± 0.83
MOON	71.84 ± 1.09	52.10 ± 0.19
FedAvg+PeWS	75.83 ± 0.88	55.12 ± 0.56
FedProx+PeWS	75.04 ± 0.85	54.67 ± 0.43

Conclusion

In this work, we introduced a novel concept called personalized warmup via subnetworks for heterogeneous FL -- a strategy that enhances convergence speed and can seamlessly integrate with existing optimization techniques. Results demonstrate that the proposed FedPeWS approach achieves higher accuracy than the relevant baselines, especially when there is extreme statistical heterogeneity.

Contact

Contact me at nurbek [dot] tastan [at] mbzuai [dot] ac [dot] ae.

Citation

@InProceedings{tastan2025fedpews,
    title={Fed{PeWS}: Personalized Warmup via Subnetworks for Enhanced Heterogeneous Federated Learning},
    author={Nurbek Tastan and Samuel Horv{\'a}th and Martin Tak{\'a}{\v{c}} and Karthik Nandakumar},
    booktitle={The Second Conference on Parsimony and Learning (Proceedings Track)},
    year={2025},
    url={https://openreview.net/forum?id=iYwiyS1YdQ} 
}