Keywords

1 Introduction

Federated learning is a new technique for training machine learning models across decentralized participants without accessing any party’s private data [1, 2, 3]. It emerges as a promising paradigm for data-private multi-institutional collaborations by distributing the model training to the data owners and aggregating their results, solving the concerns of sharing data [4]. In a federated learning system, the central server (centralized coordinator) coordinates the learning process and aggregates the parameters from local machine learning models trained on each participant’s data [5]. Although such a design minimizes the risk of privacy leakage, the centralized coordinator is vulnerable to attacks and privacy breaches, becoming the single point of failure and trust issues.

While blockchain is a technology that offers assurances of reliability and usage transparency in decentralized settings, researchers started to investigate the combinations of the two promising technologies [6, 7]. In this study, we took advantage of blockchain and federated learning and proposed a platform called Rahasak-ML [8]. Rather than using centralized coordinators to aggregate and learn the global model, the Rahasak-ML used an incremental learning technique [9, 10] to continuously train the models by multiple peers in the blockchain network. Each peer in the blockchain manages its local storage and establishes local models [11]. Once a peer generates a model, it can be incrementally trained and aggregated by other peers through the blockchain by using the federated learning approach. Rahasak-ML stores information (e.g., participating clients who generate and aggregate local models, generation times, etc.) into the blockchain ledger that all participating parties can view. It provides a way to audit the system. All actions performed on the model are entirely traceable by each user giving a clear history of all operations and incremental versions that existed. This system adds more transparency to the federated learning system by providing a traceable record of the model development, potentially alerting to adversarial machine learning attempts or fraudulent actions. Rahasak-ML makes the following contributions:

  • Integrates federated learning with blockchain to enable model sharing and aggregations without having centralized authority, increasing the transparency, trust, and provenance of the model generation;

  • Adds the ability to audit the federated learning system by storing task details (e.g., who generates local models and aggregates them, model generation times, etc.) in the blockchain;

  • Offers different functions in the platform that are implemented as independent services (microservices) that are easy to scale and deploy; and

  • Introduces a way to integrate the models in smart contracts to predict the output of real-time data.

The chapter is organized as follows. In Sect. 2, we briefly introduce federated learning, blockchain, and the role of these two technologies in drug discovery. In Sect. 3, we introduce the architecture of federated learning in the Rahasak-ML platform. In Sect. 4, we further explain the training process and the implementation in a medical use case and offer insights into related work. Finally, in Sect. 5, we discuss the future directions and open questions.

2 Overview of Federated Learning and Blockchain

2.1 Federated Learning

Machine learning represents a set of methods that can automatically uncover patterns in data and then use detected patterns to predict future data. Machine learning models show promise in aiding decision-making in healthcare [12, 13] and finance [14]. However, a large, diverse labeled dataset is the key to making a supervised machine learning model broadly effective. Collaborative learning is an efficient way to increase the data size and diversity, via multi-institutional data sharing for the training of a single model [4]. The current approach to achieving collaborative learning requires sharing the data with a third party to train a global model, such as using data repositories for different purposes (Fiscal Service Data Registry, [15]). However, this centralized approach presents many issues, such as high costs for data transmission and storage, security and privacy at high risk, lack of auditing, data ownership, and restrictions of data sharing, e.g., the Health Insurance Portability and Accountability Act (HIPAA) regulations in healthcare [16].

To address these security and privacy issues, a decentralized machine learning approach, i.e., federated learning [17, 18], has been proposed to build a shared machine learning model without storing or having access to any party's private data. In federated learning, the central server coordinates the learning process and aggregates the information from multiple participants (i.e., referred to here as “parties”) in a decentralized manner while keeping each participant's raw data private. Each party downloads the global model parameters from the central server at each iteration, locally trains it with their private/local dataset, and sends each of their local model parameters to the central server for aggregation. Then, the central server gathers all the local model parameters, aggregates them, and updates the global model parameters for the next iteration. This learning process continues until pre-defined termination criteria are met. For example, if the maximum number of iterations is reached, or if the model accuracy is greater than a threshold, the learning process is finished and will exit automatically.

2.2 Barriers and Challenges in Drug Discovery

Drug discovery involves identifying potential new medicines, which involves and requires the knowledge of a wide range of scientific disciplines, such as biology, chemistry, and pharmacology. Developing a new drug is a complex, lengthy, and costly process, entrenched with a high risk of uncertainty that a drug will succeed. The drug development pipeline included multiple stages, from identifying targeted therapeutic agents to clinical trial designs, including Phases I, II, and III. Each stage is critical but faces challenges, such as insufficient knowledge about the underlying mechanisms of disease, the heterogeneity of patients who have diverse clinical phenotyping and endotyping, a lack of targets and biomarkers, small or biased samples in clinical trials, and regulatory challenges [19]. These hurdles create barriers to the development of the drugs, leading to increased costs and time, thus increasing the risk of failure. To minimize these challenges, researchers moved toward computational approaches to accelerate pipeline, such as using high-throughput virtual screening and molecular docking to reduce the number of compounds that need to be screened experimentally [20]. However, these approaches have inaccuracy and inefficiency problems. Therefore, new methods and computing technologies to automate analytical model building for pharmaceuticals are needed and could transform drug discovery.

Today, the advances in high-throughput approaches to biology and disease present opportunities to pharmaceutical research and industry [21]. For example, multi-omics ranging from genome, proteome, transcriptome, metabolome, and epigenome are generated at unprecedented speed, improving the capabilities of systematically measuring and mining biological information. In addition, widely adopted electronic health records (EHR) and smart technologies capture detailed phenotypic patterns, allowing researchers to monitor patient outcomes and study medication treatments. The booming of such “big data,” including omics, images, clinical characteristics, social/environmental information, and literature, has driven much of researchers’ interest in harnessing machine learning to analyze and uncover novel findings and hidden patterns from the massive data [22, 23, 24].

Machine learning and deep learning are fundamental branches of artificial intelligence (AI), which refer to computer systems’ ability to learn from input or past data. AI has achieved successful applications in many domains, such as imaging detection and natural language processing. Recently, AI algorithms have been increasingly being applied in all stages of drug discovery, including screening chemical compounds, identifying novel targets [25], examining target–disease associations [26], improving the small-molecule compound design and optimization, studying disease mechanisms [27], evaluating drug toxicity and physicochemical properties [28], predicting and monitoring the drug response [29], and identifying new indications for an existing drug, known as drug repositioning. Moreover, researchers utilized machine learning models to optimize the clinical trials, such as estimating the risks of clinical trials more accurately [30] and improving the patient pre-screening process, as well as approaches to feasibility, site selection, and trial selection [31].

From a machine learning viewpoint, it is desirable to have large and diverse data to inform model training, but access to data remains a challenge in drug discovery. Several public databases contain millions of biological assay results, such as ChEMBL [32] and PubChem [33], which can provide input for machine learning models to retrieve training models and then predict biological activities or physical properties for drug-like molecules. However, the data only represents a small fraction of what has been measured, which might bias the machine learning models and affect the model reliability and reproducibility [34]. Furthermore, many larger datasets are proprietary to pharmaceutical companies or publishers and are not publicly and freely available. To overcome the barriers, researchers seek federated learning to solve data acquisition and data bias problems faced by AI drug discovery by keeping confidentiality and customizing models for users [35].

Federated learning is a new machine learning paradigm where multiple sites collaboratively learn a shared machine learning model while keeping all the training data on a single site [2]. Developing federated health AI technologies are essential and highly demanding in medicine [13]. Examples include the European Union Innovative Medicines Initiative’s (https://www.imi.europa.eu/) projects for privacy-preserving federated machine learning. Chang et al. explored data-private collaborative learning methods for medical models for image classification [36]. Xiong et al. [37] proposed using a federated learning work in predicting drug-related properties. The architecture of federated learning is that each participating pharma company (peer) will locally train the model without sharing the training data. Each peer only encrypts and uploads the model updates, and a coordinator server aggregates all the updates from the local client and broadcasts the latest shared global model to them. Thus, individual pharma companies will be able to fine-tune the machine learning model and effectively tailor it to their specific field of inquiry, with the individual research data remaining confidential.

2.3 Challenges in Federated Learning

While the federated learning process has significant improvements to minimize the risk of privacy leakage by avoiding storing raw datasets to a third party, it still presents some major vulnerability issues in the model architecture and the training process.

  • First, the central server for coordinating a shared and trained global model presents the single point of failure and trust issues. A malicious behavior or malfunction from the central server could bring inaccurate global model parameters updates, which would misrepresent the local model parameters update sent by the parties. Therefore, decentralization of the entire federated learning process was necessary.

  • Second, during the learning process, malicious parties could send manipulated local model parameters to the central server, affecting the global model parameters. If such malicious local parameters are not detected or removed before aggregation, they will compromise the global model and lower the overall model accuracy [1, 38]. Some studies [39] have proposed approaches to verify model parameters, but they mainly rely on the data sample size and the computation time, which could be easily altered by malicious participants to avoid detection.

  • In addition, these studies do not address the quality of the data sample that would affect the accuracy and the convergence analysis of the federated learning process. A more difficult malicious behavior, colluding attack, has shown the vulnerabilities of existing defenses based on Sybil [40]. Thus, it is essential to note that verifiable local model parameters update is important for the accuracy of the global model parameters.

2.4 Blockchain Benefit for Federated Learning

Blockchain provides a shared digital ledger that records data in a public or private peer-to-peer network. It guarantees a decentralized trust system without involving trusted third parties. Multiple partners (nodes) can exist in the blockchain network, and each partner (node) has a copy of the data being maintained [41]. The data on the blockchain are organized into blocks. A block contains a set of records (transactions). Each block is linked to its previous block by containing the previous block's hash in its header. If someone was to tamper with the contents of one block, then all blocks in the blockchain following that block would be invalidated.

Depending on the type of access and from where the nodes that support the blockchain are selected, there are two primary types of blockchains: permissionless and permissioned. Permissionless blockchains deal with entirely untrusted/byzantine parties; examples are Bitcoin, Ethereum, and Rapidchain. Permissioned blockchains deal with trusted/known parties; examples are BigchainDB, Hyperledger, and HbasechainDB. Many blockchains, such as Bitcoin, are used for cryptocurrencies. For example, Ethereum and Hyperledger support different transaction storage models related to other business or e-commerce activities. Recently, blockchain has quickly been applied to other areas, including the healthcare and drug industry [42, 43]. For example, studies have integrated blockchain with EHRs, to allow the different stakeholders to manage EHR transparently while guaranteeing fairness and usage (records access) consent [44].

To address the challenges of federated learning, we propose integrating blockchain with federate learning to replace the centralized coordinator. The blockchain network can be deployed among different peers, and the peers can train machine learning modes with the data on their own local storages (e.g., off-chain storage). Then the local models generated by different peers can be aggregated/averaged into a global model using the federated learning approach without using a centralized coordinator. In blockchain-enabled federated learning systems, the model parameter sharing, local model generation, incremental model training, and model sharing functions can be implemented with smart contracts. All federated learning tasks happening in the system (e.g., generate local models and aggregate them) and stored in the blockchain ledger are viewed by all participating parties. It provides a means to audit the system and adds more transparency to the federated learning process. Once local models are generated, these models can be integrated into blockchain smart contracts (e.g., a program that directs client requests to the blockchain) to predict real-time data. This system adds more transparency to the federated learning system by providing a traceable history of the model development, potentially alerting to adversarial machine learning attempts or fraudulent actions.

2.5 The Benefits of Blockchain-Empowered Federated Learning for Drug Discovery

The blockchain-enabled federated learning enhanced such infrastructure by decentralizing the architecture further and making the training process and model sharing more transparent and traceable. As a result, hospitals, institutions, and drug companies can achieve an accurate and generalizable model; more sites contribute their local insights while remaining in full control and possession of their data. This approach allows complete traceability of data access, limiting the risk of misuse by third parties. There is a consortium of pharmaceutical, technology, and academic partners, the Machine Learning Ledger Orchestration for Drug Discovery (MELLODDY, https://www.melloddy.eu/), that uses deep learning methods on the chemical libraries of ten pharma companies to create a modeling platform that can more quickly and accurately predict promising compounds for development, all without sacrificing the data privacy of the participating companies. Specifically, the benefits of a blockchain-empowered federated learning system are as follows:

  • Entails training algorithms across decentralized sites or devices holding data samples without exchanging those samples.

  • Small pharmaceutical companies and research institutions would achieve accurate, less biased models by gaining insights from other sites containing diverse data.

  • Provides a platform with more transparency, trust, and provenance for model training and sharing.

  • Provides the ability to audit the system and make sure local data and models are traceable. For example, the task information related to who generates models, aggregate parameters, and model generation time would be recorded in the blockchain.

  • Offers flexibility with connecting more participating sites and devices.

  • Provides the ability to process real-time data.

3 The Rahasak-ML Platform

3.1 Overview

The Rahasak-ML platform integrates federated learning with blockchain to enable model sharing and model training without having a centralized coordinator, which keeps the data private [45, 46]. The proposed platform has been implemented on top of the Rahasak blockchain [5], a highly scalable blockchain system for big data. The architecture of the Rahasak-ML federated learning environment is discussed in Fig. 1.

Fig. 1
An illustration shows the microservices offered from different blockchain nodes. Nodes 1 through 5 originate from the Kafka platform with issues the certificate authority and distributed cache. Node 1 has the following microservices, Rahasak ML service, Storage service, Aplos service, and Lokkaservice.

Rahasak-ML platform’s microservices-based architecture. Each blockchain node contains four services: Rahasak-ML service, Storage service, Aplos service, and Lokka service

Its proposed platform has been designed with microservice-based distributed system architecture [47]. In Rahasak-ML, all the functionalities are implemented as independent microservices. These services are Dockerized [48] and available for deployment using Kubernetes [49]. The following are the main services/components of the Rahasak-ML platform:

  • Storage service: Apache Cassandra-based block, transaction, and asset storage service [50].

  • Aplos service: smart contract service implemented using Scala functional programming language and Akka actors [51].

  • Lokka service: block creating service implemented using Scala and Akka streams [52].

  • Distributed message broker: Apache Kafka-based distributed publisher/subscriber service used as consensus and message broker platform in the blockchain, Rahasak-ML service federated machine learning service.

  • Distributed cache: Etcd-based distributed key-value pair storage (open-source distributed key-value storage system).

  • Certificate authority: certificate authority that issues certificates for peers and clients.

Each peer in the network has its own off-chain storage for storing the raw data. The hash of these data is published to a blockchain ledger and shared with other peers. The blockchain storage on the Rahasak-ML platform keeps all its transactions, blocks, and asset information (hash of the data in off-chain storage) on Cassandra-based Elassandra Storage (https://github.com/strapdata/elassandra). It exposes Elasticsearch application programming interfaces [53] for transactions, blocks, and assets on the blockchain. Each peer in the blockchain can establish supervised or unsupervised machine learning models with the existing data on its own off-chain storage. Once a peer generates a model, it can be incrementally trained and aggregated by other peers through the blockchain by using the federated learning approach. The model parameter sharing, local model generation, incremental model training, and model sharing functions are implemented in the Rahasak-ML platform. Once machine learning models are generated, these models can be integrated into blockchain smart contracts to predict real-time data. Figure 2 shows the architecture of the Rahasak-ML services in a single blockchain peer.

Fig. 2
The flowchart shows the Rahasak ML architecture. The storage or data lake is connected to the Rahasak-ML-modeler and then to the spark model. The real-time data through the gateway and Rahasak-ML-streamer leads to Prometheus through the notification-service and Kibana through analytic-storage.

Rahasak-ML service architecture. Each blockchain peer has its own Rahasak-ML service. Machine learning models will be generated with the data on each peer’s off-chain storage

Each peer in the network runs its own Rahasak-ML service. The Rahasak-ML service contains the following components. All these components are Dockerized and deployed via Kubernetes.

  • Storage Service.

  • Rahasak-ML Modeler Service.

  • Rahasak-ML Streamer Service.

  • Gateway Service.

  • Apache Kafka.

3.2 Key Components

3.2.1 Storage Service

Each peer in the Rahasak-ML platform has two storage mechanisms: off-chain and on-chain storage. Both are built with Apache Cassandra-based Elassandra storage. Off-chain storage stores the data generated by the peers. The hash of these data is published to on-chain storage and shared with other peers. Blockchain keeps all its transactions, blocks, and asset information on this on-chain storage. The on-chain storage in each peer is connected in a ring cluster architecture. The data saved in one node will be replicated with other nodes via this ring cluster. After executing transactions with smart contracts, the state updates in a peer are saved in Cassandra storage and distributed with other peers, Fig. 3.

Fig. 3
A flowchart of Rahasak-ML storage service architecture is depicted. A client submits a transaction to the smart contract and then to the off-chain-storage 1. Saving hash in the chain leads to storage 1 through 5.

Rahasak-ML storage service architecture. Each peer comes with two types of storage: on-chain storage and off-chain storage. Off-chain storage stores the actual data generated by the peers. The hash of these data is published to on-chain storage and shared with other peers

Blockchain can keep any data structure as blockchain assets since it uses Cassandra as the underlying asset storage. As a use case of Rahasak-ML, the authors built a blockchain-based secure NetFlow network packet storage and network anomaly detection (e.g., network attack) service. It stored actual NetFlow packet data in the blockchain peers’ off-chain storage. The hash of the data was stored in the on-chain storage as a blockchain asset. The smart contracts in the blockchain parsed the NetFlow packets coming through the router and stored them in the blockchain storage. Rahasak-ML can build machine learning models with the data saved in the peers’ off-chain storage. In federated learning scenarios, the local models are stored in the off-chain storage. The hash of the model and storage Uniform Resource Identifier (URI) of the model are stored in on-chain storage and distributed with other peers.

3.2.2 Rahasak-ML Modeler Service

Rahasak-ML modeler service is responsible for building the machine learning model by analyzing the peers’ off-chain storage data. It supports building both supervised (e.g., Decision Tree, Random Forest, and Logistic Regression) and unsupervised (e.g., K-Means and Isolation Forest) machine learning models. To build a new machine learning model, the first step is training, which uses a dataset as an input and adjusts the model weights for the model accuracy. The second step is testing, which takes in an independent dataset for testing the accuracy.

Figure 4 shows the overall flow of these steps, which is performed by the Rahasak-ML Modeler service. Once the prediction model is built and trained by the Rahasak-ML Modeler service, it can be used to perform tasks on new data. In a federated learning environment, each peer in the network will continuously train the generated model with the data on their off-chain storage using an incremental training approach. The continuous model training can be done with Spark Streams [54], such as real-time training libraries. More information about the continuous model training is discussed in Sect. 4.

Fig. 4
A flowchart of Rahasak-ML modeler service architecture is depicted. Seventy percent of the data set is sent to the testing data set, and the remaining 30 percent is sent to the tranning data set that is trained in the tranning model, and the final production model is formed.

Rahasak-ML modeler service architecture. Seventy percent of the data is used to train the model, and 30% will be used for testing

Following the model generation, the training models can be used in smart contracts to predict/cluster real-time data. For example, Rahasak-ML Modeler can be used to build the Isolation Forest and K-Means-based models to detect outliers of network traffic data. This model will split network data into two clusters: normal network traffic and suspicious (attacks) network traffic. Once local models are built and aggregated, the models can be integrated into blockchain smart contracts to predict the real-time network data. When new network packets come to the blockchain, smart contracts can use the model and predict the category (normal or suspicious) of real-time network traffic.

3.2.3 Rahasak-ML Streamer Service

Rahasak-ML streamer service clusters the real-time data with the machine learning models built by the Rahasak-ML Modeler service. It uses blockchain smart contracts [55, 56] to run the machine learning model with the newly generated data. Smart contract functions are written to use the model and predict the cluster output. This service consumes real-time data via Kafka (e.g., Kafka Streams and Spark Streams). For example, in the previously mentioned network traffic analysis scenario, the Rahasak-ML streamer will consume real-time network packets via Apache Kafka and run through the model built by the Rahasak-ML Modeler service. It will decide the clustering output (normal and suspicious) of the new packets, and if a suspicious packet is found, it will publish the entry to a notification service. Alerts will be generated, notifying experts via notification dashboards (e.g., Prometheus and Grafana), as shown in Fig. 5.

Fig. 5
A flowchart of Rahasak-ML streamer service architecture is depicted. Real-time data from Kafka is sent to the Rahasak-ML streamer that predicts the anomaly with the spark model. Then anomalies are reported to the notification service. The alerts are sent via Prometheus.

Rahasak-ML streamer service architecture. Streamer service clusters the real-time data with the machine learning models built by the Rahasak-ML Modeler service

3.2.4 Gateway Service

When analyzing real-time data, the Gateway service is used as the entry point to the Rahasak-ML platform. It fetches (or pushes from other services) real-time data from various data sources, such as log fields, NetFlow, TCP, UDP, and database. For example, the gateway service can receive real-time network traffic data via NetFlow. Once data arrive, they are prepared (by removing noise, parsing the data, etc.) and published to the Rahasak-ML streamer service via Kafka as JSON encoded objects. When the platform receives NetFlow packets, it extracts relevant fields, aggregates them, constructs a JSON object, and forwards it to the Rahasak-ML streamer service via Kafka, as shown in Fig. 6.

Fig. 6
A flowchart of gateway service architecture is depicted. The real-time data is collected to the gateway service, and then the data is published to Kafka.

Gateway service architecture. Gateway service is used as the entry point to the Rahasak-ML platform. It fetches (or pushes from other services) real-time data from various data sources such as log fields, NetFlow, TCP, UDP, and database

3.2.5 Kafka Message Broker

Apache Kafka is the consensus and message broker service in the Rahasak-ML blockchain environment. The authors use a Reactive Programming and Reactive Streaming model [57] where the services published events/messages with Kafka. The events will be subscribed by relevant services and take corresponding actions. The real-time data that come through the gateway service are published into Kafka first. Then Rahasak-ML streamer service consumes them and runs with the model, which is built by the Rahasak-ML Modeler service, as shown in Fig. 7.

Fig. 7
A flowchart of Rahasak-ML message broker architecture is depicted. The real-time data from the gateway service is published to Kafka. Then the data is ordered from Kafka, and then the data is consumed from Kafka to the Rahaska-ML-streamer.

Rahasak-ML message broker architecture. Apache Kafka is the message broker of the Rahasak-ML platform. Each microservice communicates with other services via Kafka

4 Rahasak-ML Federated Learning Process

4.1 Overview

Rahasak-ML proposed a blockchain-based federated learning approach to build and share the models. With this approach, model generation, incremental model training, model aggregation, and sharing can be done without having centralized authority. Federated learning approaches increase privacy but still rely on centralized control to manage the process. Centralized control can be compromised, causing a potential weak link in the system and a lack of trust in the authority that owns the centralized server [2]. A blockchain-based decentralized system provides a logical ruleset that all participants are aware of and agree on, allowing participants to audit operations to ensure that all parties follow the rules. It improves the ability to audit and adds more transparency to the federated learning process. Each peer in the blockchain network incrementally trains the machine learning models with the data on its own local off-chain storage. Once all peers (or a majority of peers) are trained, the finalized model details will be integrated into a block and published to the other peers in the network by the block-generating service of Rahasak-ML (Lokka service).

4.2 Incremental Training Flow

Assume a scenario where blockchain nodes are deployed in three companies, Companies A, B, and C. The blockchain is configured to store the data related to network traffic. Each company has its own off-chain storage, which stores the actual network traffic data. The hash of the network traffic data is published into the blockchain ledger. First, the Lokka service (that generates blocks) creates a genesis block with the incremental learning flow and the model parameters, as shown in Algorithm 1. Each peer in the network has its own Lokka service. The Block Creator is determined in a round-robin distributed scheduler. Consider the scenario in Fig. 8, which has three Lokka services, and assume that the first block is created by Lokka A, the second block will be created by Lokka B, and Lokka C creates the third block. This process is repeatedly performed to generate future blocks.

Fig. 8
A circular flow chart of Data from the storage gets the transaction, validates the transaction, creates blocks 1 through 3, and the blocks are saved into the storage.

The block creator is determined in a round-robin distributed scheduler. The block approval process is performed via the federated consensus implemented between Lokka services

Algorithm 1 Training pipeline initialization

An algorithm of training pipeline initialization in eight steps.

Incremental learning flow defines the order of the model training process. When defining a learning flow, the Lokka service finds the existing nodes in the network via distributed cache service in the Rahasak-ML. Rahasak-ML uses Etcd distributed key/value pair storage as the distributed cache and service registry. Etcd stores the health information of the blockchain nodes in the network. When a blockchain node is added to the network, it registers a node name (with meta-information) in the Etcd with the time to live (TTL) key. The node will periodically update this TTL key (before TTL reach) to prove it is alive. If a node is dead/exits, the TTL key will automatically be removed from Etcd. By using the TTL keys in Etcd, other nodes can know the available nodes in the network. The order of the incremental learning flow is decided by the TTL key created timestamp in the Etcd. This timestamp defines the blockchain nodes’ added time to the network. Assume the Lokka service has the incremental learning flow as A→B→C based on the TTL keys in the Etcd registry. This flow represents that peer A will produce a model, and then this model will be incrementally trained by peer B and then peer C. Once a miner node publishes the genesis block with model parameters and incremental flow to the blockchain ledger, other peers take the block and process it according to the defined flow, as shown in Fig. 9.

Fig. 9
A systematic diagram of Rahasak-ML training pipeline shows the initialized model with parameters and flow, lokka trained into peers a, b, and c from off-chain a, b, and c, respectively. The finalized model is shown in peer c as lokka.

Rahasak-ML training pipeline. Once a miner node publishes the genesis block with machine learning model parameters and incremental flow to the blockchain ledger, other peers take the block and process it according to the defined flow

According to the incremental learning flow, first, peer A generates the anomaly detection model with the data on the off-chain storage based on the model parameters in the genesis block. Then it saves the model built on its off-chain storage. The actual model is not published onto the blockchain ledger or any central storage. The hash and URI of the built model saved in the off-chain storage are published to the blockchain ledger as a transaction. Then peer B starts to incrementally train the model built by Peer A. To achieve this, peer B fetches the model built by peer A from peer A’s off-chain storage using the given URI. Then it trains that model with the data on peer B's off-chain storage. This training model will be saved on peer B's off-chain storage, and peer B will publish the model hash and off-chain storage URI of the model to the blockchain ledger as a transaction. Next, peer C will incrementally train the model trained by peer B and publish the details to the blockchain ledger as a transaction, as shown in Algorithm 2.

Algorithm 2 Incremental training flow

An Algorithm of Incremental training flow with an if-else loop.

The flow of the incremental learning process is described in Fig. 10.

Fig. 10
A chart of Rahasak-ML incremental training flow is depicted. It takes fourteen steps to link peer a, peer b, and peer c from off-chain storages to the model hash ledger that also includes lokka.

Rahasak-ML incremental training flow. Each peer trains the model with the data on the off-chain storage. The state update in each training step will be published to the blockchain ledger

4.3 Finalizing Model

Assume all three companies (or a majority of the companies) incrementally train the prediction model and publish the model hash and URI to the blockchain ledger as a transaction. Then Lokka service takes these transactions and creates a block with finalized model details with the final model stored in the peer C's off-chain storage. Currently, the model trained by the last peer (peer C in this scenario) is identified as the finalized model. In future work, there are plans to determine the finalized model by evaluating the accuracy of each model trained by its peers. Lokka service includes the URI of peer C’s off-chain storage (which stores the final model) and model training transaction details into the block. Then Lokka service saves the generated block in the ledger and distributes it to other peers. Once the peers receive the new block, they validate the learning process with the transactions in the block. If the process is valid, peers fetch the final model stored in peer C’s off-chain storage via the given URI in the block. The incrementally trained model sharing process is described in Fig. 11. Once the finalized model is fetched, it can be used in smart contracts for prediction.

Fig. 11
A flow of Rahasak-ML machine learning model is depicted. It takes six steps to link peer a, peer b, and peer c from off-chain storages to the model hash ledger that also includes lokka.

Rahasak-ML finalizes the machine learning model. The final model will be decided by the Lokka service when generating the final block

For the Lokka service to decide the final model, the majority of the nodes in the network need to complete the incremental learning process. If there are five nodes in the federated learning flow, three of these nodes need to finish the incremental learning flow to decide on the finalized model. Once the Lokka service has generated the block with the finalized model details, other Lokka services in the network need to approve that block. When approving, they first validate the transactions in the block. If all transactions in the block are valid, it gives a vote for the block (mark block as valid or invalid), as shown in Algorithm 3. To handle the voting process, the Lokka service digitally signs the block hash and adds the signature to the block header. When the majority of Lokka services submit the vote for a block, that block is considered as a valid/approved block.

Algorithm 3 Choose final model

An algorithm to choose the final model.

4.4 The Use Case of Blockchain-Empowered Federated Learning in the Medical Field

Blockchain-empowered federated learning provides a secure, transparent, and privacy-preserving computing solution for building accurate and robust predictive models using biomedical data from multiple parties (e.g., institutions, hospitals, and drug companies). It does not need a centralized server to collect data from various parties, which is often difficult to share due to HIPAA. As a proof of concept, the authors built blockchain-empowered federated learning for diagnosing acute inflammation of the bladder. We used inflammation of the bladder health dataset [58] and chose logistic regression as the prediction model. In this use case, a blockchain network is deployed at five peers (five hospitals). Each peer has its own dataset and trains and validates a local logistic regression model. Finally, these local models are averaged. The loss and accuracy of the models were computed, and block generation time was measured in the blockchain-enabled federated learning system. The preliminary study can be extended to more scenarios in medicine and drug discovery use cases.

4.4.1 Federated Model Accuracy and Training Loss

In the federated learning scenario, the model was trained with 1000 iterations. A copy of the shared model is sent to all peers participating in the iteration. Each peer trains its own model with its own dataset locally. Each local model is improved in its own direction. Then total loss and accuracy were computed as shown in Fig. 12. Figure 13 shows how the total training loss varies at different peers in each iteration.

Fig. 12
Four curves are graphed on a model accuracy versus iteration plane. The x axis represents the iterations and the y axis represents the model accuracy. The first curve is for the dataset peer1 starts from the bottom left and ends at the top right of the graph. Similarly, the second, third, and fourth curves are for the datasets peer2, peer3, and peer 4, and it starts from the bottom left and ends at the top right of the graph.

Federated model accuracy in different peers

Fig. 13
Four curves are graphed on a training loss versus iteration plane. The x-axis represents the iterations and the y-axis represents the training loss. The first curve is for the dataset peer1 starts from the top left and ends at the bottom right of the graph. Similarly, the second, third, and fourth curves are for the datasets peer2, peer3, and peer 4, and it starts from the top left and ends at the bottom right of the graph.

Federated model training loss in different peers

4.4.2 Block Generation Time

Block generation time was measured in the Bassa-ML federated learning system with a different number of blockchain peers (up to 7). Figure 14 shows the average block generation time when having a different number of blockchain peers in the network. Each experiment was repeated 100 times in this evaluation—each with different peer sets—and average values were plotted. When adding peers to a cluster, each peer needs to validate transactions in the block and recalculate the block header. Accordingly, block generation time increases as peers are added.

Fig. 14
Three bell-shaped curves are graphed on a frequency on the y-axis versus time to generate a block on the x-axis. All the curves initially increase and reach a peak and then decrease.

Average block generation time

5 Future Directions

The proposed platform took full advantage of blockchain and AI technologies to provide a more efficient and secure solution with a promise to accelerate the research in medicine. The following is a summary of future work and several open directions.

The proposed system overcomes several key concerns faced in centralized systems. While individual nodes (peers) develop local models based on their local data, the resulting models and parameters are shared through the blockchain platform. The model parameter sharing, local model generation, model averaging, and model sharing functions are implemented with smart contracts implemented on the platform. Most recently, the Rahasak-ML federated learning system was integrated into Rahasak blockchain version 3.0. The following are features of the Rahasak-ML platform that are planned to release in the future:

  • Decide the finalized model by evaluating the accuracy of each model trained by the peers.

  • Support more supervised/unsupervised machine learning algorithms with Rahasak-ML.

  • Automate the deployment of the Spark cluster in Rahasak-ML with Kubernetes.

  • Integrate TensorFlow-Federated libraries into Rahasak-ML.

5.1 Data Heterogeneity

Medical data are particularly diverse—in terms of the variety of modalities, dimensionality, and characteristics—even for a specific protocol, there are acquisition differences, a brand of the drugs, or local demographics [59]. Although federated learning can address the data bias issue by collecting more data sources, inhomogeneous data distribution is still challenging, as many assume independently and identically distributed data across their peers. Another challenge is the different data standards and data heterogeneity among peers. For example, hospitals may adopt EHR systems from different vendors, and different countries use different diagnostic and procedure coding systems. For example, health systems in the United Kingdom use the International Classification of Diseases ICD-10 code, but the United States adopted ICD-10-CM. This heterogeneity may lead to a situation where the optimal global solution may not work well for an individual local participant.

5.2 Efficiency and Effectiveness

From the technical view, efficiency and effectiveness are the major concerns of federated learning. Federated learning needs peers to share and update the models, and thus, the communication cost between different peers is an issue. Especially when integrated with blockchain, how to minimize the communication time and improve the efficiency of the training process is important. Studies have focused on improving the framework to jointly improve the federated learning convergence time and the training loss [60], but the tradeoff between accuracy and communication expenditure should also be considered.

5.3 Model Interpretation

Integrating machine learning models is important, particularly for healthcare and medicine. The core question of interpretability is whether humans understand why the model makes such predictions on unseen instances. Many machine learning models, such as deep learning, are a “black-box” to humans, and thus, many studies have explored tools to interpret the models [61, 62, 63, 64]. In a federated learning context, as the model was kept updated through multiple parties, the interpretation would be a challenge.

To summarize, federated learning for life sciences will benefit the process of data sharing among multiple organizations without a central authority. The data sharing process will monitor and track the data operations efficiently to ensure data integrity and provenance. Still, the data ownership problem is the key to adopting Rahasak-ML in FDA- or EMA-regulated research.

6 Conclusions

Federated learning emerges as a new technique that uses collaboration and distribution to train machine learning models without sharing the local raw data. It promises to benefit the medical field and drug industry that require strict data protection. However, most of the existing federated learning systems deal with centralized coordinators that are vulnerable to attacks and privacy breaches. We proposed a blockchain-empowered coordinator-less decentralized federated learning platform, named Rahasak-ML, to solve issues in centralized coordinator-based federated learning systems by providing better transparency and trust. We introduced the architecture and learning process of Rahasak-ML. We introduced a use case of using Rahasak-ML to train a machine learning model for diagnosis, which could be applied to other biomedical data to facilitate decision-making. Still, data standardization, communication efficiency, and model interpretation need to be resolved.