Warnat-Herresthal et al. [1] argue that a decentralized data model will be preferred choice for handling, storing, managing, and analysing any kind of large medical dataset.

Hi again! I’m ESR2, for today’s blog, I want to share my reading notes of a research article, titled swarm learning for decentralized and confidential clinical machine learning[1], which has been published in Nature in May 2021.

 

After I read this article, I am thinking about what data model and/or framework is more suitable for medical data sharing. I will give a brief introduction to this paper from the following three parts: current challenges in precision medicine, the motivations and main contributions, and my feelings and open questions.

 

Challenges:

There remains a lot of challenges with medical data sharing. Hospitals and individual research institutes collect data locally, in which medical data is inherently decentral, and the volume of local data is usually insufficient to train reliable disease classifiers from an AI perspective.

The centralized data model is usually used to address the local limitation, however, it involves tedious handling and duplication of large datasets, and concerns the regulatory and privacy issues when entities want to share their data. Besides, the centralized AI solutions (e.g., central learning and federated learning) have several disadvantages, including increased data traffic, and concerns about data ownership, confidentiality, privacy, security, and the creation of data monopolies that favour data aggregators.

 

Motivations and main contributions:

The authors are motivated by the integration of any medical data from any data owner worldwide without violating privacy laws, thus introduce Swarm Learning (SL), i.e., a decentralized machine-learning approach that unites edge computing, blockchain-based peer-to-peer(p2p) networking and coordination while maintaining confidentiality without the need for a central coordinator.

The swarm learning framework they proposed in the article mainly consists of swarm participants, swarm network, and swarm learning workflow. Each swarm participant can be seen as a decentral data owner where data and compute are on-premise, and all participants are connected via swarm network, in which they have equal rights, share insights through a learning model, while remaining transparency on contribution, as well as the immutability of results. The swarm network is mainly implemented by an Ethereum blockchain approach to building a blockchain-based p2p networking along with distributed ledger and smart contract. In addition, the authors abstracted a swarm learning workflow to describe the automated execution process of swarm learning in which swarm nodes enrolment, local model training, parameter sharing and merging, as well as stopping criterion check, etc., all happen within the swarm network.

They conducted large-scale scientific analysis for four use cases of heterogeneous diseases (COVID-19, tuberculosis, leukaemia, and lung pathologies) to illustrate the feasibility of using swarm learning to develop disease classifiers over distributed data. Although their results outperformed those developed at individual sites and fulfilled local confidentiality regulations by design, they carried out all simulations by employing two HPC servers, using a relatively simple hardware infrastructure architecture.

 

Feelings and Open questions:

Their work inspires me a lot. Firstly, at the first glance, I like the style of figures illustrated in the paper, because they are beautiful and neat. Secondly, I was excited by the large-scale data volumes and scientific analysis in the experimental settings. And I am thinking about the similarities and differences between their work and our work in the Clarify project. For instance, whether we can apply the swarm learning framework for developing more disease classifiers (e.g., for TNBC, HR-NMIBC, and SML)? Whether it is possible for us to build a decentralized data model in the Clarify project through the Cloud, machine learning and blockchain technologies? How to design our own framework for seamless trusted data sharing? What are the challenges with computing and networking infrastructure if deploying on large-scale real-world edge nodes? TBD.

To sum up, I believe this research article is of high quality and makes some directions for data sharing, e.g., “a decentralized data model will be preferred choice for handling, storing, managing, and analysing any kind of large medical dataset.” Nevertheless, it may be not easy work. Now, I haven’t a clear and concrete framework yet for seamless trusted data sharing, but I’m enjoying the investigation and feel excited to find this article. Clarify project offers a great platform with multidisciplinary team collaboration to explore more possibilities. Keep on going!

 

                                         

Self-study at REC H Library Learning Centre, 2021                                   With my colleagues at Amsterdam, 2021

Yuandou Wang – ESR2

[1] Warnat-Herresthal, S., Schultze, H., Shastry, K. L., Manamohan, S., Mukherjee, S., Garg, V., Sarveswara, R., Händler, K., Pickkers, P., Aziz, N. A., Ktena, S., Tran, F., Bitzer, M., Ossowski, S., Casadei, N., Herr, C., Petersheim, D., Behrends, U., Kern, F., … Schultze, J. L. (2021). Swarm Learning for decentralized and confidential clinical machine learning. Nature, 127, 128. https://doi.org/10.1038/s41586-021-03583-3