How to seamlessly manage workflow in a notebook?
A few months ago, we worked on scaling notebooks as re-configurable cloud workflows, which is expected as an effective method for acquiring preliminary information early to investigate the potential seamless data-sharing platform. And I submitted our manuscript to the special issue of the Data Intelligence journal on Canonical Work ow Frameworks for Research (CWFR). Today, I will share the basic ideas and discuss the potential application scenarios in the CLARIFY project, especially for the data sharing scenario.
The basic idea of scaling notebooks as re-configurable cloud workflows is shown as the following conceptual model. It mainly consists of eight basic elements: 1) a user, 2) the local Jupyter environment (e.g., Jupyter Notebook) that the user is using, 3) the component containerizer (CC), 4) the experiment manager (EM), 5) the distributed work ow bus (DWB), 6) the remote infrastructure automator (RIA), 7) the catalog to contain distributed workflow building blocks, and 8) the dedicated remote infrastructure, As shown in the Fig. 1.
According to this design, a user can locally encapsulate arbitrarily specific code fragments in the Jupyter environment as reusable workflow building blocks via the first Jupyter extension module — component containerizer (CC) — to publish their blocks to the community (e.g., DockerHub). The metadata of these blocks will firstly be added into the catalog, and other users can access it by credentials (i.e., the process of ①②③). Besides, the user is able to (re)construct his or her workflow logic via the second Jupyter extension module — experiment manager (EM)– to create workflow specifications (i.e., the process of ④⑤⑥). The workflow specification files (e.g., CWL format) could be further applied for workflow planning&scheduling. Users can submit their requests (incl., workflow specification, expected deadline, and total cost for workflow execution) to the third Jupyter extension module — distributed work ow bus (DWB), which will closely work with the fourth Jupyter extension — remote infrastructure automator (RIA) — to automatically deploy and execute the re-configurable workflows on dedicated remote infrastructures (i.e., the process of ⑦⑧⑨⑩). Currently, we are using Docker, Kubernetes (a.k.a., k8s), and Argo workflow engine for the real workflow deployment and execution. In the future, we will continuously make progress in such a model and develop more functional modules for potential applications.
Fig 1. The overview of the conceptual model.
Imagine such a scenario. The users are researchers from different hospitals and individual research institutes using Jupyter notebook locally to prototype their experiments. For instance, the algorithms or datasets such as machine learning-based models are encapsulated as reusable work ow building blocks and as different services shared among different trusted entities (via CC). Researchers can collaborate on conducting large-scale experiments via composing distributed workflows using the reusable workflow building blocks and the datasets at different scales (via EM). For the sake of better work ow performance and scalability, users request dedicated remote infrastructures to automatically deploy and execute these composed workflows (via DWB and RIA). Furthermore, we would like to integrate some modules for privacy-sensitive workflow management, mainly focusing on privacy-aware workflow scheduling. At present, the decentralized file system, such as IPFS, is a protocol and peer-to-peer (p2p) network for storing and sharing data in a distributed file system. How to make it available in our conceptual model and be used for data sharing will be our next step.
The above application scenarios will be further developed, and we will continue with the research topics on seamless data sharing and decentralized workflow management on software-defined infrastructures.
(Amazing! I completed the first half marathon in Amsterdam, the Netherlands, 17 Oct, 2021)
Yuandou Wang – ESR2