Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads
| dc.conference.date | 16-21 Nov 2025 | |
| dc.conference.place | Misuri | |
| dc.conference.title | SC Workshops '25: Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis | |
| dc.contributor.author | Merzky, Andre | |
| dc.contributor.author | Titov, Mikhail | |
| dc.contributor.author | Turilli, Matteo | |
| dc.contributor.author | Jha, Shantenu | |
| dc.contributor.funder | US Department of Energy | |
| dc.contributor.ror | https://ror.org/02jjdwm75 | |
| dc.date.accessioned | 2026-03-10T11:33:08Z | |
| dc.date.issued | 2025-11-15 | |
| dc.description.abstract | Scientific workflows increasingly involve both HPC and machine-learning tasks, combining MPI-based simulations, training, and inference in a single execution. Launchers such as Slurm’s srun constrain concurrency and throughput, making them unsuitable for dynamic and heterogeneous workloads. We present a performance study of RADICAL-Pilot (RP) integrated with Flux and Dragon, two complementary runtime systems that enable hierarchical resource management and high-throughput function execution. Using synthetic and production-scale workloads on Frontier, we characterize the task execution properties of RP across runtime configurations. RP+Flux sustains up to 930 tasks/s, and RP+Flux+Dragon exceeds 1,500 tasks/s with over 99.6% utilization. In contrast, srun peaks at 152 tasks/s and degrades with scale, with utilization below 50%. For IMPECCABLE.v2 drug discovery campaign, RP+Flux reduces makespan by 30–60% relative to srun/Slurm and increases throughput more than four times on up to 1,024. These results demonstrate hybrid runtime integration in RP as a scalable approach for hybrid AI-HPC workloads. | |
| dc.description.peerreviewed | Yes | |
| dc.description.sponsorship | A. Merzky, M. Titov and M. Turilli equally contributed to this paper.This work is supported in part by the following grants: NSF-2103986and 1931512, and US DOE DE-AC02-06CH11357 (LUCID). We thank Agastya Bhati and Peter Coveney for insights and discussions onIMPECCABLE workloads. | |
| dc.description.status | Published | |
| dc.format | application/pdf | |
| dc.identifier.citation | Merzky, A., Titov, M., Turilli, M., & Jha, S. (2025, November). Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads. In Proceedings of the SC'25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 2245-2256). https://doi.org/10.1145/3731599.3767587 | |
| dc.identifier.doi | https://doi.org/10.1145/3731599.3767587 | |
| dc.identifier.isbn | 979-8-4007-1871-7 | |
| dc.identifier.officialurl | https://dl.acm.org/doi/10.1145/3731599.3767587#core-tabbed-abstracts | |
| dc.identifier.uri | https://hdl.handle.net/20.500.14417/4264 | |
| dc.language.iso | eng | |
| dc.page.final | 2256 | |
| dc.page.initial | 2245 | |
| dc.page.total | 11 | |
| dc.relation.department | Sci Tech (Data Science) | |
| dc.relation.entity | IE University | |
| dc.relation.projectid | NSF-2103986 | |
| dc.relation.projectid | 1931512 | |
| dc.relation.projectid | DE-AC02-06CH11357 | |
| dc.relation.school | IE School of Science & Technology | |
| dc.rights | Attribution 4.0 International | |
| dc.rights.accessRights | info:eu-repo/semantics/openAccess | |
| dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | |
| dc.subject.keywords | task runtime systems | |
| dc.subject.keywords | HPC-AI | |
| dc.subject.keywords | workflows | |
| dc.subject.ods | ODS 9 - Industria, innovación e infraestructura | |
| dc.subject.unesco | 33 Ciencias Tecnológicas | |
| dc.title | Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads | |
| dc.type | info:eu-repo/semantics/conferenceObject | |
| dc.version.type | info:eu-repo/semantics/publishedVersion | |
| dspace.entity.type | Publication | |
| relation.isAuthorOfPublication | 4c105a3d-3b6f-4801-b0a6-7593ef9017d2 | |
| relation.isAuthorOfPublication.latestForDiscovery | 4c105a3d-3b6f-4801-b0a6-7593ef9017d2 |
