Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads

dc.conference.date16-21 Nov 2025
dc.conference.placeMisuri
dc.conference.titleSC Workshops '25: Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
dc.contributor.authorMerzky, Andre
dc.contributor.authorTitov, Mikhail
dc.contributor.authorTurilli, Matteo
dc.contributor.authorJha, Shantenu
dc.contributor.funderUS Department of Energy
dc.contributor.rorhttps://ror.org/02jjdwm75
dc.date.accessioned2026-03-10T11:33:08Z
dc.date.issued2025-11-15
dc.description.abstractScientific workflows increasingly involve both HPC and machine-learning tasks, combining MPI-based simulations, training, and inference in a single execution. Launchers such as Slurm’s srun constrain concurrency and throughput, making them unsuitable for dynamic and heterogeneous workloads. We present a performance study of RADICAL-Pilot (RP) integrated with Flux and Dragon, two complementary runtime systems that enable hierarchical resource management and high-throughput function execution. Using synthetic and production-scale workloads on Frontier, we characterize the task execution properties of RP across runtime configurations. RP+Flux sustains up to 930 tasks/s, and RP+Flux+Dragon exceeds 1,500 tasks/s with over 99.6% utilization. In contrast, srun peaks at 152 tasks/s and degrades with scale, with utilization below 50%. For IMPECCABLE.v2 drug discovery campaign, RP+Flux reduces makespan by 30–60% relative to srun/Slurm and increases throughput more than four times on up to 1,024. These results demonstrate hybrid runtime integration in RP as a scalable approach for hybrid AI-HPC workloads.
dc.description.peerreviewedYes
dc.description.sponsorshipA. Merzky, M. Titov and M. Turilli equally contributed to this paper.This work is supported in part by the following grants: NSF-2103986and 1931512, and US DOE DE-AC02-06CH11357 (LUCID). We thank Agastya Bhati and Peter Coveney for insights and discussions onIMPECCABLE workloads.
dc.description.statusPublished
dc.formatapplication/pdf
dc.identifier.citationMerzky, A., Titov, M., Turilli, M., & Jha, S. (2025, November). Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads. In Proceedings of the SC'25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 2245-2256). https://doi.org/10.1145/3731599.3767587
dc.identifier.doihttps://doi.org/10.1145/3731599.3767587
dc.identifier.isbn979-8-4007-1871-7
dc.identifier.officialurlhttps://dl.acm.org/doi/10.1145/3731599.3767587#core-tabbed-abstracts
dc.identifier.urihttps://hdl.handle.net/20.500.14417/4264
dc.language.isoeng
dc.page.final2256
dc.page.initial2245
dc.page.total11
dc.relation.departmentSci Tech (Data Science)
dc.relation.entityIE University
dc.relation.projectidNSF-2103986
dc.relation.projectid1931512
dc.relation.projectidDE-AC02-06CH11357
dc.relation.schoolIE School of Science & Technology
dc.rightsAttribution 4.0 International
dc.rights.accessRightsinfo:eu-repo/semantics/openAccess
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subject.keywordstask runtime systems
dc.subject.keywordsHPC-AI
dc.subject.keywordsworkflows
dc.subject.odsODS 9 - Industria, innovación e infraestructura
dc.subject.unesco33 Ciencias Tecnológicas
dc.titleIntegrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads
dc.typeinfo:eu-repo/semantics/conferenceObject
dc.version.typeinfo:eu-repo/semantics/publishedVersion
dspace.entity.typePublication
relation.isAuthorOfPublication4c105a3d-3b6f-4801-b0a6-7593ef9017d2
relation.isAuthorOfPublication.latestForDiscovery4c105a3d-3b6f-4801-b0a6-7593ef9017d2

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
3731599.3767587.pdf
Tamaño:
1.13 MB
Formato:
Adobe Portable Document Format

Bloque de licencias

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
license.txt
Tamaño:
1.71 KB
Formato:
Item-specific license agreed to upon submission
Descripción: