49 | [E3SM] MPICH bug related to collectives tunning | https://github.com/pmodels/mpich/issues/7456 | https://github.com/pmodels/mpich/issues/7456 | Abhi | 🚨 | | No response | 2025-06-27 | 2025-06-27 |
48 | Zombie Processes | GSD-11266 | none yet | Servesh M | 🚨 | | No response | 2025-06-25 | 2025-06-25 |
47 | Non standard MPI knobs suggested for performance | ANL-291 | N/A | Servesh M | | | No response | 2025-06-23 | 2025-06-27 |
45 | DDT issues since Aurora upgrade | No response | /lus/flare/projects/catalyst/world_shared/zippy/ddt | Tim Williams | | | No response | 2025-06-12 | 2025-06-26 |
44 | QMCPACK segfault in libomp | No response | Not yet created | Ye Luo | 🚨 | | No response | 2025-06-12 | 2025-06-25 |
43 | CMake can't find MKL::MKL_SYCL with MPI wrapper compilers | No response | https://github.com/thilinarmtb/onemkl_cmake_mpi_bug | Thilina Ratnayaka, Colleen Bertoni | | | improvements will be part of the next oneMKL release, 2025.3. | 2025-06-11 | 2025-06-25 |
41 | torch.compile segfaults for >2 tiles | MLSL-3728 | /flare/Aurora_deployment/vsastry/torch_compile | Varuni Sastry | | | No response | 2025-06-06 | 2025-06-12 |
39 | Feature request for Aurora runtime to include debugging symbols | ANL-286, HPCS-15374 | feature request | Ye Luo | | | No response | 2025-05-29 | 2025-06-26 |
38 | One application in GRID consistently hangs | No response | /lus/flare/projects/Aurora_deployment/xyjin/W/test_grid_g5r5_paboyle | Xiao-Yong Jin | 🚨 | | No response | 2025-05-27 | 2025-05-29 |
37 | xpu-smi reports "N/A" for GPU Utilization | RITM0428460, ANL-279, GSD-11252 | any run of xpu-smi | Kyle Felker / Colleen Bertoni | | | Implemented, PR under review, ULTs to follow, drop after agama-1146 | 2025-05-22 | 2025-06-25 |
36 | (Occasional Interruptible) hangs in applications | Possibly related to ANL-215 | /lus/flare/projects/Aurora_deployment/xyjin/W/test_example_detar.skel | Xiao-Yong Jin | 🚨 | | No response | 2025-05-15 | 2025-06-11 |
35 | Avoid outputs exceeding few KBs to stdout/stderr from MPI ranks | RITM0425437 First issue | Large MPI writes to stdout | Servesh Muralidharan | | | No response | 2025-05-15 | 2025-06-10 |
34 | Runtime Error: pytorch DDP with CCL_BCAST=<"double_tree, direct, naive, maybe others?"> | MLSL-3729 | In issue | Nathan Nichols | | | No response | 2025-05-15 | 2025-06-10 |
33 | Crash when calling too many MPI_Probe | https://github.com/pmodels/mpich/issues/7427 | https://github.com/pmodels/mpich/issues/7427 | David--Cléris Timothée | | | No response | 2025-05-15 | 2025-05-15 |
32 | PETSc segfaults in sparse matrix calls | IGDB-6516, GSD-10450 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/mkl/csr_gemv_usm/ | Junchao Zhang | 🚨 | | 2025.3 for part malloc_shared in MKL | 2025-05-15 | 2025-06-25 |
31 | GAMESS segfaults with -O0 | GSD-10393, CMPLRLIBS-35345,GSD-11035 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/openmp/gamess_O0_page_fault | Colleen Bertoni | 🚨 | | No response | 2025-05-14 | 2025-06-25 |
30 | Copy 2D/3D are broken (zeCommandListAppendMemoryCopyRegion) | NEO-14954 | https://github.com/rpereira-dev/ze-zoo | Romain PEREIRA and Thomas APPLENCOURT | | | No response | 2025-05-10 | 2025-05-27 |
29 | Significant slowdown with LAMMPS in first run, subsequent runs much faster | No response | /flare/catalyst/proj_shared/knight/projects/ExtremeCarbon/snap-carbon-scaling/1B/ | Christopher Knight | | | No response | 2025-05-09 | 2025-06-27 |
27 | Build failures on PVC with Cutlass | GSD-11099, https://github.com/codeplaysoftware/cutlass-sycl/issues/329 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/cutlass-sycl | Abhi | 🚨 | | agama 1133 and agama 1146 (ww24, ~ second week of June) | 2025-05-07 | 2025-06-27 |
26 | L0 memcpy bug | GSD-11142, NEO-14641 | I was doing the same run as QMCPACK SOW runs in the reframe | Ye Luo | 🚨 | | agama 1146 | 2025-05-06 | 2025-06-25 |
25 | Compile fail in Lattice App | Brian reproduced and confirms fixed in 2025.1 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/bug_cgpt_icpx | Xiao-Yong Jin | 🚨 | | Brian confirms fixed in 2025.1 | 2025-05-01 | 2025-05-02 |
22 | SYCL In-order queue broken | NEO-14641 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/in-order | Thomas Applencourt | 🚨 | | fix in ww24 (~second week of june) agama 1146 | 2025-04-23 | 2025-06-25 |
20 | Issue with gpu-bind for mpiexec under ZE_FLAT_DEVICE_HIERARCHY=FLAT mode | ANL-283 | See below | Abhishek, Nathan, Khalid | | | No response | 2025-04-16 | 2025-05-30 |
19 | Severe CPU memory growth in MPICH | No response | /flare/catalyst/world_shared/zippy/reproducers/issue19 | Tim Williams | | | No response | 2025-04-04 | 2025-04-24 |
18 | Ping failures and hangs with production runs using GPT/GRID | ANL-251, RITM0404147, RITM0404148, RITM0405730 | /lus/flare/projects/LatticeFlavor/lehner | Xiao-Yong Jin | | | No response | 2025-04-04 | 2025-04-18 |
17 | hang with MPI pipelining | https://github.com/pmodels/mpich/issues/7373 | Build and run commands are in the MPICH issue. | James Osborn | | | No response | 2025-04-03 | 2025-04-08 |
16 | Catastrophic memory error in context lmp_aurora_kokkos | No response | public LAMMPS | Chris Knight | | | N/A | 2025-04-03 | 2025-04-03 |
13 | XGC hangs at scale | No response | xgc-es-cpp-gpu app, ES_ITER test case | Tim Williams | 🚨 | | No response | 2025-04-03 | 2025-04-03 |
12 | CXI alloc failed on cxi1: request exceeds ACs limits | No response | None | Not Thomas | | | No response | 2025-04-01 | 2025-04-07 |