Skip to content

< Back to Aurora Known Issues page

Open Issues

Internal ID Description Vendor ID Reproducer Path PoC Priority? Pre-production? ETA Date Opened Last Updated
49 [E3SM] MPICH bug related to collectives tunning https://github.com/pmodels/mpich/issues/7456 https://github.com/pmodels/mpich/issues/7456 Abhi 🚨 No response 2025-06-27 2025-06-27
48 Zombie Processes GSD-11266 none yet Servesh M 🚨 No response 2025-06-25 2025-06-25
47 Non standard MPI knobs suggested for performance ANL-291 N/A Servesh M No response 2025-06-23 2025-06-27
45 DDT issues since Aurora upgrade No response /lus/flare/projects/catalyst/world_shared/zippy/ddt Tim Williams No response 2025-06-12 2025-06-26
44 QMCPACK segfault in libomp No response Not yet created Ye Luo 🚨 No response 2025-06-12 2025-06-25
43 CMake can't find MKL::MKL_SYCL with MPI wrapper compilers No response https://github.com/thilinarmtb/onemkl_cmake_mpi_bug Thilina Ratnayaka, Colleen Bertoni improvements will be part of the next oneMKL release, 2025.3. 2025-06-11 2025-06-25
41 torch.compile segfaults for >2 tiles MLSL-3728 /flare/Aurora_deployment/vsastry/torch_compile Varuni Sastry No response 2025-06-06 2025-06-12
39 Feature request for Aurora runtime to include debugging symbols ANL-286, HPCS-15374 feature request Ye Luo No response 2025-05-29 2025-06-26
38 One application in GRID consistently hangs No response /lus/flare/projects/Aurora_deployment/xyjin/W/test_grid_g5r5_paboyle Xiao-Yong Jin 🚨 No response 2025-05-27 2025-05-29
37 xpu-smi reports "N/A" for GPU Utilization RITM0428460, ANL-279, GSD-11252 any run of xpu-smi Kyle Felker / Colleen Bertoni Implemented, PR under review, ULTs to follow, drop after agama-1146 2025-05-22 2025-06-25
36 (Occasional Interruptible) hangs in applications Possibly related to ANL-215 /lus/flare/projects/Aurora_deployment/xyjin/W/test_example_detar.skel Xiao-Yong Jin 🚨 No response 2025-05-15 2025-06-11
35 Avoid outputs exceeding few KBs to stdout/stderr from MPI ranks RITM0425437 First issue Large MPI writes to stdout Servesh Muralidharan No response 2025-05-15 2025-06-10
34 Runtime Error: pytorch DDP with CCL_BCAST=<"double_tree, direct, naive, maybe others?"> MLSL-3729 In issue Nathan Nichols No response 2025-05-15 2025-06-10
33 Crash when calling too many MPI_Probe https://github.com/pmodels/mpich/issues/7427 https://github.com/pmodels/mpich/issues/7427 David--Cléris Timothée No response 2025-05-15 2025-05-15
32 PETSc segfaults in sparse matrix calls IGDB-6516, GSD-10450 /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/mkl/csr_gemv_usm/ Junchao Zhang 🚨 2025.3 for part malloc_shared in MKL 2025-05-15 2025-06-25
31 GAMESS segfaults with -O0 GSD-10393, CMPLRLIBS-35345,GSD-11035 /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/openmp/gamess_O0_page_fault Colleen Bertoni 🚨 No response 2025-05-14 2025-06-25
30 Copy 2D/3D are broken (zeCommandListAppendMemoryCopyRegion) NEO-14954 https://github.com/rpereira-dev/ze-zoo Romain PEREIRA and Thomas APPLENCOURT No response 2025-05-10 2025-05-27
29 Significant slowdown with LAMMPS in first run, subsequent runs much faster No response /flare/catalyst/proj_shared/knight/projects/ExtremeCarbon/snap-carbon-scaling/1B/ Christopher Knight No response 2025-05-09 2025-06-27
27 Build failures on PVC with Cutlass GSD-11099, https://github.com/codeplaysoftware/cutlass-sycl/issues/329 /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/cutlass-sycl Abhi 🚨 agama 1133 and agama 1146 (ww24, ~ second week of June) 2025-05-07 2025-06-27
26 L0 memcpy bug GSD-11142, NEO-14641 I was doing the same run as QMCPACK SOW runs in the reframe Ye Luo 🚨 agama 1146 2025-05-06 2025-06-25
25 Compile fail in Lattice App Brian reproduced and confirms fixed in 2025.1 /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/bug_cgpt_icpx Xiao-Yong Jin 🚨 Brian confirms fixed in 2025.1 2025-05-01 2025-05-02
22 SYCL In-order queue broken NEO-14641 /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/in-order Thomas Applencourt 🚨 fix in ww24 (~second week of june) agama 1146 2025-04-23 2025-06-25
20 Issue with gpu-bind for mpiexec under ZE_FLAT_DEVICE_HIERARCHY=FLAT mode ANL-283 See below Abhishek, Nathan, Khalid No response 2025-04-16 2025-05-30
19 Severe CPU memory growth in MPICH No response /flare/catalyst/world_shared/zippy/reproducers/issue19 Tim Williams No response 2025-04-04 2025-04-24
18 Ping failures and hangs with production runs using GPT/GRID ANL-251, RITM0404147, RITM0404148, RITM0405730 /lus/flare/projects/LatticeFlavor/lehner Xiao-Yong Jin No response 2025-04-04 2025-04-18
17 hang with MPI pipelining https://github.com/pmodels/mpich/issues/7373 Build and run commands are in the MPICH issue. James Osborn No response 2025-04-03 2025-04-08
16 Catastrophic memory error in context lmp_aurora_kokkos No response public LAMMPS Chris Knight N/A 2025-04-03 2025-04-03
13 XGC hangs at scale No response xgc-es-cpp-gpu app, ES_ITER test case Tim Williams 🚨 No response 2025-04-03 2025-04-03
12 CXI alloc failed on cxi1: request exceeds ACs limits No response None Not Thomas No response 2025-04-01 2025-04-07

Closed Issues

Internal ID Description Vendor ID Reproducer Path PoC Priority? Date Opened Closed Date
42 Linking fails with old build environment No response /lus/flare/projects/PHASTA_aesp_CNDA/jrwrigh/petsc_build_test Kris Rowe 2025-06-06 2025-06-10
40 Need SYSMAN support for all modes in recent releases HPCS-15366, related: GSD-11104 /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/l0/leak_zesMemoryGetState Thomas Applencourt 🚨 2025-05-30 2025-06-17
28 CMake failures with SYCL No response /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/ Abhishek Bagusetty 2025-05-09 2025-05-09
24 Noticeably more "ping failed" than before the 2025.1 SDK + 1099.12 UMD/KMD upgrade JIRA is:  HPCS-15331 N/A Xiao-Yong Jin Colleen Bertoni 2025-05-01 2025-05-16
23 Apps stop running after Apr 29 upgrade due to libstdc++ dependency No response See details Ye Luo 2025-04-30 2025-05-06
21 Error during write with Quantum ESPRESSO No response see .zip file attached below, also /lus/flare/projects/matml_aesp_CNDA/dir_io_QE_crash Filippo Simini 🚨 2025-04-17 2025-04-18
9 Multithreaded data-transfer can cause page-fault N/A Full QMCPACK Ye Luo 2025-04-01 2025-05-08
8 Lots of H2D copies produce CPU I9 error and incorrect value N/A Full QMCPACK Ye Luo 🚨 2025-04-01 2025-05-28
7 MPI_Bcast gets faster when turning off XPMEM pmodels/mpich#7334 see Issue on MPICH GitHub repo Ye Luo 2025-04-01 2025-04-24
6 MPICH memory allocation slows down at scale pmodels/mpich#7333 see MPICH issue Ye Luo 🚨 2025-04-01 2025-04-24
4 Incorrect results in receive buffer in GPU memory MPICH 7312 grid application (lattice QCD) Patrick Steinbrecher, Tim Williams 🚨 2025-03-25 2025-04-24
3 Linker error found by XGC CMPLRLLVM-66496 /home/zippy/smalltests/aurora/xgc42/fails Tim Williams 2025-03-19 2025-03-28