< Back to Aurora Known Issues page
Open Issues¶
Internal ID | Description | Vendor ID | Reproducer Path | PoC | Priority? | Pre-production? | ETA | Date Opened | Last Updated |
---|---|---|---|---|---|---|---|---|---|
70 | PALS gpu-bind, composite, envall lead to "launch failed" | No response | applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/mpi/envall | Thomas Applencourt | No response | 2025-09-10 | 2025-09-10 | ||
68 | warpx segfaults/hangs with OpenPMD enabled | No response | /lus/flare/projects/catalyst/world_shared/zippy/reproducers/issue52/ | Tim Williams | No response | 2025-08-23 | 2025-08-23 | ||
67 | warpx Debug build crashes oneAPI compiler | CMPLRLLVM-24314 | /lus/flare/projects/catalyst/world_shared/zippy/reproducers/issue52/ | Tim Williams | No response | 2025-08-21 | 2025-09-03 | ||
66 | Compiling with "-g" leads to a much larger binary than without | CMPLRLLVM-69909, CMPLRLLVM-24314 | lammps + -g | Brian Holland | No response | 2025-08-20 | 2025-08-27 | ||
65 | Clarification requested about ZE_DEVICE_PROPERTY_FLAG_ONDEMANDPAGING on PVC | GSD-11510 | source/reproducers/l0/ondemand_paging/ | Colleen | No response | 2025-08-20 | 2025-08-21 | ||
64 | E3SM fortran compile ICE | CMPLRLLVM-69862 | source/reproducers/ifx/e3sm_homme_ICE_error | Abhi | 2025.3.0 | 2025-08-18 | 2025-09-12 | ||
63 | Kokkos kernels fails to build with kokkos built with openmp enabled | CMPLRLLVM-69908 | source/applications/kokkos-kernels | Sean Koyama / Colleen Bertoni | gone starting with 4.19 (fixed in 2025.3 branch, any chance of getting it sooner?) | 2025-08-18 | 2025-08-20 | ||
62 | -ftarget-register-alloc-mode=pvc:large and "-device 12.60.7" for AOT | GSD-11490 | source/reproducers/general/ftarget-register-alloc-mode_flag | Steve Rangel | No response | 2025-08-14 | 2025-08-14 | ||
61 | Failing unit tests on PVCs with 2025.2 oneAPI SDK -- is it expected? | https://github.com/uxlfoundation/oneMath/issues/703 | https://github.com/uxlfoundation/oneMath/issues/703 | Colleen Bertoni | No response | 2025-07-30 | 2025-07-30 | ||
60 | ext_oneapi_memcpy2d is significantly slower with implicit scaling than explicit | CMPLRLLVM-69398, GSD-11459 | source/reproducers/dpcpp/ext_oneapi_memcpy2d_perf | Natalie Beams | No response | 2025-07-29 | 2025-08-06 | ||
58 | kokkos inclusive and exclusive scan giving incorrect answers for 1146.10 | CMPLRLLVM-69285 | source/reproducers/dpcpp/kokkos_optimization_scan | Daniel Arndt | 🚨 | No response | 2025-07-23 | 2025-07-30 | |
57 | GPU segfault in gtensor_bench with 2025.2 | MKLD-18276, CMPLRLIBS-35326, CMPLRLLVM-68696 | source/applications/gtensor_bench | Colleen Bertoni | 2025.3 | 2025-07-22 | 2025-08-11 | ||
56 | RSBench-SYCL incorrect answers with 1146.10 | GSD-11247 | source/applications/RSBench/ | John Tramm, Colleen Bertoni | LTS2 branch as part of IGC 2.15 series update (WW34, 3-4weeks) Need to change priority to get it into LTS2 (since things must be backported into it). Things may be fixed in mainline (rolling release) but not in LTS2. bug fixes everywhere, no new features for PVC in mainline. | 2025-07-22 | 2025-08-20 | ||
55 | Linking in LZ causes changes in signal handling | cmplrlibs-35385, GSD-11413 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/l0/signal_handler/ | Thomas Applencourt, Colleen Bertoni | No response | 2025-07-22 | 2025-07-25 | ||
54 | oneCCL zeMemGetAddressRange error with alltoallv and zero-sized buffers | oneCCL GitHub Issue: https://github.com/uxlfoundation/oneCCL/issues/174, MLSL-3764 | See instructions on oneCCL GitHub Issue: https://github.com/uxlfoundation/oneCCL/issues/174 | Riccardo Balin | 🚨 | oneCCL 2021.17, oneAPI 2025.3 | 2025-07-18 | 2025-08-26 | |
52 | compiler segfaults linking warpx binary | GSD-11357 | /lus/flare/projects/catalyst/world_shared/zippy/reproducers/issue52/warpx | Tim Williams | 🚨 | 2025.2 + 1146.10 | 2025-07-07 | 2025-07-18 | |
51 | [SYCL] Bug from SYCL peer_access | No response | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/sycl_peer_access | Abhi | Fixed internally with 2025.2 | 2025-07-02 | 2025-07-08 | ||
49 | [E3SM] MPICH bug related to collectives tunning | https://github.com/pmodels/mpich/issues/7456 | https://github.com/pmodels/mpich/issues/7456 | Abhi | 🚨 | Next Programming Environment (25.190) | 2025-06-27 | 2025-08-21 | |
48 | Zombie Processes | GSD-11266 | none yet | Servesh M | 🚨 | Should be within the LTS2 release (1146.12) | 2025-06-25 | 2025-08-06 | |
47 | Non standard MPI knobs suggested for performance | ANL-291 | N/A | Servesh M | No response | 2025-06-23 | 2025-06-27 | ||
45 | DDT issues since Aurora upgrade | No response | /lus/flare/projects/catalyst/world_shared/zippy/ddt | Tim Williams | Linaro Forge 2025.0.1 has the workaround. GDB 2025.3 the root cause will be gone. | 2025-06-12 | 2025-08-05 | ||
43 | CMake can't find MKL::MKL_SYCL with MPI wrapper compilers | No response | https://github.com/thilinarmtb/onemkl_cmake_mpi_bug | Thilina Ratnayaka, Colleen Bertoni | improvements will be part of the next oneMKL release, 2025.3. | 2025-06-11 | 2025-06-25 | ||
39 | Feature request for Aurora runtime to include debugging symbols | ANL-286, HPCS-15374, GSD-11460 | feature request | Ye Luo | No response | 2025-05-29 | 2025-08-06 | ||
38 | One application in GRID consistently hangs | GSD-11441 | /lus/flare/projects/Aurora_deployment/xyjin/W/test_grid_g5r5_paboyle | Xiao-Yong Jin | 🚨 | No response | 2025-05-27 | 2025-08-18 | |
37 | xpu-smi reports "N/A" for GPU Utilization | RITM0428460, ANL-279, GSD-11252 | any run of xpu-smi | Kyle Felker / Colleen Bertoni | 1146.24 (Implemented, LTS2, WW34) | 2025-05-22 | 2025-08-20 | ||
36 | (Occasional Interruptible) hangs in applications | Possibly related to ANL-215 | /lus/flare/projects/Aurora_deployment/xyjin/W/test_example_detar.skel | Xiao-Yong Jin | 🚨 | No response | 2025-05-15 | 2025-07-09 | |
34 | Runtime Error: pytorch DDP with CCL_BCAST=<"double_tree, direct, naive, maybe others?"> | MLSL-3729 | In issue | Nathan Nichols | 2021.15.1 | 2025-05-15 | 2025-07-31 | ||
33 | Crash when calling too many MPI_Probe | https://github.com/pmodels/mpich/issues/7427 | https://github.com/pmodels/mpich/issues/7427 | David--Cléris Timothée | No response | 2025-05-15 | 2025-05-15 | ||
32 | PETSc segfaults in sparse matrix calls | IGDB-6516, GSD-10450 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/mkl/csr_gemv_usm/ | Junchao Zhang | 🚨 | 2025.3 for part malloc_shared in MKL | 2025-05-15 | 2025-06-25 | |
31 | GAMESS segfaults with -O0 | GSD-10393, CMPLRLIBS-35345,GSD-11035 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/openmp/gamess_O0_page_fault | Colleen Bertoni | 🚨 | 1146.24 (Targeted for LTS2 (1146.12+), contained with the IGC 2.16 series / WW34 (2-3 weeks)) | 2025-05-14 | 2025-08-20 | |
30 | Copy 2D/3D are broken (zeCommandListAppendMemoryCopyRegion) | NEO-14954, GSD-11132 | https://github.com/rpereira-dev/ze-zoo | Romain PEREIRA and Thomas APPLENCOURT | No response | 2025-05-10 | 2025-08-05 | ||
29 | Significant slowdown with LAMMPS in first run, subsequent runs much faster | No response | /flare/catalyst/proj_shared/knight/projects/ExtremeCarbon/snap-carbon-scaling/1B/ | Christopher Knight | No response | 2025-05-09 | 2025-08-20 | ||
27 | Build failures on PVC with Cutlass | GSD-11099, https://github.com/codeplaysoftware/cutlass-sycl/issues/329 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/cutlass-sycl | Abhi | 🚨 | agama 1133 and agama 1146 (ww24, ~ second week of June) | 2025-05-07 | 2025-08-13 | |
26 | L0 memcpy bug | GSD-11142, NEO-14641 | I was doing the same run as QMCPACK SOW runs in the reframe | Ye Luo | 🚨 | agama 1146 | 2025-05-06 | 2025-06-25 | |
25 | Compile fail in Lattice App | Brian reproduced and confirms fixed in 2025.1 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/bug_cgpt_icpx | Xiao-Yong Jin | 🚨 | Brian confirms fixed in 2025.1 | 2025-05-01 | 2025-05-02 | |
22 | SYCL In-order queue broken | NEO-14641 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/in-order | Thomas Applencourt | 🚨 | fix in ww24 (~second week of june) agama 1146 | 2025-04-23 | 2025-07-29 | |
20 | Issue with gpu-bind for mpiexec under ZE_FLAT_DEVICE_HIERARCHY=FLAT mode | ANL-283/HPE Support Case 5390607860 | See below | Abhishek, Nathan, Khalid | Likely March 2026 (Servesh requested HPE's gpu-bind match gpu_tile_compact.sh at least) | 2025-04-16 | 2025-08-20 | ||
18 | Ping failures and hangs with production runs using GPT/GRID | ANL-251, RITM0404147, RITM0404148, RITM0405730, GSD-11441 | /lus/flare/projects/LatticeFlavor/lehner | Xiao-Yong Jin | 🚨 | No response | 2025-04-04 | 2025-08-18 | |
17 | hang with MPI pipelining | https://github.com/pmodels/mpich/issues/7373 | Build and run commands are in the MPICH issue. | James Osborn | Should be fixed in top of aurora_test | 2025-04-03 | 2025-08-20 | ||
13 | XGC hangs at scale | No response | xgc-es-cpp-gpu app, ES_ITER test case | Tim Williams | 🚨 | No response | 2025-04-03 | 2025-04-03 | |
12 | CXI alloc failed on cxi1: request exceeds ACs limits | No response | None | Not Thomas | No response | 2025-04-01 | 2025-08-04 |
Closed Issues¶
Internal ID | Description | Vendor ID | Reproducer Path | PoC | Priority? | Date Opened | Closed Date |
---|---|---|---|---|---|---|---|
59 | [ISHMEM] Unit test fails with ishmem 1.4.0 | https://github.com/oneapi-src/ishmem/issues/10 | https://github.com/oneapi-src/ishmem/issues/10 and source/applications/ishmem_sos | Abhi | 2025-07-25 | 2025-07-31 | |
53 | IFX Compiler reads and stores floating point values from a text file at single-precision | No response | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/ifx/fp_precision | Victor Anisimov | 🚨 | 2025-07-09 | 2025-07-10 |
50 | OpenMP Thread binding | No response | See bellow | Romain PEREIRA | 2025-07-02 | 2025-07-02 | |
44 | QMCPACK segfault in libomp | No response | Not yet created | Ye Luo | 🚨 | 2025-06-12 | 2025-07-23 |
42 | Linking fails with old build environment | No response | /lus/flare/projects/PHASTA_aesp_CNDA/jrwrigh/petsc_build_test | Kris Rowe | 2025-06-06 | 2025-06-10 | |
41 | torch.compile segfaults for >2 tiles | MLSL-3728 | /flare/Aurora_deployment/vsastry/torch_compile | Varuni Sastry | 2025-06-06 | 2025-07-24 | |
40 | Need SYSMAN support for all modes in recent releases | HPCS-15366, related: GSD-11104 | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/l0/leak_zesMemoryGetState | Thomas Applencourt | 🚨 | 2025-05-30 | 2025-06-17 |
35 | Avoid outputs exceeding few KBs to stdout/stderr from MPI ranks | RITM0425437 First issue | Large MPI writes to stdout | Servesh Muralidharan | 2025-05-15 | 2025-07-23 | |
28 | CMake failures with SYCL | No response | /lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/ | Abhishek Bagusetty | 2025-05-09 | 2025-05-09 | |
24 | Noticeably more "ping failed" than before the 2025.1 SDK + 1099.12 UMD/KMD upgrade | JIRA is: HPCS-15331 | N/A | Xiao-Yong Jin Colleen Bertoni | 2025-05-01 | 2025-05-16 | |
23 | Apps stop running after Apr 29 upgrade due to libstdc++ dependency | No response | See details | Ye Luo | 2025-04-30 | 2025-05-06 | |
21 | Error during write with Quantum ESPRESSO | No response | see .zip file attached below, also /lus/flare/projects/matml_aesp_CNDA/dir_io_QE_crash | Filippo Simini | 🚨 | 2025-04-17 | 2025-04-18 |
19 | Severe CPU memory growth in MPICH | No response | /flare/catalyst/world_shared/zippy/reproducers/issue19 | Tim Williams | 2025-04-04 | 2025-07-31 | |
16 | Catastrophic memory error in context lmp_aurora_kokkos | No response | public LAMMPS | Chris Knight | 2025-04-03 | 2025-07-23 | |
9 | Multithreaded data-transfer can cause page-fault | N/A | Full QMCPACK | Ye Luo | 2025-04-01 | 2025-05-08 | |
8 | Lots of H2D copies produce CPU I9 error and incorrect value | N/A | Full QMCPACK | Ye Luo | 🚨 | 2025-04-01 | 2025-05-28 |
7 | MPI_Bcast gets faster when turning off XPMEM | pmodels/mpich#7334 | see Issue on MPICH GitHub repo | Ye Luo | 2025-04-01 | 2025-04-24 | |
6 | MPICH memory allocation slows down at scale | pmodels/mpich#7333 | see MPICH issue | Ye Luo | 🚨 | 2025-04-01 | 2025-04-24 |
4 | Incorrect results in receive buffer in GPU memory | MPICH 7312 | grid application (lattice QCD) | Patrick Steinbrecher, Tim Williams | 🚨 | 2025-03-25 | 2025-04-24 |
3 | Linker error found by XGC | CMPLRLLVM-66496 | /home/zippy/smalltests/aurora/xgc42/fails | Tim Williams | 2025-03-19 | 2025-03-28 |
Update tables¶
Automatically updated nightly. To update now, wait 10-15s after last change to AuroraBugTracking Issues, then run (anywhere on a machine that has authenticated with gh
):
Or execute aurora-bug-table-sync.sh to automatically run everything step-by-step and know exactly when the changes are live online.