Early User Notes and Known Issues¶

Last Updated: 2025-09-05

Early User Notes¶

Please check back here often for updates (indicated by the "Last Updated" timestamp above). As early production users encounter new issues and find solutions or workarounds, we will update these notes. As Aurora matures and becomes more stable over the first year of production, this section should become obsolete.

Outages and Downtime – Expectations¶

Users should expect weekly preventative maintenance (PM) on the system, although PMs will be deferred where possible. The stability of the Aurora system has improved significantly in the recent past, but there are still a number of improvement efforts ongoing in cooperation with HPE and Intel, to improve the user experience. Users need to be proactive about verifying correctness, watching for hangs, and otherwise adopting work methods that are mindful of and resilient to instability.

Scheduling¶

The current queue policy for Aurora is set up based on experiences to date to help maximize productive use of the machine by projects.

The initial goal for teams is to start testing at small scales, ensure correct results (and performance), and ramp up to generating scientific results in production campaigns.
Focus initially on making good use of the system with <=2048 nodes per job; the key is to validate code and runtime behavior, then start generating science results.

Storage¶

Flare (Lustre File System)¶

This is the primary and most stable storage filesystem for now. It is still possible that heavy use may trigger significant lags and performance degradations, and possibly lead to compute nodes crashing. We will continue to monitor filesystem stability as production use ramps up. We encourage teams to start out easy on I/O (both amount and job size), if possible, and report issues.

DAOS (Object Store)¶

DAOS is a scratch file system. Please note that data may be removed or unavailable at any time.

The initial configuration of DAOS has a smaller number of nodes, resulting in smaller project allocations. We expect DAOS to grow over the year, and when that happens, changes will be announced/posted in user docs. Please email support@alcf.anl.gov if you are hitting limits and need the allocation size to be increased.

The performance of DAOS has been impressive, but we continue to experience crashes with large jobs, including loss of data. Projects may use it, but should not consider it stable or safe for long-term storage.

Grand/Eagle¶

These won’t be mounted on Aurora initially, but they might be mounted around May 2025, depending on feasibility. Similarly, Flare will not initially be mounted on Polaris. DTNs and Globus are the best means to transfer data between Polaris and Aurora.

Scaling out of Flare (Lustre) and `/soft` (NFS)¶

Applications which dynamically load libraries out of shared filesystems such as Flare or /soft may experience performance impacts when scaling to large numbers of nodes. These guidelines may mitigate some of these scaling impacts:

Use software in the Aurora PE (/opt/aurora) whenever possible, as this avoids dependence on shared filesystems entirely.
Statically link application binaries, as this reduces the number of dynamically loaded files.
If loading many small (≲100MB) shared libraries or Python modules, use Copper.

Checkpointing¶

Checkpointing is absolutely essential. The mean time between application interrupts caused by system instability may be as short as an hour for larger jobs. The frequency of checkpointing is something that needs to be decided for each individual application based on the scale of runs:

If checkpointing has minimal overhead, consider checkpointing once every 15 minutes.
If checkpointing has substantial overhead, then consider checkpointing every 30-60 minutes.
It may be the case that the highest throughput initially will be with creating job dependency chains where scripts are able to 1) automatically restart from the latest available checkpoint file and 2) confirm that the prior run generated reasonable/correct results.

Troubleshooting Common Issues¶

As always, INCITE and ALCC projects should report all issues to the Catalyst point of contact.

Ping Failures¶

Network and compute node instabilities may lead to inaccessible compute nodes, which will cause MPI ranks on those nodes to become unreachable. If your job output shows an error message like these:

ping failed on x4707c6s4b0n0: Couldn't forward RPC ping(24c93b8c-3434-4fb5-a8f0-53cff4cbbe42) to child x4707c7s6b0n0.hostmgmt2707.cm.aurora.alcf.anl.gov: Resource temporarily unavailable

ping RPC timeout from x4212c7s0b0n0.hostmgmt2212.cm.aurora.alcf.anl.gov after 120s

ping failed on x4304c1s6b0n0: No reply from x4307c2s6b0n0.hostmgmt2307.cm.aurora.alcf.anl.gov after 87s

then most likely your application will crash or hang. If you see these, the best action is to kill the job and re-run it (from the last checkpoint, if there has been one). Potential issues can be discovered when the non-responsive node gets brought back online. Users may query PBS for more info.

$ pbsnodes x4307c2s6b0n0 |grep comment
     comment = StabilityDB 2025-02-24T04:38:14: hbm_controller_errors

Email support@alcf.anl.gov to open a ticket if your application experiences ping failures, especially if these failures are frequent and/or involve the same problematic nodes. ALCF Operations may take such nodes offline.

Hangs¶

There are multiple failure modes that can lead to jobs hanging. For known hardware or low-level software issues such as ping failures as discussed above, just restart the job.

To avoid a hung job running out all the requested wallclock time on all its nodes, we suggest devising ways to monitor job progress. For example, if your application regularly writes small output to a logfile, then you could launch a “watcher” script that looks for that expected output and collects a stack trace and kills the job if it's been too long since progress was made. Please engage your Catalyst POC if you are interested in evaluating this for your application.

GPU Segfaults (a.k.a. "Page Faults")¶

Memory errors on the GPUs are caught when illegal accesses exceed a page boundary. When you see an error message indicating Unexpected page fault from GPU at <address>

The best tools for debugging these are gdb-oneapi and DDT, both of which allow debugging into GPU kernel threads and looking at GPU data structures. You may also dump and step through the PVC assembly code using the debuggers if helpful. It is possible that there remain bugs in the IGC compiler that produces invalid assembly code, though as always the most likely cause of segfaults is memory errors in application code. To use the debuggers effectively in GPU kernels, you should compile and link your application with -g -O0. Keep in mind that the IGC compilation of GPU kernels takes place during the link phase if you're using AoT compilation.

Known Issues¶

This is a collection of known issues that have been encountered during Aurora's early user phase. Documentation will be updated as issues are resolved. Users are encouraged to email support@alcf.anl.gov to report issues.

A known issues page can be found in the CELS Wiki space used for NDA content. Note that this page requires a JLSE Aurora early hardware/softare resource account for access.

Runtime and Compiler Issues¶

1. `Cassini Event Queue overflow detected`¶

Cassini Event Queue overflow detected errors may occur for certain MPI communications and may happen for a variety of reasons - software and hardware, job placement, job routing, and the state of the machine. Simply speaking, it means one of the network interfaces is getting messages too fast and cannot keep up with processing them.

libfabric:16642:1701636928::cxi:core:cxip_cq_eq_progress():531<warn> x4204c1s3b0n0: Cassini Event Queue overflow detected.

As a workaround, the following environment variables can be set to try alleviating the problem.

export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_OFLOW_BUF_SIZE=8388608
export FI_CXI_CQ_FILL_PERCENT=20

The value of FI_CXI_DEFAULT_CQ_SIZE can be set to something larger if issues persist. This is directly impacted by the number of unexpected messages sent and so may need to be increased as the scale of the job increases.

It may be useful to use other libfabric environment settings. In particular, the setting below may be useful to try. These are what Cray MPI sets by default Cray MPI libfabric Settings.

2. `failed to convert GOTPCREL relocation`¶

If you see

_libm_template.c:(.text+0x7): failed to convert GOTPCREL relocation against '__libm_acos_chosen_core_func_x'; relink with --no-relax

try linking with -flink-huge-device-code

3. SYCL Device Free Memory Query Error¶

Note that if you are querying the free memory on a device with the Intel SYCL extension get_info<sycl::ext::intel::info::device::free_memory>();, you will need to set export ZES_ENABLE_SYSMAN=1. Otherwise, you may see an error like:

x1921c1s4b0n0.hostmgmt2000.cm.americas.sgi.com 0: The device does not have the ext_intel_free_memory aspect -33 (PI_ERROR_INVALID_DEVICE)
x1921c1s4b0n0.hostmgmt2000.cm.americas.sgi.com 0: terminate called after throwing an instance of 'sycl::_V1::invalid_object_error'
  what():  The device does not have the ext_intel_free_memory aspect -33 (PI_ERROR_INVALID_DEVICE)

Applications are recommended to improve the error handling by checking aspect::ext_intel_free_memory SYCL device property before making a memory query.

4. `No VNIs available in internal allocator.`¶

If you see an error like start failed on x4102c5s2b0n0: No VNIs available in internal allocator, pass --no-vni to mpiexec

5. `PMIX ERROR: PMIX_ERR_NOT_FOUND` and `PMIX ERROR: PMIX_ERROR`¶

When running on a single node, you may observe this error message:

PMIX ERROR: PMIX_ERR_NOT_FOUND in file dstore_base.c at line 1567 
PMIX ERROR: PMIX_ERROR in file dstore_base.c at line 2334

These errors can be safely ignored.

6. File too large for Clang to processs¶

If you see a compile error similar to:

error: file '/var/tmp/icpx-85f5cf3735/wrapper-a0fbdc.bc' is too large for Clang to process
1 error generated.

when you are compiling with -g, you can decrease the size of the files by compiling with -fno-system-debug -g. See Intel documentation for more details. If -fno-system-debug -g does not help, adding additionally --offload-compress may help as well.

7. Set `TMPDIR` to avoid `AF_UNIX path too long` error¶

Software that relies on the setting of TMPDIR to create socket files may encouter the linux error AF_UNIX path too long when running in processes launched with mpiexec on a single node. This issue has arisen in software using the python multiprocessing library for this purpose, including some use cases of pytorch and parsl.

The solution for this error is to manually set TMPDIR before launching the application, e.g.

export TMPDIR=/tmp

Alternatively, it can be set with the mpiexec command, e.g.

mpiexec --env TMPDIR=/tmp -n 1 --ppn 1 ...

Submitting Jobs¶

Jobs may fail to successfully start at times (particularly at higher node counts). If no error message is apparent, then one thing to check is the comment field in the full job information for the job using the command qstat -xfw <JOBID> | grep comment. Some example comments follow.

comment = Job held by <USER> on Tue Feb 6 05:20:00 2024 and terminated

The user has placed the job on hold; the user can qrls the job when ready for it to be queued again.

comment = Not Running: Queue not started. and terminated

The user has submitted to a queue that is not currently running; the user should qmove the job to an appropriate queue.

comment = job held, too many failed attempts to run

The job tried and failed to start. In this scenario, the user should find that their job was placed on hold. This does not indicate a problem with the user's job script, but indicates PBS made several attempts to find a set of nodes to run the job and was not able to. Users can qdel the job and resubmit or qrls the job to try running it again.

comment = Not Running: Node is in an ineligible state: down and terminated

There are an insufficient number of nodes online and free for the job to start.

In the event of a node going down during a job, users may encounter messages such as ping failed on x4616c0s4b0n0: Application 047a3c9f-fb41-4595-a2ad-4a4d0ec1b6c1 not found. The node will likely have started a reboot and won't be included in jobs again until checks pass.

Use of the qsub -V flag (note: upper-case) is discouraged, as it can lead to startup failures. The following message (found via pbsnodes -l):

failed to acquire job resources; job startup aborted (jobid: <YOUR JOBID>)

indicates such a failure. It is recommended to instead use -v (note: lower-case) and explicitly export any environment variables that your job may require.

To increase the chances that a large job does not terminate due to a node failure, you may choose to interactively route your MPI job around nodes that fail during your run. See this page on Working Around Node Failures for more information.

Other Issues¶

A large number of Machine Check Events from the PVC, which causes nodes to panic and reboot.
HBM mode is not automatically validated. Jobs requiring flat memory mode should test by looking at numactl -H for 4 NUMA memory nodes instead of 16 on the nodes.
Application failures at the single-node level are tracked in the JLSE Wiki/Confluence page

Aurora Bug Tracking repository and table¶

The repository argonne-lcf/AuroraBugTracking is a public bug tracking system for known issues (and recently resolved bugs) that affect production science on ALCF Aurora. To report an issue, please reach out to ALCF Support.

For convenience, nightly (sortable) copies of the summary tables are included here. For the latest versions, see bugs.md

Wider View of Tables ¶

Open Issues¶

Internal ID	Description	Vendor ID	Reproducer Path	PoC	Priority?	ETA	Date Opened	Last Updated
122	[IntelPython] Bug in DPCTL to support for `order` parameter for `dpt.asnumpy`	No response	https://github.com/IntelPython/dpctl/issues/2138	Abhi		No response	2026-02-23	2026-02-23
121	[IntelPython] Feature request for sub-class support in dpnp arrays	No response	https://github.com/IntelPython/dpnp/issues/2764	Abhi		No response	2026-02-23	2026-02-23
120	[IntelPython] dpnp array `.data.ptr` on array views ignores USM offset	No response	https://github.com/IntelPython/dpnp/issues/2781	Abhi	🚨	No response	2026-02-23	2026-02-23
119	[IntelPython] Indexing bug with dpnp nd-array	No response	https://github.com/IntelPython/dpnp/issues/2783	Abhi	🚨	No response	2026-02-23	2026-02-23
118	Incorrect RUNPATH in libimf.so and libirng.so	No response	Embeded	Ye Luo		No response	2026-02-19	2026-02-19
117	Fortran ICE module + input)in)	CMPLRLLVM-73523	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/ifx/ice_module_in	Thomas / Victor		2026.0	2026-02-18	2026-02-23
116	PCIe counters has a regression on 1.3.X for Datacenter Max GPUs	https://github.com/intel/xpumanager/issues/119	https://github.com/intel/xpumanager/issues/119	Servesh		No response	2026-02-18	2026-02-18
115	flash attn and fused_moe/test_grouped_gemm tests fail in the VLLM framework	No response	source/reproducers/frameworks/vllm/	Servesh / Nathan		No response	2026-02-16	2026-02-17
114	Offline debugging issues	No response	N/A	Servesh		3/5 next release (~1146.56-8), 2/5 still under analysis	2026-02-16	2026-02-18
113	Engineering version of `vtune-backend` is extremely slow	VASP-33498	/tmp/rcaddy/tmp on Aurora head node 11	Robert Caddy		No response	2026-02-11	2026-02-18
111	[Frameworks] alltoallv with zero-sized buffer from pytorch	https://github.com/uxlfoundation/oneCCL/issues/190 MLSL-4075	https://github.com/argonne-lcf/nekRS-ML/blob/alcf4/3rd_party/dist-gnn/run_all2all_bench.sh	Riccardo Balin		In debug	2026-02-05	2026-02-18
110	[Frameworks] degraded Ptycho_Vit performance Vs A100	No response	https://github.com/SYNAPS-I/ptycho-vit/tree/aurora_port	Varuni Katti Sastry		No response	2026-02-03	2026-02-04
109	Global MPI rank issue with STAT	HPE ticket CPE-13691	/home/jkwack/Tools/STAT/Multi-node_test on Sunspot	JaeHyuk Kwack	🚨	No response	2026-02-02	2026-02-02
108	[LZ] Hanging on event with multiple immediate command lists	No response	source/reproducers/l0/synch_hang_multi_imm	Paulius Velesko		Working with 2025.3 SDK + 1146.40 runtime	2026-01-27	2026-01-27
107	Vtune times out even when run with collection paused	VASP-33391	/lus/flare/projects/CoreCollapseModel/rcaddy/vtune_issue /lus/flare/projects/Tools/jkwack-tools-reproducer/Robert_Caddy/vtune_issue	Robert Caddy		2026.0	2026-01-07	2026-02-11
106	[LZ] Hang on zeEventPoolDestroy when called before a non-related non-same-pool signal	GSD-12152	source/reproducers/l0/multi_event_pools_hang	Colleen, Paulius		No response	2026-01-07	2026-01-07
105	PCIe counters not working on LTS2 2523.31 and xpu-smi 1.2.X	https://github.com/intel/xpumanager/issues/114 GSD-12079	in issue	Servesh		Fixed in newer PE on sunspot. will close once it's default on aurora	2026-01-06	2026-02-18
104	[LZ] Crashing with UseKMDMigration	GSD-12102	source/reproducers/dpcpp/supercontext	Thomas		Under investigation	2025-12-17	2026-01-07
103	[Frameworks][PyTorch][IPEX] PyTorch Complex Matmul support W/O IPEX	No response	/lus/flare/projects/datasets/softwares/testing/ptychi_tests/complex.py in test set at: source/reproducers/frameworks/pytorch_matmul_ipex	Khalid Hossain		PyTorch-2.10 (next drop, currently on sunspot)	2025-12-17	2026-02-16
102	[Frameworks][Triton] "No device of requested type available" when ONEAPI_DEVICE_SELECTOR="level_zero:gpu"	PYTORCHDCQ-7882	source/reproducers/frameworks/triton_get_device	Nathan Nichols		WA: `ONEAPI_DEVICE_SELECTOR="*:gpu"` https://github.com/intel/intel-xpu-backend-for-triton/pull/5745	2025-12-17	2026-01-06
101	Signalling a clSetUserEventStatus does not wake up barriers a barrier depending on it for in-order queues.	GSD-12087	source/reproducers/opencl/user_event_in_order	Paulius Velesko		No response	2025-12-11	2025-12-12
100	Level Zero event used between an in-order immediate command list and out-of-order regular comment list resulting in ZE_RESULT_ERROR_INVALID_ARGUMENT	GSD-12085	source/reproducers/l0/inorder_outorder_event/	Paulius Velesko		No response	2025-12-11	2025-12-11
99	Advisor tripcounts analysis fails with a PyTorch example.	ADV-10735	/flare/Performance/jkwack/Tools/Roofline/SC25_tutorial/ai_ml_profiling/reproducer or /lus/flare/projects/Tools/jkwack-tools-reproducer/JaeHyuk/advisor_pytorch/reproducer or source/reproducers/tools/pytorch_advisor	JaeHyuk Kwack	🚨	Under investigation, Advisor and python compatibility issues maybe	2025-12-09	2026-02-18
98	Hanging OpenCL code when one command queue waits on an event from another command queue	CMPLRLLVM-72048 / GSD-12075	source/reproducers/opencl/hanging_marker	Colleen		Under investigation	2025-12-02	2026-01-07
97	SHMEM on Aurora: Unit test wait_until_all-on_queue-2 hanging	https://github.com/oneapi-src/ishmem/issues/15	source/applications/ishmem_sis	Colleen / Abhi		Actively working on it	2025-11-21	2025-12-10
96	Sporadic libze_intel_gpu.so segmentation fault when running QMCPACK	GSD-12033	See attached reproducer	Ye Luo	🚨	Intel working on reproducing	2025-11-17	2026-02-18
95	Memory leak in Libfabric	No response	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/mpi/cxi_memory_lead	Rob Lathan		Fixed by https://github.com/ofiwg/libfabric/pull/11334, thanks Rob! expect in SHS 13.1.0, on aurora end of March	2025-11-13	2026-02-18
94	zeMemFree slowdown in a loop	GSD-11962, NEO-17411	source/reproducers/l0/zememfree_slowdown/	Colleen	🚨	being investigated	2025-11-08	2026-02-18
93	oneCCL exeption with PyTorch DTensor: SYCL recv is not supported for multi-node case	MLSL-3951	In the text body source/reproducers/frameworks/pytorch_93 (note only for single node, we must test by hand for 2 nodes)	Väinö Hatanpää		oneAPI 2025.3, oneCCL 2021.17	2025-11-05	2026-02-04
92	SYCL device info free_memory wrong on 2-stack PVC1550 GPU	CMPLRLLVM-71510, GSD-12043	source/reproducers/dpcpp/sycl_free_flat	Jakub H		Under investigation	2025-10-31	2026-01-07
91	sycl failed malloc_device on GPU takes 20 seconds	GSD-10587	source/reproducers/dpcpp/slow_alloc/	Jakub H		post-1146.40, fixed internally, but under investigation on what to cherry	2025-10-31	2026-01-07
90	Device Sanitizer + LIBOMPTARGET_DEBUG=1 issues for the GAMESS RI-MP2 mini-app	CMPLRLLVM-71455	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/tools/sanitizer_rimp2_test (w60 ones)	Brian		Fixed internally, 2026.0 (end of april)	2025-10-31	2026-02-18
89	Device Sanitizer breaks with MKL DGEMM call in GAMESS RI-MP2 mini-app	MKLD-19334	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/tools/sanitizer_rimp2_test (w30 ones)	Brian, JaeHyuk		Fixed internally, 2026.0 (end of april)	2025-10-31	2026-02-18
87	QUDA compile fail	cmplrllvm-70981	source/reproducers/openmp/quda_crash	Xiayong Jin / Brian W		2026.0 (end of april)	2025-10-28	2026-01-07
86	omp_alloc should support pinned memory, or implement proper fallback behavior	CMPLRLIBS-35442	/home/kweide/projects/OpenMP_VV/tests/5.1/allocate/test_omp_alloctrait_pinned.c and source/reproducers/openmp/omp_alloctrait_pinned in the test set	Klaus Weide		fixed internally -- correct error message. likely 2026.1	2025-10-28	2026-02-18
85	zeEventQueryKernelTimestampsExt is broken with IMM command lists	GSD-11124	source/reproducers/l0/zeEventQueryKernelTimestampsExt_clock	Thomas/John Mellor-Crummey		In progress	2025-10-27	2025-11-12
84	Device Sanitizer is not functional with OpenMP C/Fortran codes		/lus/flare/projects/Aurora_deployment/jkwack/JK_AT_Tools/sanitizer and source/reproducers/tools/sanitizer	JaeHyuk Kwack	🚨	2025.3	2025-10-22	2026-02-18
83	With ifx, `openmp_version` is missing from omp_lib	CMPLRLIBS-35365	/home/kweide/tests/test_openmp_version.f90 and source/reproducers/openmp/omp_version in the test set	Klaus Weide		2025.3	2025-10-20	2025-10-24
81	IGC_StackOverflowDetection not working	GSD-11763	source/reproducers/openmp/stack_overflow_not_working	Brian		In progress	2025-10-15	2025-10-29
80	VTune fails with "Assertion failed: tool_gtpin_support:126: (buffer) "	VASP-32612, GTPIN-1169	/lus/flare/projects/Aurora_deployment/jkwack/JK_AT_Tools/Apps/GAMESS_RI-MP2_MiniApp source/reproducers/tools/vtune_gtpin_fail in the test set	JaeHyuk Kwack	🚨	2025.3	2025-10-10	2025-10-30
79	Advisor fail with "advisor: Warning: The application returned a non-zero exit value."	ADV-10687	source/reproducers/tools/advisor_gflop	JaeHyuk Kwack		Fixed with `advisor --version == 616302`, which should be in 2025.3	2025-10-08	2025-10-30
77	[SYCL] Function pointers compilation issue	CMPLRLLVM-16317	Reproducer below and `source/reproducers/dpcpp/func_pointers`	Abhi, Patrick Steinbrecher	🚨	Under discussion	2025-10-06	2025-10-15
76	Segfaults in MPICH routines in next-eval	No response	for XGC: /lus/flare/projects/catalyst/world_shared/zippy/xgc	Tim Williams	🚨	No response	2025-10-01	2025-10-01
74	ZES_ENABLE_SYSMAN should default to 1 in the oneapi module	No response	see Details	Tim Williams		No response	2025-09-29	2025-10-15
73	"error: undefined reference to `old_llvm.umul.with.overflow.i64'" in newer kokkos	CMPLRLLVM-70603, GSD-12239	source/reproducers/dpcpp/kokkos_mdspan_umul	Daniel Arndt		Compiler-side fixed, waiting on agama fix	2025-09-17	2026-02-18
71	RPC launch error tracking						2025-09-15	2025-09-23
70	PALS gpu-bind, composite, envall lead to "launch failed"	DCE Case 5392152905	applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/mpi/envall	Thomas Applencourt		Fixed in USS-1.5 (March '26)	2025-09-10	2025-12-09
68	warpx segfaults/hangs with OpenPMD enabled	No response	/lus/flare/projects/catalyst/world_shared/zippy/reproducers/issue52/	Tim Williams		No response	2025-08-23	2026-01-08
67	warpx Debug build crashes oneAPI compiler	CMPLRLLVM-24314	/lus/flare/projects/catalyst/world_shared/zippy/reproducers/issue52/	Tim Williams		No response	2025-08-21	2025-10-29
65	Clarification requested about ZE_DEVICE_PROPERTY_FLAG_ONDEMANDPAGING on PVC	GSD-11510	source/reproducers/l0/ondemand_paging/	Colleen		implemented, post-1146.41+, ~ Jan. (1146.58)	2025-08-20	2026-02-18
64	E3SM fortran compile ICE	CMPLRLLVM-69862	source/reproducers/ifx/e3sm_homme_ICE_error	Abhi		2025.3.0	2025-08-18	2025-10-09
63	Kokkos kernels fails to build with kokkos built with openmp enabled	CMPLRLLVM-69908	source/applications/kokkos-kernels	Sean Koyama / Colleen Bertoni		gone starting with 4.19 (fixed in 2025.3 branch)	2025-08-18	2025-09-16
62	-ftarget-register-alloc-mode=pvc:large and "-device 12.60.7" for AOT	GSD-11490	source/reproducers/general/ftarget-register-alloc-mode_flag	Steve Rangel		Fixed internally, 1146.58	2025-08-14	2026-02-18
60	ext_oneapi_memcpy2d is significantly slower with implicit scaling than explicit and on PVC vs A100	GSD-11132, GSD-12277	source/reproducers/dpcpp/ext_oneapi_memcpy2d_perf	Natalie Beams		No response	2025-07-29	2026-02-03
58	kokkos inclusive and exclusive scan giving incorrect answers for 1146.10	CMPLRLLVM-69285, GSD-11736	source/reproducers/dpcpp/kokkos_optimization_scan	Daniel Arndt	🚨	1146.40 (two weeks out -- end of Nov)	2025-07-23	2025-12-10
57	GPU segfault in gtensor_bench with 2025.2	MKLD-18276, CMPLRLIBS-35326, CMPLRLLVM-68696	source/applications/gtensor_bench	Colleen Bertoni		2025.3	2025-07-22	2025-08-11
56	RSBench-SYCL incorrect answers with 1146.10	GSD-11247	source/applications/RSBench/	John Tramm, Colleen Bertoni		1146.31	2025-07-22	2025-09-17
55	Linking in LZ causes changes in signal handling	cmplrlibs-35385, GSD-11413	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/l0/signal_handler/	Thomas Applencourt, Colleen Bertoni		Fixed internally, still in vetting	2025-07-22	2025-12-10
54	oneCCL zeMemGetAddressRange error with alltoallv and zero-sized buffers	oneCCL GitHub Issue: https://github.com/uxlfoundation/oneCCL/issues/174, MLSL-3764	See instructions on oneCCL GitHub Issue: https://github.com/uxlfoundation/oneCCL/issues/174 and source/reproducers/mpi/oneccl_174	Riccardo Balin	🚨	oneCCL 2021.17, oneAPI 2025.3	2025-07-18	2026-02-04
52	compiler segfaults linking warpx binary	GSD-11357, GSD-11855	/lus/flare/projects/catalyst/world_shared/zippy/reproducers/issue52/warpx	Tim Williams	🚨	2025.2 + 1146.10	2025-07-07	2026-01-08
47	Non standard MPI knobs suggested for performance	ANL-291	N/A	Servesh M		No response	2025-06-23	2025-06-27
43	CMake can't find `MKL::MKL_SYCL` with MPI wrapper compilers	No response	https://github.com/thilinarmtb/onemkl_cmake_mpi_bug	Thilina Ratnayaka, Colleen Bertoni		improvements will be part of the next oneMKL release, 2025.3.	2025-06-11	2025-06-25
39	Feature request for Aurora runtime to include debugging symbols	ANL-286, HPCS-15374, GSD-11427	feature request	Ye Luo		1146.40 drop	2025-05-29	2025-12-10
38	One application in GRID consistently hangs	GSD-11441	/lus/flare/projects/Aurora_deployment/xyjin/W/test_grid_g5r5_paboyle	Xiao-Yong Jin	🚨	Internal investigation, testing a patch, ~1146.58	2025-05-27	2026-02-18
36	(Occasional Interruptible) hangs in applications	Possibly related to ANL-215	/lus/flare/projects/Aurora_deployment/xyjin/W/test_example_detar.skel	Xiao-Yong Jin	🚨	No response	2025-05-15	2025-07-09
33	Crash when calling too many MPI_Probe	https://github.com/pmodels/mpich/issues/7427	https://github.com/pmodels/mpich/issues/7427	David--Cléris Timothée		No response	2025-05-15	2025-05-15
32	PETSc segfaults in sparse matrix calls	IGDB-6516, GSD-10450	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/mkl/csr_gemv_usm/	Junchao Zhang	🚨	2025.3 for part malloc_shared in MKL	2025-05-15	2025-06-25
31	GAMESS segfaults with -O0	GSD-10393, CMPLRLIBS-35345,GSD-11035	`/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/openmp/gamess_O0_page_fault`	Colleen Bertoni	🚨	1146.31 (Targeted for LTS2 (1146.12+), contained with the IGC 2.16 series / WW34 (2-3 weeks))	2025-05-14	2025-09-17
30	Copy 2D/3D are broken (zeCommandListAppendMemoryCopyRegion)	NEO-14954, GSD-11132	https://github.com/rpereira-dev/ze-zoo also source/reproducers/l0/copyRegionPitch	Romain PEREIRA and Thomas APPLENCOURT	🚨	No response	2025-05-10	2026-02-18
29	Significant slowdown with LAMMPS in first run, subsequent runs much faster	No response	/flare/catalyst/proj_shared/knight/projects/ExtremeCarbon/snap-carbon-scaling/1B/	Christopher Knight		No response	2025-05-09	2026-01-06
17	hang with MPI pipelining	https://github.com/pmodels/mpich/issues/7373	Build and run commands are in the MPICH issue.	James Osborn		Merged in https://github.com/pmodels/mpich/pull/7622	2025-04-03	2026-02-20
13	XGC hangs at scale	CMPLRTST-27836	xgc-es-cpp-gpu app, ES_ITER test case	Tim Williams	🚨	No response	2025-04-03	2026-01-07

Closed Issues¶

Internal ID	Description	Vendor ID	Reproducer Path	PoC	Priority?	Date Opened	Closed Date
112	[MPI] MPI_probe crashing with H/W event overflow	CAST-39582	in the issue and source/reproducers/mpi/mpi_probe	Colleen		2026-02-09	2026-02-13
88	RPATH issue when mixing and matching SDK and spack packages built by another SDK	No response	No need. reprdducer attached in this ticket	Ye Luo		2025-10-30	2026-02-18
82	Symbol missing issue with 1.3 version onwards in SLES and Intel Datacenter Max GPU on Aurora	https://github.com/intel/xpumanager/issues/113	https://github.com/intel/xpumanager/issues/113	Servesh		2025-10-16	2026-02-18
78	Applications failing to compile with `is too large for Clang to process` or generating significantly larger exes with "-g"	CMPLRLLVM-70962, (general and related: CMPLRLLVM-53145, CMPLRLLVM-69909, CMPLRLLVM-24314)	source/reproducers/dpcpp/jit_too_large_for_Clang	Abhi	🚨	2025-10-06	2026-01-06
75	"MPL_gpu_query_is_same_dev(int, int): Assertion `global_dev1 >= 0 && global_dev1 < known_ze_device_count' failed." with mpich.dbg	No response	https://github.com/pmodels/mpich/issues/7602	Tim, JaeHyuk, Colleen		2025-09-30	2025-10-13
72	MPI_aborts in many applications in next-eval at larger scales	No response	N/A	Brian Holland / Tim Williams		2025-09-16	2025-09-30
66	Compiling with "-g" leads to a much larger binary than without	CMPLRLLVM-69909, CMPLRLLVM-24314 (similar JIRAs)	lammps + -g	Brian Holland		2025-08-20	2026-01-06
61	Failing unit tests on PVCs with 2025.2 oneAPI SDK -- is it expected?	https://github.com/uxlfoundation/oneMath/issues/703, CMPLRLLVM-69572, ONSAM-1930, GSD-11482	https://github.com/uxlfoundation/oneMath/issues/703	Colleen Bertoni		2025-07-30	2025-11-12
59	[ISHMEM] Unit test fails with ishmem 1.4.0	https://github.com/oneapi-src/ishmem/issues/10	https://github.com/oneapi-src/ishmem/issues/10 and source/applications/ishmem_sos	Abhi		2025-07-25	2025-07-31
53	IFX Compiler reads and stores floating point values from a text file at single-precision	No response	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/ifx/fp_precision	Victor Anisimov	🚨	2025-07-09	2025-07-10
51	[SYCL] Bug from SYCL peer_access	No response	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/sycl_peer_access	Abhi		2025-07-02	2025-10-13
50	OpenMP Thread binding	No response	See bellow	Romain PEREIRA		2025-07-02	2025-07-02
49	[E3SM] MPICH bug related to collectives tunning	https://github.com/pmodels/mpich/issues/7456	https://github.com/pmodels/mpich/issues/7456	Abhi	🚨	2025-06-27	2025-10-09
48	Zombie Processes	GSD-11266	none yet	Servesh M	🚨	2025-06-25	2025-10-29
45	DDT issues since Aurora upgrade	No response	/lus/flare/projects/catalyst/world_shared/zippy/ddt	Tim Williams		2025-06-12	2025-11-03
44	QMCPACK segfault in libomp	No response	Not yet created	Ye Luo	🚨	2025-06-12	2025-07-23
42	Linking fails with old build environment	No response	/lus/flare/projects/PHASTA_aesp_CNDA/jrwrigh/petsc_build_test	Kris Rowe		2025-06-06	2025-06-10
41	torch.compile segfaults for >2 tiles	MLSL-3728	`/flare/Aurora_deployment/vsastry/torch_compile`	Varuni Sastry		2025-06-06	2025-07-24
40	Need SYSMAN support for all modes in recent releases	HPCS-15366, related: GSD-11104	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/l0/leak_zesMemoryGetState	Thomas Applencourt	🚨	2025-05-30	2025-06-17
37	xpu-smi reports "N/A" for GPU Utilization	RITM0428460, ANL-279, GSD-11252	any run of xpu-smi	Kyle Felker / Colleen Bertoni		2025-05-22	2025-10-29
35	Avoid outputs exceeding few KBs to stdout/stderr from MPI ranks	RITM0425437 First issue	Large MPI writes to stdout	Servesh Muralidharan		2025-05-15	2025-07-23
34	Runtime Error: pytorch DDP with CCL_BCAST=<"double_tree, direct, naive, maybe others?">	MLSL-3729	In issue	Nathan Nichols		2025-05-15	2025-10-13
28	CMake failures with SYCL	No response	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/	Abhishek Bagusetty		2025-05-09	2025-05-09
27	Build failures on PVC with Cutlass	GSD-11099, https://github.com/codeplaysoftware/cutlass-sycl/issues/329	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/cutlass-sycl	Abhi	🚨	2025-05-07	2025-10-13
26	L0 memcpy bug	GSD-11142, NEO-14641	I was doing the same run as QMCPACK SOW runs in the reframe	Ye Luo	🚨	2025-05-06	2025-10-13
25	Compile fail in Lattice App	Brian reproduced and confirms fixed in 2025.1	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/bug_cgpt_icpx	Xiao-Yong Jin	🚨	2025-05-01	2025-10-13
24	Noticeably more "ping failed" than before the 2025.1 SDK + 1099.12 UMD/KMD upgrade	JIRA is: HPCS-15331	N/A	Xiao-Yong Jin Colleen Bertoni		2025-05-01	2025-05-16
23	Apps stop running after Apr 29 upgrade due to libstdc++ dependency	No response	See details	Ye Luo		2025-04-30	2025-05-06
22	SYCL In-order queue broken	NEO-14641	/lus/flare/projects/Aurora_deployment/applications.hpc.argonne-national-lab.aurora.anl-testing/source/reproducers/dpcpp/in-order	Thomas Applencourt	🚨	2025-04-23	2025-10-13
21	Error during write with Quantum ESPRESSO	No response	see .zip file attached below, also /lus/flare/projects/matml_aesp_CNDA/dir_io_QE_crash	Filippo Simini	🚨	2025-04-17	2025-04-18
20	Issue with gpu-bind for mpiexec under ZE_FLAT_DEVICE_HIERARCHY=FLAT mode	ANL-283/HPE Support Case 5390607860	See below	Abhishek, Nathan, Khalid		2025-04-16	2025-10-01
19	Severe CPU memory growth in MPICH	No response	/flare/catalyst/world_shared/zippy/reproducers/issue19	Tim Williams		2025-04-04	2025-07-31
18	Ping failures and hangs with production runs using GPT/GRID	ANL-251, RITM0404147, RITM0404148, RITM0405730, GSD-11441	/lus/flare/projects/LatticeFlavor/lehner	Xiao-Yong Jin	🚨	2025-04-04	2025-12-11
16	Catastrophic memory error in context lmp_aurora_kokkos	No response	public LAMMPS	Chris Knight		2025-04-03	2025-07-23
12	CXI alloc failed on cxi1: request exceeds ACs limits	No response	None	Not Thomas		2025-04-01	2025-12-09
9	Multithreaded data-transfer can cause page-fault	N/A	Full QMCPACK	Ye Luo		2025-04-01	2025-05-08
8	Lots of H2D copies produce CPU I9 error and incorrect value	N/A	Full QMCPACK	Ye Luo	🚨	2025-04-01	2025-05-28
7	MPI_Bcast gets faster when turning off XPMEM	pmodels/mpich#7334	see Issue on MPICH GitHub repo	Ye Luo		2025-04-01	2025-04-24
6	MPICH memory allocation slows down at scale	pmodels/mpich#7333	see MPICH issue	Ye Luo	🚨	2025-04-01	2025-04-24
4	Incorrect results in receive buffer in GPU memory	MPICH 7312	grid application (lattice QCD)	Patrick Steinbrecher, Tim Williams	🚨	2025-03-25	2025-04-24
3	Linker error found by XGC	CMPLRLLVM-66496	/home/zippy/smalltests/aurora/xgc42/fails	Tim Williams		2025-03-19	2025-03-28