Two Worlds Collide:
Trustworthiness and Energy Efficiency for Coupled HPC + AI Simulation


Wednesday, 20 November 2024
12:00pm - 1:45pm EST
Location: B208
Georgia World Congress Center,
Atlanta, Georgia, USA.

Held in conjunction with SC24: The International Conference for High Performance Computing, Networking, Storage and Analysis

Welcome

This Birds of a Feather session, “Two Worlds Collide: Trustworthiness and Energy Efficiency for Coupled HPC+AI Simulation,” is the third installment of a series started in 2021 aimed at discussing and brainstorming solutions for a new paradigm in HPC: the coupling of simulation with artificial intelligence (AI). In this installment, we continue our discussions on needs, use cases, testing, and reproducibility, and add a new focus on energy efficiency: energy reduction from speedups must be assessed together with energy costs of training campaigns, which can be costly. How can we provide transformative scientific discoveries, while delivering efficiency and correctness assurance?

Link to SC24 BoF Page

Agenda

Time Topic
12:00-12:05 PM Two Worlds Collide BoF Introduction [Slides]
Mark Coletti (ORNL)
Ada Sedova, (ORNL)
12:05-12.20 PM Speaker Introductions (2/3 minutes each)
12:20-1.15 PM Panel Discussion and Audience Q&A
Moderator: Mark Coletti (ORNL)
Panelists:
National Laboratories:
  • Mark Coletti (ORNL)
  • Ada Sedova (ORNL)
  • Venkatram Vishwanath (ANL)
  • Oscar Hernandez (ORNL)
  • Riccardo Balin (ANL)
  • Mathieu Taillefumier (CSCS)
  • Vendors:
  • J. Austin Ellis (AMD)
  • Sanjif Shanmugavelu (Groq)
  • Discussion Topics

    • How do ideas about accuracy and correctness differ between traditional HPC simulation and applications that use AI, like deep learning?
    • What are the differences in the way that data science treats machine learning vs. the way science needs it to be?
    • Disconnects between the two fields in terms of knowledge, software, and the problems themselves
    • Hallucinations, black-box issues (lack of analytical estimates of convergence, lack of numerical analysis methods for errors), floating point issues, and sensitivity
    • How does extensive runtime testing increase power/energy use?
    • What support infrastructure can we build to track and understand bugs and expected accuracy?
    • What support infrastructure can we build?

    Organizers

  • Mark Colleti (ORNL)
  • Ada Sedova (ORL)
  • Join the Discussion!

  • Overleaf
  • Mailing List
  • Slack Channel