Applying foundation models within the Department of Energy’s (DOE’s) missions presents a multilayered set of technical and operational challenges. These models, which emerged from success in domains such as natural language processing and vision, struggle to transfer directly into DOE’s computational science workflows that require physical consistency, mesh- or geometry-aware representations, and scalable inference across high-dimensional, multiscale partial differential equation systems (Pyzer-Knapp et al. 2025). DOE applications such as reactor modeling, Earth systems prediction, and fusion simulation involve high-dimensional, spatiotemporal fields with millions to trillions of values per instance, placing extreme demands on memory, computational throughput, and architectural efficiency. The absence of embedded physical constraints in standard foundation model architectures, combined with stochastic training dynamics, emergent capabilities, and nondeterministic behaviors, hinders scientific reliability, complicates verification, and reduces confidence in high-stakes scenarios (Babuska and Oden 2004). The core promise of foundation models, pretraining across diverse tasks and modalities to enable broad generalization, is precisely what introduces new risks in scientific domains where accuracy, stability, and reproducibility are paramount (Palmer and Stevens 2019). Scientific foundation models are expected to extrapolate across physical regimes, boundary conditions, and domain geometries with minimal adaptation, yet this capability remains largely aspirational in practice. Fine-tuning on downstream scientific problems often proves computationally expensive, brittle, and sensitive to discretization artifacts, with performance degrading when faced with domain shifts or mesh changes (Radova et al. 2025). The lack of standardized data sets for DOE-rele-
vant systems further hampers reproducibility, robust benchmarking, and model transferability across simulation codes or physical domains. DOE use cases also demand task interactivity and feedback integration, such as real-time control of plasma confinement or anomaly detection in sensor networks. These agentic and dynamic workflows are not typically reflected in the static pretraining distributions used to develop generalist foundation models, which are often drawn from web-scale or simulation-agnostic corpora. Consequently, adapting pretrained foundation models for DOE environments requires techniques such as domain-specific simulation environments, reward-informed data-relabeling pipelines, digital twin infrastructure, and architectural modifications that encode physical priors or conservation laws (Yuan et al. 2025). Even in data-rich domains, the absence of reward structures, labeled physics, or causal annotations limits the ability to drive meaningful adaptation. In addition, the need to accommodate heterogeneous data types, such as text, sensor streams, video, and mesh-based simulations, introduces architectural challenges in designing foundation models that can jointly align, fuse, and validate across disparate modalities while preserving spatiotemporal and physical coherence (Mukherjee et al. 2025).
Collaboration with industry introduces additional constraints. Proprietary model weights, restricted data access, and closed-source infrastructure often prevent rigorous verification, validation, and uncertainty quantification (VVUQ) and reproducibility practices, especially when security, transparency, or auditability are required.
Finally, the energy and computational costs of training and adapting large foundation models, particularly across diverse scientific regimes, impose significant burdens on DOE facilities (Koch et al. 2025). Addressing these challenges will require coordinated investments in energy-efficient and sustainable foundation model development, physically informed architectures, domain-specific VVUQ methodologies, and infrastructure for transparent, traceable, and reproducible deployment across DOE’s science and national security missions (Teranishi et al. 2025).
AI assurance for foundation models refers to the evidence-based process of demonstrating that a system is reproducible, auditable, and fit for purpose in DOE mission settings. Assurance is tied to acceptance criteria declared in advance for a specific task and operating regime, and results must be repeatable across software environments and hardware platforms. It is not a single evaluation step but a continuous life-cycle discipline spanning model conception, training, deployment, and requalification. This framing echoes emerging life-cycle models for trustworthy AI (Afroogh et al. 2024) and conceptual roadmaps that advocate a “never trust, always verify” paradigm for AI systems (Tidjon and Khomh 2022).
At the requirements stage, DOE programs should specify quantitative criteria for accuracy, stability, and latency. Verification must enforce conformance with
physical conservation, invariants, boundary, and unit consistency, and portability across meshes and geometries using both synthetic and experimental data. Validation extends beyond simulation comparison to include closed-loop testing with controllers or optimizers in the loop, where stability margins, constraint-violation rates, and worst-case performance are directly measured. Recent assurance frameworks emphasize that validation should be tied to empirical conditions of use, not only static benchmarks (Bloomfield and Rushby 2024).
Uncertainty quantification is decision linked: predictive coverage should be calibrated against DOE-relevant distributions, and provenance must trace uncertainty sources to modalities, training stages, and preprocessing steps. To support this, foundation models must carry reproducibility dossiers documenting data set lineage, hash-verified snapshots, training seeds, hardware and software stacks, and code commits. Determinism budgets should quantify acceptable drift across multinode and mixed-precision runs. This aligns with recent calls for comprehensive trustworthiness assessment across robustness, transparency, and accountability dimensions (Kowald et al. 2024).
Deployment in high-consequence settings such as fusion control or grid operations requires staged test-beds. Models first undergo software-in-the-loop trials with high-fidelity simulators, advance to hardware-in-the-loop testing on the target control stack, and finally, operate in shadow mode with full telemetry in the live environment. Full deployment proceeds only if predeclared acceptance criteria are satisfied in the simulator and hardware-in-the-loop stages; any modification to data, model, controller, or operating envelope triggers mandatory requalification. This staged life cycle reflects broader proposals for trustworthy and safe AI architecture (Chen et al. 2024) and ensures that DOE’s mission applications meet safety and reliability requirements before operational use.
Operational safeguards must be integral to the assurance framework. These include watchdogs and admission control for computing resources, fixed profile execution bounds, and certified fallback controllers. Out-of-distribution detection should be paired with safe degradation policies such as hold-last-good. Where counterfactual reasoning is central, training should be coupled with interventional simulators, and validation should include intervention suites and replay of historical logs. The importance of embedding such protections has been emphasized in the broader AI governance literature (Blau et al. 2024) and in proposals for architectural frameworks for AI safety (Chen et al. 2024).
By consolidating VVUQ, reproducibility, robustness testing, and staged deployment into a unified life cycle, DOE can ensure that foundation models are evaluated with the same rigor long applied to scientific codes.
Verification ensures that foundation models are implemented correctly and yield outputs consistent with physical principles (Gurieva et al. 2022). For scientific applications at DOE scale, this requires more than standard software test-
ing. The size and complexity of modern foundation models, which often contain billions of parameters and are applied to high-dimensional spatiotemporal fields, demand modular verification strategies that address emergent behaviors, stochastic dynamics, and numerical stability. This is especially critical for systems where any violation of conservation laws or symmetry principles may have safety or operational consequences.
Some DOE applications require inference that is not only accurate but also predictable in timing and auditable in operation so that control and protection functions consistently meet strict deadlines. For these settings, foundation model pipelines must be engineered to satisfy fixed execution budgets, deliver deterministic behavior under load, and fail safely when assumptions are violated. A practical assurance profile includes predictable worst-case execution time established through fixed-profile scheduling on the target platform, hardware-in-the-loop staging before any field activation, and phased deployment that begins in shadow mode with full telemetry before actuation authority is granted. Safety must be ensured through conservative fallback controllers when timing bounds or input validity checks are not met. Continuous audit trails should capture timing, inputs, intermediate states, and actions to ensure full traceability. Additional safeguards include admission control for computing resources, watchdogs, and out-of-distribution input tests that automatically trigger safe states. Acceptance criteria must demonstrate that closed-loop stability and protection margins are preserved across the specified disturbance set and operating envelope. Scientific data sets further complicate the task. Inputs such as three-dimensional mesh-based simulation fields often contain trillions of values, overwhelming conventional memory and computing pipelines. Differences introduced by stochastic initialization, hardware platforms, or software libraries can lead to inconsistent model outputs, undermining reproducibility and making fault tracing difficult (Barton et al. 2022). Most foundation model architectures, especially transformer-based models, are trained on data sets with limited fidelity to physical systems, simulation structure, or simulation-specific structure. As a result, it is difficult to determine whether their predictions honor physical realism, particularly in applications such as turbulent flow or magnetohydrodynamics. These issues are compounded when industry partnerships restrict access to pretrained weights or codebases, limiting transparency and reproducibility (Yang et al. 2020).
Sustainability is another key concern. Verification of large foundation models across multiple scientific domains often involves retraining or revalidation, which incur high energy and computational costs (Han et al. 2023). As model sizes continue to grow, DOE evaluates energy-efficient alternatives and sustainability metrics to ensure that foundation model verification remains viable at scale.
Addressing these challenges requires adopting modular model designs that support isolated testing and interpretation of internal components. This approach is already used in several scientific and engineering pipelines. In operator-learning architectures (Hossain et al. 2025; Kobayashi and Alam 2024; Kobayashi et
al. 2025; Lu et al. 2021), the branch-trunk decomposition (e.g., multiple-input operator nets) cleanly separates an encoder branch from a trunk coordinate network, allowing the encoder to be frozen while the trunk is unit-tested on synthetic or gold-standard fields. Neural operator methods with adapter layers create plug-in modules that can be swapped or ablated while holding the core operator architecture fixed (e.g., modular operator learning approaches such as multioperator architecture; Zhang 2024). In hybrid modeling, learned subgrid closures or surrogate modules are routinely inserted into traditional solvers (e.g., turbulence closures in fluid or atmospheric codes) so that the learned module can be validated separately under canonical flow conditions before being integrated into the full solver. (See survey of machine learning closure modeling in turbulence; Beck and Kurz 2021.)
Retrieval-augmented pipelines also already evaluate retriever and predictor modules separately, enabling stress tests of the knowledge interface. Mixture-of-experts (MoE) and routing architectures expose per-expert behavior that can be profiled with targeted inputs and compared against reference cases (e.g., recent MoE gating models showing analyzable expert routing; Nabian and Choudhry 2025).
In practice, isolation is enforced using stable interfaces and test harnesses: strict component contracts for inputs and outputs, synthetic data generators to probe edge-case behavior, golden tests on curated benchmarks, and swap-in or swap-out experiments that leave the surrounding system unchanged except for the module under test. These patterns demonstrate that isolated testing and interpretation are not only possible but already in use in modern scientific and machine learning systems, and they can be extended to foundation models intended for DOE mission-critical deployment.
Benchmarking across DOE high-performance computing (HPC) environments can reduce platform-induced variability, while federated test-beds enable collaboration with industry partners without compromising sensitive intellectual property. Verification efforts should be tightly integrated with comprehensive uncertainty documentation, capturing both aleatory and epistemic components to support robust deployment decisions. To ensure that foundation models are viable for science and engineering, users must treat verification as a foundational component of trust, aligned with sustainability and reproducibility objectives (Mahmood et al. 2024).
Validation assesses whether foundation model outputs faithfully reflect real-world behavior, particularly in mission-critical DOE applications such as reactor dynamics, grid stability, and materials performance (Wong et al. 2023). This requires systematic comparison of foundation model predictions against experimental observations and high-fidelity simulations (Hsieh et al. 2021), ensuring
alignment with physical laws and constraints, such as energy conservation, continuity, and thermodynamic consistency. For complex systems such as microreactors, where safety margins are narrow and data availability is limited, validation must account for data quality, physical plausibility, and generalizability.
High-quality, representative data sets are foundational to foundation model validation. Yet DOE domains often contend with sparse, noisy, or biased data, especially from heterogeneous physical systems such as renewable energy grids or coupled fluid–structure systems. These challenges are compounded by a lack of standardized benchmarks and by the diverse modalities and formats typical of scientific simulations, including scalar, vector, and tensor fields. Furthermore, the geometric dependence of DOE simulations introduces portability concerns, as foundation models trained on one discretization may fail when applied to different meshes or boundary conditions (Brunton et al. 2016; Moscoso et al. 2020).
Validating large-scale, pretrained, multimodal foundation models also entails a significant computational burden. Scientific problems in Earth systems, fusion, or subsurface modeling require validation across spatiotemporal domains and governing equations, often with high-dimensional input–output mappings. Although foundation models are designed to generalize across tasks and scale with data volume, verifying their consistency across multiple physical regimes remains a formidable task (Selin et al. 2024).
To address these issues, DOE can leverage a layered validation strategy. First, experimental cross-validation using real-world data from national user facilities, such as the Advanced Test Reactor, the Advanced Photon Source, or the National Renewable Energy Laboratory, anchors foundation model outputs to physical reality. Second, physics-based benchmarks, such as Monte Carlo neutron transport codes in ExaSMR or SCALE, serve as reference standards for evaluating foundation model fidelity. Where empirical data are sparse, synthetic data sets from validated simulators can support surrogate validation, provided they are curated with traceable metadata and grounded in domain-specific governing equations. For time-critical systems such as fusion control or grid stabilization, validation must also extend to closed-loop behaviors, ensuring stability and performance under uncertainty (Prinn 2013). In turbulence and Earth systems modeling, for example, learned subgrid closures have been validated first on canonical benchmark flows before being integrated into general circulation models, demonstrating that modular surrogate validation is feasible in practice (Beck and Kurz 2021; Hassanian et al. 2025). In nuclear engineering, Monte Carlo neutron transport has long served as a reference standard against which lower-fidelity or surrogate models are calibrated and tested (Leppänen et al. 2015). Similarly, synthetic data from validated simulators are already widely used in fusion and materials science to supplement scarce experimental observations, provided that the synthetic sets carry documented provenance and are grounded in governing equations (Kobayashi et al. 2025). Recent surrogate modeling studies further reinforce this layered approach, including climate emulation with graph neural
networks (Potter et al. 2024), coastal ocean circulation surrogates with physics-based constraints (Xu et al. 2024), adaptive implicit neural representations for high-fidelity scientific simulations (Li et al. 2025), and surrogate-based Bayesian calibration frameworks for climate models (Holthuijzen et al. 2025). Mesh portability challenges have been addressed using graph neural network surrogates on unstructured grids (Shi et al. 2022), and DOE’s Oak Ridge National Laboratory has employed surrogate-based calibration of the E3SM atmosphere model (Yarger et al. 2024). Physics-informed surrogate models have also been demonstrated for groundwater transport forecasting (Meray et al. 2024), while diffusion-based surrogates are emerging for regional climate and sea-ice simulations (Finn et al. 2024). These precedents indicate that the layered validation strategy is not speculative but reflects a growing body of practice across multiple domains.
Importantly, validation is not a binary pass/fail exercise. If a foundation model is shown to be invalid for a given regime, it is not discarded wholesale; instead, its use is confined to conditions where validation evidence is sufficient. In DOE mission settings, this means restricting the model to advisory or exploratory roles until retraining, fine-tuning, or hybridization with physics solvers restores fidelity. Models may also be demoted to shadow-mode operation, where outputs are logged but not acted upon until requalification criteria are met. This mirrors the way traditional simulation codes undergo continuous VVUQ cycles rather than one-time certification. Thus, the layered validation framework both builds on prior evidence and provides structured pathways for handling failure, ensuring that only models with verified domain fidelity are elevated to operational use.
Uncertainty Quantification (UQ) is indispensable for the trustworthy use of foundation models in DOE applications (Bilbrey et al. 2025). Unlike traditional simulators with interpretable inputs and outputs, foundation models pretrained on diverse tasks and modalities behave as black box approximators whose outputs are not explicitly governed by physical laws. This creates deep challenges for UQ, as error sources can propagate across input types, scientific contexts, or temporal regimes without clear attribution or traceability (Wang et al. 2023). Validation cannot rely on predictive fit alone when DOE decisions depend on counterfactuals and operator interventions. Foundation models must preserve causal structure under changes in operating point, control actions, and boundary conditions. Meeting this challenge requires integrating causal formalisms and intervention-based testing into both training and validation. Practical approaches include incorporating physics-based causal graphs or invariance penalties during training, pairing learning with interventional simulators that generate policy-relevant counterfactuals, and extending validation to intervention suites derived from simulation campaigns and historical logs. Evidence of robustness should include not just predictive accuracy but counterfactual fidelity, invariance under
admissible interventions, and stability when the model is exercised in closed-loop control settings. Recognizing causal and interventional robustness as a distinct challenge ensures that DOE foundation models are capable of supporting decision making in safety-critical and policy-relevant environments.
Pretrained foundation models used in DOE settings must often integrate sparse, noisy, or out-of-distribution data to support scientific inference (Moro et al. 2025). This introduces layered uncertainties: aleatory uncertainty from inherent randomness, epistemic uncertainty from incomplete knowledge, and structural uncertainty arising due to domain shift between pretraining and deployment (Moscoso et al. 2020). For example, a foundation model trained on geophysical sensor networks may fail to generalize to grid control scenarios if rare but critical events are underrepresented. Without explicit UQ, narrow predictive intervals may mask failure risks that compromise safety and mission assurance. Multimodal foundation models compound this complexity. Architectures that integrate text, telemetry, simulation output, and high-resolution spatiotemporal fields confront alignment and calibration issues unique to each data type. Classical UQ techniques, which assume homogeneity of inputs and well-defined likelihoods, are poorly suited to these heterogeneous scientific settings. Pretraining on unlabeled corpora also introduces ambiguity about data provenance, fidelity, and representativeness, weakening the basis for uncertainty estimation in downstream DOE applications.
DOE applications demand not only accurate predictions but transparent characterization of uncertainty across heterogeneous data sources. Foundation models must therefore estimate and report uncertainty per modality before composing it at the task level. Each input class, whether text, point sensors, images, or simulation fields, requires its own calibrated noise model and uncertainty head, with ensembles or Bayesian layers providing epistemic estimates of model uncertainty. Out-of-distribution detection should operate at both the single-modality and joint levels to flag inputs outside training distributions. Coverage guarantees must be calibrated with conformal or likelihood-free methods on DOE-relevant distributions to ensure reliability. Every prediction should carry a structured uncertainty record that attributes contributions to specific modalities, training stages, and preprocessing steps. Such provenance enables users, operators, and regulators to understand not only the magnitude of uncertainty but its origin, providing the transparency required for deployment in high-consequence DOE missions. For DOE’s mission-critical use, uncertainty must not only be quantified but also interpretable to domain experts and regulators (NEA 2016). While ensemble methods and Bayesian deep learning offer statistical tools, they do not fully meet DOE’s high-dimensional and context-sensitive requirements (Fort et al. 2020). In domains such as fusion energy or nuclear thermal hydraulics, UQ must resolve sensitivity to mesh discretization, boundary geometry, and initial condition variability (Wang et al. 2022). UQ must be integrated into foundation model pipelines from the outset, rather than retrofitted postdeployment. Early
inclusion allows recursive calibration, scenario-based testing, and adaptive trust assessment as models are transferred across domains or facility environments.
For DOE’s mission-critical settings, predictive fit alone is insufficient. Decision support often requires counterfactual reasoning: how a system responds under interventions such as operator actions, set-point changes, or equipment faults. Foundation models must therefore be validated not just on observed data but on their behavior under interventions and in closed-loop interaction with controllers. Integrating causal robustness into DOE’s assurance framework requires physics-informed causal graphs or invariance penalties during training, coupled with interventional simulators that generate policy-relevant counterfactuals. Validation should extend to structured intervention suites built from simulations and historical logs. Pairing uncertainty quantification with causal checks during pretraining and fine-tuning enables early rejection of models that may replicate passively observed data but collapse under perturbation. Evidence of robustness must include counterfactual fidelity, invariance under admissible interventions, and stability when embedded in control loops. Recognizing causal and interventional robustness as a distinct challenge ensures that foundation models can support DOE operators and regulators with trustworthy, decision-relevant behavior. This alignment with validation and reproducibility workflows gives DOE decision makers a reliable basis for quantifying and managing uncertainty in operational systems (Rudin 2019), with test-beds such as DOE’s Nuclear Energy Advanced Modeling and Simulation program (NEAMS n.d.) and Office of Cybersecurity, Energy Security, and Emergency Response (DOE n.d.) offering structured platforms for future foundation model–UQ integration).
Ultimately, general-purpose foundation models are not viable for deployment in DOE’s regulatory and high-risk environments without multimodal, physics-aware, and domain-transferable UQ mechanisms that match the complexity and societal stakes of DOE science. Although foundation models offer compelling new capabilities, DOE cannot assume that existing VVUQ practices for traditional simulation codes apply directly. At present, foundation models should be pursued as research assets whose deployment in high-consequence settings depends on the creation of assurance frameworks. This means that near-term use is appropriate for exploratory science, surrogate modeling, and advisory applications, but operational roles in control, protection, or licensing should await the development of DOE-specific VVUQ, reproducibility, and assurance standards. Thus, the immediate recommendation is not to prohibit use but to invest in dedicated research that adapts and extends VVUQ methods to the foundation model context, establishing the evidence base required for safe and certifiable deployment.
Conclusion 5-1: VVUQ methods analogous to those for traditional computational modeling do not exist for, or map directly onto, foundation models.
Reproducibility is the ability to replicate results under consistent conditions, a foundational requirement for scientific integrity and model trustworthiness. In the context of foundation models, especially those pretrained across heterogeneous data modalities and designed for cross-task generalization, reproducibility becomes significantly more complex. These models are often trained on massive, uncurated data sets, under evolving software environments and stochastic training routines (Laine et al. 2021). Such variability introduces silent failure modes that can undermine reliability in DOE’s high-stakes domains (Tian et al. 2018), where model outputs may influence nuclear safety evaluations, advanced material qualification, or infrastructure resilience planning (Wang et al. 2025). Unlike narrowly scoped machine learning models, foundation models function as multipurpose, continuously evolving systems. Their ability to generalize across modalities (e.g., text, simulation data, and sensor fields) and across tasks introduces deeper reproducibility risks. The same model may be applied to subchannel thermal-hydraulics in one instance and to geospatial risk mapping in another, with minimal retraining. Without rigorous documentation of pretraining data sets, transfer learning decisions, and model evolution, the provenance of any single prediction becomes difficult to verify or audit. Moreover, generalist models often operate with latent knowledge acquired during pretraining stages that are difficult to retrace or validate (Pyzer-Knapp et al. 2025).
In DOE contexts, these concerns are not academic. Reproducibility is a precondition for regulatory acceptance, operational deployment, and scientific validation (Allison et al. 2018). Yet, three critical barriers persist. First, nondeterminism due to random weight initialization, floating-point discrepancies, and hardware variability can yield different outputs for the same inputs, especially when dealing with distributed training across heterogeneous platforms (Allison et al. 2018). Second, data and code access are often restricted in national security or proprietary collaborations, making external replication difficult. Third, inconsistent training practices (e.g., undocumented hyperparameters, varying data preprocessing pipelines, or ad hoc fine-tuning) introduce methodological drift across teams and institutions (Nichols et al. 2021).
Addressing these challenges requires intentional infrastructure and cultural shifts. Standardized computing environments, reproducible pipelines using fixed seeds and version-controlled dependencies, and MLOps tooling for experiment lineage must become baseline practices (Nature.com 2021). DOE is uniquely positioned to lead here, leveraging its HPC systems and scientific workflow infrastructure to enforce deterministic model training and versioned data sets. Open science policies, where feasible, should promote model card documentation, training log archival, and reproducibility benchmarks. In secure settings, controlled-access reproducibility testbeds can support internal verification without exposing sensitive materials. Ultimately, the reproducibility of foundation models in science depends on shared codebases, fixed sources of randomness,
and acknowledging that foundation models are not static endpoints but evolving, reusable artifacts (Nichols et al. 2021). Reproducibility must account for how a model was trained on what data, for which task, and under which assumptions, while enabling traceable, auditable reuse across new applications. This becomes especially vital as DOE seeks to deploy general-purpose models across institutions and missions, where latent variability may propagate unnoticed and compromise reliability at scale.
In DOE contexts, fit-for-purpose means that a foundation model can be demonstrated to satisfy acceptance criteria that are explicitly matched to the safety, security, and reliability demands of its intended use. For exploratory science and low-risk applications, this may require only statistical fidelity, convergence under refinement, and reproducibility of results across runs and platforms. For regulatory or mission-relevant applications, fit-for-purpose raises the bar: models must provide deterministic behavior within specified tolerances, complete provenance of data and training decisions, and calibrated uncertainty estimates with coverage guarantees tied to DOE-relevant distributions. For real-time control or protection functions, fit for purpose requires safety certification: predictable execution under bounded latency and jitter, validated closed-loop stability margins, and robust fallback or fail-safe behavior under disturbance.
Mapping VVUQ to these tiers ensures that DOE foundation models are not treated as “one size fits all,” but are qualified according to the risks they manage. Tiered acceptance criteria might include (1) reproducibility benchmarks and physics-based consistency checks for discovery science; (2) reproducibility dossiers, provenance logging, and validated uncertainty quantification for regulatory use; and (3) hardware-in-the-loop timing guarantees, interventional validation suites, and documented fail-safe policies for mission-critical control. By embedding these criteria, fit for purpose becomes an operational standard rather than a rhetorical goal, aligning model trustworthiness with the concrete safety, security, and reliability needs of DOE missions.
Conclusion 5-2: VVUQ, interpretability, and reproducibility are critical for establishing and maintaining trust in systems that are inherently complex, opaque, and increasingly deployed in high-stakes situations. Integration of VVUQ into foundation models would lead to increasing their trustworthiness, reliability, and fit for purpose, which is essential for future scientific discovery and innovation.
Recommendation 5-1: The Department of Energy (DOE) should lead the development of verification, validation, and uncertainty quantification frameworks tailored to foundation models, with built-in support for physical consistency, structured uncertainty quantification, and reproducible benchmarking in DOE-relevant settings.
Conclusion 5-3: AI for science will demand more and different physical experiments to validate the veracity of the AI predictions. Empirical grounding ensures that foundation model outputs reflect physical laws and domain-specific behavior. This is especially critical in high-stake DOE applications, where simulations alone cannot guarantee correctness, and where physical experiments provide the only definitive test of predictive validity.
Recommendation 5-2: In line with Recommendation 4-2, the Department of Energy should place high priority on data collection efforts to support reproducible foundation model training and validation, analogous to traditional efforts in verification, validation, and uncertainty quantification.
Recommendation 5-3: The Department of Energy should establish and enforce standardized protocols and develop benchmarks for training, documenting, and reproducing foundation models for science and should participate in defining software standards, addressing randomness, hardware variability, and data access across its laboratories and high-performance computing infrastructure.
There are both benefits and risks when collaborating with AI industry leaders. It would benefit DOE to be aware of such benefits and the challenges that collaboration might bring.
The following measures frame responsible use and deployment alongside validation, verification, and uncertainty quantification.
Conclusion 5-4: Partnering of DOE laboratories with industry on AI foundation models will require deliberate effort, including flexible contracting mechanisms, clear intellectual property agreements, data-sharing processes, aligning on VVUQ approaches, responsible AI practices, and a shared understanding of respective missions, objectives, and constraints.
Recommendation 5-4: The Department of Energy should deliberately pursue partnerships with industry and academia to address national mission goals, governed by flexible contracts, responsible artificial intelligence standards, and alignment on reproducibility, verification, validation, and uncertainty quantification approaches and data sharing.
Afroogh, S., A. Akbari, E. Malone, M. Kargar, and H. Alambeigi. 2024. “Trust in AI: Progress, Challenges, and Future Directions.” arXiv:2403.14680. https://ui.adsabs.harvard.edu/abs/2024arXiv240314680A.
Allison, D.B., R.M. Shiffrin, and V. Stodden. 2018. “Reproducibility of Research: Issues and Proposed Remedies.” Proceedings of the National Academy of Sciences of the United States of America 115(11):2561–2562.
Babuska, I., and J.T. Oden. 2004. “Verification and Validation in Computational Engineering and Science: Basic Concepts.” Computer Methods in Applied Mechanics and Engineering 193(36–38): 4057–4066.
Bail, C.A. 2024. “Can Generative AI Improve Social Science?” Proceedings of the National Academy of Sciences of the United States of America 121(21):e2314021121.
Barton, C.M., A. Lee, M.A. Janssen, S. van der Leeuw, G.E. Tucker, C. Porter, J. Greenberg, et al. 2022. “How to Make Models More Useful.” Proceedings of the National Academy of Sciences of the United States of America 119(35):e2202112119.
Beck, A., and M. Kurz. 2021. “A Perspective on Machine Learning Methods in Turbulence Modeling.” GAMM Mitteilungen 44(1):e202100002.
Bilbrey, J.A., J.S. Firoz, M.S. Lee, and S. Choudhury. 2025. “Uncertainty Quantification for Neural Network Potential Foundation Models.” npj Computational Materials 11(1):109.
Blau, W., V.G. Cerf, J. Enriquez, J.S. Francisco, U. Gasser, M.L. Gray, M. Greaves, et al. 2024. “Protecting Scientific Integrity in an Age of Generative AI.” Proceedings of the National Academy of Sciences of the United States of America 121(22):e2407886121.
Bloomfield, R., and J. Rushby. 2024. “Assurance of AI Systems from a Dependability Perspective.” arXiv:2407.13948. https://ui.adsabs.harvard.edu/abs/2024arXiv240713948B.
Brunton, S.L., J.L. Proctor, and J.N. Kutz. 2016. “Discovering Governing Equations from Data by Sparse Identification of Nonlinear Dynamical Systems.” Proceedings of the National Academy of Sciences of the United States of America 113(15):3932–3937.
Chen, C., X. Gong, Z. Liu, W. Jiang, S.Q. Goh, and K.-Y. Lam. 2024. “Trustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations.” arXiv:2408.12935. https://ui.adsabs.harvard.edu/abs/2024arXiv240812935C.
DOE (Department of Energy). n.d. “About the Office of Cybersecurity, Energy Security, and Emergency Response (CESER).” https://www.energy.gov/ceser/about-office-cybersecurity-energy-security-and-emergency-response, accessed July 31, 2025.
Finn, T.S., C. Durand, A. Farchi, M. Bocquet, P. Rampal, and A. Carrassi. 2024. “Generative Diffusion for Regional Surrogate Models from Sea-Ice Simulations.” Journal of Advances in Modeling Earth Systems 16:e2024MS004395.
Fort, S., H. Hu, and B. Lakshminarayanan. 2020. “Deep Ensembles: A Loss Landscape Perspective.” ArXiv. https://doi.org/10.48550/arXiv.1912.02757.
Gurieva, J., E. Vasiliev, and L. Smirnov. 2022. “Application of Conservation Laws to the Learning of Physics-Informed Neural Networks.” Procedia Computer Science 212:464–473.
Han, B.A., K.R. Varshney, S. LaDeau, A. Subramaniam, K.C. Weathers, and J. Zwart. 2023. “A Synergistic Future for AI and Ecology.” Proceedings of the National Academy of Sciences of the United States of America 120(38):e2220283120.
Hassanian, R., Á. Helgadóttir, F. Gharibi, A. Beck, and M. Riedel. 2025. “Data-Driven Deep Learning Models in Particle-Laden Turbulent Flow.” Physics of Fluids 37(2):023348.
Holthuijzen, M.F., A. Chakraborty, E. Krath, and T. Catanach. 2025. “Surrogate-Based Bayesian Calibration Methods for Climate Models: A Comparison of Traditional and Non-Traditional Approaches.” arXiv:2508.13071. https://ui.adsabs.harvard.edu/abs/2025arXiv250813071H.
Hossain, R., F. Ahmed, K. Kobayashi, S. Koric, D. Abueidda, and S.B. Alam. 2025. “Virtual Sensing-Enabled Digital Twin Framework for Real-Time Monitoring of Nuclear Systems Leveraging Deep Neural Operators.” npj Materials Degradation 9(1):21.
Hsieh, A.S., K.A. Brown, N.B. deVelder, T.G. Herges, R.C. Knaus, P.J. Sakievich, L.C. Cheung, B.C. Houchens, M.L. Blaylock, and D.C. Maniaci. 2021. “High-Fidelity Wind Farm Simulation Methodology with Experimental Validation.” Journal of Wind Engineering and Industrial Aerodynamics 218:104754.
Kobayashi, K., and S.B. Alam. 2024. “Deep Neural Operator-Driven Real-Time Inference to Enable Digital Twin Solutions for Nuclear Energy Systems.” Scientific Reports 14(1):2101.
Kobayashi, K., S. Roy, S. Koric, D. Abueidda, and S. Bahauddin Alam. 2025. “From Proxies to Fields: Spatiotemporal Reconstruction of Global Radiation from Sparse Sensor Sequences.” arXiv:2506.12045. https://ui.adsabs.harvard.edu/abs/2025arXiv250612045K.
Koch, F., A. Djuhera, and A. Binotto. 2025. “Intelligent Orchestration for Inference of Large Foundation Models at the Edge.” arXiv. https://doi.org/10.48550/arXiv.2504.03668.
Kowald, D., S. Scher, V. Pammer-Schindler, P. Müllner, K. Waxnegger, L. Demelius, A. Fessl, et al. 2024. “Establishing and Evaluating Trustworthy AI: Overview and Research Challenges.” arXiv:2411.09973. https://ui.adsabs.harvard.edu/abs/2024arXiv241109973K.
Laine, R.F., I. Arganda-Carreras, R. Henriques, and G. Jacquemet. 2021. “Avoiding a Replication Crisis in Deep-Learning-Based Bioimage Analysis.” Nature Methods 18(10):1136–1144.
Leppänen, J., M. Pusa, T. Viitanen, V. Valtavirta, and T. Kaltiaisenaho. 2015. “The Serpent Monte Carlo Code: Status, Development and Applications in 2013.” Annals of Nuclear Energy 82: 142–250.
Li, Z., Y. Duan, T. Xiong, Y.-T. Chen, W.-L. Chao, and H.-W. Shen. 2025. “High-Fidelity Scientific Simulation Surrogates via Adaptive Implicit Neural Representations.” arXiv:2506.06858. https://ui.adsabs.harvard.edu/abs/2025arXiv250606858L.
Lu, L., P. Jin, G. Pang, Z. Zhang, and G.E. Karniadakis. 2021. “Learning Nonlinear Operators via DeepONet Based on the Universal Approximation Theorem of Operators.” Nature Machine Intelligence 3(3):218–229.
Mahmood, S., H. Sun, A.A. Alhussan, A. Iqbal, and E.-S.M. El-kenawy. 2024. “Active Learning-Based Machine Learning Approach for Enhancing Environmental Sustainability in Green Building Energy Consumption.” Scientific Reports 14(1):19894.
Meray, A., L. Wang, T. Kurihana, I. Mastilovic, S. Praveen, Z. Xu, M. Memarzadeh, A. Lavin, and H. Wainwright. 2024. “Physics-Informed Surrogate Modeling for Supporting Climate Resilience at Groundwater Contamination Sites.” Computers & Geosciences 183:105508.
Moro, V., C. Loh, R. Dangovski, A. Ghorashi, A. Ma, Z. Chen, S. Kim, P.Y. Lu, T. Christensen, and M. Soljačić. 2025. “Multimodal Foundation Models for Material Property Prediction and Discovery.” Newton 1(1):100016.
Moscoso, M., A. Novikov, G. Papanicolaou, and C. Tsogka. 2020. “The Noise Collector for Sparse Recovery in High Dimensions.” Proceedings of the National Academy of Sciences of the United States of America 117(21):11226–11232.
Mukherjee, S., J. Lang, O. Kwon, I. Zenyuk, V. Brogden, A. Weber, and D. Ushizima. 2025. “Foundation Models for Zero-Shot Segmentation of Scientific Images Without AI-Ready Data.” arXiv Computer Science. https://doi.org/10.48550/arXiv.2506.24039.
Nabian, M.A., and S. Choudhry. 2025. “A Mixture of Experts Gating Network for Enhanced Surrogate Modeling in External Aerodynamics.” arXiv:2508.21249. https://ui.adsabs.harvard.edu/abs/2025arXiv250821249N.
Nature.com. 2021. “Moving Towards Reproducible Machine Learning.” Nature Computational Science 1(10):629–630.
NEA (Nuclear Energy Agency). 2016. Review of Uncertainty Methods for Computational Fluid Dynamics Application to Nuclear Reactor Thermal Hydraulics. Organisation for Economic Co-operation and Development.
NEAMS (Nuclear Energy Advanced Modeling and Simulation). n.d. About. https://neams.inl.gov/about-us, accessed July 31, 2025.
Nichols, J.D., M.K. Oli, W.L. Kendall, and G.S. Boomer. 2021. “A Better Approach for Dealing with Reproducibility and Replicability in Science.” Proceedings of the National Academy of Sciences of the United States of America 118(7):e2100769118.
Palmer, T., and B. Stevens. 2019. “The Scientific Challenge of Understanding and Estimating Climate Change.” Proceedings of the National Academy of Sciences of the United States of America 116(49):34390–34395.
Potter, K., C. Martinez, R. Pradhan, S. Brozak, S. Sleder, and L. Wheeler. 2024. “Graph Convolutional Neural Networks as Surrogate Models for Climate Simulation.” arXiv:2409.12815. https://ui.adsabs.harvard.edu/abs/2024arXiv240912815P.
Prinn, R.G. 2013. “Development and Application of Earth System Models.” Proceedings of the National Academy of Sciences of the United States of America 110(Suppl. 1):3673–3680.
Pyzer-Knapp, E.O., M. Manica, P. Staar, L. Morin, P. Ruch, T. Laino, J.R. Smith, and A. Curioni. 2025. “Foundation Models for Materials Discovery—Current State and Future Directions.” npj Computational Materials 11(1):61.
Radova, M., W.G. Stark, C.S. Allen, R.J. Maurer, and A.P. Bartók. 2025. “Fine-Tuning Foundation Models of Materials Interatomic Potentials with Frozen Transfer Learning.” npj Computational Materials 11(1):237.
Rudin, C. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1(5):206–215.
Sarker, I.H. 2022. “AI-Based Modeling: Techniques, Applications and Research Issues Towards Automation, Intelligent and Smart Systems.” SN Computer Science 3(2):158.
ScienceAdviser. 2024. “Accelerating the Discovery of Battery Materials with AI.” https://www.science.org/do/10.1126/science.zw4uuid/full/_20240216_cpub_microsoft_feature-1714408300127.pdf.
Selin, N.E., A. Giang, and W.C. Clark. 2024. “Showcasing Advances and Building Community in Modeling for Sustainability.” Proceedings of the National Academy of Sciences of the United States of America 121(29):e2215689121.
Shi, N., J. Xu, S.W. Wurster, H. Guo, J. Woodring, L.P. Van Roekel, and H.W. Shen. 2022. “GNN-Surrogate: A Hierarchical and Adaptive Graph Neural Network for Parameter Space Exploration of Unstructured-Mesh Ocean Simulations.” arXiv:2202.08956. https://ui.adsabs.harvard.edu/abs/2022arXiv220208956S.
Talirz, L., S. Kumbhar, E. Passaro, A.V. Yakutovich, V. Granata, F. Gargiulo, M. Borelli, et al. 2020. “Materials Cloud, a Platform for Open Computational Science.” Scientific Data 7(1):299.
Teranishi, K., H. Menon, W.F. Godoy, P. Balaprakash, D. Bau, T. Ben-Nun, A. Bhatele, et al. 2025. “Leveraging AI for Productive and Trustworthy HPC Software: Challenges and Research Directions.” arXiv. https://doi.org/10.48550/arXiv.2505.08135.
Tian, D., J. Deng, E. Zio, F. Maio, and F. Liao. 2018. “Failure Modes Detection of Nuclear Systems Using Machine Learning.” In 2018 5th International Conference on Dependable Systems and Their Applications (DSA). IEEE. https://doi.org/10.1109/DSA.2018.00017.
Tidjon, L.N., and F. Khomh. 2022. “The Different Faces of AI Ethics Across the World: A Principle-Implementation Gap Analysis.” arXiv:2206.03225. https://ui.adsabs.harvard.edu/abs/2022arXiv220603225N.
Wang, S., J. González-Cao, H. Islam, M. Gómez-Gesteira, and C. Guedes Soares. 2022. “Uncertainty Estimation of Mesh-Free and Mesh-Based Simulations of the Dynamics of Floaters.” Ocean Engineering 256:111386.
Wang, Z., M. Daeipour, and H. Xu. 2023. “Quantification and Propagation of Aleatoric Uncertainties in Topological Structures.” Reliability Engineering and System Safety 233:109122.
Wang, Z., H. Wei, R. Tian, and S. Tan. 2025. “A Review of Data-Driven Fault Diagnosis Method for Nuclear Power Plant.” Progress in Nuclear Energy 186:105785.
Wong, M.L., C.E. Cleland, D. Arend, S. Bartlett, H.J. Cleaves, H. Demarest, A. Prabhu, J.I. Lunine, and R.M. Hazen. 2023. “On the Roles of Function and Selection in Evolving Systems.” Proceedings of the National Academy of Sciences of the United States of America 120(43):e2310223120.
Xu, Z., J. Ren, Y. Zhang, J.M. Gonzalez Ondina, M. Olabarrieta, T. Xiao, W. He, et al. 2024. “Accelerate Coastal Ocean Circulation Model with AI Surrogate.” arXiv:2410.14952. https://ui.adsabs.harvard.edu/abs/2024arXiv241014952X.
Yang, Y., W. Youyou, and B. Uzzi. 2020. “Estimating the Deep Replicability of Scientific Findings Using Human and Artificial Intelligence.” Proceedings of the National Academy of Sciences of the United States of America 117(20):10762–10768.
Yarger, D., B.M. Wagman, K. Chowdhary, and L. Shand. 2024. “Autocalibration of the E3SM Version 2 Atmosphere Model Using a PCA-Based Surrogate for Spatial Fields.” Journal of Advances in Modeling Earth Systems 16:e2023MS003961.
Yuan, E.C.-Y., Y., Liu, J. Chen, P. Zhong, S. Raja, T. Kreiman, S. Vargas, et al. 2025. “Foundation Models for Atomistic Simulation of Chemistry and Materials.” arXiv Physics. https://doi.org/10.48550/arXiv.2503.10538.
Zhang, Z. 2024. “MODNO: Multi-Operator Learning with Distributed Neural Operators.” Computer Methods in Applied Mechanics and Engineering 431:117229.
This page intentionally left blank.