Evaluation Theater
NIH's Gold Standard Science Plan Promised a Make-Believe Evaluation Strategy
This image was generated by Google AI (Gemini)
About a year ago, President Trump issued his Executive Order “Restoring Gold Standard Science.” Section 3(d) required agency heads to “report to the OSTP Director on the actions taken to implement Gold Standard Science” within 60 days.
So, on August 22, 2025, the National Institutes of Health released a document called “Leading in Gold Standard Science: An NIH Implementation Plan” including a foreword by NIH Director Jay Bhattacharya. The plan committed NIH to the nine tenets of scientific excellence outlined in the EO: reproducibility, transparency, interdisciplinary collaboration, unbiased peer review, and more. Then it catalogued existing NIH programs (which started long before the Trump administration), described a few new initiatives, and projected confidence about the agency’s commitment to rigorous science.
“Gold standard science isn’t just what we strive for, it is embedded in everything we do, from the research we support to the policies and programs we create.” - Jay Bhattacharya, NIH Director
Curiously, near the end of the document, on pages 13 and 14, the plan made a specific and verifiable promise. It said that NIH would evaluate adherence to all nine tenets using “several frameworks.” It promised to develop “composite measure indexes” to augment evaluation metrics. And it indicated that NIH would “post annual updates of evaluation progress and findings online, sharing widely with public audiences.”
Nine months have passed. No evaluation report exists. This should surprise no one. The NIH “Gold Standard Science” document was simply EO compliance dressed up as institutional substance. But why promise an evaluation? Perhaps an attempt to appear credible and scientific?
As a former NIH Program Officer of 22 years, and a scientist who has been involved in many actual scientific evaluation projects, frankly, I laughed out loud at this section of the NIH plan. It was completely ridiculous, just as ridiculous as the EO.
Pages 13-14:
Defining Our Success
Evaluation of our programs, policies, and initiatives is foundational to ensuring NIH continues to deliver results for the public. Current and planned NIH initiatives will be periodically assessed for adherence to the nine tenets via initiative-specific retrospective or prospective evaluation. Evaluation metrics will be augmented with several composite measure indexes that are under development.
Evaluation Strategy
NIH may rely upon several frameworks to guide its evaluation activities: the simplified Consolidation [sic] Framework for Implementation Research (CFIR)2 to guide evaluation of the pre-implementation phase; the Reach Effectiveness Adoption Implementation Maintenance (RE-AIM)3 framework to guide evaluation of the implementation and post-implementation phases; and Proctor’s Implementation Outcomes Framework (IOF)4, a framework for translating research into practice and planning programs to improve the odds for successful implementation in “real-world” settings. The use of these frameworks will enable NIH to evaluate adoption, implementation fidelity, and acceptability of aspects of gold standard science as well as sustainability and scalability after initial implementation.
Evaluation Transparency
NIH will post annual updates of evaluation progress and findings online, sharing widely with public audiences. Additionally, through channels such as evaluation presentations at local, regional, and national medical and public health professional meetings, NIH plans to disseminate information on its efforts to support the translation, communication, and incorporation of the nine gold standard science tenets while informing the public of its progress in affirming gold standard science.
Evaluation Structure
Below is a sample evaluation planning table for tenet, Reproducible, subject to refinement based on feasibility, scoping, and resources.
________________________
2 Damschroder LJ, Reardon CM, Widerquist MAO, Lowery J. The updated Consolidated Framework for Implementation Research based on user feedback. Implement Sci. 2022 Oct 29;17(1):75.
n.b. CFIR is a comprehensive typology of constructs likely to influence the implementation of Evidence-Based Innovations (EBIs): (a) intervention characteristics, (b) outer setting, (c) inner setting, (d) characteristics of individuals, and (e) the implementation process.
3 Glasgow RE, Vogt TM, Boles SM. Evaluating the public health impact of health promotion interventions: the RE-AIM framework. Am J Public Health. 1999 Sep;89(9):1322-7.
4 Proctor E, Silmere H, Raghavan R, Hovmand P, Aarons G, Bunger A, Griffey R, Hensley M. “Outcomes for implementation research: conceptual distinctions, measurement challenges, and research agenda.” Adm Policy Ment Health. 2011 Mar;38(2):65-76.
The Promise
To those unfamiliar with legitimate evaluation research, perhaps this appears reasonably constructed. It names three specific scientific frameworks it would use to guide assessment: the Consolidated Framework for Implementation Research (CFIR), as described in Damschroder et al. (2022); the Reach, Effectiveness, Adoption, Implementation, Maintenance (RE-AIM) framework, as described in Glasgow et al. (1999); and Proctor’s Implementation Outcomes Framework (IOF), as described in Proctor et al. (2011). These are all real, peer-reviewed frameworks, published in respected journals, with extensive citation records. Their inclusion in the document creates a strong impression of methodological seriousness.
The plan even organized these frameworks by phase. CFIR would guide evaluation in the “pre-implementation phase.” RE-AIM would guide the “implementation and post-implementation phases.” Proctor’s IOF would help translate research into practice and plan programs for “real-world settings.” Below this description, NIH provided a sample evaluation planning table for the first tenet, Reproducibility, listing objectives, outcomes, measures, metrics, analytic tools, data sources, and mechanisms.
It looks like a plan. It reads like a plan. It is not a plan.
What Implementation Science Actually Does
To understand why, it helps to understand what implementation science actually is.
Implementation science, as defined by Eccles and Mittman in 2006, is the scientific study of methods to promote the systematic uptake of research findings and other evidence-based practices into routine clinical practice, in order to improve the quality and effectiveness of health services. A useful plain-language summary comes from a 2024 commentary by Frances Chu in the Journal of the Medical Library Association. Chu describes it this way: the intervention or practice being studied is “The Thing.” Efficacy and effectiveness research asks whether The Thing works. Implementation research then asks how to get people and organizations to do The Thing (Chu, 2024, citing Curran, 2020).
Crucially, implementation science presupposes that The Thing has already been proven to work. It does not evaluate whether agency aspirations are being met. It studies whether a defined, evidence-based practice is being adopted by identifiable implementers in a specific healthcare delivery setting, and what barriers and facilitators shape that adoption. As Chu notes, “the expectation is that an intervention has been demonstrated to be efficacious and effective” before implementation science enters the picture.
This precondition and purpose are not a technicality. They are the entire foundation on which all three of NIH’s cited frameworks rest.
Implementation Science Frameworks are Completely Irrelevant to the NIH Gold Standard Science Implementation Plan
CFIR is what researchers call a determinant framework. Its purpose, as described by Damschroder et al. (2022), is to identify barriers and facilitators to implementing a defined evidence-based innovation. Before CFIR can be used, users must complete three mandatory steps: define the innovation being implemented, replace broad construct language with project-specific language, and identify the individuals who have power and influence over implementation outcomes.
NIH’s plan identifies no innovation. “Gold standard science” is a set of normative aspirations, not a manualized intervention with developers whose intent could be operationalized. There is no evidence base being translated into practice, no prior efficacy trial of “reproducibility,” no defined setting in which adoption is being measured. Damschroder and colleagues are explicit: “The CFIR must be fully operationalized prior to use in a project.” NIH’s plan skips operationalization entirely because it is structurally impossible to complete given what the plan is actually describing.
RE-AIM was developed by Glasgow et al. (1999) to evaluate the population-level impact of health promotion interventions across five dimensions: Reach, Efficacy, Adoption, Implementation, and Maintenance. Each dimension requires specific, measurable inputs. Reach is defined as an individual-level measure of the percentage of the target population that participated in the intervention. Adoption refers to the proportion of settings that adopt a given policy or program. Maintenance assesses whether innovations become a stable part of an organization’s behavioral repertoire, with a minimum assessment window of two years.
Now apply these requirements to NIH’s commitment to “reproducible” science. Who are the individual participants whose reach is measured? What is the adoption rate across which defined settings? What behavior at the individual level is being maintained? The plan’s own sample evaluation planning table for tenet 1, Reproducible, proposes metrics like “percentage change in the number of NIH-funded studies meeting protocol standards” and “peer review scores on experimental design.” These may or may not be reasonable things to track. But they have absolutely nothing to do with RE-AIM, which was designed to evaluate smoking cessation programs, diabetes interventions, and worksite health promotion, not the normative commitments of a federal science agency.
Proctor’s Implementation Outcomes Framework defines eight conceptually distinct implementation outcomes: acceptability, adoption, appropriateness, feasibility, fidelity, implementation cost, penetration, and sustainability (Proctor et al., 2011). Each of these outcomes requires a specific evidence-based practice as its referent. Fidelity, for example, is defined as the degree to which an intervention was implemented as it was prescribed in the original protocol or as intended by its developers. Penetration is the integration of a practice within a service setting and its subsystems.
The questions that follow are unanswerable given what NIH has written. Fidelity to what protocol? Penetration of what practice into which service setting? There are no developers whose intent can be measured against. There is no protocol. There is no service setting in the implementation science sense. Proctor’s framework is a tool for understanding whether a specific medical or behavioral health treatment is being delivered consistently across clinics. It cannot evaluate whether a government agency sincerely values unbiased peer review.
The Deeper Problem
It would be charitable to read the citation of these frameworks as a well-intentioned error, the result of a dutiful junior staff member who added these framework citations without fully understanding what they do. That charitable reading is available.
But it strains credulity when applied to Bhattacharya, who is a trained health economist with deep familiarity with research methodology, and whose name and opening quotation appear on the document. It strains credulity further when one notices that the frameworks were not cited generically but with specific journal article references, in a section organized by implementation phase, and yet the plan cannot correctly name the first framework it cites, calling it the “Consolidation Framework” rather than the Consolidated Framework for Implementation Research. Someone knew enough to find the citation. Nobody knew enough to read it carefully. The result is that three legitimate scientific frameworks were erroneously conscripted into an evaluation context for which none of them were designed.
The more uncomfortable interpretation is that the citations are doing what decorative citations always do: creating the appearance of rigor without the substance. They tell a reader who does not look closely that serious methodological thought has gone into this evaluation strategy. They tell a reader who does look closely something else entirely.
This is what evaluation theater looks like. Not the absence of evaluation, which can sometimes reflect genuine resource constraints or bureaucratic delay. Evaluation theater is the performance of evaluation: frameworks cited, phases described, metrics listed, an annual report promised, none of it connected to a coherent methodology that could actually produce findings.
Nine Months of Silence
The plan promised that NIH would “post annual updates of evaluation progress and findings online, sharing widely with public audiences.” It promised dissemination at “local, regional, and national medical and public health professional meetings.” Nine months after publication, none of this exists so far.
The silence is consistent with what we would expect from a plan that was never operationally serious. It was a political document. An evaluation cannot be conducted using frameworks that do not apply to the thing being evaluated. The absence of a report is not the primary evidence of failure. It is the predictable consequence of a methodology that could never have worked.
There is also a structural irony worth naming. One of the plan’s nine tenets is transparency. The plan itself claimed that “public input and accountability are embedded throughout NIH processes.” The commitment to post annual evaluation findings publicly was, by the plan’s own logic, a transparency commitment. A document that cannot meet its own stated standards for transparency is a document that deserves scrutiny on all its other claims as well.
What a Real Evaluation Would Require
This is not a counsel of impossibility. Evaluating whether a federal science agency is living up to its commitments is hard, but it is not methodologically mysterious.
Appropriate frameworks would include program logic models and theory of change approaches that clearly specify inputs, activities, outputs, and outcomes. The NIH already operates under the Government Performance and Results Act, which has its own performance measurement infrastructure that could be adapted here. Scientometric and bibliometric methods, which are well-established in research evaluation, could track preregistration rates, data sharing compliance, replication study outcomes, and retraction rates across NIH-funded research over time. These methods are not exotic. They are what research evaluation actually looks like.
The choice to invoke implementation science frameworks instead is therefore conspicuous. It is not the choice of a team that surveyed the methodological landscape and selected the best tools. It reads like the choice of a team that needed the section to appear rigorous and found credentialed-sounding names to paste into the document. Their evaluation strategy was word salad.
Why This Matters
I have been documenting the decline of the NIH on this Substack because I believe the erosion of this institution has consequences that will unfold over years and decades, long after the current political moment passes. The “Gold Standard Science” plan has attracted criticism on the most obvious grounds: it was issued by an administration that has cancelled grants for political reasons, suppressed CDC findings on vaccine effectiveness, and systematically shut down entire programs of science it finds ideologically inconvenient. Those criticisms are valid and important.
But the evaluation section of this plan deserves its own attention, for a different reason. It is a case study in how bureaucratic documents can be constructed to look serious while committing to nothing.
This is not how well-intentioned bureaucracies stumble. Bureaucracies that stumble produce flawed evaluations, belated reports, metrics that miss the mark. What this NIH plan produced is something more insidious: a document structured to look like accountability while making accountability impossible. Authoritarian-style governance has long understood that the appearance of scientific rigor is more politically useful than the rigor itself. You cite the frameworks, you name the phases, you promise the annual report, and you collect the credibility without ever submitting to the scrutiny. Evaluation theater does not produce findings that can embarrass anyone. That is not a bug. It is the design.
These are not people who are serious about science. I implore the public. Stop treating Jay Bhattacharya as if he is a worthy holder of the NIH Director’s title.
References
Chu, F. (2024). Implementation science: Why should we care? Journal of the Medical Library Association, 112(3), 281-285. https://doi.org/10.5195/jmla.2024.1919
Exec. Order No. 14,303, 90 Fed. Reg. 22,601 (May 29, 2025). Restoring gold standard science. https://www.federalregister.gov/documents/2025/05/29/2025-09802/restoring-gold-standard-science
Damschroder, L. J., Reardon, C. M., Opra Widerquist, M. A., & Lowery, J. (2022). The updated Consolidated Framework for Implementation Research based on user feedback. Implementation Science, 17(1), Article 75. https://doi.org/10.1186/s13012-022-01245-0
Glasgow, R. E., Vogt, T. M., & Boles, S. M. (1999). Evaluating the public health impact of health promotion interventions: The RE-AIM framework. American Journal of Public Health, 89(9), 1322-1327. https://doi.org/10.2105/ajph.89.9.1322
National Institutes of Health. (2025, August 22). Leading in gold standard science: An NIH implementation plan. U.S. Department of Health and Human Services. https://www.nih.gov/sites/default/files/2025-08/2025-gss.pdf
Proctor, E., Silmere, H., Raghavan, R., Hovmand, P., Aarons, G., Bunger, A., Griffey, R., & Hensley, M. (2011). Outcomes for implementation research: Conceptual distinctions, measurement challenges, and research agenda. Administration and Policy in Mental Health and Mental Health Services Research, 38(2), 65-76. https://doi.org/10.1007/s10488-010-0319-7
This essay is part of an ongoing series reflecting on what I learned over more than two decades working inside the U.S. biomedical research enterprise. Each piece stands alone, but together they examine how science is shaped not only by ideas and funding, but by the structures that support or constrain them.


I promoted and conducted program evaluation at NIH for 22 years and, prior to that, did evaluation for 14 years at the request of Congress at the US Government Accountability Office (GAO). How soon we forget the Evidence Act, which was passed in 2018 and was designed to strengthen evaluation across HHS and the federal government overall. Apparently this is being ignored!
http://aspe.hhs.gov/topics/data/evidence-act-0
Liz, this is a great post. It’s a complete misuse of the concept of implementation science. However, I suppose there is one way they could use it: first they mandate that the NIH ICs implement programs that, e.g., fund reproducibility, and set aside funds for these programs. Then they show that the implementation succeeded.