What pre and post tests cannot tell you: A critical review of a widely-used but poorly-understood assessment method

DOI: 10.59350/0jnd1-hg665

·

Reda Sadki Avatar
By

Reda Sadki

What pre and post tests cannot tell you

This is the first of two articles about assessment, exploring the limitations and misuse of pre and post tests. The second article examines the framework used by The Geneva Learning Foundation to overcome the limitations described here.

The reassuring illusion of the knowledge quiz

Imagine a two-day workshop on menopause for healthcare providers.

Before the first session, participants answer twenty multiple-choice questions about symptoms, diagnosis, and treatment options.

After the last session, they answer the same twenty questions again.

The scores go up.

The average gain is 14 percentage points.

The program manager writes in the final report: “Participants demonstrated significant improvement in knowledge of menopause, confirming the effectiveness of the training.”

It feels like evidence.

It looks like evidence.

The number is real, the instrument was applied consistently, and the change is statistically significant.

For a funder or program officer who has seen this format dozens of times, it reads as rigorous evaluation.

The problem is that this conclusion vastly overstates what the data can support.

The score increase confirms that participants could answer more questions correctly on a particular day.

It does not confirm that the training caused that improvement.

It does not confirm that participants will remember anything in six weeks.

It does not confirm that any of them will practice differently with their patients.

And it does not confirm that patient outcomes will improve.

These are not minor caveats.

They are the entire point of the exercise.

This article explains, in plain terms, what assessment science actually says about pre- and post-tests and why global health and humanitarian professionals who rely on them as their primary measure of impact are, in most cases, measuring the wrong thing, at the wrong time, with an instrument that contains structural biases they may not know about.

What pre and post tests are, formally

In the language of education research, the format described above is known as a one-group pretest-posttest design.

It is classified as a “pre-experimental” design, meaning that it precedes the standards required for experimental research.

The pre-test serves as a diagnostic or baseline assessment.

The post-test serves as a summative achievement test.

The difference between the two is interpreted as a measure of learning gain.

This design is the most common form of program evaluation in global health and humanitarian training.

It is administratively simple, inexpensive, and produces numbers that are easy to interpret and communicate.

These practical advantages explain its persistence, but they do not address its fundamental limitations as evidence of impact.

The causation problem

The most basic limitation of the one-group pretest-posttest design is that it cannot establish that the training caused the change in scores.

As Knapp (2016) demonstrated in a widely cited analysis, “all one can say when using a one-group pretest-posttest design is that a change has occurred, but not that an intervention caused it.” This is not a minor methodological footnote.

It is the design’s central weakness.

Researchers call the factors that compete with the intervention as explanations for observed change “threats to internal validity.” Campbell and Stanley (1963), whose taxonomic work on research design remains definitive more than sixty years later, identified the following threats that apply directly to the pre- and post-test format:

  • History: Other events occurred between the pre-test and post-test that could account for the change. A participant who attended the workshop may also have read an article, had a relevant clinical encounter, or discussed the topic with a colleague during the same period.
  • Maturation: Participants naturally develop and mature over time, independent of any training. In short workshops this effect is small, but in programs that span weeks or months it can be substantial.
  • Testing effects: The act of taking the pre-test itself is an educational intervention. Participants who do not know the answer to a question during the pre-test often look it up afterward, or simply remember the question and its answer by the time the post-test arrives.
  • Instrumentation: The test-taking context, the format of questions, or the conditions of administration may differ between pre- and post-test in ways that affect scores independently of learning.
  • Regression to the mean: If participants were selected because they performed unusually poorly on a baseline measure, their scores will tend to improve on subsequent measurement regardless of any intervention.

All of these threats are present to varying degrees in any field-based training evaluation.

None of them can be controlled in a one-group pretest-posttest design.

This does not mean that the score change is zero.

It means that the score change cannot be attributed to the training.

The response shift problem

Beyond the causation problem, pre- and post-tests face a measurement validity problem that is particularly acute in professional development contexts: response shift bias.

Response shift bias occurs when the training itself changes the internal standard that participants use to evaluate their own knowledge or competence.

Consider a community health worker who rates her understanding of malaria prevention as “4 out of 5” before attending a training.

She is not lying.

She genuinely believes she understands malaria prevention well, based on her current conception of what that entails.

After a rigorous training program, she now has a much more sophisticated understanding of what competence in malaria prevention actually requires.

Her post-test self-rating is “3 out of 5,” not because she has learned less, but because the training has raised her awareness of how complex the subject is and how much more she still needs to learn.

The pre-test measured one construct.

The post-test measured a different construct.

The comparison between them is therefore not valid.

This effect is not a curiosity of psychology research.

It is pervasive in exactly the settings where pre- and post-tests are most commonly used.

Experienced practitioners who are confident in their work are particularly susceptible to overestimating their baseline competence in domains they have not studied formally.

Well-designed training programs that genuinely shift understanding are precisely the ones most likely to trigger response shift bias in evaluation scores.

This is the reason that some evaluators have moved toward retrospective pretest designs, in which participants are asked, after training, to estimate where they were before the training.

But this approach adds complexity, is rarely implemented in field settings, and has its own validity concerns.

A related phenomenon, sometimes described through the Dunning-Kruger framework, compounds the problem: before training, participants may feel confident because they do not yet know what they do not know.

After training, their confidence may actually decrease as they become aware of the complexity they had previously missed.

A knowledge test administered at these two different points in a learner’s journey can paradoxically show smaller gains, or even apparent declines, for programs that are working exactly as intended.

The recall problem

Even setting aside validity threats and response shift bias, there is a deeper conceptual problem with pre- and post-tests focused on knowledge recall: they measure the lowest levels of cognitive activity.

Benjamin Bloom’s taxonomy of educational objectives, developed in 1956 and revised in 2001, provides a widely used framework for classifying cognitive tasks.

The taxonomy describes six levels, from most basic to most complex: remembering, understanding, applying, analyzing, evaluating, and creating.

Pre- and post-tests consisting of multiple-choice or short-answer questions almost exclusively assess the two lowest levels, remembering and understanding.

Research confirms that this is the case even when test designers intend otherwise.

Yet the professional outcomes that global health and humanitarian programs are designed to achieve, such as problem-solving in complex field situations, judgment under uncertainty, coordination with colleagues, adaptation of guidelines to local context, and leadership for change, sit at the top of the taxonomy.

Analyzing, evaluating, and creating are the cognitive activities through which professionals make a difference.

A test of individual recall cannot measure whether a participant can do any of these things.

This produces a persistent and widely observed mismatch in program evaluation: learning outcomes are stated in terms of what participants will be able to do, but assessments measure only what they know at a given moment.

The program may be genuinely achieving its intended outcomes while the evaluation instrument is measuring something unrelated to them.

The transfer problem

Research on transfer of learning, accumulated over decades across organizational, educational, and health professions settings, converges on a finding that should be sobering for anyone who relies on post-tests as evidence of impact: knowledge gained in training transfers to workplace practice at an extremely low rate.

Multiple reviews of transfer of learning research suggest that roughly 10 to 12 percent of what is taught in training is subsequently applied on the job in any sustained way.

The gap between what is learned in a classroom or online course and what is actually practiced months later is not a minor implementation challenge.

It is the norm, not the exception.

This finding is reflected in the Kirkpatrick model of training evaluation, which has been the dominant framework in organizational learning since its development in the late 1950s.

(This model has been challenged and, to some extent, discredited. We nevertheless refer to it here because it is widely-known.)

The model identifies four levels: reaction (did participants like the training?), learning (did knowledge scores improve?), behavior (did participants change how they work?), and results (did outcomes improve at the organizational or community level?).

Pre- and post-tests operate entirely at Level 2.

They measure learning in the narrow sense of knowledge acquisition.

They provide no information about Level 3 or Level 4 outcomes, which are precisely the levels at which the impact of global health and humanitarian training is supposed to materialize.

Behavior change takes three to six months to become observable in practice.

Measuring knowledge retention two weeks after a workshop produces data that has very limited predictive value for whether practice will change at all.

This does not mean that knowledge is irrelevant.

Of course practitioners need to know things.

But knowing something is neither sufficient nor reliably necessary for changing behavior.

The behavioral sciences have made this clear across a wide range of domains: knowing that smoking is harmful does not predict cessation.

Knowing the correct handwashing procedure does not predict compliance.

Knowing vaccine safety evidence does not predict a practitioner’s ability to address vaccine hesitancy in their community.

Knowledge is often a component of behavior change, but it is rarely its cause.

The timing problem

Pre- and post-tests are almost always administered at the worst possible moment for measuring complex learning outcomes: immediately before and immediately after the training event.

This is the moment of maximum knowledge accessibility in short-term memory.

It is also the moment furthest in time from the professional context in which learning is supposed to make a difference.

Genuine behavioral change in a professional context requires time: time to return to the work setting, encounter situations that activate the new knowledge, attempt to apply it, receive feedback from colleagues and patients, refine the approach, fail and try again, and gradually integrate new capabilities into professional practice.

This process takes weeks and months, not hours.

An evaluation instrument applied immediately after training captures something, but what it captures is closer to a snapshot of short-term memory under favorable conditions than a measure of enduring professional capability.

The research literature on retention of medical knowledge, for instance, shows that significant decay begins within days of training and continues substantially over the following weeks unless the knowledge is actively used.

What this means for pre-test sensitization

A final structural problem that is rarely discussed in global health evaluation contexts is pre-test sensitization: the act of administering a pre-test can change how participants respond to the training itself.

When participants take a knowledge quiz before a training, they become primed to attend to specific content.

They notice which questions they could not answer and pay particular attention to relevant content during the training.

They may also discuss questions with fellow participants between the pre-test and the training, especially in small-group settings.

The result is that participants in a pre-tested group may perform better on the post-test not because the training was more effective, but because the pre-test directed their attention in ways that would not occur in the absence of testing.

This effect is particularly important in settings where the same training is delivered repeatedly to successive cohorts.

If pre-test sensitization is operating, the evaluation instrument is not simply measuring the effect of the training.

It is becoming part of the training and progressively inflating apparent gains across cohorts.

What pre and post tests can reasonably do

This critique should not be taken as a blanket dismissal of all uses of pre- and post-tests.

The format has legitimate uses when applied with realistic expectations.

Pre-tests can function as diagnostic instruments to identify what participants already know and do not know, allowing facilitators to calibrate the training accordingly.

They can motivate participants by surfacing knowledge gaps and priming attention before a program begins.

And in contexts where the goal is genuinely limited to assessing knowledge acquisition on a well-defined body of factual content, such as checking whether participants have understood updated clinical guidelines, a post-test alone, administered to a reasonably large sample, can provide useful signal at relatively low cost.

The problem arises not from using pre- and post-tests but from treating them as evidence of impact in the broader sense.

When a pre- and post-test is used to claim that a training program is effective, that it changes professional behavior, that it leads to better health outcomes, or that it justifies continued investment at scale, the instrument is being asked to support conclusions it is structurally incapable of supporting.

The alternative framing

If pre- and post-tests measure the wrong thing, what should be measured instead?

The research literature on assessment and learning evaluation points toward several principles.

Evidence of complex learning should be collected close to the professional context in which learning is supposed to make a difference, not in the artificial conditions of a test environment.

It should capture multiple dimensions of capability, not only knowledge recall.

It should be collected over time, not at a single point immediately after training.

It should draw on multiple sources, including the learner, their peers, their supervisors, and observation of their work.

And it should be designed to capture behavior change and its consequences, not only knowledge acquisition as a proxy for those consequences.

These principles are harder to operationalize than a standardized knowledge quiz.

They require more design effort, more data collection infrastructure, and more analytical sophistication.

They may not produce the single clean number that a funder can graph over time.

But they produce something much more valuable: credible evidence about whether professional learning is translating into the changed practice and improved outcomes that the learning was designed to produce.

The second article in this series examines how The Geneva Learning Foundation has operationalized exactly these principles through its value creation measurement framework, and what that framework reveals about the true depth and reach of learning impact that pre- and post-tests cannot see.

References

Assessment design and validity

Campbell, D.T. and Stanley, J.C. (1963) Experimental and quasi-experimental designs for research. Chicago: Rand McNally. https://davidpassmore.net/courses/databook/CampandStanley.pdf

Shadish, W.R., Cook, T.D. and Campbell, D.T. (2002) Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin. ISBN: 0-395-61556-9. 

Knapp, T.R. (2016) ‘Why is the one-group pretest-posttest design still used?’, Clinical Nursing Research, 25(5), pp. 467-472. DOI: https://doi.org/10.1177/1054773816666280

Glenn, B.A., Bastani, R. and Maxwell, A.E. (2013) ‘The perils of ignoring design effects in experimental studies: lessons from a mammography screening trial’, Psychology and Health, 28(5), pp. 593-602. DOI: https://doi.org/10.1080/08870446.2012.756880

Response shift bias

Howard, G.S. (1980) ‘Response-shift bias: a problem in evaluating interventions with pre/post self-reports’, Evaluation Review, 4(1), pp. 93-106. DOI: https://doi.org/10.1177/0193841X8000400105

Drennan, J. and Hyde, A. (2008) ‘Controlling response shift bias: the use of the retrospective pre-test design in the evaluation of a master’s programme’, Assessment and Evaluation in Higher Education, 33(6), pp. 699-709. DOI: https://doi.org/10.1080/02602930701773026

Little, T.D., Chang, R., Gorrall, B.K., Waggenspack, L., Fukuda, E., Allen, P.J. and Noam, G.G. (2020) ‘The retrospective pretest-posttest design redux: on its validity as an alternative to traditional pretest-posttest measurement’, International Journal of Behavioral Development, 44(2), pp. 175-183. DOI: https://doi.org/10.1177/0165025419877973

Fernandez-Castilla, B., Declercq, L., Jamshidi, L., Beretvas, S.N., Onghena, P. and Van den Noortgate, W. (2017) ‘Evaluating intervention programs with a pretest-posttest design: a meta-analytic approach’, Frontiers in Psychology, 8, p. 341. DOI: https://doi.org/10.3389/fpsyg.2017.00341

Cognitive levels and assessment

Anderson, L.W. and Krathwohl, D.R. (eds.) (2001) A taxonomy for learning, teaching and assessing: a revision of Bloom’s taxonomy of educational objectives. New York: Longman.

Zaidi, N.L.B., Hwang, C., Festival, J., Cheung, J.J.H., Gomez-Garibello, C. and Tuchman, B. (2021) ‘Examining Bloom’s taxonomy in multiple choice questions: a U.S. medical licensing examination perspective’, Medical Science Educator, 31(4), pp. 1437-1442. DOI: https://doi.org/10.1007/s40670-021-01375-0

Transfer of learning and the Kirkpatrick model

Kirkpatrick, D.L. and Kirkpatrick, J.D. (2006) Evaluating training programs: the four levels. 3rd edn. San Francisco: Berrett-Koehler. [No DOI available for the book.]

Baldwin, T.T. and Ford, J.K. (1988) ‘Transfer of training: a review and directions for future research’, Personnel Psychology, 41(1), pp. 63-105. DOI: https://doi.org/10.1111/j.1744-6570.1988.tb00632.x

Connectivism and new learning

Siemens, G. (2005) ‘Connectivism: a learning theory for the digital age’, International Journal of Instructional Technology and Distance Learning, 2(1), pp. 3-10. Available at: https://www.ceebl.manchester.ac.uk/events/archive/aligningcollaborativelearning/Siemens.pdf

Cope, B. and Kalantzis, M. (eds.) (2017) E-learning ecologies: principles for new learning and assessment. New York: Routledge. DOI: https://doi.org/10.4324/9781315639215

Cope, B. and Kalantzis, M. (2015) ‘A pedagogy of multiliteracies: learning by design’, Palgrave Macmillan. DOI: https://doi.org/10.1057/9781137539724

How to cite this article

As the primary source for this original work, this article is permanently archived with a DOI to meet rigorous standards of verification in the scholarly record. Please cite this stable reference to ensure ethical attribution of the theoretical concepts to their origin. Learn more

Reda Sadki (2026). What pre and post tests cannot tell you: A critical review of a widely-used but poorly-understood assessment method. Reda Sadki: Learning to make a difference. https://doi.org/10.59350/0jnd1-hg665

Fediverse reactions

Discover more from Reda Sadki

Subscribe now to keep reading and get access to the full archive.

Continue reading