Considerable effort is required to produce quality assessments for medical education. Expertise is required, not only in subject content, but also in exam construction and delivery. For multiple reasons, assessments are not always the best that they could be.
Assessments can be reliable or valid or both. Reliability describes reproducibility: for example, do examination questions yield similar results each time they are administered and to different groups of students? Validity describes appropriateness: do examination questions measure knowledge and reasoning ability, thereby providing a measure of meaningful achievement?
There are two major categories for flaws in examinations, particularly examinations in a multiple choice format:
Construct irrelevant variance
Construct Irrelevant Variance
Construct irrelevant variance (CIV) is the introduction of extraneous, uncontrolled variables that affect assessment outcomes. The meaningfulness and accuracy of examination results is adversely affected, the legitimacy of decisions made upon exam results is affected, and the validity is reduced. Sources of CIV include:
Poorly constructed examination questions
Indefensible passing score
When examinations contain flawed items, 'noise' is introduced in the form of badly worded, misleading, and confusing questions that make it more difficult for the student to answer correctly, even if the student has mastered the content domain of the question. Flawed items are more likely to produce 'false negatives' or students who fail the examination but should not have failed. The testwise student can use flaws in the structure of questions to arrive at the correct answer without knowing anything about the content upon which the question is based. Flawed questions lead to guessing, and flaws reduce randomness and may reduce the likelihood of chance alone from picking the corrent answer. Item bias is detected by differential item functioning that seeks to identify flawed test items that favor one group of students over another. Setting the passing standard requires judgment regarding the level of achievement required and should not be arbitrary. Academic institutions take care to create testing environments that preclude irregularities such as cheating.
Construct underrepresentation (CU) occurs when the examination lacks validity because the examination content is not reflective of relevant knowledge. Examples of construct underrepresentation include:
Rote memorization for factual recall
Few examination items
Maldistribution of examination items
Teaching to the test
Trivial content is unimportant for future learning or care of patients. Examination items at low level of cognitive function require only rote memorization to recall isolated facts that may not reflect the integrated knowledge to support clinical reasoning with problem solving for care of patients with real medical problems. Maldistribution of examination items leads to oversampling of some content areas and undersampling of others. Too few examination items leads to failure to adequately sample the learning content in the achievement domain desired. The reliability of the examination suffers as well. An examination of sufficient length will be a fairer, more accurate, and reliable sample of important knowledge. Teaching to the test leads to scores that are an inaccurate reflection of the knowledge domain. In summary CU can be overcome when sufficient examination items require higher order cognition to solve clinically relevant problems.
The students who are most likely to be affected by CIV and CU are those who exhibit marginal performance toward the lowest passing level. Hence, improving performance above this level will help prevent the 'false negative' effect of poor examinations.
Students should exhibit the professional behavior to seek appropriate knowledge content for use at higher cognitive levels for clinical problem solving. Even if the test is not reflective of the student's true ability, the student moves on to higher levels where the value of such learning can be demonstrated.
Downing SM, Haladyna TM. Validity threats: overcoming interference with proposed interpretations of assessment data. Medical Education. 2004;38:327-333.
Downing SM. Threats to the validity of locally developed multiple-choice tests in medical education: construct-irrelevant variance and construct underrepresentation. Adv Health Sci Educ. 2002;7:235-241.
Downing SM. The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Adv Health Sci Educ. 2005;10:133-143.