Wednesday, 11 September 2013

Methodologies for Evaluating the Validity of Scores Produced by CEFL’s Equivalency Exams


CEFL’s only Mathematics 030-1 Equivalency Exam was created by me based on an exhaustive analysis of the Mathematics 30-1 Program of Studies, the Mathematics 30-1 Assessment Standards and Exemplars, and item-writing workshops with the exam managers hosted by the CRCPD. The exam was first employed in December 2012. Out of sheer curiosity, and building on experiences garnered from other validity studies I’ve done, I performed my own validity study of the Mathematics 30-1 Equivalency Final Exam.

First, let’s consider reliability. Reliability is the degree of consistency of measurements; from this it follows that the higher the reliability, the lower the random measurement error. “The concern with reliability in educational and psychological measurements is paramount because of the difficulty of producing consistent measures of achievement” (Violato, et al). The reliability statistic, called a reliability coefficient, rxx, has a maximum value of 1, which indicates a test that produces scores with no measurement error --no such test exists as all tests are afflicted with at least a small amount of random error. Violato (1992) provides some general guidelines for evaluating the reliabilities of teacher-made tests: 0.50 or greater is acceptable, and 0.70 or greater is excellent. Tests produced by professional testing agencies should have reliabilities of at least 0.80, which means that 80% of the variation in observed scores is caused by real differences in ability, and 20% of the variation is caused by errors in measurement--essentially, 20% of the variation in the test-takers’ scores is just noise and tells us nothing about students’ relative abilities.

The easiest method I’ve found for estimating the reliability of dichotomously scored tests (each item is marked as either completely correct or completely wrong) is Kudar-Richardson Formula 21:
 (Violato, 1992).

Where k is the number of items, X is the mean raw score, and s2 is the variance of the raw scores. For tests composed of only multiple-choice and numerical-response items, all you need is the set of test scores in order to estimate reliability--the labour of item-analysis is unnecessary. (I also collect students’ scores on the Mathematics 30-1 unit exams to estimate their reliabilities.) Here are the estimated reliability coefficients of the Mathematics 30-1 equivalency final exam and the January 2013 Mathematics 30-1 diploma exam:


The reliability of the Mathematics 30-1 equivalency final exam is very high at 0.87 and compares well to the diploma exam, especially considering that Alberta Education has access to virtually infinite resources. KR21 always underestimates reliability, sometimes by a significant amount, so we know the reliability of the exam is more than 0.87.

Validity is “the extent to which certain inferences can be made accurately from--and certain actions should be based on--test scores or other measurement” (Schinka, 2003). Reliability sets the upper limit of validity and is a necessary, but not sufficient, condition for the scores produced by a test to be valid. (Schinka, 2003). In the trinitarian view, validity is a three-in-one godhead composed of content validity, criterion-related validity, and construct validity. (See http://jalt.org/test/bro_8.htm for detailed information about the trinitarian and the unitarian notions of validity). Other types of validity sometimes mentioned are face validity (how much the test is liked by test-users) and social validity (how much the test is liked by test-takers). Though these are considered unscientific types of validity--they are more like minor prophets than gods--they are important political considerations.

One common way of building a validity argument is to compare the exam in question to another exam that has strong validity arguments supporting it: “Criterion-related validity usually includes any validity strategies that focus on the correlation of the test being validated with some well-respected outside measure(s) of the same objectives or specifications“ (http://jalt.org/test/bro_8.htm). I assume the diploma exams produce scores with the highest validity of any available assessment of the Mathematics 30-1 curriculum. There are many reasons to think this is so: Alberta Education has vast resources at its disposal, teams of psychometricians dissect each exam for item-performance data from huge samples, extensive field testing is conducted, and the top curriculum experts are at hand for consultation. (Whether the diploma exams ought to be our all-in-all is another question). Ideally, I would like to have a large group of students write the diploma exam, and then write the equivalency final exam the very next day (people forget in different ways at different rates as time passes, so the time between writings should be minimized). This would directly compare the equivalency exam to the diploma exam. Of course, this is impractical. Instead, I decided to consider the Pearson product-moment correlation between the unit exam scores and the diploma exam scores, as well as the correlation between unit exam scores and equivalency exam scores. If both correlations are high, then the equivalency final exam likely measures the same ability as the diploma exam. This is an indirect comparison between the two exams and relies upon the unit exam scores as a basis of comparison between the two, but is the best measure of criterion-related validity currently available.

The unit exam scores are highly correlated with diploma exam scores with r=0.86, n=19. Corrected for attenuation, r is nearly unity. Correcting for attenuation eliminates the effect of the unreliability (random error) present in each exam and shows the true, underlying correlation of the construct we’re trying to measure. Furthermore, several students with low class grades did not write the diploma exam. If they had written the diploma exam, they would have likely received very low scores. This would have the effect of increasing the correlation coefficient even higher. The effect of decreasing r as a consequence of missing data is termed “range restriction” and there are ways to compensate, but I did not do so for lack of time. Having an r of nearly 1 means the unit exams and the diploma exams measure almost exactly the same underlying construct, presumably, ability in Mathematics 30-1, although the sample size fairly small. The sample size is small because the extranet shows a surprising number of our students never write the diploma exam. But as Yoga Berra said, “Predictions are hard to make, especially about the future,” so this result is still impressive.

The unit exam scores are also highly correlated with the equivalency final exam scores with r=0.81, n=48. Corrected for attenuation, r=0.95. As was the case with the diploma exams, several students with low class grades did not write the equivalency exam; if they had, the correlation would be even higher than at present. The high correlation means the unit exams and the equivalency exams measure almost exactly the same underlying ability.

One difference between the diploma exam results and equivalency exam results of note is the drop in scores from the class mark to the exam mark is greater in the case of the diploma exam. The average drop from the class mark to the diploma exam mark is 16%, whereas the average drop from the class mark to the equivalency exam mark is 6%. (One could do an analysis of variance using Excel to see if this difference is significant). This is likely a consequence of the sometimes long time delay between finishing the class and writing the diploma exam--an exam manager told me this is observed at other colleges as well. Likely, the forgetting curve is probably rather steep for mathematics concepts because overlearning is difficult to accomplish in this subject area (however, I have no evidence of this beyond individual observations).

Assuming the diploma exam is the best available test of ability in Mathematics 30-1, these results support the criterion-referenced validity of the Mathematics 30-1 equivalency final exam as well as the Mathematics 30-1 form A unit exams. The other aspects of the validity trinity, content validity and construct validity, are supported by careful blueprinting, judicious adherence to item-writing research, and thorough reviews by four subject-matter experts. The concord among these lines of evidence supports the unitary conclusion that the Mathematics 30-1 equivalency exam form A produces valid results from which valid inferences and decisions can be made. The methodologies mentioned in this article apply equally well to any course that has an associated equivalency and diploma exam.

Regards,
Michael Gaschnitz

If you have any questions about anything in this article, please contact me at mgaschnitz@bowvalleycollege.ca.

References
 JA Schinka, WF Velicer, IB Weiner, DK Freedheim (2003). Handbook of Psychology, Assessment Psychology. Wiley.

Violato, C., McDougall, D., & Marini, A. (1992). Educational Measurement and Evaluation. Dubuque: Kendall/Hunt.

No comments:

Post a Comment

We love comments. Why else would be post? Let us know what you like. Add your own thoughts. And if comments are not enough, send us a post.