Monday, 16 September 2013

Item Writing Guide for Math and Science from Alberta Education

An instructor asked me about a guidebook/primer produced by Alberta Education that is used to train their item writers. I often refer to this guidebook when writing math and science items: on the department drive, navigate to “STAFF FOLDERS\Michael Gaschnitz\OTSP” to find Alberta Education’s Writing Effective Machine-Scored Questions for Mathematics & Science.

This 84-page document is divided into several helpful sections:

In Why Use Multiple-Choice Questions and Why Use Numerical Response Questions, the situations best-suited to each item type are addressed. Positive features and cautions are listed for each type.

In Test Wiseness Quiz, a quiz is provided that demonstrates several items that can be answered by testwise students who may actually be ignorant of the concept the item is trying to test. We want our tests to be as testwiseness-proof as possible in order to maintain the validity of scores. Otherwise, we end up testing a kind of intelligence or street-smarts instead of the relevant content area.

In Guidelines for Writing Multiple-Choice Questions and Guidelines for Writing Numerical-Response Questions, 25 or so specific guidelines are provided, such as “do not use 5 as the deciding factor for rounding.” Most of them also appear in the broader item-writing research and literature.

In Commonly Used Question Formats and Stems, stem templates are provided. This allows for what is called “item modelling,” a strategy item-writers use to help prevent the deeply-dreaded state of item-writer’s-block. If item writers are stumped when trying to create an item for a particular concepts, they can look to these templates for some inspiration.

In Blueprints for Tables of Test Specifications, test blueprinting guidelines are provided. One curious bit of advice states that “In general, 40% to 50% of the marks should be targeted for the ‘acceptable standard’ and 25% for the ‘standard of excellence’ with the rest in between.” The remaining 30% are in a category described in the Chemistry 30 Information bulletin as “intermediate standard.” However, no information bulletin that I know of clearly describes the intermediate standard.

In Example: Unit Test Blueprints, a Physics 30 blueprint is provided that categorizes items as easy, medium or hard. I’ve never seen that schema in any information bulletin for any course, but it seems quite intuitive.

In Fix-It Questions, several problematic items are provided and we are asked to identify the problem and improve the item. Answers are provided on p.56.

This document is from the “Writing Effective Machine Scored Questions” session delivered by the science and mathematics diploma exam managers, and was hosted by the Calgary Regional Consortium (http://www.crcpd.ab.ca/). I highly recommend these sessions as you can obtain an abundant amount of information that is never posted at http://education.alberta.ca/. For example, here are few “rules” for marking numerical-response items that I’ve never seen in writing:

Blanks (columns without an entry) are generally ignored or are assumed zero. Although the instructions say to left-justify all responses, right-justification is also accepted. For example, if the answer is “42,” then any of the following responses are accepted as correct:

4 2 _ _
_ 4 2 _
_ _ 4 2
4 _ _ 2
4 _ 2 _
_ 4 _ 2

If the answer to a numerical response item works out to be 0.776, then any of the following responses are accepted as correct:

0 . 7 8
 _ . 7 8
. 7 8 _ 
. _ 7 8
 
In Biology 30 and Chemistry 30, truncated answers are also accepted as correct:
0 . 7 7
_ . 7 7
. 7 7 _ 
. _ 7 7

Furthermore, the psychometricians look at the results for a particular numerical response item and may add an additional answer to the key if many stronger students provide a particular answer, and if the answer makes sense. In other words, the key for numerical-response items has to evolve as results come in so that we are fair to students. We cannot blindly mark numerical-response items. Item writers often cannot predict every possible correct answer because numerical-response items are “constructed-response” items--to some extent that is why they are being used in place of written-response items. Some numerical-response items are becoming incredibly complex! Other testing agencies besides Alberta Education adjust their numerical-response keys as well, so this is, apparently, an accepted practice.

Regards, Michael

ELA Instructors Come Together for Course Enhancement – Lorna Malanik

I am thrilled to announce that under the leadership of Lusine Harutyunyan, Acting Media Lead for Curriculum Development, the English Language Arts instructors across all modes of delivery (RTOL, ATOL, Trad, and Flex) are working together on a Video Curation Project. The purpose of the project is to gather instructional videos to enhance student learning in an engaging, concise, and precise format. The videos will focus on a wide variety of topics at each high school grade level which relate to the analysis and deeper appreciation of texts and the writer’s craft. Upon completion of the project, instructors and students alike will have an abundance of relevant resources from which to draw. Using YouTube as our primary resources, we are hoping to find videos for all of our 50-ish topics for each of the three levels (grades 10, 11, and 12) by the end of October, 2013. Our next step will be to discern which topics do not have appropriate videos and then work on the creation of our own instructional videos so that all topics have video resources that are applicable to each grade level. It is exciting to have ELA instructors of all modes of delivery work together on this project; not only will this enrich student learning, it will strengthen our relationships with each other and ultimately lead to further collaboration.

Wednesday, 11 September 2013

Methodologies for Evaluating the Validity of Scores Produced by CEFL’s Equivalency Exams


CEFL’s only Mathematics 030-1 Equivalency Exam was created by me based on an exhaustive analysis of the Mathematics 30-1 Program of Studies, the Mathematics 30-1 Assessment Standards and Exemplars, and item-writing workshops with the exam managers hosted by the CRCPD. The exam was first employed in December 2012. Out of sheer curiosity, and building on experiences garnered from other validity studies I’ve done, I performed my own validity study of the Mathematics 30-1 Equivalency Final Exam.

First, let’s consider reliability. Reliability is the degree of consistency of measurements; from this it follows that the higher the reliability, the lower the random measurement error. “The concern with reliability in educational and psychological measurements is paramount because of the difficulty of producing consistent measures of achievement” (Violato, et al). The reliability statistic, called a reliability coefficient, rxx, has a maximum value of 1, which indicates a test that produces scores with no measurement error --no such test exists as all tests are afflicted with at least a small amount of random error. Violato (1992) provides some general guidelines for evaluating the reliabilities of teacher-made tests: 0.50 or greater is acceptable, and 0.70 or greater is excellent. Tests produced by professional testing agencies should have reliabilities of at least 0.80, which means that 80% of the variation in observed scores is caused by real differences in ability, and 20% of the variation is caused by errors in measurement--essentially, 20% of the variation in the test-takers’ scores is just noise and tells us nothing about students’ relative abilities.

The easiest method I’ve found for estimating the reliability of dichotomously scored tests (each item is marked as either completely correct or completely wrong) is Kudar-Richardson Formula 21:
 (Violato, 1992).

Where k is the number of items, X is the mean raw score, and s2 is the variance of the raw scores. For tests composed of only multiple-choice and numerical-response items, all you need is the set of test scores in order to estimate reliability--the labour of item-analysis is unnecessary. (I also collect students’ scores on the Mathematics 30-1 unit exams to estimate their reliabilities.) Here are the estimated reliability coefficients of the Mathematics 30-1 equivalency final exam and the January 2013 Mathematics 30-1 diploma exam:


The reliability of the Mathematics 30-1 equivalency final exam is very high at 0.87 and compares well to the diploma exam, especially considering that Alberta Education has access to virtually infinite resources. KR21 always underestimates reliability, sometimes by a significant amount, so we know the reliability of the exam is more than 0.87.

Validity is “the extent to which certain inferences can be made accurately from--and certain actions should be based on--test scores or other measurement” (Schinka, 2003). Reliability sets the upper limit of validity and is a necessary, but not sufficient, condition for the scores produced by a test to be valid. (Schinka, 2003). In the trinitarian view, validity is a three-in-one godhead composed of content validity, criterion-related validity, and construct validity. (See http://jalt.org/test/bro_8.htm for detailed information about the trinitarian and the unitarian notions of validity). Other types of validity sometimes mentioned are face validity (how much the test is liked by test-users) and social validity (how much the test is liked by test-takers). Though these are considered unscientific types of validity--they are more like minor prophets than gods--they are important political considerations.

One common way of building a validity argument is to compare the exam in question to another exam that has strong validity arguments supporting it: “Criterion-related validity usually includes any validity strategies that focus on the correlation of the test being validated with some well-respected outside measure(s) of the same objectives or specifications“ (http://jalt.org/test/bro_8.htm). I assume the diploma exams produce scores with the highest validity of any available assessment of the Mathematics 30-1 curriculum. There are many reasons to think this is so: Alberta Education has vast resources at its disposal, teams of psychometricians dissect each exam for item-performance data from huge samples, extensive field testing is conducted, and the top curriculum experts are at hand for consultation. (Whether the diploma exams ought to be our all-in-all is another question). Ideally, I would like to have a large group of students write the diploma exam, and then write the equivalency final exam the very next day (people forget in different ways at different rates as time passes, so the time between writings should be minimized). This would directly compare the equivalency exam to the diploma exam. Of course, this is impractical. Instead, I decided to consider the Pearson product-moment correlation between the unit exam scores and the diploma exam scores, as well as the correlation between unit exam scores and equivalency exam scores. If both correlations are high, then the equivalency final exam likely measures the same ability as the diploma exam. This is an indirect comparison between the two exams and relies upon the unit exam scores as a basis of comparison between the two, but is the best measure of criterion-related validity currently available.

The unit exam scores are highly correlated with diploma exam scores with r=0.86, n=19. Corrected for attenuation, r is nearly unity. Correcting for attenuation eliminates the effect of the unreliability (random error) present in each exam and shows the true, underlying correlation of the construct we’re trying to measure. Furthermore, several students with low class grades did not write the diploma exam. If they had written the diploma exam, they would have likely received very low scores. This would have the effect of increasing the correlation coefficient even higher. The effect of decreasing r as a consequence of missing data is termed “range restriction” and there are ways to compensate, but I did not do so for lack of time. Having an r of nearly 1 means the unit exams and the diploma exams measure almost exactly the same underlying construct, presumably, ability in Mathematics 30-1, although the sample size fairly small. The sample size is small because the extranet shows a surprising number of our students never write the diploma exam. But as Yoga Berra said, “Predictions are hard to make, especially about the future,” so this result is still impressive.

The unit exam scores are also highly correlated with the equivalency final exam scores with r=0.81, n=48. Corrected for attenuation, r=0.95. As was the case with the diploma exams, several students with low class grades did not write the equivalency exam; if they had, the correlation would be even higher than at present. The high correlation means the unit exams and the equivalency exams measure almost exactly the same underlying ability.

One difference between the diploma exam results and equivalency exam results of note is the drop in scores from the class mark to the exam mark is greater in the case of the diploma exam. The average drop from the class mark to the diploma exam mark is 16%, whereas the average drop from the class mark to the equivalency exam mark is 6%. (One could do an analysis of variance using Excel to see if this difference is significant). This is likely a consequence of the sometimes long time delay between finishing the class and writing the diploma exam--an exam manager told me this is observed at other colleges as well. Likely, the forgetting curve is probably rather steep for mathematics concepts because overlearning is difficult to accomplish in this subject area (however, I have no evidence of this beyond individual observations).

Assuming the diploma exam is the best available test of ability in Mathematics 30-1, these results support the criterion-referenced validity of the Mathematics 30-1 equivalency final exam as well as the Mathematics 30-1 form A unit exams. The other aspects of the validity trinity, content validity and construct validity, are supported by careful blueprinting, judicious adherence to item-writing research, and thorough reviews by four subject-matter experts. The concord among these lines of evidence supports the unitary conclusion that the Mathematics 30-1 equivalency exam form A produces valid results from which valid inferences and decisions can be made. The methodologies mentioned in this article apply equally well to any course that has an associated equivalency and diploma exam.

Regards,
Michael Gaschnitz

If you have any questions about anything in this article, please contact me at mgaschnitz@bowvalleycollege.ca.

References
 JA Schinka, WF Velicer, IB Weiner, DK Freedheim (2003). Handbook of Psychology, Assessment Psychology. Wiley.

Violato, C., McDougall, D., & Marini, A. (1992). Educational Measurement and Evaluation. Dubuque: Kendall/Hunt.

Thursday, 5 September 2013

English Equivalency Exam Review Update

Happy Fall Term Everyone!

With the new term comes opportunity to continue work on the English Equivalency Exam Review Project. In the summer, the English Equivalency Exam project team worked with the curriculum team to complete Form A of the English 30-2 Equivalency Exam. The project had been ongoing since January of this year with a lot of hard work being done by some very dedicated and talented instructors. Thanks so much for all your work!

Thank you to Meghan Clayon, Lorna Houck, Susan Lemmer, Jennefer Rousseau, Patricia Pryce, Murray Ronaghan Tasha Nott, Chris Taylor, and Allen Bobie for their work on reviewing and revising the exam. We had a retreat in March to give us the opportunity to work together and determine which readings and items would remain on the exam. During the summer months, Meghan Clayon, Lorna Houck, Susan Lemmer, Jennefer Rousseau, and Patricia Pryce worked to create items to fill gaps in the blueprint for Form A. The collaboration in the short time frame was inspiring! Once the items were written, the exam was reviewed as a whole by the group and the new Form A Version 202 was born. We are collecting statistics on the exam and its items to ensure reliability. We will share data once a large enough sample size is collected. As with all exam review projects, exam maintenance is ongoing and we will continue to modify the exam where necessary. 

Now we move to writing items for Form B to replace eliminated items and to create items for new readings. The exam will be reviewed by all project team members, and the instructional designers will create a blueprint to ensure the exam is valid. Our targeted completion for Form B is December. As we have components in place such as English exam blueprints, templates, item writing guidelines, and the Foundational Curriculum website, the English Language Arts exam development can be even more rapid and collaborative. 

Thank you to everyone who worked on and supported the English Equivalency Exam Review! We are happy to have everyone contribute to projects. Please email Maureen or Carey with any questions or concerns. You can also submit a project idea on the Exam Development Project page.