Assessment, Higher Education

Multiple Choice Exam Scoring

Multiple choice exams are widely used and are here to stay. A previous article was about how to write good MCQ exams. Often those who use MCQs as a way to test students are accused of being lazy and poor teachers. This is not true, but often a little bit more care could be taken when scoring an MCQ test. The use of an MCQ test does not mean that you are a poor teacher, but the blind scoring of the tests implies that you don’t really know how to properly score an MCQ test or that you are a lazy, uncaring teacher. If you are lazy and uncaring (which I believe very few of you are) then you should stop reading here, because this will be a waste of time. If you care, keep going.

When we test students using any type of test, what we are doing is trying to discriminate between students who know a lot about the content domain that we have been teaching and those who know less about the content domain that we are teaching. The questions that we choose to ask the students to make this discrimination are a sampling of the content domain. The purpose of every question that we choose, regardless of the type, is to provide us with a measure that we can use to discriminate between those who have learned a lot and those who have learned less. An MCQ test is no different. Every question has been selected to discriminate between those who know that bit of the content domain and those who do not know or understand that bit of the content domain. The primary advantage of using an MCQ test is to measure a wide breadth of the content domain, although it has become the default for large classes because of the time taken to mark an MCQ exam.

The simplest method of scoring an MCQ test is to get an overall raw score based on the percentage of correct answers that a student gets right. With the use of a four alternative MCQ (correct answer and three distractors), the most common and straightforward type of MCQ test used, one of the four alternatives presented to the students is correct. This means that if a student were to guess at every question, they would get 25% of them correct – assuming that the common MCQ writing pitfalls are avoided. There are numerous methods proposed to correct for this guessing bias, but the simplest method is to set 25% as a zero mark for the exam. If I can guess 25% of the questions correctly then I don’t know any of the material. This is a statistically correct statement. I don’t have to know anything to get 25%. This becomes important in determining how well an item discriminates between those who know the material and one who doesn’t.

Simple Item Analysis

If any of us were perfect in our ability to teach then item analysis would be unnecessary. I have been told by individuals that they are near perfect, but I would never lay claim to that status. I am as fallible as anyone and will continue to make it a lifetime pursuit to learn how people learn so that I can more effectively teach students who want to learn. What I mean by perfect teachers in this context is that if I were perfect in writing an MCQ there would never be any need to examine each item to see if it is doing what I think it is doing – discriminating between the students. I owe it to my students to examine each item that they answer to see if the item is doing what I think it is doing.

Item Difficulty

Simple item analysis begins with the raw score and finding the score for the difficulty of an item (question). By putting the number of correct answers over the number of answers in an exam we can get a percentage correct. If we then average the percentage correct for each question across the entire class we can get the difficulty score for each and every question. In Walsh’s (and every other psychometric text I have read) text he has a simple definition of how difficult a question is. What is the proportion of students that were able to answer the question correctly?

The difficulty score can be artificially inflated or deflated by asking simple and obvious questions about the content domain (inflating) or asking obscure or trivial questions about the content domain (deflate). Doing this does not fulfill the goal of the purpose of every question in a test – to discriminate between the students who understand the content domain and those who don’t. I have known a number of teachers who pride themselves in either writing tests that are really challenging (obscure or trivial questions) or with how well their students can do on their tests (either asking the obvious or teaching to the test).

When properly analyzing an item to see if it is doing what you intend it to do, you have to look at the overall probability of the students getting the question correct. If the probability is either too low (less than the guessing rate) or too high (over 90% of the students got the question correct) then the question is not discriminating between the students who know the content domain and the students who don’t know the content domain. Before the final analysis if the test results, these two types of questions should be removed from the test.

Item Discrimination

Another test of an items suitability to remain in a test is the discriminability of an item. By the way, all of these item analyses can be carried out easily in a spreadsheet if you can get the raw scores as a series of ones and zeros for each student. An item discrimination index is the score for each item in how well it discriminates between the students who know the material (scores high on the exam) and the students who don’t know the material (scores poorly on the exam). To do this you sort all of your students and their scores from highest to the lowest. You then get an average score for the top 25% or 30% of the students on every item and then you get an average score for the bottom 25% or 30% of the students on every item. You then subtract the average score of the poorest performing students from the average score of the top performing students. This is your discrimination index.

What does an item discrimination index tell you? It lets you know how well an item discriminates between those who understand the content domain and those who do not understand the content domain. If the discrimination index score for an item is zero (or close to zero) that means that the item is not discriminating at all between those who know the material and those who don’t. If you have a negative number, that means that the students who scored the most poorly on the exam as a whole answered the question correctly more often than the students who scored the highest on the exam as a whole. These questions should also be removed from the exam before the final results are tallied.

Why should we ever remove items from a test? Well, if you claim to be a perfect teacher who always writes perfect MCQs, then you shouldn’t remove questions – the fault always lies with the students. However, as a mere mortal who regularly makes mistakes, I know that I regularly write questions that I think are going to discriminate between those who understand the content domain and those who don’t, and the questions don’t actually do that. As a result, I regularly remove questions from the MCQs that I give (or gave).

Item analyses and item removal is only fair for your students. They are led to believe that when they take a test it is doing what we claim that it is doing – discriminating between those who know the material and those who don’t. It is my job, as a teacher, to ensure that (to the best of my ability) every question that I ask them to answer is doing that.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.