ITEM ANALYSIS



Item analysis is a process which examines student responses to individual test items (questions) in order to assess the quality of those items and of the test as a whole. Item analysis is especially valuable in improving items which will be used again in later tests, but it can also be used to eliminate ambiguous or misleading items in a single test administration. In addition, item analysis is valuable for increasing instructors' skills in test construction, and identifying specific areas of course content which need greater emphasis or clarity. Separate item analyses can be requested for each raw score1 created during a given ScorePak® run. A basic assumption made by ScorePak® is that the test under analysis is composed of items measuring a single subject area or underlying ability. The quality of the test as a whole is assessed by estimating its "internal consistency." The quality of individual items is assessed by comparing students' item responses to their total test scores.

Item Difficulty Index

The item difficulty index is one of the most useful, and most frequently reported, item
analysis statistics. It is a measure of the proportion of examinees who answered the item correctly; for this reason it is frequently called the p-value. As the proportion of examinees who got the item right, the p-value might more properly be called the item easiness index, rather than the item difficulty. It can range between 0.0 and 1.0, with a higher value indicating that a greater proportion of examinees responded to the item correctly, and it was thus an easier item. For criterion-referenced tests (CRTs), with their emphasis on mastery-testing, many items on an exam form will have p-values of .9 or above. Norm-referenced tests (NRTs), on the other hand, are designed to be harder overall and to spread out the examinees’ scores. Thus, many of the items on an NRT will have difficulty indexes between .4 and .6.

ScorePak® arbitrarily classifies item difficulty as
"easy" if the index is 85% or above;
"moderate" if it is between 51 and 84%;
"hard" if it is 50% or below.


Item Discrimination Index
The item discrimination index is a measure of how well an item is able to distinguish between examinees who are knowledgeable and those who are not, or between masters and non-masters. There are actually several ways to compute an item discrimination, but one of the most common is the point-biserial correlation. This statistic looks at the relationship between an examinee’s performance on the given item (correct or incorrect) and the examinee’s score on the overall test. For an item that is highly discriminating, in general the examinees who responded to the item correctly also did well on the test,while in general the examinees who responded to the item incorrectly also tended to do poorly on the overall test.
The possible range of the discrimination index is -1.0 to 1.0; however, if an item has a discrimination below 0.0, it suggests a problem. When an item is discriminating
negatively, overall the most knowledgeable examinees are getting the item wrong and the least knowledgeable examinees are getting the item right. A negative discrimination index may indicate that the item is measuring something other than what the rest of the
test is measuring. More often, it is a sign that the item has been mis-keyed.
When interpreting the value of a discrimination it is important to be aware that there is a
relationship between an item’s difficulty index and its discrimination index. If an item
has a very high (or very low) p-value, the potential value of the discrimination index will
be much less than if the item has a mid-range p-value. In other words, if an item is either very easy or very hard, it is not likely to be very discriminating.

Content validity
When a test has content validity, the items on the test represent the entire range of possible items the test should cover. Individual test questions may be drawn from a large pool of items that cover a broad range of topics.In some instances where a test measures a trait that is difficult to define, an expert judge may rate each item’s relevance. Because each judge is basing their rating on opinion, two independent judges rate the test separately. Items that are rated as strongly relevant by both judges will be included in the final test.
Criterion-related Validity
A test is said to have criterion-related validity when the test has demonstrated its effectiveness in predicting criterion or indicators of a construct. There are two different types of criterion validity:
·         Concurrent Validity occurs when the criterion measures are obtained at the same time as the test scores. This indicates the extent to which the test scores accurately estimate an individual’s current state with regards to the criterion. For example, on a test that measures levels of depression, the test would be said to have concurrent validity if it measured the current levels of depression experienced by the test taker.



·         Predictive Validity occurs when the criterion measures are obtained at a time after the test. Examples of test with predictive validity are career or aptitude tests, which are helpful in determining who is likely to succeed or fail in certain subjects or occupations.


Distractor Analysis
One important element in the quality of a multiple choice item is the quality of the item’s distractors. However, neither the item difficulty nor the item discrimination index
considers the performance of the incorrect response options, or distractors. A distractor
analysis addresses the performance of  incorrect response options in multiple choice.
Just as the key, or correct response option, must be definitively correct, the distractors
must be clearly incorrect (or clearly not the "best" option). In addition to being clearly
incorrect, the distractors must also be plausible. That is, the distractors should seem likely or reasonable to an examinee who is not sufficiently knowledgeable in the content area. If a distractor appears so unlikely that almost no examinee will select it, it is not
contributing to the performance of the item. In fact, the presence of one or more
implausible distractors in a multiple choice item can make the item artificially far easier
than it ought to be.
In a simple approach to distractor analysis, the proportion of examinees who selected each of the response options is examined. For the key, this proportion is equivalent to the item p-value, or difficulty. If the proportions are summed across all of an item’s responseoptions they will add up to 1.0, or 100% of the examinees' selections.
The proportion of examinees who select each of the distractors can be very informative.
For example, it can reveal an item mis-key. Whenever the proportion of examinees who
selected a distractor is greater than the proportion of examinees who selected the key,
the item should be examined to determine if it has been mis-keyed or double-keyed. A
distractor analysis can also reveal an implausible distractor. In CRTs, where the item pvalues are typically high, the proportions of examinees selecting all the distractors are, as a result, low. Nevertheless, if examinees consistently fail to select a given distractor, thismay be evidence that the distractor is implausible or simply too easy.

Ideal difficulty levels for multiple-choice items in terms of
discrimination potential are:

Format                                                             Ideal Difficulty
Five-response multiple -choice                                                70
Four-response multiple -choice                                               74
Three-response multiple -choice                                              77
True-false(two-response multiplechoice)                                85

Weakness of Multiple Choice:
1.      the limited types of knowledge that can be assessed by multiple choice tests. Multiple choice tests are best adapted for testing well-defined or lower-order skills. Problem-solving and higher-order reasoning skills are better assessed through short-answer and essay tests. However, multiple choice tests are often chosen, not because of the type of knowledge being assessed, but because they are more affordable for testing a large number of students. This is especially true in the United States where multiple choice tests are the preferred form of high-stakes testing.
2.      The posibility of ambiguity in the examinee's interpretation of the item. Failing to interpret information as the test maker intended can result in an "incorrect" response, even if the taker's response is potentially valid. The term "multiple guess" has been used to describe this scenario because test-takers may attempt to guess rather than determine the correct answer. A free response test allows the test taker to make an argument for their viewpoint and potentially receive credit.
3.      Even if students have some knowledge of a question, they receive no credit for knowing that information if they select the wrong answer and the item is scored dichotomously. However, free response questions may allow an examinee to demonstrate partial understanding of the subject and receive partial credit. Additionally if more questions on a particular subject area or topic are asked to create a larger sample then statistically their level of knowledge for that topic will be reflected more accurately in the number of correct answers and final results.
4.      a student who is incapable of answering a particular question can simply select a random answer and still have a chance of receiving a mark for it. It is common practice for students with no time left to give all remaining questions random answers in the hope that they will get at least some of them right. Many exams, such as the Australian Mathematics Competition, have systems in place to negate this, in this case by making it more beneficial to not give an answer than to give a wrong one. Another system of this is formula scoring, in which a score is proportionally reduced based on the number of incorrect responses and the number of possible choices. In this method, the score is reduced by the number of wrong answers divided by the average number of possible answers for all questions in the test, W/(c-1) where w=number of wrong responses on the test and c=the average number of possible choices for all questions on the[9] test. All exams scored with the three-parameter model of item response theory also account for guessing. This is usually not a great issue, moreover, since the odds of a student receiving significant marks by guessing are very low when four or more selections are available.
5.      questions phrased ambiguously may cause test-taker confusion. It is generally accepted that multiple choice questions allow for only one answer, where the one answer may encapsulate a collection of previous options. However, some test creators are unaware of this and might expect the student to select multiple answers without being given explicit permission, or providing the trailing encapsulation options. Of course, untrained test developers are a threat to validity regardless of the item format.

0 comments:

Post a Comment