Item analysis is a process which
examines student responses to individual test items (questions) in order to
assess the quality of those items and of the test as a whole. Item analysis is
especially valuable in improving items which will be used again in later tests,
but it can also be used to eliminate ambiguous or misleading items in a single
test administration. In addition, item analysis is valuable for increasing
instructors' skills in test construction, and identifying specific areas of
course content which need greater emphasis or clarity. Separate item analyses can
be requested for each raw score1 created during a given ScorePak® run. A basic
assumption made by ScorePak® is that the test under analysis is composed of
items measuring a single subject area or underlying ability. The quality of the
test as a whole is assessed by estimating its "internal consistency."
The quality of individual items is assessed by comparing students' item
responses to their total test scores.
Item Difficulty Index
The item difficulty index is one of the most
useful, and most frequently reported, item
analysis statistics. It is a measure of the proportion
of examinees who answered the item correctly; for this reason it is
frequently called the p-value. As the proportion of examinees who got
the item right, the p-value might more properly be called the item easiness
index, rather than the item difficulty. It can range between 0.0 and 1.0, with
a higher value indicating that a greater proportion of examinees responded to
the item correctly, and it was thus an easier item. For criterion-referenced
tests (CRTs), with their emphasis on mastery-testing, many items on an exam
form will have p-values of .9 or above. Norm-referenced tests (NRTs), on the
other hand, are designed to be harder overall and to spread out the examinees’
scores. Thus, many of the items on an NRT will have difficulty indexes between
.4 and .6.
ScorePak® arbitrarily classifies item
difficulty as
"easy" if the index is 85% or above;
"moderate" if it is between 51 and 84%;
"hard" if it is 50% or below.
Item Discrimination Index
The item discrimination index is a measure of
how well an item is able to distinguish between examinees who are knowledgeable
and those who are not, or between masters and non-masters. There are actually
several ways to compute an item discrimination, but one of the most common is
the point-biserial correlation. This statistic looks at the relationship
between an examinee’s performance on the given item (correct or incorrect) and
the examinee’s score on the overall test. For an item that is highly discriminating,
in general the examinees who responded to the item correctly also did well on
the test,while in general the examinees who responded to the item incorrectly
also tended to do poorly on the overall test.
The possible range of the discrimination
index is -1.0 to 1.0; however, if an item has a discrimination below 0.0, it
suggests a problem. When an item is discriminating
negatively, overall the most knowledgeable examinees are
getting the item wrong and the least knowledgeable examinees are getting the
item right. A negative discrimination index may indicate that the item is
measuring something other than what the rest of the
test is measuring. More often, it is a sign that the item
has been mis-keyed.
When interpreting the value of a discrimination it is
important to be aware that there is a
relationship between an item’s difficulty index and its
discrimination index. If an item
has a very high (or very low) p-value, the potential
value of the discrimination index will
be much less than if the item has a mid-range p-value. In
other words, if an item is either very easy or very hard, it is not likely to
be very discriminating.
Content validity
When
a test has content validity, the items on the test represent the entire range
of possible items the test should cover. Individual test questions may be drawn
from a large pool of items that cover a broad range of topics.In some instances
where a test measures a trait that is difficult to define, an expert judge may
rate each item’s relevance. Because each judge is basing their rating on
opinion, two independent judges rate the test separately. Items that are rated
as strongly relevant by both judges will be included in the final test.
Criterion-related
Validity
A test is said to have
criterion-related validity when the test has demonstrated its effectiveness in
predicting criterion or indicators of a construct. There are two different
types of criterion validity:
·
Concurrent Validity occurs when the
criterion measures are obtained at the same time as the test scores. This
indicates the extent to which the test scores accurately estimate an
individual’s current state with regards to the criterion. For example, on a
test that measures levels of depression, the test would be said to have
concurrent validity if it measured the current levels of depression experienced
by the test taker.
·
Predictive Validity occurs when the
criterion measures are obtained at a time after the test. Examples of test with
predictive validity are career or aptitude tests, which are helpful in
determining who is likely to succeed or fail in certain subjects or
occupations.
Distractor Analysis
One important element in the quality of a
multiple choice item is the quality of the item’s distractors. However, neither
the item difficulty nor the item discrimination index
considers the performance of the incorrect response
options, or distractors. A distractor
analysis addresses the performance of incorrect response options in multiple
choice.
Just as the key, or correct response option, must be
definitively correct, the distractors
must be clearly incorrect (or clearly not the
"best" option). In addition to being clearly
incorrect, the distractors must also be plausible. That
is, the distractors should seem likely or reasonable to an examinee who is not
sufficiently knowledgeable in the content area. If a distractor appears so
unlikely that almost no examinee will select it, it is not
contributing to the performance of the item. In fact, the
presence of one or more
implausible distractors in a multiple choice item can
make the item artificially far easier
than it ought to be.
In a simple approach to distractor analysis,
the proportion of examinees who selected each of the response options is
examined. For the key, this proportion is equivalent to the item p-value, or
difficulty. If the proportions are summed across all of an item’s
responseoptions they will add up to 1.0, or 100% of the examinees' selections.
The proportion of examinees who select each of the
distractors can be very informative.
For example, it can reveal an item mis-key. Whenever the
proportion of examinees who
selected a distractor is greater than the proportion of
examinees who selected the key,
the item should be examined to determine if it has been
mis-keyed or double-keyed. A
distractor analysis can also reveal an implausible
distractor. In CRTs, where the item pvalues are typically high, the proportions
of examinees selecting all the distractors are, as a result, low. Nevertheless,
if examinees consistently fail to select a given distractor, thismay be
evidence that the distractor is implausible or simply too easy.
Ideal difficulty levels for multiple-choice items in terms of
discrimination potential are:
Format Ideal Difficulty
Five-response multiple -choice
70
Four-response multiple -choice
74
Three-response multiple -choice
77
True-false(two-response
multiplechoice) 85
Weakness of Multiple Choice:
1.
the limited types of knowledge that can be
assessed by multiple choice tests. Multiple choice tests are best adapted for testing well-defined or
lower-order skills. Problem-solving and higher-order reasoning skills are
better assessed through short-answer and essay tests. However, multiple choice
tests are often chosen, not because of the type of knowledge being assessed,
but because they are more affordable for testing a large number of students.
This is especially true in the United States where multiple choice tests are
the preferred form of high-stakes testing.
2.
The posibility of ambiguity in the examinee's interpretation of the item. Failing to interpret information as the test
maker intended can result in an "incorrect" response, even if the
taker's response is potentially valid. The term "multiple guess" has
been used to describe this scenario because test-takers may attempt to guess
rather than determine the correct answer. A free response test allows the test taker to make
an argument for their viewpoint and potentially receive credit.
3.
Even if students have some knowledge of a question, they
receive no credit for knowing that information if they select the wrong answer
and the item is scored dichotomously. However, free response questions may allow an examinee to
demonstrate partial understanding of the subject and receive partial credit.
Additionally if more questions on a particular subject area or topic are asked
to create a larger sample then statistically their level of knowledge for that
topic will be reflected more accurately in the number of correct answers and
final results.
4.
a student who is incapable of answering a
particular question can simply select a random answer and still have a chance
of receiving a mark for it. It
is common practice for students with no time left to give all remaining
questions random answers in the hope that they will get at least some of them
right. Many exams, such as the Australian
Mathematics Competition, have systems in place to negate this, in
this case by making it more beneficial to not give an answer than to give a
wrong one. Another system of this is formula scoring, in which a score is
proportionally reduced based on the number of incorrect responses and the
number of possible choices. In this method, the score is reduced by the number
of wrong answers divided by the average number of possible answers for all
questions in the test, W/(c-1) where w=number of wrong responses on the test
and c=the average number of possible choices for all questions on the[9] test. All exams scored with the
three-parameter model of item response theory
also account for guessing. This is usually not a great issue, moreover, since
the odds of a student receiving significant marks by guessing are very low when
four or more selections are available.
5.
questions phrased ambiguously may cause
test-taker confusion. It is
generally accepted that multiple choice questions allow for only one answer,
where the one answer may encapsulate a collection of previous options. However,
some test creators are unaware of this and might expect the student to select
multiple answers without being given explicit permission, or providing the
trailing encapsulation options. Of course, untrained test developers are a
threat to validity regardless of the item format.
0 comments:
Post a Comment