Rater Reliability in Language Testing
Dr Raphael Gamaroff (Assistant Professor of English, Abu Dhabi University, Al Ain)
The 5th Annual English Language Teaching Conference, 25th March , 2004
Reliability is concerned with how we measure; validity is concerned with what we are supposed to measure, i.e. with the purposes of a test. In this paper, I concentrate on reliability; however, it is often difficult to discuss reliability without bringing validity into the picture.
The reliability of a test is concerned with the accuracy of scoring and the accuracy of the administration procedures of the test.
In this paper the following key aspects of reliability are dealt with
- Facets: These refer to such factors as the (1) testing environment, e.g. the time of testing and the test setting, (2) test organisation, e.g. the sequence in which different questions are presented, and (3) the relative importance of different questions and topic content. Facets also include cultural differences between test takers, the attitude of the test taker, and whether the tester does such things as point out the importance of the test for a test-taker’s future.
- Features or conditions: These refer to such factors as clear instructions, unambiguous questions and items that do or do not permit guessing.
- The manner in which the test is scored. A central factor in this regard is rater reliability, i.e. the agreement between raters on scores and judgements related to a particular test. Rater reliability (consistency) becomes a problem mainly in the kind of tests that involve subjective judgements such as essay tests.
The bulk of the paper will focus on rater reliability.
A major obstacle in test development has been the lack of agreement on what it means to know a language, which affects
WHAT aspects of language knowledge are tested; and
HOW these aspects are tested.
Validity and reliability are closely connected. Validity has to do with WHAT you are measuring, reliability with HOW. The WHAT of testing (validity) deals with the knowledge and skills that we are testing/measuring.
The HOW in reliability refers to THREE main things.
The first two are intrinsic to the 1. form(s) of test and to the 2. conditions under which it is taken (intrinsic reliability); the third is extrinsic to the test and to the test conditions.
The form of the test. The are several things to consider here:
clear instructions, unambiguous questions and items that do not permit guessing.
the sequence in which different questions are presented
the relative importance of different questions
if you give two different forms of a test to two different individuals, or groups, both forms of the test should be testing the same things and be of the same degree of difficulty.
Testing conditions. A few examples:
- testing environment, e.g. the time of testing and the test setting
if test takers in one test group are allowed to talk to one another during a test, and another group is not allowed to do so, this will affect the reliability of the test.
Rater (examiner) reliability.
The manner in which the test is scored. A central factor in this regard is rater consistency (reliability), i.e. the agreement between raters on scores and judgements related to a particular test. Rater consistency becomes a problem mainly in the kind of tests that involve subjective judgements such as essay tests.
In the rest of this paper, I focus on rater reliability.
Scores can only be reliable, if you know WHAT you are measuring/ marking
But, if you know what you are measuring, it does not follow that you will automatically measure consistently/accurately/reliably.
So, the teacher has to know his/her subject well, but must also know the problems in assessing a student’s work.
Rater reliability is EXTRINSIC (external) to the test itself, because it doesn’t deal with things such as the forms of the test or the conditions under which they are done. It deals with the way tests are marked.
There are two kinds of rater reliability:
Intrarater reliability and
In intrarater reliability we aim for consistency within the rater. For example, if a rater (teacher) has many exam papers to mark and doesn’t have enough time to mark them, he or she might take much more care with the first, say, ten papers, than the rest. This inconsistency will affect the students’ scores; the first ten might get higher scores.
Earlier I mentioned the pressure of time a rater (teacher) may have when marking. There are many other kinds of pressures that teachers may have to endure in the way they mark exams and tests. One of these may be the need to equip as many people as possible to join the real world outside of school.
Besides the pressures that teachers may have to endure in their testing procedures, there are other factors that affect the ability of a teacher to be OBJECTIVE, i.e. to see things as they are, rather than to be SUBJECTIVE, i.e. to see things as one wants/needs them to be.
subjective than the natural sciences. It is much easier for teachers to agree that water boils at zero degrees centigrade at sea level than that a student in “C” level is stupid or cheeky.
It is also much easier to agree on the correctness of a grammar item (objective part) than on the quality of a composition (subjective whole).
Language is meant for real-life, i.e. for communication.
Examples of communication are:
1. Writing a letter, or a composition;
2. Speaking to or listening to someone;
3. Reading a book or a newspaper.
Grammar is not communication; it is the basic building blocks of communication.
It is easier to accurately measure grammatical knowledge (grammatical competence) and harder to measure language use (communicative competence).
To return to our two terms OBJECTIVE and SUBJECTIVE, grammatical competence is more objective, while communicative competence is more subjective.
For example, most grammar items are either right or wrong, whereas in a composition, it is much more difficult to make judgements about large chunks of language such as a composition. There is often a wide range in judgements and scores between different teachers (raters).
This means that grammar scores are often accurate, i.e. reliable, but that composition scores are often inaccurate, i.e. unreliable.
To sum up
The greater the validity of a test, the more it reflects language use. Language use is more difficult to mark than grammar. The greater the difficulty to mark, the less the reliability.
Grammar tests are more reliable but less valid
Composition tests are less reliable but more valid.
We must be careful when we say that grammar scores are much more reliable (accurate) than composition scores.
If the test is a multiple choice test where the answers are provided by the test maker, teachers do not need to decide on the accuracy of the grammar item. So, scoring will be more reliable.
However, where the answers to grammar items are not provided, marking can be unreliable. It depends on the grammatical knowledge of the teacher.
To repeat what was said earlier
Reliability = Accuracy of HOW we assess/test
Validity = WHAT we assess/test
The easier it is to measure something the less complex it is. Language use is more complex than grammar; emotions are more complex than motions.
Composition writing is much more difficult to mark than grammar items because composition is much more authentic/real-life than grammar. In composition we need to look at things like cohesion and coherence, which has to do with how sentences and parts of sentences come together to create paragraphs and longer stretches of discourse.
For this reason, composition writing (real-life language, i.e. language use) gives one a better understanding of a person’s language competence.
Reliability in composition writing is a more fruitful field of study than isolated grammar items, because in composition we find all the three important components of language use, namely:
Content 2. Organisation. 3. Language
Conclusions and Recommendations
The problem of rater reliability is how to be as fair as possible in the allocation of scores and judgements. The problem of rater reliability in assessment seems to take precedence over all other issues in testing. This is understandable because assessment is the last and most crucial stage in the teaching programme. (Most learners only protest about poor teaching, unclear exam questions, etc. if they fail).
At the beginning of an academic year, all the teachers in the department can rate one student’s assignment and discuss the criteria they used and the marks they awarded. This exercise done repeatedly over a period of time gradually increases interrater reliability.
2. Before a major test or an exam, questions set by individual teachers can (indeed must) be discussed (formally and/or informally) by the whole department in terms of:
- clarity of the questions
- length of questions, and
- number of marks to be awarded for specific questions or sections.
3. The person who prepares the question should also give a memorandum to others in the department to demonstrate the criteria by which answers will be evaluated.
If one had to follow these procedures one would hope that objectivity would be increased. Yet, it seems that even if one does consult with colleagues (as would be the case with literary or music or culinary critics) hard problems remain. What is worrisome in the assessment of written output is that in spite of discussions and workshops on establishing common criteria, there remain large differences in the relative weight raters attach to the different criteria, e.g. linguistic structure, content and organisation.
But one should not stop trying to improve reliability, just as one should not stop trying to be as objective as possible in all the other facets of our lives, especially in our
judgements of others.