Category Archives: Language Testing

Language, Content and Skills in the Testing of English for Academic Purposes

 

 

 

 

Raphael Gamaroff, South African Journal of Higher Education, 12 (1), 109-116, 1998.

 

Abstract

Introduction

Review of the literature

Reliability and validity

Method

Results

Discussion

Conclusion

References

Appendix

Abstract

In modern education there is a critical need to understand better the relationship between language, (organisational) skills and content/subject matter. Nowhere is this more evident than in the teaching of academic and scientific discourse. An examination is made of the evaluations of second language lecturers of English for Academic purposes (EAP) and of the evaluations of Science lecturers at a South African historical black university (HBU). Comparisons are firstly made within each group of raters (lecturers), and secondly between the two groups. Both groups of raters were asked to evaluate first-year students’ essays on the “Greenhouse Effect”. The findings show that there is a wide range of scores and judgements within each group as well as between the two groups of raters, which affects the reliability of the scores. Reasons are offered for this wide range of variability in judgements and scores.

Introduction

In education there is a critical need to understand the relationship between language skills, organisational skills and content/world knowledge/subject matter. Nowhere is this more evident than in the testing of academic and scientific discourse. Language teachers and subject teachers (e.g. biology and history teachers) can learn from one another with regard to the testing of academic discourse. To test, and accordingly to teach, academic and scientific discourse, science teachers need a knowledge of language and language teachers need a knowledge of science.

A comparison is made between the evaluations of second language lecturers of English for Academic purposes (EAP) and the evaluations of Science lecturers at a historically black university in South Africa, where the mother tongue of students is a Bantu language. In South Africa the vast majority of learners are not mother tongue users of English, thus the EAP that educators and researchers are mostly concerned with in South Africa is English as a second language.

The article consists of four parts: 1. The problematic distinction between language, content/subject knowledge and organisational skills/argument; 2. Definitions of the terms academic and scientific discourse, as well as of the two key terms in testing, namely reliability and validity; 3. Method of the investigation; 4. Results of the investigation and discussion.

Review of the literature

One of the major findings in the South African Committee of University Principals Report (HSRC, 1981) was that the lack of proficiency in the second language as medium of instruction, namely English, was the main cause of poor achievement among black learners. And one of the preliminary findings of the “Programme for the educationally disadvantaged pupils in South Africa” (Botha & Cilliers, 1993) was that there are three major areas of concern in black schools in South Africa, namely, cognitive deprivation, language inadequacies and consequent scholastic backlogs.

For Young (1987:164) these scholastic backlogs, manifested by the poor pass rate of Standard 10 pupils in the Department of Education and Training (DET) “is often rooted in language incompetence, the causes of which cannot be found in the English subject classroom alone, but across the curriculum in every subject taught through English as a medium.” (The DET – now defunct – was the controlling body in South African black education until 1994, the year in which South Africa’s first democratic elections were held).

Young’s opinion is a common one, e.g. Mcintyre (1992:10) for whom the problem in English as the medium of instruction for disadvantaged black learners “lies with language competence and not with subject competence”. Contrary to Mcintyre, other researchers are loath to make such a clear distinction between “language” and “content” (Saville-Troike 1984; Snow & Brinton,1988; Spack, 1988; Bradbury, Damerell, Jackson & Searle, 1990; Murray, 1990;Starfield, 1990; Starfield & Kotechka, 1991; Angelil-Carter, 1994). These authors believe that language cannot be taught without content and skills. Consider some of the issues in the problematic distinction between language content and (organisational) skills:

Mcintyre’s “language competence” could mean linguistic competence or general language proficiency. Linguistic competence is only one part of language proficiency.

Linguistic competence is the knowledge of how to relate sounds to meaning. It is this relationship that we study in pure linguistics, whereas language proficiency is concerned with all of the following (Bialystok, 1978:71-75):

– Language input – exposure to the language.

– Knowledge in language use – storage of input. This knowledge is of three kinds: (1) “explicit linguistic knowledge” (conscious knowledge of the language); (2) “implicit linguistic knowledge” (unconscious knowledge of the language); and (3) “other knowledge” such as mother tongue, other languages, and knowledge of the world. I equate the latter with content knowledge.

– Output – the product of comprehension and production.

One of the great mysteries remains the relationship between input, information- processing and output (Mandler, 1984), which is closely tied up with the often deceptive relationship between the notions of “language” knowledge and “content” (or “subject”) knowledge: deceptive in that researchers (e.g. Hughes, 1989:82) often assume that there is a clear distinction between these two notions, which is not the case. On the one hand, it is recommended (Hughes, 1989:82) that we test language proficiency and nothing else, because language testers, according to Hughes, are “not normally interested in knowing whether students are creative, imaginative, or even intelligent, have wide general knowledge, or have good reasons for the opinions they happen to hold”. Much of the research into language learning has concentrated largely on linguistic knowledge, because it is widely believed that linguistic knowledge can be separated from content knowledge and skills, as in Hughes above (Saville-Troike 1984:199).

It is difficult, perhaps impossible, to separate the language-specific cognitive structures of language proficiency from content knowledge (Langacker, 1987; Taylor, 1989:81ff) or from problem-solving abilities (Vollmer, 1983:22; Bley-Vroman, 1990) which, according to Bialystok above, are components of language proficiency.

With regard to language and content, it is difficult to sort “knowledge into two discrete epistemic sets: knowledge of language [i.e. linguistic knowledge] and knowledge of the world”(Biggs 1982:112). As a result of this difficulty, it is often not possible to decide which of these two kinds of knowledge – not neglecting other possible factors – causes academic failure. Bolinger (1965) refers to the attempts to distinguish between these two kinds of knowledge as the “atomisation of meaning”. One might also describe this attempt as a splintering of culture, where culture “as a whole may be charactarizable as a vast integrated semiotic in which can be recognized a number of subsemiotics, one of which is language” (Lamb, 1984:96).

With regard to content and skills, content can be learnt by rote – where rote memory does not require any higher order skills -or it can be learnt so that it becomes assimilated into cognitive structure, which requires higher order cognitive skills, e.g. comparison and transfer. Content may not be understood owing to the fact that the higher order skills are not developed enough to make content part of cognitive structure. So if content is a problem, so also is skills, because the one is enmeshed in the other.

We have a dilemma. On the one hand, it is recommended (Hughes, 1989:82) that we test language ability and nothing else, but, on the other hand, it is difficult, perhaps impossible, to separate language-specific cognitive structures from general problem-solving abilities (Vollmer, 1983:22; Bley-Vroman, 1990) or from content (Taylor, 1989:81ff; ). What Mathews (1964:89) states with regard to reading applies to EAP as a whole:

As one cannot ignore the importance of human variabilities and the situational and existential resonance of every reading [EAP] experience, so he cannot think long about reading (EAP) without thinking about content…Nevertheless we speak of “reading [EAP] teachers” in contradistinction to “content area instructors”. Whom do we fool in asserting such a hopelessly arbitrary dichotomy…For our present purpose, reading [EAP] may be defined as a civilized means of enlarging the mind in one or more identifiable ways: and the character of that enlargement will be controlled by the material – the content. Nowhere in the world do means exist totally isolated from materials and objects; then how can we so often act as if the act of reading [EAP] can be abstracted from content? I am saying that the study of reading [EAP] as something divorced from content is hopelessly limited, and ultimately false, in its approach.

Reliability and validity

Two key concepts in evaluation are reliability andvalidity. The reliability of a test is concerned with the accuracy of scoring and the accuracy of the administration procedures of the test (e.g. the way the test is compiled and conducted). I focus on interrater reliability, which has to do with the equivalence of scores and of judgements between raters. In the discussion of raters’ judgements I deal with the three components/criteria of language (grammar, vocabulary and punctuation), topic (content knowledge) and organisation.

If reliability is concerned with how we measure, validity is concerned with what we are supposed to measure, i.e. with the “purposes of a test” (Carmines & Zeller, 1979), . In this investigation we are measuring the three criteria of language, content and organisation. A distinction is often made between content and skills (where the latter refers to the organisation of content). EAP is meant to teach academic skills. But as the data will show, it is not easy to distinguish between content and skills. Nor is it easy to distinguish between these two and “language”.

It is possible for a test to have high reliability, where raters give equivalent scores but this does not necessarily mean that the tests are valid, i.e. that these scores represent what they are supposed to measure. For instance, if all raters of an essay believe that language is the most important criterion in academic discourse and accordingly give equivalent scores in terms of language, the interrater reliability will be high. But, the question is whether language should indeed be the most important criterion in academic discourse. If teachers believe that language should play first fiddle to other criteria such as content and organisation, then the test is not being used for the intended purposes of measuring academic discourse, because academic discourse should involve high proficiency in all three of the abovementioned criteria. If one of these criteria is underplayed, the use of the test would be invalid. To summarise the problem in reliability and validity: raters may choose whatever they fancy to measure and in so doing may obtain equivalent scores, i.e. high interrrater reliability, but if they do not measure what they are supposed to measure, then the use of the test would be invalid.

[Oller (1979:272) defines reliability in terms of correlation (the way in which two sets of scores vary together) and rejects equivalence (in scores between raters) as a factor in reliability. I prefer Ebel and Frisbie’s (1991:76) view that equivalence in scores is also an important aspect of reliability].

Method

The research was carried out at the University of Fort Hare, where English is the medium of instruction. The vast majority of the students are mother tongue speakers of Xhosa and South Sotho.

Data were collected from two sources at Fort Hare on two occasions, one year apart: 1. the English Department and 2. Science departments . The Science departments include biology, zoology, geography and agriculture departments. The data from the English Department were collected from an EAP workshop on evaluation held in 1990. The following year I presented the same protocols used in the EAP workshop to a group of Science lecturers for evaluation. I then compared the evaluations of the English lecturers with the evaluations of the Science lecturers. The English group is dealt with first, followed by the Science group.

English group

The EAP evaluation workshop was held for the English Department in the second semester of 1990. The workshop consisted of 11 raters; 10 from the English Department and the workshop leader from another university. I also participated in the workshop, and am designated in this article as English Rater 3. Five of these English lecturers were literature lecturers and to my knowledge these literature lecturers had no or little training in EAP.

The essays that were evaluated belonged to students from the University of the North West (formerly the University of Bophuthatswana) who had taken a course in Special English (SPEN) based on Murray and Johansen’s “EAP in South Africa Series” (1989), which is now used in several universities in South Africa. The purpose of the SPEN course was to help students to read and write for academic purposes. The following academic skills were taught in three consecutive sections:

1. Reading for a specific purpose to find out the answer to a question, and writing definitions, notes and summaries. These skills prepare the student to write an argument. (Book 1, “Reading for meaning”).

2. Writing for an academic audience, which consists of planning, drafting and editing. (Book 2, “Writing for meaning”).

3. Writing to improve, which focuses on lexis, grammar, structure and style. (Book 3, “Write to improve).

The final exam essay of the SPEN course, which I deal with in this article, was a response to a question based on Appendix 6 of Murray and Johansen’s Book 3. The title of the essay question was: “Discuss how climatic changes brought about by the Greenhouse Effect are likely to affect the world’s plant and animal species.”

During the SPEN course, students studied Appendix 6 of Murray and Johansen (Book 3), which consisted of 16 readings on different aspects of the Greenhouse Effect. They were instructed to write mind maps and notes on the readings, and to think about the kind of questions that could be set in the exams. The idea was that students would transfer the reading and writing skills practised on other topics in the course work to the examination topic, namely, the Greenhouse Effect. The students had no previous teaching on the background material for the essay question, but were expected – after they had spent a large part of the SPEN course learning about organisational skills – to prepare for the exam during the latter part of the course by studying the 16 readings at the back of their textbook. In the exam they were expected to apply the study skills they had learned to the prepared material. There was only one question, which was presented for the first time at the exam. The fact that notes and reading material were not allowed into the exam room and that the question was not accompanied by any helpful texts needs to be taken into account, because, in my opinion, the absence of these supports made the exam very difficult for a first-year EAP student.

The EAP workshop consisted of the following procedures:

1. A discussion on the criteria of academic essay writing. The following main criteria were listed by the lecturers.

– Knowledge of topic/content/subject matter

– Clear expression

– Confident handling of material

– Argumentation

– Appropriate selection of subject matter (content)

– Cohesion/coherence/clarity.

2. In the main part of the workshop, each rater (lecturer) was presented with photocopies of four protocols, one from each of four students, and were given approximately half an hour to evaluate them. Raters had ample time to study the background reading material to the topic (the readings in Murray and Johansen’s Book 3), which was provided prior to the workshop.

3. Each of the 11 raters provided a score for each of the four protocols; 44 scores in all.

4. Each rater supplied reasons for their respective scores.

I deal mainly with Protocol 1 (see the appendix for a copy of Protocol 1), but also deal briefly with Protocol 2. I have only selected the opinions of seven of the 11 English raters on Protocol 1, because four of them did not make direct comments on this particular protocol.

Science Group

I gave 18 Science lecturers copies of two protocols (Protocols 1 and 2) for their evaluation as well as instructions containing guidelines on the criteria to be evaluated, of which I received eight responses. The data of these eight lecturers are used in this article. The lecturers were requested to comment on the following: 1. content, 2. organisation, and

3. language.

The Science lecturers were not informed that the protocols belonged to EAP students, and assumed that the essays were written by first-year Science students. The instructions to the Science lecturers did not contain the background information that was supplied to the English workshop group. The Science group, therefore, didn’t know whether the students were previously exposed to the background knowledge of the examination essay through explicit teaching or through the student’s own private research. In spite of the Science group’s lack of background information in this regard, this lack of information shouldn’t have had any significant effect on their judgements, although it is possible that the scores could have been affected. For example, if a rater judged the student to be hopeless on all three criteria, but was aware that the student had not been taught the material beforehand, the rater is likely to be more lenient and accordingly award a higher score. But I emphasise that we must keep judgements and scores as two distinct issues, as I shall show shortly.

 

Results

The table below shows the scores and summaries of the judgements of the English group (seven raters) and Science group (eight raters) on Protocol 1. The three criteria are content/topic, organisation and language. The blank spaces indicate that a rater (lecturer) has apparently ignored a specific criterion. This could either mean that little or no importance was attached to this criterion or that raters who left blank spaces thought that the criteria overlapped.

 

 

 

Discussion

With regard to the judgements and scores of both groups, the following observations are important:

1. Of the 14 raters who commented on organisation, seven raters (two English, five Science) were positive, and seven raters (four English, three Science) were negative. Thus within the two groups combined there is an equal split in opinion; seven positive and seven negative.

2. Raters emphasise different criteria, and apparently ignore others. If a rater comments on either organisation or content but not both, this does not necessarily mean that the rater ignores one or the other, but perhaps that content is implicit in organisation. The reason is that it is difficult to separate content from organisation. I also don’t think that one can separate language (as defined above) from the other two criteria, unless in Orr and Du Toit’s superficial sense of language as “clothes covered with food stains and dirt, with perhaps a few missing buttons” (Du Toit & Orr, 1987:199).

3. Similar scores between raters do not necessarily mean similar judgements. For example English Raters 2 and 6 have the same score (58%) but they have opposite judgements concerning the strength of argument (organisation) of the essay. They agree that the content is on the topic, which strengthens my argument, because if they agree that the content is on the topic it is possible to isolate the criterion that is causing the disagreement (i.e. organisation), and thus to infer that radically different views on organisation (strength of argument) yields identical scores.

4. If similar scores between raters do not necessarily mean similar judgements (as in 3 above), it is also true that different scores between raters do not necessarily mean different judgements. For example, English Raters 3 and 6 have radically different scores (80% and 58%) but similar judgements. Here is an example from the Science group: Science Raters 2 and 5 have radically different scores (30% and zero respectively) yet similar judgements on organisation and language. Science Rater 5 said the content was “poor”. I would think that Science Rater 2, who does not say anything about content thought along similar lines.

5. Some of the descriptions are vague; for example it is not clear whether comments such as “General and broad” (English Rater 4) and “A few major errors” (English Rater 6) are favourable or unfavourable judgements. In the case of the latter example, it is possible that English Rater 6, who is a second language speaker and sometimes does make a few grammatical faux pas in conversation with his colleagues, has confused “There are a few major errors” with “There are few major errors” The first remark is negative, the latter positive.

6.What is of ultimate interest is the different academic perspectives behind these judgements. In this regard, I quote English Rater 2 (verbatim from the videotape of the workshop. Hence the conversational style).

On the topic, but argument seems to be weaker. It didn’t have the strength of argument in it. It is effect-centred rather than cause-and-effect centred. Some bits of logic in paragraphs 1 and 2 I could follow – incoherence.

English Rater 2 maintains that Protocol 1 is more a literary discourse (i.e. “effect-centred”) than a scientific discourse (i.e. “cause-and-effect centred), where the emphasis in scientific discourse, he claims, should be on logic. This implies that literary discourse does not require logic. Rater 1 did make the point in the English workshop discussion that logic is also required in literary discourse. I would add that logic is not a unitary entity but rather consists of a cluster of diverse “logics”, just as discourse consists of a cluster of diverse “discourses”.

Contrast English Rater 2’s comment quoted above with English Rater 3’s (myself) comment on Protocol 1 below:

Is very well sorted out – the intro is there. In the next paragraph, the student actually discusses animals and plant reproduction. So he [she] takes one aspect, he takes reproduction of animals and plants and gives a general idea about it and then in the next paragraph we have supporting ideas; some animals produce their offspring and so on. So he is taking the reproduction side of animals and plants, and shows how that is affected by the Greenhouse Effect. In the next paragraph, he discusses the effects. He is getting more specific.

English Rater 2 is looking for the right things (e.g. logic) in EAP, but does not find them. His claim is that the student indulges in “effects” (he means literary devices) instead of the cause-and-effect of scientific discourse. On the contrary, Rater 3 (myself) finds little evidence of literary “effects” and much evidence of logical progression. In fact, academic discourse, whether literary discourse (the creation of a literary work as well as the analysis of a literary work) or scientific discourse, both require logical thought.

It is important to note that the essay question was unaccompanied by any textual aids, which means that the student, during the EAP course had to ingest, chew, digest and assimilate 16 diverse texts on global warming, some of them quite long and complicated, ranging from the causes of global warming to its effect on plants and animals. My judgement above (English Rater 3) takes this into account, as well as the fact that the student had no choice of question in the exam. For these reasons I thought that Protocol 1 deserved high marks. I concede that after considering (being swayed by?) the scores and judgements of the other participants in the English workshop, I thought that my 80% was a bit too generous, and that about 70% would have been a fairer score. Having conceded this point, I nevertheless do not share the general opinion of the Science group that the language of Protocol I was poor. On the contrary I do not think that the quality of the language was spoiled by a few “missing buttons” (Du Toit and Orr above).

It is interesting that Rater 2’s main training is in literature, while Rater 3 is trained in Applied Linguistics as well as in the teaching of General Science. Of course there is no necessary causal link between the two rater’s different judgements and their different training.

Here is a glaring example of different perspectives that shows the differences between a rater who has been trained in the hard sciences (e.g. biology and chemistry) and a rater who has been trained in the humanities (e.g. literature and philosophy). Consider the title of the essay again.

Discuss how climatic changes brought about by the Greenhouse Effect are likely to affect the world’s plant and animal species

To illustrate my argument I need to refer to Protocol 2, which I introduce into the discussion. Here is the relevant sentence in Protocol 2: “The danger of the use of these gases is the climatic changes which affect the plant,man and animal species.” The student devoted a large part of the essay to the effects of these gases on man.

I introduce English Rater 10 (the leader of the workshop), who was not used in the discussion of Protocol 1, but whose following comment on Protocol 2 is of interest. She comments: “At the end of the 2nd paragraph to last paragraph about people that was not really relevant.” Accordingly, English Rater 10 judged the student to be completely off the topic and awarded the student a score of 10%.

The Science raters had a different interpretation of this sentence. Three of them circled the word “man” in Protocol 2, where one of them commented in the margin of the student’s protocol that “man is an animal species”. These 3 Science raters did not think that it was necessary to distinguish between man and animal in this particular sentence, owing to the fact that in science “man” is subsumed under “animal”. The Science raters are not saying, as English Rater 10 does, that it is illegitimate for the student to include “man”. On the contrary, they think it is quite legitimate to do so. Their criticism is that it is incorrect in the biological sciences to say “man and animal” as if man were separate from the animal kingdom.

A (biological) scientist regards a human as an animal while an English teacher (in this case, Rater 10) does not usually regard man as an animal. So, for an English teacher (e.g. our English Rater 10), if students specify the effects on “man” in this essay, they could be heavily penalised for writing off the topic. Recall Rater 2’s score of 10% on this protocol.

7. The specialised background knowledge required was too difficult for first-year EAP students to learn (on their own). Even some of the Science raters differed, quite markedly in some cases, on the truth of the facts presented. The point is that this topic should not have been given to the SPEN students to study, especially without any assistance from the lecturer. The fact is that it is not possible to separate discourse skills from factual knowledge. Therefore it would not be correct, especially in EAP to expect students to be able to transfer skills from one kind of text to another (e.g. a sociology text to a biology text) without taking into account the content knowledge of the different texts. With regard to the relationship between language skills and content knowledge, consider the following comments of two Science raters on Protocol 1:

Science Rater 6

Comment A is addressed to this researcher; comment B to the student.

A. `I’m far more concerned with the underlying processes than the detailed language problems. Grammar and spelling on borderline.’

B. `Your understanding of the processes is shaky. Vocabulary problems hinder your expression of ideas. Overall sequence of essay is sound.’

1 There are two problems here: the first is whether it is possible to have a sound sequence of ideas when there are vocabulary problems which hinder the expression (sequence) of ideas. The second problem is that it is not certain whether this rater regards vocabulary as part of language or/and as a part of organisation.

The point here is that in comment A Science Rater 6 underplays language, revealing a lack of appreciation of the full meaning of academic discourse, which involves not only a knowledge of the subject matter and the ability to argue, but also involves academic language proficiency. However, EAP teachers as well often pay scant attention to language, i.e. grammar, spelling and punctuation, which I think should be a crucial criterion of academic discourse.

Science Rater 8

Comment A is addressed to this researcher about Protocols 1 and 2; comment B to the Protocol 1 student.

A. Both essays show glaring spelling and grammatical errors. However as this is a science exercise and the student is presumably being tested on his knowledge of the subject, this should not affect his mark. However, if his ability to write is being judged, then the approach would be different and should be penalised.

Science Rater 8, in Comment A above , distinguishes between “ability to write” and “knowledge of the subject”. These distinctions are very vague. I am not censuring Rater 8, but rather pointing out the problems in trying to distinguish between the three criteria of content, skills and language.

And Comment B of Science Rater 8, which is directed to the student:

B. A good attempt to answer the question. You have related your answer to what is actually asked for.

If we compare Science Rater 8’s score of 70% with the scores of English Rater 5 and English Rater 3 (68% and 80%, respectively), we notice that both English raters thought the student’s language was good, whereas Science Rater 8 thought it was “glaring”, and yet does not take it into account.

Conclusion

The findings show that there is a wide range of scores and judgements within each group and between the two groups of raters, which affects the reliability of the scores. This wide range of opinion shows the difficulties in the testing/teaching of academic discourse, where EAP teachers, if not ignorant of the double connotation of the term academic (scientific) discourse, often do not consider this double connotation to be a problem. The radically different scores and judgements within the Science group are also cause for concern. It is noteworthy that seven of the eight Science lecturers gave negative comments about language in Protocol 1. In my view the language of Protocol 1 was good, and was not significantly affected by a few “missing buttons”.

The fact remains that to ensure high interrater reliability there should only be a narrow range of scores and judgements between raters. This article discussed the reasons for this wide range of variability in terms of the three criteria language, organisation and topic. With regard to these criteria, the following observations stand out in the evaluations:

1. Language. All the Science lecturers, except one, think the language is bad. Of the four English lecturers who gave their opinions on language, three of them thought that the language was good.

2. Organisational skills. English lecturers as well as Science lecturers are radically divided on the issue.

3. Content/Topic. The science content is a problem for EAP lecturers. However, the Science lecturers, in several instances, also do not agree on the accuracy of the content in the protocols.

Scientists as well as EAP teachers can help one another to improve the quality of the teaching and testing of academic discourse. Both groups need to understand the fact that the successful evaluation of scientific discourse can only be achieved when one has an adequate understanding of the relationship between language, skills and content.

References

Alderson, J.C. & Clapham, C. 1992. Applied linguistics and language testing: A case study of the ELTS test. Applied Linguistics, 13(2):149-167.

Angelil-Carter, S. 1994. The adjunct model of content-based language learning. South African Journal of Higher Education (SAJHE), 3(2):9-14.

Bialystok, E.1978. A theoretical model of second language learning. Language Learning, 28(1):69-84.

Biggs, C. 1982. In a word, meaning. In Crystal, S. (ed.) 1982. Linguistic controversies. London: Edward Arnold.

Bley-Vroman, R. 1990. The logical problem of foreign language learning. Linguistic Analysis, 20 (1-2):3-49

Bolinger, D. 1965. The atomization of language. Language, 41:555-573.

Botha, H.L. & Cilliers, C.D. 1993. Programme for educationally disadvantaged pupils in South Africa: A multi-disciplinary approach. South African Journal of Education, 13(2):55-60.

Bradbury, J., Damerell, C., Jackson, F. & Searle, R. (1990). ESL issues arising from the “Teach-test-teach” programme. In K. Chick, (Ed), Searching for relevance: Contextual issues in applied linguistics. South African Applied Linguistics Association.

Brown, K. 1984. Linguistics today. Suffolk: Fontana Paperbacks.

Carmines, G. & Zeller, A. (1979). Reliability and validity assessment. Beverley Hills, California: Sage Publications.

Child, J. 1993. Proficiency and performance in language testing. Applied Linguistic Theory, 4(1/2):19-54.

Chomsky, N. 1965. Aspects of the theory of syntax. Cambridge, Massachusetts: M.I.T. Press.

Du Toit, A. & Orr, M. (1989). Achiever’s Handbook. Johannesburg: Southern Book Publishers.

Ebel, R.L. & Frisbie, D.A. 1991. Essentials of educational measurement. (5th ed.) Englewood Cliffs, new Jersey: Prentice Gall.

HSRC. 1981. Language teaching: Report of the Committee of University Principals. Pretoria: Human Sciences Research Council (HSRC).

Hutchinson, T. & Waters, A. (1987). English for special purposes: a learner-centred approach. Cambridge: Cambridge University Press.

Hughes, A. 1989. Testing for language teachers. Cambridge: Cambridge University Press.

Lamb, S.M. 1984. Semiotics of language and culture. In: Fawcett, R.P., Halliday, M.A.K., Lamb, S.M. & Makkai, A. 1984. The semiotics of culture and language, Vol. 2. London: Frances Pinter (Publishers).

Langacker, R. (1987). Foundations of cognitive grammar, Vol. 1. Theoretical preliminaries. Stanford: Stanford University.

Leech, G. 1981. Semantics. Harmondsworth, Middlesex: Penguin.

Mcintyre, S.P. 1992. Language learning across the curriculum:A possible solution to poor results. Popagano, 9-10 June, Mmabatho.

Mandler, J.M. 1984. Stories, scripts and scenes: Aspects of schema theory. Hillsdale, New Jersey. Lawrence Erlbaum Associates.

Mathews, J.H. 1964. The need for enrichment of concept and methodology in the study of reading. In: Thurstone, E.L. & Hafner, L.E. New concepts in college-adult reading: Thirteenth yearbook of the National ReadingConference. Milwaukee 3, Wisconsin: The National Reading Conference, Inc.

Murray, S. 1990. Teaching English for Academic Purposes (EAP) at the University of Bophuthatswana. South African Journal of Higher Education, Special Edition.

Murray, S. & Johanson, L. 1989. EAP series for Southern Africa. Randburg: Hodder & Stoughton.

Porter, D.1983. Assessing communicative proficiency: The search for validity. In: Johnson, K. & Porter, D. (eds.). Perspectives of communicative language teaching. London: Academic Press, Inc.

Snow, M.A. & Brinton, D.M. 1988. The adjunct model of language instruction: An ideal EAP framework. In Benesch, S. (ed.). Ending remediation: Linking ESL and content in higher education. PLACE: PUBLISHER.

Spack, R. 1988. Invention strategies and the ESL college composition student. Ruth Spack. TESOL Quarterly, Vol 18, no.4, Dec 1984. p.649-670.

Starfield, S. 1990. Contextualising language and study skills. South African Journal of Higher Education (SAJHE), Special Edition.

Starfield, S. & Kotechka, P. 1991. Language and learning: The Academic Support Programme’s intervention at the University of the Witwatersrand, Paper presented at the South African Applied Linguistics Association Conference, July.

1Taylor, J.R. (1989). Linguistic categorization. Oxford: Oxford University Press.

Vollmer, H.J. 1983. The structure of foreign language competence. In: Hughes, A. & Porter, D. (eds.). Current developments in language testing. London: Academic Press.

Young, D. 1987. A priority in language education: Language across the curriculum in black education. In: Young, D. & Burns, R. (eds.). Education at the crossroads. Rondebosch: University of Cape Town.

Appendix

Title of essay

“Discuss how climatic changes brought about by the Greenhouse Effect are likely to affect the world’s plant and animal species.”

Protocol 1

Climatic changes are true result of the greenhouse gases. These gases are concentrated in the atmosphere and they block the solar heat that travels back to space. They cause climatic changes and even changes in pattern of rainfall. These gases are carbon-dioxide, methane, chlorofluorocarbons, nitrous oxide and water vapour. The consequence of this gases also caused disastrous events like flood, drought and hurricanes.

The rise in temperatures are likely to affect the manner in which plants and animals reproduce. It will disrupt the way in which plants flowers and fruit. In dry seasons, some plants might not be able to fruit. This will cause shortage of food to animals and they eventually die because of hunger. Some of the plants do not grow under extremely high temperatures this will also lead to the destruction of the forest trees.

Some animals reproduce their offsprings according to how the climate is. The turtle for example, produce females when it is warmer and males when it is cooler. If the climate has changed, it will bring about the imbalance of sexes among certain animals. Animals like elephants alter their behaviour according to how wet or dry it is. If it is wet they gather together and the superior bull have a tendency to mate with all females thus transmitting its superior genes to those females. When it is dry they scatter and begin to live in swampy areas. The subordinate bull can now mate with female and transmitting its inferior genes.

The first dangerous effect of climatic change is over the Arctic. The sea ice of the Arctic is essential for animals that live and feed of it, e.g. walruses and the animals that migrate across it. Animal like the corals live near the coast of the seas. If the sea-level rises, the corals might not be able to cope up and this will bring to the total destruction of the corals.

According to the scientists warming is likely to occur more at latitudes near the poles but the tropics which are already hot will warm slightly.. Many animals will migrate to the north where they will not experience severe greenhouse effect e.g. prevailing winds, drought, hurricanes and floods. People living in the north will be able to have more crop yield. And they will be able to help people who are experiencing the changing of the climate.

In order to stop this, people should stop producing gases that contribute to the greenhouse. Trees should be planted in order to reduce certain amount of carbon dioxide. They should introduce a special tax for the emission of the carbon dioxide. In order to eliminate warming that can cause climate change deforestation and pollution should be stopped. The production should be stopped. People should make use of the low CO2 emitting sources.

 

Rater Reliability in Language Testing

Plenary presentation
Rater Reliability in Language Testing
Dr Raphael Gamaroff (Assistant Professor of English, Abu Dhabi University, Al Ain)

The 5th Annual English Language Teaching  Conference, 25th March , 2004

Abstract

Reliability is concerned with how we measure; validity is concerned with what we are supposed to measure, i.e. with the purposes of a test. In this paper, I concentrate on reliability; however, it is often difficult to discuss reliability without bringing validity into the picture.

The reliability of a test is concerned with the accuracy of scoring and the accuracy of the administration procedures of the test.

In this paper the following key aspects of reliability are dealt with

– Facets: These refer to such factors as the (1) testing environment, e.g. the time of testing and the test setting, (2) test organisation, e.g. the sequence in which different questions are presented, and (3) the relative importance of different questions and topic content. Facets also include cultural differences between test takers, the attitude of the test taker, and whether the tester does such things as point out the importance of the test for a test-taker’s future.
– Features or conditions: These refer to such factors as clear instructions, unambiguous questions and items that do or do not permit guessing.
– The manner in which the test is scored. A central factor in this regard is rater reliability, i.e. the agreement between raters on scores and judgements related to a particular test.  Rater reliability (consistency) becomes a problem mainly in the kind of tests that involve subjective judgements such as essay tests.

The bulk of the paper will focus on rater reliability.

A major obstacle in test development has been the lack of agreement on what it means to know a language, which affects

WHAT aspects of language knowledge are tested; and
HOW these aspects are tested.

Validity and reliability are closely connected. Validity has to do with WHAT you are measuring, reliability with HOW. The WHAT of testing (validity) deals with the knowledge and skills that we are testing/measuring.

The HOW in reliability refers to THREE main things.

The first two are intrinsic to the 1. form(s) of test and to the 2. conditions under which it is taken (intrinsic reliability); the third is extrinsic to the test and to the test conditions.

The form of the test. The are several things to consider here:

clear instructions, unambiguous questions and items that do not permit guessing.
the sequence in which different questions are presented
the relative importance of different questions
if you give two different forms of a test to two different individuals, or groups, both forms of the test should be testing the same things and be of the same degree of difficulty.
.
Testing conditions. A few examples:
–    testing environment, e.g. the time of testing and the test setting
if test takers in one test group are allowed to talk to one another during a test, and another group is not allowed to do so, this will affect the reliability of the test.

Rater (examiner) reliability.
The manner in which the test is scored. A central factor in this regard is rater consistency (reliability), i.e. the agreement between raters on scores and judgements related to a particular test.  Rater consistency becomes a problem mainly in the kind of tests that involve subjective judgements such as essay tests.

In the rest of this paper, I focus on rater reliability.

__________________________________

Scores can only be reliable, if you know WHAT you are measuring/ marking

But, if you know what you are measuring, it does not follow that you will automatically measure consistently/accurately/reliably.

So, the teacher has to know his/her subject well, but must also know the problems in assessing a student’s work.

Rater reliability is EXTRINSIC (external) to the test itself, because it doesn’t deal with things such as the forms of the test or the conditions under which they are done. It deals with the way tests are marked.

There are two kinds of rater reliability:

Intrarater reliability and
Interrater reliability.

In intrarater reliability we aim for consistency within the rater. For example, if a rater (teacher) has many exam papers to mark and doesn’t have enough time to mark them, he or she might take much more care with the first, say, ten papers, than the rest. This inconsistency will affect the students’ scores; the first ten might get higher scores.

Interrater reliability

Earlier I mentioned the pressure of time a rater (teacher) may have when marking. There are many other kinds of pressures that teachers may have to endure in the way they mark exams and tests. One of these may be the need to equip as many people as possible to join the real world outside of school.

Besides the pressures that teachers may have to endure in their testing procedures, there are other factors that affect the ability of a teacher to be OBJECTIVE, i.e. to see things as they are, rather than to be SUBJECTIVE, i.e. to see things as one wants/needs them to be.
subjective than the natural sciences. It is much easier for teachers to agree that water boils at zero degrees centigrade at sea level than that a student in “C” level is stupid or cheeky.

It is also much easier to agree on the correctness of a grammar item (objective part) than on the quality of a composition (subjective whole).

Language is meant for real-life, i.e. for communication.
Examples of communication are:
1. Writing a letter, or a composition;
2. Speaking to or listening to someone;
3. Reading a book or a newspaper.

Grammar is not communication; it is the basic building blocks of communication.
It is easier to accurately measure grammatical knowledge (grammatical competence) and harder to measure language use (communicative competence).

To return to our two terms OBJECTIVE and SUBJECTIVE, grammatical competence is more objective, while communicative competence is more subjective.

For example, most grammar items are either right or wrong, whereas in a composition, it is much more difficult to make judgements about large chunks of language such as a composition. There is often a wide range in judgements and scores between different teachers (raters).

This means that grammar scores are often accurate, i.e. reliable, but that composition scores are often inaccurate, i.e. unreliable.

To sum up

The greater the validity of a test, the more it reflects language use. Language use is more difficult to mark than grammar. The greater the difficulty to mark, the less the reliability.

Therefore

Grammar tests are more reliable but less valid

AND

Composition tests are less reliable but more valid.

We must be careful when we say that grammar scores are much more reliable (accurate) than composition scores.

If the test is a multiple choice test where the answers are provided by the test maker, teachers do not need to decide on the accuracy of the grammar item. So, scoring will be more reliable.

However, where the answers to grammar items are not provided, marking can be unreliable. It depends on the grammatical knowledge of the teacher.

To repeat what was said earlier

Reliability = Accuracy of HOW we assess/test

Validity = WHAT we assess/test

MORE reliability
=
LESS validity

The easier it is to measure something the less complex it is. Language use is more complex than grammar; emotions are more complex than motions.

Composition writing is much more difficult to mark than grammar items because composition is much more authentic/real-life than grammar.  In composition we need to look at things like cohesion and coherence, which has to do with how sentences and parts of sentences come together to create paragraphs and longer stretches of discourse.

For this reason, composition writing (real-life language, i.e. language use) gives one a better understanding of a person’s language competence.

Reliability in composition writing is a more fruitful field of study than isolated grammar items, because in composition we find all the three important components of language use, namely:

Content 2. Organisation. 3. Language

Conclusions and Recommendations

The problem of rater reliability is how to be as fair as possible in the allocation of scores and judgements. The problem of rater reliability in assessment seems to take precedence over all other issues in testing. This is understandable because assessment is the last and most crucial stage in the teaching programme. (Most learners only protest about poor teaching, unclear exam questions, etc. if they fail).

Rrecommendations

At the beginning of an academic year, all the teachers in the department can rate one student’s assignment and discuss the criteria they used and the marks they awarded. This exercise done repeatedly over a period of time gradually increases interrater reliability.

2. Before a major test or an exam, questions set by individual teachers can (indeed must) be discussed (formally and/or informally) by the whole department in terms of:

– clarity of the questions
– length of questions, and
– number of marks to be awarded for specific questions or sections.

3. The person who prepares the question should also give a memorandum to others in the department to demonstrate the criteria by which answers will be evaluated.

If one had to follow these procedures one would hope that objectivity would be increased. Yet, it seems that even if one does consult with colleagues (as would be the case with literary or music or culinary critics) hard problems remain. What is worrisome in the assessment of written output is that in spite of discussions and workshops on establishing common criteria, there remain large differences in the relative weight raters attach to the different criteria, e.g. linguistic structure, content and organisation.

But one should not stop trying to improve reliability, just as one should not stop trying to be as objective as possible in all the other facets of our lives, especially in our
judgements of others.

Psychometrics and Reductionism in Language Assessment

Author: Raphael Gamaroff

Much of the content of  this unpublished article appears in “Paradigm lost, paradigm regained: Statistics in language testing” , on this WordPress site.

Abstract

Choosing a research paradigm

Opposition to psychometrics

Norm-referenced tests

What did the others get?

Conclusion

References

Abstract

The basic problem in language assessment remains how to assess individual differences within language-specific abilities. In modern democracies, psychometrics in language assessment, and in educational assessment in general, is eschewed by many where a “ethnographical”, or “naturalist”, method of assessment is preferred. This paper discusses the opposition to psychometric (statistical) assessment in testing, with special reference to language testing, and defends the use of psychometrics. It argued that psychometric methods and ethnographical, or “naturalist”, methods of assessment both have a crucial, complementary and valuable role to play. The history of the quantitative/qualitative controversy can be viewed from two diametrically opposite angles: (1) qualitative research has been dominated by quantitative research for many decades and is only in recent years becoming accepted as a legitimate scinetific approach. Or (2) quantitative research has been for more than two decades challenging qualitative methods and also setting itself up as the only legitimate form of research.

Choosing a research paradigm

Philosophy and science is saddled with two contrasting paradigms: the empiricist/objective/reductionist paradigm and the ethnographical/subjective/ holistic paradigm. The first paradigm, the “standard account” involves putting questions directly to Nature and letting it answer: the paradigm of empiricist, or normal, science and the Age of Enlightenment. This paradigm is based on three assumptions: (i) naive realism, i.e. the reality of objects are separate from observation, (ii) the existence of a universal scientific language, and (iii) the correspondence theory of truth, i.e. propositions about the world are true if they correspond to what is out there; theories about the world must be inferred from observation.An alternative paradigm, the “seamless web”, provides different answers to those offered by the first paradigm. This alternative paradigm has various sectarian aliases: naturalistic; inductivist, postpositivistic, ethnographical, phenomenological, subjective, qualitative, hermeneutic, humanistic and actor-network . The “seamless web” protagonists accuse reductionists of tearing things from their context.

The kind of assessment procedures one uses is a window onto what abilities are valued and rewarded (Rowntree, 1977:1). I shall argue that it is not only possible to reconcile the two antagonistic paradigms described above but it is also necessary to do so, if we wish to discover any truth about language asessment, and assessment in general.

According to Nunan (1992:20),

[u]nderpinniing quantitative research is the positivistic notion that the basic function of research is to uncover facts and truths which are independent of the researcher. Qualitative researchers question the notion of an objective reality.

Opposition to psychometrics

The increasing number of studies in a purely “ethnographical/ sociolinguistic approach to language proficiency assessment” is witness to the opposition to the “positivistic” notion of “quantitative research” (Nunan above).

The opposition to psychometrics is closely connected to the opposition to “reductionist approaches to communicative competence” (Lantolf & Frawley, 1988:182).

Spolsky threw decorum to the wind and referred to “psychometrists” as “hocus-pocus” scientists. For Spolsky, psychometrics was no more than sleight-of-hand psychometry. Spolsky’s recent “postmodern” approach to psychometrics is that it should be used in conjunction with “humanist” approaches (such as those of Lincoln and Guba, and Groddeck described above).

There is another reason – less philosophical than the reasons given above – why psychometric measurement is eschewed in language testing and in language research in general, namely that most language teachers (and many language researchers) have a poor knowledge of language testing and educational measurement and are consequently “metrically naive” (Stevenson, 1985:112, Bonheim, 1997). Yeld (1987:78) speaks of those “who have not been trained in the use of techniques of statistical analysis and are suspicious of what they perceive as ‘number-crunching'” and for this reason prefer “face valdity”. There is also a certain fear of the objectivity of numbers – of not getting (or of not being seen to be getting) them right: mistakes of judgement are much easier to detect in (“objective”) quantitative assessment than in (“subjective”) qualitative assessment.

Norm-referenced tests

The main problem in assessment is how to assess individual people. The individual needs the norm and the norm needs the individual: one without the other is an abstraction from social reality.

Norm-referenced, can be distinguished from criterion-referenced and individual-referenced tests (Ur, 1996:245-246):

1. Norm-referenced tests are concerned with how well an individual performs compared to a group which he or she is a member of. This is traditional psychometric testing.

2. Criterion-referenced tests are concerned with how well an individual performs relative to a fixed criterion, e.g. how to ask questions. This is what Cziko (1982:27) calls “edumetric” testing.

3. Individual-referenced are concerned with how individuals perform relative to their previous performance or to an estimate of their ability.

The emphasis in this discussion is on norm-referenced tests. Norm-referenced tests are important because without data on the variance between individuals within a group, it is not possible to separate what (which is the concern of criterion-referenced tests) an individualknows from what other peopleknow. Individual-referenced tests also cannot be separated from what other people know.

Rowntree (1977:185) explains the importance of the norm in assessment:

Consider a test whose results we are to interpret by comparison with criteria. To do so we must already have decided on a standard of performance and we will regard students who attain it as being significantly different from those who do not…The question is: How do we establish the criterion level? What is to count as the standard? Naturally, we can’t wait to see how students actually do and base our criterion on the average performance of the present group: this would be to go over into blatant norm-referencing. So suppose we base our criterion on what seems reasonable in the light of past experience? Naturally, if the criterion is to be reasonable, this experience must be of similar groups of students in the past. Knowing what has been achieved in the past will help us avoid setting the criteria inordinately high or low. But isn’t this very close to norm-referencing? It would even be closer if we were to base the criterion not just on that of previous students but on students in general.

What is occurring in South Africa is an effort to downplay psychometric measurement, which is linked to the resistance to the unpopular notion of the one-off (norm-referenced) test and to the preference for process-oriented measures (Mclean, 1996:48; Docking, 1994:15). Docking (1994:15) contrasts the “rigorous and detailed management of competency development” with the “‘loosely’ defined evidence which is ‘doctored’ and legitimated through statistical procedures on the other (traditional teaching and assessment).” I find it hard to understand how one can establish any principles of testing without some – indeed, a large – recourse to norms, no matter how “process-oriented” the task is claimed to be. And norms imply psychometric measurement.

What did the others get?

Psychometrics has much to do with “context”, which the “postpositivist”, or “naturalist”, paradigm claims to be absent in the “positivist” paradigm. In naturalistic inquiry “realities are wholes that cannot be understood in isolation from their contexts” (Lincoln and Guba, 1985:39). Thus, owing to different contexts and interactions, generalisations should be done with caution, if at all. The emphasis, “naturalists” argue, should be on time-bound and context-bound working hypotheses (“idiographic” statements) rather than time-free and context-free generalizations (“nomothetic” statements). The point is that one cannot make idiographic statements without reference to nomothetic statements, i.e. the individual and group are abstractions isolated from each other. This is no less true in psychometrics. A simple example: if my daughter comes home and tells me that she got 80% for a test, a predictable contextual question would be: “What did the others get?” Further, what is a whole if not a bit of a larger whole, and what is a bit but a whole of a smaller bit? In other words, the notion of a gestalt is a relative term and therefore can only exist in terms of something (a context) greater and smaller than itself. The paradox of knowledge is that it is impossible to understand the bits – of culture, theory, language, etc. – unless we understand how the whole fits together in its function; and, without an understanding of the structure of the discrete bits, we won’t be able to understand how the whole works (Rorty, 1980:319).

Psychometrics like language is “fictional”, in the sense of being evocations-representations- constructions of reality. But neither statistics nor language, is a deliberate error or a lie. We try our best to measure with and up to the brains we have been given.

Conclusion

Weir (1993:68) believes that there is a more pressing need for research in formative testing, i.e. in process-oriented methods, than research in summative testing, i.e. “quantitative summaries” (Messick, 1987:3). I believe that there is still a pressing need for research in summative testing, even though mainstream language testing, which is certainly the case in South Africa, is taking a different turn. I suggest that the rejection of psychometric measurement in South Africa in the name of restoring individuality to learning is misguided and is consequently having a negative influence on education in South Africa.

Having said that there is no doubt that “true ethnography demands as much training skill” (Nunan, 1992:53) as psychometric measurement. It it is also true that it is much easier to be proved wrong in a psychometric judgement than in an ethnographical one. This could be one reason for avoiding psychometric research.

What is important is that psychometricians and ethnographers both realise that each has a crucial – and complementary – contribution to make to the human sciences, where the “accumulation of data is at best the humble soil in which the tree of knowledge can grow” (Lorenz, 1969:77).

Being mindful of Hesse’s (1980:5) caveat that “if all theories are dangerous and likely to be superseded, so are the present theories in terms of which the inductivist judges the past”, we need to be humble in any claims we may have to ultimate truth (a good example was set by Spolsky [1995] in his dilution of his strong negative attitude towards psychometrics mentioned earlier) because the search for truth is a never-ending path towards understanding and stability of meanings, which is indispensable for individual freedom and social equilibrium.

According to Lincoln and Guba (1985:114) generalizations presuppose facts, but there is no necessity that only one generalization must emerge: “There are always (logically) multiple possibility generalizations to account for any set of particulars, however extensive and inclusive they may be” (Lincoln & Guba, 1985:114). Yet, if this were so, there could be no knowledge, therefore no stability of meanings, because ambiguity would be the normal attribute of language. If theories could mean anything, we wouldn’t have unhappy theories that lead to disagreements and misunderstandings. All we’d have is happy hot air.

References

Bonheim, H. Language testing Panel, European Society of the Study of English (ESSE) conference, (Debrecen, Hungary, September 1997).

Docking, R. “Competency-based curricula – the big picture”, Prospect, 9,2(1994):15.

Hesse, M. Revolutions and reconstructions in the philosophy of science, (Bloomington: Indiana University Press, 1980).

ibid.

Lincoln, Y.S. and E.G. Guba, Naturalistic enquiry, (Newbury Park, California: Sage Publications, 1985).

Lorenz, K. “On the biology of learning”, in J. Kagan, On the biology of learning. (New York: Harcourt, Brace and World, Inc., 1969).

Mclean, D. “Language education and the national qualifications framework: An introduction to competency-based education and training”, in HSRC. Language assessment and the National Qualifications Framework, (Pretoria: Human Science Research Council Publishers, 1996).

Nunan, D. Research methods in language learning, (Cambridge, New York: Cambridge University Press. 1992).

Nunan, D.Research methods in language learning, (Cambridge, New York: Cambridge University Press. 1992).

Rorty, R. Philosophy and the mirror of nature, (Princeton: Princeton University Press, 1980).

Rowntree, D. Assessing students: How shall we know them, (London: Harper and Row, Publishers, 1977).

Rowntree, D. Assessing students: How shall we know them, (London: Harper and Row, Publishers, 1977).

Spolsky, B. Measured words, (Oxford: Oxford University Press, 1995).

Stevenson, D.K. “Pop validity and performance testing”, in Y. Lee, A. Fok, R. Lord and G. Low (eds.), New directions in language testing, (Oxford: Pergamon. 1985).

Ur, P. 1996. A course in language teaching: practice and theory. Cambridge: Cambridge University Press.

Weir, C.J. Understanding and developing language tests, (London: Prentice Hall, 1993).

Yeld, N. Communicative language testing and validity Journal of the South African Association of Language Teaching (SAALT), 21,3(1987):78.

Paradigm lost, Paradigm regained: Statistics in Language Testing

Journal of Language Teaching, 31 (2), 131-139, 1997.

Author: Raphael Gamaroff

ABSTRACT

1. INTRODUCTION

2. “OLD PARADIGM” VERSUS “NEW PARADIGM” RESEARCH

3. NEGOTIATING THE TASK-DEMANDS

4. CONCLUSION

REFERENCES

Abstract

The main issue in educational testing is how to measure, and accurately, individual differences within language-specific abilities and academic abilities, i.e. how to recognise performance, which has to do with the setting of valid standards. Valid standards should be concerned with fulfilling the relevant purposes of education. In South Africa the use of statistics in evaluation is often linked to the oppression of the disadvantaged. This view seems to be gaining influence among academics and policy makers in South Africa, and for this reason the importance of statistical methods in evaluation needs serious reconsideration in terms of relevance. It is on the issue of relevance that people differ. This is the reason why educational issues in the context of evaluation (e.g. admission tests, placement tests and promotion tests), are beginning to play second fiddle to the more imperious need for sociopolitical transformation. Much is at stake in testing, where evaluations have to be made by human beings of other human beings, where judgements (often the occasion, if not the cause, of much distress) have to be made about whether somebody should be admitted to an education programme or to a job, or promoted to a higher level. Within the sociopolitical and multi-lingual-cultural- racial-ethnic context of South Africa, these judgements assume an intense poignancy.

1. INTRODUCTION

Language testing is closely related to one’s theory of what language is, which in turn is closely related to one’s theory of how languages are learnt. Thus in order to answer the question “what are we testing?”, we need to answer the question “what is being learnt?”. And to answer the question “what is being learnt?” We also need to ask “what are we testing?

[In order to] arrive at a greater specificity [of language proficiency], it will now be advantageous to look at the issue from the point of view of the field that is mostly directly concerned with the precise description and measurement of second language knowledge, namely second language testin(Spolsky, 1989:59; my square brackets).

An important reason why endeavours are made to improve learning and teaching is in order to improve performance on tests, which is not the same thing as teaching to the test. In the former, one is concerned with improving the ability to perform one’s competence; in the latter, one is merely concerned with the ability to “perform” (a test).

A major part of testing is its measurement. Owing to the fact that statistics in educational measurement is such a controversial issue, it is necessary to consider rigorously and dispassionately the value of statistics in evaluation, keeping in mind that one of the major challenges in the improvement of education is the creation of a more appropriate and effective system of evaluation (King and Van Den Berg, 1993:207), or to use Rowntree’s (1977:1) term “assessment”:

If we wish to discover the truth about an educational system, we must look into its assessment procedures. What student qualities and achievements are actively valued and rewarded by the system? How are its purposes and intentions realised? To what extent are the hopes and ideals, aims and objectives professed by the system ever truly perceived, valued and striven for by those who make their way within it? The answers to such questions are to be found in what the system requires students to do in order to survive and prosper. The spirit and style of student assessment defines the de facto curriculum.

What the system “requires students to do” (Rowntree above) is what validity is concerned with; in other words, with the purpose of (test) behaviour. I adopt the view of validity as a “unitary concept that describes an integrated evaluative judgement of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (Messick, 1988:2).

What is occurring in South Africa is an effort to downplay statistical (psychometric) measurement, which is linked to the resistance to the unpopular notion of the one-off test. Many language researchers and psychologists oppose the use of statistics (Spolsky, 1978,1985; Macdonald, 1990,1990a). This opposition, I believe, is having a negative influence on educational policy in South Africa. I shall argue that this opposition to psychometric measurement has not been reasoned out in a cogent way.

According to Nunan (1992:20): “underpinning quantitative research is the positivistic notion that the basic function of research is to uncover facts and truths which are independent of the researcher.” Qualitative researchers question the notion of an objective reality.”

As Rist (1977:43) asserts: “ultimately, the issue is not research strategies, per se. Rather, the adherence to one paradigm as opposed to another predisposes one to view the world and the events within it in profoundly different ways.”

In a similar vein, Macdonald (1990:21, 40) contrasts the qualitative and “illuminative” (Pariett, 1981) paradigm of ethnographical research, which she favours, to the psychometric paradigm, which she rejects. The paradigm one chooses predisposes one to view the world in a certain way. Indeed, and in deed, the paradigm is the view.

2. “OLD PARADIGM” VERSUS “NEW PARADIGM” RESEARCH

I examine the psychometric controversy, and its sociopolitical consequences and ethical implications, particularly for South Africa. The controversy involves the following two opposing views: the one view, which holds that statistical measurement is of limited value, is represented by authors such as Spolsky (1978,1985), Lantolf and Frawley (1988) and Macdonald (1990 a 1990); the other view, which holds that statistical measurement is of considerable value, is represented by such authors as Popham (1981), Oller (1979,1983) and Stevenson (1985).

Spolsky (1978,1985), Lantolf & Frawley (1988) and Macdonald (1990) maintain that the psychometric paradigm reduces humans to objects. Lantolf & Frawley (1988:181) assert that psychometrics is an “imposition” upon the reality of an open-ended system, because it is “criterion-reductive, analytically-derived and norm-referenced”. Spolsky (1985:33-34) denounces psychometrists as “hocus pocus” scientists:

In the approach of scientific modern tests, the criterion of authenticity of task is generally submerged by the greater attention given to psychometric critera of validity and reliability. The psychometrists are ‘hocus-pocus’ scientists in the fullest sense; in their arguments, they sometimes even claim not to care what they measure provided that their measurement predicts the criterion variable: face validity receives no more than lip service… It is in what I have labeled the postmodern period and approach to language testing that the criterion of authenticity of task has come to have special importance. (My emphasis)

For Harrison (1983:84), statistical measurement is inappropriate due to its subjective nature:

Testing is traditionally associated with exactitude, but it is not an exact science… – The quantities resulting from test-taking look like exact figures – 69 percent looks different from 68 percent but cannot be so for practical purposes, though test writers may imply that they are distinguishable by working out tables of precise equivalences of test and level, and teachers may believe them. These interpretations of scores are inappropriate even for traditional testing but for communicative testing they are completely irrelevant. The outcome of a communicative test is a series of achievements, not a score denoting an abstract ‘level’.

Harrison seems to mean that “the quantities resulting from test-taking [which] look like exact figures” appear to measure objectively, but in fact they do nothing of the kind- rather, they measure subjectively. For Morrow (1981:12)

[o]ne of the most significant features of psychometric tests as opposed to those of ‘prescientific’ days is the development of the twin concepts of reliability and validity… The basis of reliability claimed by Lado is objectivity The rather obvious point has, however not escaped observers… that Lado’s tests are objective only in terms of actual assessment. In terms of the evaluation of the numerical score yielded, and perhaps more importantly, in terms of the construction of the test itself, subjective factors play a large part.

Contrary to the negative attitudes towards statistics mentioned above, Oller (1983:37) believes that statistical procedures play an important role in language testing, while Stevenson (1985:112) maintains that most language teachers have a poor knowledge of language testing and educational measurement. The problem, however, is larger than ignorance, and often involves a certain degree of quantophobia. Popham (1981:203) maintains that there exists the misconception among those who don’t know or like statistics that statistics makes the research more difficult. It seems more reasonable to argue that it is more difficult to obtain useful results from qualitative data alone, because they do not give a complete enough picture of the informants under investigation (Popham, 1981:203).

3. NEGOTIATING THE TASK-DEMANDS

Macdonald’s “Threshold Project” in primary schools in the former Bophuthatswana (now the North West Province), which involved several researchers, has had a strong influence on recent attitudes towards testing in South Africa. The views expressed in the ‘Threshold Project’ are also relevant to all testing, from the primary school through to tertiary education.

Macdonald (1990 a: 46) rejects the Human Sciences Research Council’s norm-referenced test which serves as a diagnostic tool for pupils entering Grades 4, 5 and 6. She mentions the following: the average score of pupils from these three standards were 22%, 44% and 66% respectively; the HSRC recommends that the test be converted into a criterion-referenced test, where Grade 4 pupils should score a minimum of 80% to gain admission to Grade 5.

Macdonald has two objections to this approach: Firstly, the majority of prospective Grade 5 pupils would probably get far less than 80% on such a test, which would mean rejection for admission to Std 3. The predicament, maintains Macdonald, would then be what to do with these unsuccessful Grade 4 children. This is ultimately a social problem that cannot be resolved by the HSRC or by Macdonald.

Her second objection is that the causal or correlational link between language proficiency and academic achievement is not clear. For Macdonald (1990 a: 42)

the most difficult connection to make is that between different aspects of English communicative competence and their relation – causal or correlational – to formal school learning through EMl (English as a medium of instruction). If one is able to set up these relationships in a reasoned way – and nobody to our knowledge has gone very far in this task – then the significance of the current test scores would be absolutely transparent. There is a way through this conundrum, and that is to change the nature of the question.

Macdonald’s predicament of what to do with the large number of unsuccessful applicants to higher grades connects with her “conundrum” of the opaque relationship between language proficiency and academic achievement. What these two problems have in common is the question of individual success, which is “one of the major conundrums in the Second Language Acquisition (SLA) field” (Larsen-Freeman & Long, 1990: 153; see also Diller, 1981).

Macdonald claims that the HSRC’s tests are invalid, firstly, because there would be many children excluded from school if the HSRC’s recommendation were to be followed, and secondly, because the relationship between language proficiency and academic achievement is not clear.

With regard to the HSRC’s tests, the HSRC equates failing Grade 4 with lack of sufficientacademic ability. This lack, I suggest, along with many other authors (for example Cummins, 1979,1980; Collier, 1987) is intimately tied up with the lack of the development of what Cummins (1983) calls Cognitive and Academic Language Proficiency (CALP), which is closely connected to the development of cognitive abilities through the mother tongue. There is a close relationship between the CALP strategies (which are usually learnt in an artificial, i.e. tutored situation) developed through a first or second language and the strategies that are used to learn any other academic subject. All these strategies are rooted in the ability of learning how to learn.

Macdonald rejects the HSRC’s “old” paradigm of psychometric testing and suggests that all these children should be allowed into Std 3 (and the higher standards?) regardless of their ability as measured by tests. The implication of the HSRC’s test policy, however, seems to be that passing a standard should not be equated with academic success, because in spite of the fact that children may pass all their standards (which may often be through automatic promotion), this cannot be regarded as authentic success. According to the editorial of Educamus (1990), there is a low failure rate from preschool to Grade 11 in DET secondary schools, because low ability pupils in many DET schools are autocratically promoted through the standards, except for the final Grade 12 external examination.

A third problem for Macdonald (1990 a:46) is that

doing things in such a post hoc way [namely, the HSRC’s psychometric tests] would fail to force us into analyzing the nature of the learning that the child has to be able to meaningfully participate in… we would have described a test and some external criteria and identified children through the use of these – but we would have failed to explain what it is the children have to be able to do. (My square brackets)

Macdonald (above) is contrasting the “post hoc” psychometric paradigm of the HSRC which “fail[s] to explain what it is the children have to be able to do” with her “negotiating the task-demands”, which she claims does explain what children have to be able to do. Which raises the question: What is a real, authentic, natural task? For Macdonald the answer to this question lies in “negotiating the task-demands”.

Macdonald’s (1990 a:46) solution to her three problems mentioned above is to replace the “outdated and rigid modes of curriculum development in South Africa” such as psychometric measurement (norm-referenced and criterion-referenced tests) and the general ability of communicative proficiency with “negotiating the task-demands of Std 3 [Grade 5]”, which involves “going from one situation (and knowledge domain) to another to see how the curriculum in its broadest sense has been constituted, and which aspects are negotiable”. Examples of such tasks-demands are (Macdonald, 1990a:47):

1. Following a simple set of instructions for carrying out a task.
2. Showing command of a range of vocabulary (in semantic clusters) from across the curriculum.

3. Solving problems involving logical connectives.

4. Being able to show comprehension of simple stories and information books.

But do we know what a real-life task is (not merely looks like), or if we knew, do we know whether it is necessary to do real-life tasks in order to learn or to prove that we are proficient to do them? This question lies at the heart of the problem of what is an authentic language activity, that is, an authentic test. Alderson (1983:90), in the context of communicative (that is, real-life) tests, maintains that we do not yet know what communicative tests are, owing to the fact that “we stand before an abyss of ignorance The criteria that may provide some help comes from “theory” and “from what one knows and believes about language and communicative abilities, and from what one knows and believes about communication with and through language” (Alderson, 1983:91). What Alderson maintains about communicative tests may also be true about negotiating Macdonald’s “task-demands” above. Thus it doesn’t seem wise to try and separate – as Macdonald suggests – (general) communicative proficiency from a task-demand such as “showing command of a range of vocabulary (in semantic clusters) from across the curriculum” (Macdonald, 1990 a:47). After all, the most demanding part of “negotiating the task-demands” is often the (general) communicative proficiency part, especially for limited English proficiency pupils. Black students often have more problems with general background knowledge than with new knowledge. For this reason a radical separation should not be made between a Language for Specific Purposes task and a general proficiency task, because the harder part is often the general language proficiency part, especially for low English proficiency pupils.

Fodor (1980:149) suggests that theory so far, has not been of much help in redeeming our knowledge and beliefs from the abyss; “there simply isn’t any theory of how learning can affect concepts”. This implies that we are not clear about how to test concepts, because if we are not clear about how concepts are learnt, we cannot be clear about how they are tested. And to test language is to test concepts. Thus, from the theoretical perspective, the HSRC’s psychometric paradigm is, to say the least, not worse than Macdonald’s “negotiating the task-demands”. Macdonald’s (1990 a:46) argument, as mentioned earlier, is that the HSRC tests do not tap what learners “have to be able to do”. The problem is that the connection between the activity of doing “old” paradigm tests, such as those used by the HSRC, and the “new” paradigm activity of “negotiating task-demands” is far from clear.

Macdonald’s (1990 a:15,28,31,39) statistical data, oddly, are dealt with under the rubric of “qualitative” data, which explains why Macdonald appears to be paying lip service to quantitative data 2

The difficulty in statistical research is trying to be both group-orientated and individual-oriented. Whatever the inadequacies of statistics, the best argument for its usefulness is the fact that much of academic evaluation ultimately ends up as a score, and if that is the brutish fact of the matter, we might as well try and measure this score properly. Having said that, it is undeniable that “true ethnography demands as much training skill” (Nunan, 1992:53) as statistical measurement. What is important is that statisticians and ethnographers both realise that each has a crucial – and complementary – contribution to make to the human sciences. Lip service, either to “face validity” (that is, “real-life”; see Spolsky (1978, 1985) or to psychometrics does a disservice to both.

In the last decade there have been attempts towards making educational research more “human”, and through these attempts has sprung the conflict between the orthodox scientific and objective methods of experimental research and statistical analysis, on the one hand, and “new paradigm research” (Reason & Rowan, 1981), on the other.

Below is a summary of the salient features of “new paradigm research” (Reason & Rowan,

1981:xiv-xvi):

1. There is too much “quantophrenia” going on. The emphasis should fall on human significance, not on statistical significance. Researchers should become involved in the human side of the phenomenon under study, because the person behind the data can often upset the neat statistics. This means that people should not be reduced to variables or to operational definitions in order to be manipulated into a research design.

2· Care must be taken not to make outlandish generalisations from unrepresentative samples.

3· Safe, respectable research should be avoided.

4· Fear of victimisation may cause the researcher to pick only those bits of research that will impress and please.

5· Science requires the humility to change one’s views in the light of better theories or new observations.

Reason and Rowan’s (1981) view is that statistical (quantitative, objective) research and “human” (qualitative, subjective) research are complementary. Rutherford (1981) echoes Reason and Rowan’s misgivings about the danger of reducing humans to objects. Rutherford (1987:65) quotes the physicist Niels Bohr: “Isolated material particles are abstractions their properties being definable and observable only through their interaction with other systems.” Rutherford’s message is that (the testing of) humans cannot be isolated into parts. This is probably true. But, what should be in dispute as far as language is concerned is not whether language should be tested through its (reductive, mechanistic) parts or through the organic) whole, but how the parts and the whole interact; which is not only the basic problem of testing, but also of learning, knowing, and of being (human).

4. CONCLUSION

Statistics is a contentious and often an odious issue in the human sciences. However, without (hard) statistical evidence, that is, quantitative evidence, language evaluation – in fact all educational evaluation, in my view – would be reduced to a fistful of profiles, case studies and anecdotes. Thus, to split quantitative and qualitative research into separate paradigms is symptomatic of the urge to find strict oppositions where there are none; an urge that originates from the human and humanistic fear of “reductionism”; of the fear of disempowerment). Ironically, this fear of reductionism, and efforts to prevent it, ends up being the most reductionist – and antihumanist – effort of all. In South Africa, the use of statistics in evaluation is often linked to the oppression and the reduction of power of the “disadvantaged” (a euphemism for “blacks”). This view seems to be gaining influence among policy makers in South Africa, and for this reason the importance of statistical methods in evaluation needs serious reconsideration.

The paradigm of qualitative research is safe and respectable, because it describes; the paradigm of quantitative research is neither safe nor respectable, because it prescribes. Yet, in order to make moral, political and economic sense of evaluation, both paradigms (a “holodigm”!) are necessary. The reduction of one paradigm leads to the reduction of the other. Accordingly, the suggestion that quantitative (psychometric) measurement be replaced by qualitative methods such as “negotiating the task-demands” (Macdonald 1990a:46) might be unwise. Both, namely, psychometrics and “negotiating the task-demands”, should work hand in hand.

The main issue in educational testing is how to measure, and accurately, individual differences within language-specific abilities and academic abilities, i.e. how to recognise performance, which has to do with the setting of valid standards, i.e. with what one considers relevant to fulfilling the purposes of education. And it is on this issue of relevance that people differ. This is the reason why educational issues, in this context, evaluation (e.g. admission tests, placement tests and promotion tests), are beginning to play second fiddle to the more imperious need for sociopolitical transformation, whose nom de guerre is “empowerment”.

It seems that there are two irreconcilable world views, or paradigms: the HSRC’s and Macdonald’s. However, this is no cause for alarm, because science and academia are generated by – and seemingly thrive on (unlike politics) – incompatible theories, e.g. Chomsky versus Piaget, Piaget versus Vygotsky, Vygotsky versus Chomsky; and in testing, Spolsky, Macdonald, Lantolf and Frawley versus OIler, Stevenson and Popham.

Much is at stake in testing, where evaluations have to be made by human beings of other human beings; where judgements (often the occasion, if not the cause, of much distress) have to be made about whether somebody should be admitted to an education programme or to a job; or promoted to a higher level. Within the sociopolitical and multi-lingual-cultural-racial-ethnic context of South Africa, these judgements assume an intense poignancy.

REFERENCES

Alderson, J.C. (1983). Who needs jam? In: Hughes, A. & Porter, D. (1983) Current developments in language testing. London: Academic Press.

Collier, V.P. (1987). Age and rate of acquisition of second language for academic purposes. TESOL Quarterly, 2114, pp. 617-641

Cummins, J. (1979). Linguistic interdependence and the educational development of bilingual children. Review of Educational Research, 49, pp. 222-251.

Cummins, J. (1980). The cross-lingual dimensions of language proficiency: Implications for bilingual education and the optimal age issue. TESOL Quarterly, 4/12, pp. 175-87.

Cummins, J. (1983). Language proficiency and academic achievement. In: Oller, J.W. (Jr.), (ed.). (1983). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.

Diller, K.C. (ed.). (1981). Individual differences and universals in language learning aptitude. Rowley, Massachusetts: Newbury House.

Educamus. (1990). Editorial: internal promotions, 36/9, pp. 3. Pretoria: Department of Education and Training.

Fodor, J.R. (1980). Fixation of belief and concept acquisition. In: Piatelli-Palmarini, M (1980) Language and learning: The debate between Jean Piaget and Noam Chomsky London: Routledge, Kegan & Paul.

Harrison A. (1983). Communicative testing: Jam tomorrow? In: Hughes, A. & Porter, D. (eds.). (1983). Current developments in language testing. London: Academic Press.

Hutchinson, T. & Waters, A. (1987). English for special purposes; A learner-centred approach. Cambridge: Cambridge University Press.

King, M. & Van den Berg, 0. (1993). The Independent Examinations Board, August 1989 -February 1992: A narrative. In: Taylor, N. (ed.). (1993) Inventing knowledge: Contests in curriculum instruction. Cape Town: Maskew Miller Longman.

Lantolf, P. & Frawley, W. (1988). Proficiency: Understanding the construct. Studies in Second Language Acquisition (SLLA), 10/2, pp. 181-195.

Larsen-Freeman, D. & Long, M. H. (1990). An introduction to second language acquisition research. New York: Longman.

Macdonald C. A. (1990) Crossing the Threshold into standard three in black education:The consolidated main report of the Threshold Project Pretoria: Human Sciences Research Council (HSRC).

Macdonald, C. A. (1990 (a). English language skills evaluation (A final report of the Threshold Project), Report Soling-1 7. Pretoria: Human Sciences Research Council (HSRC).

Messick, S. (1988). Meaning and values in test validation: The science and ethics of measurement. Princeton, New Jersey: Educational Testing Service.

Morrow, K. (1981). Communicative language testing: Revolution or evolution. In: Alderson J (ed.). (1981) Issues in language testing. ELT Documents, The British Council.

Nunan, D. (1992). Research methods in language learning. Cambridge, New York. Cambridge University Press.

Oller, J.W. (Jr.). (1979). Language tests at school. London: Longman.

Rist, R. (1977). On the relations among educational research paradigms: from disdain to détente. Anthropology an Education Quarterly, 8, pp. 42-49.

Rowntree, D. (1977). Assessing students: How shall we know them, London: Harper & Row Publishers.

Spolsky, B. (1978). Approaches to language testing. In: Spolsky, B (ed.)., (1978). Advances in Language Testing Series, 2. Arlington, Virginia. Center for Applied Linguistics.

Rutherford, W.E. (1987). Second language grammar: Learning and teaching. London: Longman.

Spolsky, B. (1978). Approaches to language testing. In: Spolsky, B (ed.). Advances in Language Testing Series, 2. Arlington, Virginia. Center for Applied Linguistics.

Spolsky, B. (1985). The limits of authenticity in language testing. Language Testing, 2, pp. 31-40.

Spolsky, B. (1989). Conditions for second language learning. Oxford: Oxford University Press.

Stevenson, D.K. (1985). Pop validity and performance testing. In: Lee, Y; Fok, A; Lord, R; & Low, G. (eds.). (1985). New directions in language testing. Oxford: Pergamon.

“Old” Paradigm Language Proficiency Tests as Predictors of Long-term Academic Achievement

Per Linguam, 13(2), 1-23, 1997.

Author: Raphael Gamaroff

Abstract

Without the differences revealed between strong and weak learners, one would have an unclear idea what one was measuring. It is the differences between groups that is the meeting point of construct and predictive validity. That is the reason why psychometrics (where the emphasis is on norm-referenced tests) plays a vital role in assessment. In this article a battery of English proficiency tests was given to a mixture of L1 and L2 Grade 7 learners and subsequently used to predict long-term academic achievement. It was found that the tests in combination or singly were good predictors of academic achievement and distinguished well between levels of language proficiency. The results also show that learners from former DET (Department of Education and Training) schools do not perform well in schools that use a Joint Matriculation Board syllabus or its equivalent.

THE Discrete-point/Integrative controversy and Authenticity in Language Testing

Journal of Language Teaching, 32(2):94-104, 1998.

Author: Raphael Gamaroff

For a statistical analysis of language tests found in this article,  see here

1. INTRODUCTION
2. DIRECT AND INDIRECT TESTS
3. THE UNITARY COMPETENCE HYPOTHESIS AND THE “ONE BEST TEST”
4. DISCRETE-POINT AND INTEGRATIVE TESTS
5. CONCLUSION
6. REFERENCES

Abstract
For “real-life” testers the discrete-point/integrative controversy is out of fashion and long dead and buried. It is argued that there is still a lot of life in the “old beast”, and the questions the controversy raises are as pertinent as ever. The overarching problem in language testing is how to validly and reliably assess authentic language reception and production. It is widely believed that “integrative” tests such as cloze tests, dictation tests and essay tests test authentic communicative language, while “discrete-point” tests such as cloze tests and grammar tests merely test the elements of language. Such a distinction between the two kinds of tests is an oversimplification, which is largely due to characterising “integrative” tests as authentic, real-life and naturalistic, while characterising “discrete-point” tests as unauthentic, unreal and artificial. Some even argue that “integrative” tests such as cloze and dictation are also unauthentic. It is argued that tests do not have to be “naturalistic” or “direct” to be authentic tests.

1. INTRODUCTION

The basic question in second language acquisition is: “What does it mean to know and use a second language?” The basic question in testing is: “How do we test this knowledge and use?” The first question is concerned with the nature of knowledge, the second with the selection of methods for testing this knowledge. Although these are distinct questions, the validity of tests depends on understanding how languages are learnt and used.

Do researchers know what a real-life task is , not merely what it looks like; or, if they knew, is it necessary for learners to do real-life tasks in order to prove that they are proficient to do them? This question lies at the heart of the problem of what an authentic language activity or test is mean to be (Gamaroff, 1996). Although language proficiency has ultimately to do with language use, with authentic, or communicative, or direct, language, it doesn’t follow that language proficiency can only be validly tested “on the wing” (Harrison, 1983:82), i.e. naturalistically. The implication of the arguments presented is that until we know more about testing, it is legitimate to follow the practical route of using “discrete-point” tests or “integrative” tests to predict “real-life” language proficiency.

2. DIRECT AND INDIRECT TESTS

For language “naturalists”, the only authentic tests are those presented in a direct real-life situation, because they are based on “naturalistic contexts” (Omaggio, 1986:312-313; see also Ingram, 1985:239ff). For “direct” testers, tests such as grammar tests, cloze tests and dictation tests are regarded as indirect tests, while essay tests and interviews would be regarded as direct tests (Hughes, 1989; Omaggio, 1986).

Many studies have found high correlations between “direct” or “indirect” tests (e.g. Oller, 1979; Henning et al, 1981; Hale et al, 1984; Fotos, 1991; Haussmann, 1992). Henning et al (1981) found high correlations between composition tests and error identification tests (.76). Several studies in Hale et al (1984: 120, 152) report high correlations between cloze tests and grammar tests (.82 to .93), cloze tests and essay tests (.78 to .94), and error recognition tests and essay tests (.75 to .93). Darnell (1968), Oller (1973:114; 1976:282) and Oller and Conrad (1971) found high correlations between written cloze and listening comprehension, and between listening comprehension and oral cloze (Oller, 1973:114, 1976:282).

Cloze and dictation reveal similar production errors in writing (Oller, 1976:287ff, 1979:57), and a combination of cloze tests and dictation tests have been used effectively in determining general language proficiency (Stump, 1978; Hinofotis, 1980). Oller (1979:61) maintains that all “pragmatic” tasks such as cloze tests or dictation tests probe the same underlying skill. In contrast to Oller, Savignon (1983:264) does not believe that cloze and dictation tests test pragmatic language, that is, language use.

How is the construct is able to account for these differences in views described in the previous paragraphs: “Shouldn’t supposedly similar types of tests relate more to each other than to supposedly different types of tests?” An adequate response presupposes four further questions: (1) “What are similar/different types of tests?” (2) Wouldn’t it be more correct to speak of so-called discrete-point tests and so-called integrative tests? (3) Isn’t the discrete/integrative dichotomy irrelevant to what the cloze test (or any test) is measuring? And most importantly: (4) Is it necessary to use direct tests to predict direct language proficiency. In the next section I suggest some answers to these questions.

3. THE UNITARY COMPETENCE HYPOTHESIS AND THE “ONE BEST TEST”

The debate of whether language proficiency consists of a unitary or general factor (analogous to a g factor in intelligence), or of a number of independent factors has straddled three decades, receiving prominence in the work of authors such as Carroll (1961, 1983a) and Oller (1979, 1983, 1983a). The old beast has still not been put to rest (Oller, 1983a) but is very much alive (Davies, 1990).

Protagonists of the “unitary competence hypothesis” (UCH) – spearheaded by such writers as Oller (1979) and Oller and Kahn (1981) – believed that each of the four different language skills manifested a holistic language ability, and that accordingly it was possible to predict UCH from any one of these skills. For example a high proficiency in writing would indicate proficiency in all the other language skills.

The UCH hypothesis has a strong form and a weak form (Oller and Khan 1981). In the strong form, a single proficiency test could validly predict UCH. In the weak form, a unitary factor accounts for a large section of the variance in language tests, but differentiated components also need to be taken into account. Oller (1983a) has since opted for the weak form of the Unitary Competence Hypothesis, which adopts an interactionist approach between “global” and discrete components of language. Oller (1983a:36, see also Oller and Khan 1981) describes this approach:

…not only is some sort of global factor dependent for its existence on the differentiated components which comprise it, but in their turn, the components are meaningfully differentiated only in relation to the larger purpose(s) to which all of them in some integrated (integrative? – original brackets) fashion contribute.

(See Carroll [1983a:82] and Bachman and Palmer [1981:54] for similar views).

Physicists are ever-searching for that grand unified theory (GUT), or theory of everything (TOE). Many applied linguistics, in contrast, especially “real-life” testers, have given up the search for unitary theories and have happily buried the UCH. But the UCH is far from dead, because it is closely related to the problem of “whether skills (production and reception) are equal and how many separate tests are needed to assess proficiency, [and] that is the ‘one best test question’. (Davies, 1990:76; my brackets and italics).

Alderson (1981b:190) suggests that

regardless of the correlations, and quite apart from any consideration of the lack of face validity of the One Best Test, we must give testees a fair chance by giving them a variety of language tests, simply because one might be wrong: there might be no Best Test, or it might not have the one we chose to give, or there might not be one general proficiency factor, there may be several.

It would be very difficult to find the “One Best” or “perfect” test. The problem has to do not only with construct validity but also with face validity, because even if one was convinced that one had found the “One Best” test, it would not find general acceptance, owing to the fact that it would probably lack face validity. For example, Henning et al (1981) in their Egyptian study found that the highest correlation with Composition was with Error Identification (.76). They accordingly maintained that “Error Identification may serve as an indirect measure of composition writing ability” (Henning et al, 1981:462). Even though Henning et al (1981:464) established that Reading Comprehension “like Listening Comprehension was of little psychometric value in predicting general proficiency”, they conceded that they had to include reading in order for their battery to “find acceptance” (Henning et al, 1981:464). Accordingly, Henning et al (1981) replace their Error Identification test by a Reading Comprehension test. Thus, Henning et al had to choose between the psychometric evidence and “acceptance”. They capitulate to the latter, because decisions based on testing need to look right, i.e. they need to have face validity.

4.DISCRETE-POINT AND INTEGRATIVE TESTS

The terms “integrative” and “discrete-point” are rejected by some applied linguists. Alderson (1979) prefers to distinguish between “low order” and “higher order” tests than between “discrete-point” and “integrative” tests. Fotos (1991:318) equates “integrative” skills with “advanced skills and global proficiency”, which he contrasts with Alderson’s (1979) “basic skills” (Fotos, 1991:318). These “basic skills” are Alderson’s (1979) “low order” skills.

There are very few tests that do not involve some kind of integrative meaning. Consider the following examples from Rea (1985), Canale and Swain (1980) and from (Bloor et al, 1970:35-40). The following are two examples from Rea (1985:22):

1. How….milk have you got?

(a) a lot (b) much of (c) much (d) many

2. …. to Tanzania in April, but I’m not sure.

(a) I’ll come (b) I’m coming (c) I’m going to come (d) I may come.

Item 1 is testing a discrete element of grammar. All that is required is an understanding of the “collocational constraints of well-formedness” (Rea, 1985:22; see also Canale & Swain, 1980:35), i.e. to answer the question it is sufficient to know that “milk” is a mass noun. Item 2 relates form to global meaning, therefore all parts of the sentence must be taken into account, which makes it an integrative task. To use Rea’s (1985:22) terminology, her item 1 (above) is testing “non-communicative performance”, while her item 2 (above) is testing “communicative performance” (also called “communicative competence” [Canale & Swain, 1980:34]).

Consider Canale and Swain’s (1980:35) examples, which are similar to Rea’s examples above. The first example they regard as a discrete-point item, and the second as an integrative item:

1. Instruction – Select the correct preposition to complete the following sentence:

We went….the store by car. (a) at; (b) on; (c) for; (d) to

2. Instruction – The sentence underlined below may be either grammatically correct or incorrect. If you think it is correct go on to the next item; if you think it is incorrect, correct it by changing, adding or deleting only one element.

We went at the store by car.

The complex instructions of the second item and the fact that one has to produce the correct answer and not merely selectthe correct answer make such items more complex (more integrative) than the first item.

Consider the following three items from the mixed grammar test of Bloor et al (1970:35-40):

Item 38. My friend always goes home….foot.

A) by

B) with

C) on a

D) on

Answer: D

In item 38 knowledge of the correct preposition does not depend on global understanding.

Item 50. We….our meat from that shop nowadays.

A) were never buying

B) do never buy

C) never buy

D) never bought

Answer: C

In contrast to item 38, item 50 requires the understanding of more elements in the sentence, but does not require the understanding of all the elements.

Item 30. When the door-bell…., I was having a bath.

A) rang

B) rings

C) rung

D) ringed

Answer: A

Item 30 is more difficult than the others, because it requires not only knowledge of an idiosyncratic past tense formation, but also an understanding of more elements of the sentence (e.g. “when”, “was having”) than in the case of the previous two items. However, one isn’t required to know the meaning of all the elements in item 30, e.g. “bath”. These examples show that there few tests that do not involve some degree of integrative meaning (Rea, 1985).

I would like to elaborate on the discrete-point/integrative controversy. It is widely believed that “pragmatic” tests (often mistakenly called “integrative” tests) such as dictation tests and essay tests test the “use” of language (Widdowson, 1979), i.e. authentic communicative language, while “discrete-point” tests such as error recognition tests and grammar accuracy tests test the “usage” of language (Widdowson, 1979), i.e. the elements of language. Such a distinction between the two kinds of tests, which Farhady (1983) describes as the “disjunctive fallacy”, is an oversimplification. Two opposing positions exist with regard to language proficiency tests: the one position maintains that the only authentic tests are “real-life, or “communicative” tests, because only such tests are able to measure individual performance (e.g. Morrow, 1979; Harrison, 1983; Lantolf & Frawley, 1988). Finding directions or interpreting maps are examples of a “real-life” task. The contrary position maintains that non-communicative tests, i.e. “discrete-point” tests, can successfully test “real-life” tasks (e.g.Rea, 1985; Politzer & Mcgroarty; 1983).

For language “naturalists” the only authentic tests – whether communicative tests or grammar tests – are those presented in a direct real-life situation. Spolsky (1985:33-34) maintains that “authenticity of task is generally submerged by the greater attention given to psychometric criteria of validity and reliability, where face validity receives no more than “lip service” (see Gamaroff, 1997). For Spolsky and others, e.g. Hughes (1989:15), authenticity is closely related to communicative language, i.e. to direct language. Authentic tests for Spolsky would be “direct” tests in contradistinction to “indirect” tests.

Indirect testing for Hughes (1989:15) is both “discrete testing”, i.e. “testing item by item”, and “integrative testing”, which combines “a variety of language elements at the same time” (Hughes, 1989:15), e.g. dictation and cloze tests. Hughes (1989:150) believes that the relationship between the performances of indirect and direct tests is both weak and uncertain. Owing to the lack of clarity on the relationship between a global skill like composition writing and the components of composition writing, e.g. vocabulary, punctuation and grammar, Hughes (1989:15) believes that it is best, in terms of one’s present knowledge, to try and be as comprehensive as possible in the choice of tests, where direct tests would be favoured. Alderson (1983a) maintains, however, that there is no clarity on what communicative tests measure, and therefore there is no cogent reason why one should only use direct, i.e. communicative, tests. Alderson (1983a:88) maintains that “`communicative testers’ only talk about face validity, at the expense of other validities.” Rea gives the following reasons why indirect tests (e.g. cloze tests) should be used:

– There is no such thing as a pure direct test.

– Direct tests are too expensive and involve too much administration.

– Direct tests only sample a restricted portion of the language, which makes valid inferences difficult.

Even if indirect performance is accepted to be a valid predictor of direct performance, one may still not be comfortable with the idea that direct performance, which one may regard as natural, can be predicted by indirect performance, which one may regard as unnatural, or artificial. There is a misunderstanding here in that the dichotomy between “natural” and “unnatural” is a spurious one with regard to the laws of learning. As Butzkamm (1992) points out, with regard to language teaching approaches, it is incorrect to assume that “natural” approaches (Krashen & Terrell, 1983) and immersion programmes mirror natural language acquisition and that the ordinary classroom doesn’t. The playground or the cocktail party or the cooking club are not more neither less natural than the traditional classroom (Gamaroff, 1986) The laws of learning, and of testing, apply to all contexts, “naturalistic” (Omaggio, 1986:312-313) and otherwise. Granted that the quality of learning depends on the quality of input; but this is as trite as the fact that one wouldn’t be able to learn a language without verbal input, or live without food.

5. CONCLUSION

All language testing has a certain arbitrary character. To establish consistency, testers need to decide how to control and develop this arbitrariness. The basic problem of communicative/ direct tests is that it is difficult to make them real-life and authentic, for the simple reason that they are tests. One can have authentic tests, because tests are authentic activity types in their own right (Alderson, 1983a:89). A key issue in language testing should be whether the test is an authentictest, not whether it is “natural”, if by the latter we mean “spontaneous”. Much of education in a tutored context, i.e. much of one’s young life, is “unspontaneous”, but not less natural for being so.

Even when there exists a strong psychometric justification for using indirect tests as predictors of communicative tasks, “communicative” testers will argue that indirect tests are not authentic, because they do not test real life. One might as well argue that an eye-test – say for a driver of a vehicle – is not authentic because it is done at the optometrist instead of on the street. If it could be proven that “discrete-point” tests are valid predictors of direct performance, this would be a good reason for using “discrete-point” tests. The practical implication of the arguments presented is that, until we know more about testing, it is legitimate to follow the practical route of using “discrete-point” tests to predict “real-life” language.

What is so difficult with the interpretation of tests such as essay tests and other “real-life/communicative tasks” (Weir, 1993) is that their “evidential basis” (Messick, 1988:19) is very subjective. Owing to the subjective nature of “real-life” tests, each protocol is the product of a unique web of meanings: the test-taker’s, entangled in another web of meanings: the rater’s.

All language testing theories are inadequate owing to the difficulties involved in devising tests that test authentic language reception and production (Oller, 1983a:269). This does not mean that you should stop measuring until you’ve decided what you are measuring (Spolsky, 1981:46). You do the best you can by taking account of generally accepted views of the nature of language proficiency (Alderson and Clapham, 1992:149), and disagreeing if you feel sensibly compelled to do so.

For a statistical analysis of language tests, which is based on this article,  see here.

6. REFERENCES

Alderson, J.C. (1979). The cloze procedure and proficiency in English as a foreign language. TESOL Quarterly, 13, 219-227.

Alderson, J.C. (1981a). Reaction to the Morrow paper. In J.C. Alderson & A. Hughes (Eds.). Issues in language testing: ELT Documents III. The British Council.

Alderson, J.C. (1981b). Report of the discussion on general language proficiency. In J.C. Alderson & A. Hughes (Eds.). A. Issues in language testing: ELT Documents III. The British Council.

Alderson, J.C. (1983a). Who needs jam? In A. Hughes & D. Porter. Current developments in language testing. London: Academic Press.

Alderson, J.C. (1983b). The cloze procedure and proficiency in English as a foreign language. In J.W. Oller, Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers. (A republication of Alderson, 1979)

Alderson, J.C. & Clapham, C. (1992). Applied linguistics and language testing: A case study of the ELTS test . Applied Linguistics, 13(2), 149-167.

Bachman, L.F. (1990). Fundamental considerations in language testing. Oxford:Oxford University Press.

Bloor, M., Bloor, T., Forrest, R., Laird, E. & Relton, H. (1970). Objective tests in English as a foreign language. London: Macmillan.

Bailey, C.J. (1976). The state of no-state linguistics. Annual review of anthropology, 5, 93-106.

Butzkamm, W. 1992. Review of H. Hammerly. “Fluency and accuracy: Toward balance in language teaching and learning.” System, 20(4):545-548.

Canale, M. & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1):1-47.

Carroll, J.B. (1961). Fundamental considerations in testing for English language proficiency of foreign language students. Washington, D.C: Center for Applied Linguistics.

Carroll, J.B. (1983). Psychometric theory and language testing. In J.W. Oller, Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.

Carroll, J.B. (1993). Human cognitive abilities: A survey of factor analytic studies. Cambridge: Cambridge University Press.

Cummins, J. (1979). Linguistic interdependence and the educational development of bilingual children. Review of Educational Research, 49:222-51.

Cummins, J. (1980). The cross-lingual dimensions of language proficiency: Implications for bilingual education and the optimal age issue. TESOL Quarterly, 14(2):175-87.

Cummins, J. (1983). Language proficiency and academic achievement. In J.W. Oller, Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.

Darnell, D.K. (1968). The development of an English language proficiency test of foreign students using a clozenthropy procedure: Final Report. Boulder: University of Colorado.

Duran, R.P. (1984). Some implications of communicative competence research for integrative proficiency testing. In C. Rivera, (Ed.). Communicative competence approaches to language proficiency assessment: Research and application. Clevedon, England: Multilingual Matters.

Farhady, H. (1983). The disjunctive fallacy between discrete-point tests and integrative tests. In J.W. Oller, Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.

Fotos, S. (1991). The cloze test as an integrative measure of EFL proficiency: A substitute for essays on college entrance examinations. Language Learning, 41(3), 313-336.

Gamaroff, R. 1996. Is the (unreal) tail wagging the (real) dog?:

Understanding the construct of language proficiency. Per Linguam, 12(1):48-58.

Gamaroff, R. 1997. Paradigm lost, paradigm regained: Statistics in language testing. Journal of the South African Association of Language Teaching (SAALT), 31(2)131-139.

Hale, G.A., Stansfield, C.W. & Duran, R.P. (1984). TESOL Research Report, 16. Princeton, New Jersey: Educational Testing Service.

Halliday, M.A.K. (1975). Learning how to mean. London: Arnold.

Harrison, A. (1983). Communicative testing: Jam tomorrow? In A. Hughes & D. Porter, D. (Eds.). Current developments in language testing. London: Academic Press.

Haussmann, N.C.(1992). The testing of English mother-tongue competence by means of a multiple-choice test: An applied linguistics perspective. Doctoral thesis, Rand Afrikaans University, Johannesburg.

Henning, G.A., Ghawaby, S.M., Saadalla, W.Z., El-Rifai, M.A., Hannallah, R.K. & Mattar, M. S. (1981). Comprehensive assessment of language proficiency and achievement among learners of English as a foreign language. TESOL Quarterly, 15(4), 457-466.

Hinofotis, F.B. (1980). Cloze as an alternative method of ESL placement and proficiency testing. In J.W. Oller, Jr. & K. Perkins. Research in language testing. Rowley, Massachusetts: Newbury House.

Hughes, A. (1989). Testing for language teachers. Cambridge: Cambridge University Press.

Krashen, S. & Terrell, T. (1983). The natural approach: Language acquisition in the classroom. Hayward, California: Alemany Press.

Lantolf, J.P. & Frawley, W. (1988). Proficiency: Understanding the construct. Studies in Second Language Acquisition (SLLA), 10(2), 181-195.

Messick, S. (1988). Meaning and values in test validation: The science and ethics of measurement. Princeton, New Jersey: Educational Testing Service

Morrow, K. (1979). Communicative language testing: Revolution or evolution. In C.J. Brumfit & K. Johnson. (Eds.). The communicative approach to language teaching. London Oxford University Press.

Oller, J.W., Jr. (1973). Cloze tests of second language proficiency and what they measure.Language Learning, 23(1), 105-118.

Oller, J. W., Jr. (1976). Cloze, discourse, and approximations to English In K. Burt & H.C. Dulay (Eds.). New directions in second language learning, teaching and bilingual education. TESOL: Washington, D.C.

Oller, J. W., Jr. (1979). Language tests at school. London: Longman.

Oller, J. W., Jr. (1983a). A consensus for the 80s. In J.W. Oller, Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.

Oller, J.W., Jr. (1983b). “g”, “What is it? In A. Hughes & D. Porter. Current developments in language testing. London: Academic Press.

Oller, J.W., Jr. (1995). Adding abstract to formal and content schemata: Results of recent work in Peircean semiotics. Applied Linguistics, 16(3), 274-306.

Oller, J.W., Jr. & Conrad, C. (1971). The cloze technique and ESL proficiency. Language Learning, 21, 183-196.

Oller, J.W., Jr. & Kahn, F. (1981). Is there a global factor of language proficiency? In J.A.S. Read. Directions in language testing. Singapore: Singapore University Press.

Oller, J.W., Jr. & Perkins, K. (1980). Research in language testing. Rowley, Massachusetts:

Omaggio, A.C. (1980). Priorities for classroom testing for the 1980s. Proceedings of the national conference on professional priorities, ed. Dale, L. Lange. Hastings-on-Hudson, New York: ACTFL.

Omaggio, A.C. (1986). Teaching language in context: Proficiency-orientated instruction. Boston, Massachusetts: Henle & Henle.

Politzer, R.L. & Mcgroarty, M. (1983). A discrete-point test of communicative competence. International Review of Applied Linguistics, 21(3), 179-191.

Popham, W.J. (1981). Modern educational measurement. Englewood Cliffs, New Jersey: Prentice-Hall.

Rea, P. (1985). Language testing and the communicative language teaching curriculum. In Lee, Y.P. et al (1985). New directions in language testing. Oxford. Institute of English.

Savignon, S.J. (1983). Communicative competence: Theory and classroom practice. Reading, Mass: Addison-Wesley Publishing Company.

Spolsky, B. (1978). Approaches to language testing. In B. Spolsky (Ed.). Advances in Language Testing Series, 2. Arlington, Virginia: Center for Applied Linguistics.

Spolsky, B. (1981). Some ethical questions about language testing. In Klein-Braley & Stevenson.

Spolsky B. (1985). The limits of authenticity in language testing. Language Testing, 2:31-40.

Stevenson, D.K. (1985). Pop validity and performance testing. In: Y. Lee, A. Fok , R. Lord & G. Low (Eds.). New directions in language testing. Oxford: Pergamon.

Stump, T.A. (1978). Cloze and dictation tasks as predictors of intelligence and achievement scores. In J.W. Oller, Jr. & K. Perkins (Eds.). Language in education: testing the tests. Rowley, Massachusetts: Newbury House.

Weir, C.J. (1993). Understanding and developing language tests. London: Prentice Hall.

Widdowson, H.G. (1979). Explorations in applied linguistics. Oxford: Oxford University Press.

Cloze Tests as Predictors of Global Language Proficiency: A Statistical Analysis

South African Journal of Linguistics, 1998, 16 (1), 7-15.

Author: Raphael Gamaroff

Abstract

1.   Introduction

2.   Literature review of the cloze procedure as a test of  reading

3.   Closure, and deletion procedures in cloze

4.   Quantitative and qualitative methods

4.1 The limitations of error analysis

4.2 Quantitative methods in assessing levels of proficiency

5.   Method

5.1 Subjects

5.2 Instruments

6.   Results

7.   Discussion

8.   Bibliography

Abstract

The usefulness of the cloze test is examined for assessing levels of language proficiency. The methodology involves a statistical analysis of levels of proficiency using the group-differences method of construct validity. Pienaar’s (1984) cloze tests are used to assess the level of Grade 7 pupils, who represent a wide range of English proficiency. It is argued that although qualitative analysis is important, the role of statistical analysis is crucial in understanding the construct of language proficiency.

1. Introduction

The educational context of this article is Mmabatho High School in the North West Province of South Africa, where I spent over seven years (January 1980 to April 1987) as a teacher of second languages (English and French) and researcher in the learning of English as a second language. In January 1987, I administered, in collaboration with the Grade 7 teachers, a battery of English proficiency tests to find out the level of English proficiency of entrants to Grade 7 at the School, where English was used as the medium of learning and instruction. The battery consisted of two essay tests, two dictation tests (Gamaroff, forthcoming), two cloze tests, an error recognition test and a mixed grammar test. The cloze tests discussed in this article are part of this test battery. Pienaar’s (1984) “reading for meaning” cloze tests were used.

The methodology consists of a statistical analysis of levels of proficiency, where the important of statistics in the assessment of language proficiency is emphasised.

2. Literature review of the cloze procedure as a test of reading

Cloze tests are deceptively simple devices that have been constructed in so many ways for so many purposes that an overview of the entire scope of the literature on the subject is challenging to the imagination not to mention the memory.

(Oller, 1973:106)

Since 1973 the literature on cloze has more than doubled, adding even more challenges to the imagination if not – thanks to the printed word – to the memory.

The aim of a cloze test is to evaluate (1) readability and (2) reading comprehension. The origin of the cloze procedure is attributed to Taylor (1953), who used it as a tool for testing readability. Of all the formulas of readability that have been devised, cloze tests have been shown, according to Geyer (1968), Weintraub (1968) and Oller (1973:106), to be the best indicators of readability. It is also regarded as a valid test of reading comprehension. Oller (1973:106) cites Bormuth (1969:265) who found a multiple correlation coefficient of .93 between cloze tests and other linguistic variables that Bormuth used to assess the difficulty of several prose passages. Bormuth (1969:265) maintains that cloze tests “measure skills closely related or identical to those measured by conventional multiple choice reading comprehension tests.”

Many standardised reading tests use cloze tests, e.g. the Stanford Proficiency Reading Test. Johnson and Kin-Lin (1981:282) believe that cloze is more efficient and reliable than reading comprehension, because it is easier to evaluate and does not, as in many reading comprehension tests, depend on long written answers for evaluation. (But it is also possible to use multiple choice reading tests; see Bormuth [1969] in the previous paragraph). Johnson and Kin-Lin’s implication is that although cloze and reading comprehension are different methods of testing, they both tap reading processes. Anderson (1976:1), however, maintains that as there is no consensus on what reading tests actually measure, all that can be said about a reading test is that it measures reading ability. On the contrary, far more can be said about reading: notions associated with reading are “redundancy utilization” (Weaver & Kingston, 1963), “expectancies about syntax and semantics” (Goodman, 1969:82) and “grammar of expectancy” (Oller, 1973:113). All these terms connote a similar process. This process involves the “pragmatic mapping” of linguistic structures into extralinguistic context (Oller, 1979:61). This mapping ability subsumes global comprehension of a passage, inferential ability, perception of causal relationships and deducing meaning of words from contexts (Schank, 1982:61). According to Bachman (1982:61)

`[t]here is now a considerable body of research providing sound evidence for the predictive validity of cloze test scores. Cloze tests have been found to be highly correlated with virtually every other type of language test, and with tests of nearly every language skill or component.’

Clarke (1983), in support of Bachman, is cautiously optimistic that the cloze procedure has a good future in reading research. Alderson (1979:225) who is less optimistic, maintains that

individual cloze tests vary greatly as measures of EFL proficiency. Insofar as it is possible to generalise, however, the results show that cloze in general relates more to tests of grammar and vocabulary (ELBA tests 5 and 6) than to tests of reading comprehension (ELBA test 7).

(The ELBA [English Language Battery] originates from the University of Edinburgh ([Ingram, 1964, 1973]).

Johnson & Kin-Lin (1981) and Oller (1979), contrary to Alderson, found that a great variety of cloze tests correlates highly with reading tests.

Alderson (1979) also believes, as does Hughes (1981) and Porter (1978), that individual cloze tests produce different results, and that each cloze test “needs to be validated in its own right and modified accordingly” (Alderson, 1979:226). Such a view is contrary to the view of Johnson & Kin-Lin (1981) and Oller (1979) mentioned in the previous paragraph that there is a high correlation between all kinds of cloze tests, indeed between cloze tests, dictation tests, essay tests and “low order” (Alderson, 1979) grammar tests, which indicates that cloze is a valid test of global, or general language proficiency – (see also Brown, 1983; Fotos, 1991; Hale et al., 1984; Oller, 1973; Stubbs & Tucker, 1974). However, research keeps bringing to light examples that show the difficulty involved in establishing the validity of cloze tests as a valid test of global proficiency, for example, the effect of text difficulty, or content on cloze scores (Alderson, 1979; Piper and McEachern, 1988) and on reading proficiency, specifically in the English for Special Purposes situation (Alderson & Clapham, 1992). With regard to cultural content in cloze testing, Chihara, Sakurai and Oller (1989) found in their English cloze tests for Japanese speakers that minor adjustments such as changing names from Joe or Nicholas to Japanese names produced a gain of six per cent over tests that had not been modified (Oller, 1995).

In spite of these problems, the evidence is strong that cloze is a valid and reliable test of global proficiency (Oller & Jonz, 1994; Walker, Rattanavich & Oller, 1992). I shall return to the validity of the cloze test as an indicator of global proficiency in the discussion of the results at the end of the article.

3. Closure, and deletion procedures in cloze

Closure is a pivotal concept in cloze theory. Closure does not merely mean filling in items in a cloze, but filling them in a way that reveals the sensitivity to intersentential context, which measures “higher-order skills” (Alderson, 1979:225). A cloze test that lacks sufficient closure would not be regarded as a good cloze test. According to Alderson (1979:225) “the cloze” is sentence-bound. Alderson (1979:225) states that

`one must ask whether the cloze is capable of measuring higher-order skills. The finding in Alderson (1978) that cloze seems to be based on a small amount of context, on average, suggests that the cloze is sentence – or indeed clause – bound, in which case one would expect a cloze test to be capable, of measuring, not higher-order skills, but rather much low-order skills…as a test, the cloze is largely confined to the immediate environment of a blank.’

This means that there is no evidence that increases in context make it easier to complete items successfully. Oller (1976:354) maintains, contrary to Alderson, that subjects “scored higher on cloze items embedded in longer contexts than on the same items embedded in shorter segments of prose”. Oller used five different cloze passages and obtained similar results on all of them.

With regard to methods of deletion, Jacobs (1988:47) lists two basic methods of deletion: fixed deletion and rational deletion. In the fixed deletion method, every nth word is deleted; which may range between every fifth word – which is believed to be the smallest gap permissible without making the recognition of context too difficult – and every ninth word. The rational deletion method is not fixed but is based on “selective” deletion (Ingram, 1985:241). Pienaar’s (1984) “reading” tests, which are used in this article, are rational deletion cloze tests.

Alderson (1980:59-60) proposes that the rational deletion procedure should not be referred to as a “cloze” but as a “gap-filling” procedure. Such a proposal has been accepted by some researchers, e.g. Weir (1993:81), but not accepted by others, e.g. Bachman’s (1985) “Performance on cloze tests with fixed-ratio and rational deletions”, Maclean’s (1984) “Using rational cloze for diagnostic testing in L1 and L2 reading” and Markham’s (1985) “The rational deletion cloze and global comprehension in German.” There is nothing wrong with the proposal that the rational-deletion procedure be called a gap-filling test, if it remains nothing more than that – a proposal.

Alderson (1979:226) suggests that what he calls cloze tests (namely, every nth word deletion) should be abandoned in favour of “the rational selection of deletions, based upon a theory of the nature of language and language processing”. Thus, although Alderson proposes that the rational selection of items, his “gap-filling” should not be called a cloze procedure, he still favours “gap-filling” tests over “cloze” tests. This superiority of the rational deletion method is supported by a substantial body of research, e.g. Bachman (1982) and Clarke (1979). However, it should be kept in mind that language hangs together, and thus the every-nth-word cloze test is also, in my view, a good test of global proficiency. In other words, whether one uses “fixed” deletions or “rational” deletions, both methods test global proficiency.

Having considered the arguments for the validity of the cloze test as a test of reading, it seems that cloze tests are valid tests of reading strategies, i.e. they can test long-range contextual constraints. One must keep in mind, however, that deletion rates, ways of scoring, e.g. acceptable words or exact words, and types of passages chosen in terms of background knowledge and of discourse devices, may influence the way reading strategies are manifested. But it is debatable whether one should make to much of these differences.

4. Qualitative and quantitative methods

Before dealing with the method of the investigation, it would be useful to say something about construct validity and the relationship between quantitative methods and qualitative methods in language proficiency testing. Quantitative methods deal with data that involve statistics, while qualitative measurement (one doesn’t really “measure” qualitative data, but analyses them) deals with data that do not involve statistics, e.g. an error analysis.

4.1 The limitations of error analysis

Without some knowledge of the subject’s ability and previous output – revealed through a longitudinal study of different outputs – a valid interpretation of errors is difficult to achieve. It is usually less difficult to infer processes from the production of grammatical errors than processes from the production of lexical items, where there exists no traditional corpus of errors. On a few occasions the subjects gave the same wrong answer to a cloze item, but usually they gave different wrong answers. When there are similar wrong answers, it may be easier to find a principled explanation. However, when there is a variety of wrong answers to the same item, I don’t think that much is to be gained by doing a “linguistic” analysis, owing to the mega-encyclopaedic range of possible interpretations. Though, testers may gain some insight may from a test taker’s self-editing or interviews with individual test takers after the test. However, owing to time constraints in the testing/teaching situation, it is often not possible to hold interviews. In my testing situation at Mmabatho High School, it would have required at least a half an hour with each subject on each cloze passage. Further, to ask 12-year old L2 English learners – this would be true of many L1 English speakers as well – to explain in a second language, or even in their mother tongue, the information-processing strategies they used in a test or in any other kind of learning behaviour is fraught with problems. So, even if one could interview test-takers after a test – whether in the second language or the mother tongue – one can never be sure whether they have understood the mental processes involved. The interpretation/ evaluation problem for raters is not only distinguishing between the processes and products of the test-taker, but between the rater’s own processes and products. The process-product distinction is far more useful in what it should be saying about where process and product meet than what it says about where they separate (Besner, 1985:9).

A pitfall of error analysis is that the more satisfying the explanation, the foggier the idea may be of what is going on in the test taker’s head. Thus, in an error analysis it is indeed possible to label the error in purely linguistic terms, but the more important diagnostic issue of why specific errors are committed remains largely a mystery. Raters are like inquisitive insects wandering around in a “gigantic multi-dimensional cobweb” in which every item requiring an interpretation is attached to a host of others (Aitchison, 1987:72).

4.2 Quantitative methods in assessing levels of proficiency

Although qualitative methods in language testing are useful, qualitative analysis without quantitative measurement (statistics, or psychometrics) would be of limited value (Gamaroff, 1996, 1997a).

According to Nunan (1992:20), there exists a dichotomy between objective-quantitative and subjective-qualitative research. This dichotomy is understandable owing to the danger of the individual getting lost in the thicket of the group, or norm. However, the psychometrist, whose business is norm-referenced testing, is not a “hocus-pocus” scientist, as Spolsky (1985:33-34) thinks, because any interpretation of test results by comparison with criteria must presuppose a criterion level. And the only way to establish what learners/test takers do is to base the criterion on the average performance of a group (Rowntree, 1977:185). Without a comparison of levels of proficiency, it is not possible to establish the construct validity of a test. That is why the group-differences approach is so important in construct validation. This approach is now explained:

The aim of testing is to discern levels of ability. If one uses reading ability as an example of a construct, one would hypothesise that people with a high level of this receptive (but not at all passive) ability would have a good command of sentence structure, cohesion and coherence; while people with a low level of this ability would have a poor command of these. Tests are then administered, e.g. cloze tests, and if it is found that there is a significant difference between a group of high achievers and a group of low achievers, this would be valid evidence for the existence of the construct. In general, second language learners are relatively less competent than mother tongue speakers. If a test fails to discriminate between low-ability and high-ability learners, there are three possible reasons for this:

1. The test may be too easy or too difficult. 2. The theory undergirding the construct is faulty. 3. The test has been inaccurately administered and/or scored, that is, it is unreliable.

5. Method

5.1 Subjects

The sample consists of 80 subjects, consisting of the following “ethnic” groups as shown in Table 1. For the purposes of my statistical analysis, I use “L1” and “L2” to distinguish between groups in terms of whether they take English as a first-language course subject or as a second-language course subject at the School, and not in terms of whether they have English as their mother tongue, or main language, or not, which are the usual meanings of L1 and L2. (When I do use L1 and L2 in their usual meanings, I shall make it clear that I am doing so).

Table 1. Detailed Composition of the Sample

Close Table 1

The vast majority of the L1 group originated from Connie Minchin Primary School, Mmabatho, which was the Main Feeder School for Mmabatho High School during 1980 to 1990. Some of the Tswanas in the L1 group originated from ex-DET (Department of Education and Training) Schools. The L2 group of 38 subjects originated from 29 different Schools in the North West Province; thus, from each of these schools there were, in general, only one entrant to Mmabatho High School.

Entrants decided themselves whether they belonged to the L1 and L2 group, i.e. whether they wanted to take English as a first or second language subject. I shall argue that classifications of such a nature should be based on valid and reliable test scores and not the classifications of the entrants themselves.

In South Africa, it is often not clear which individuals or groups use their mother tongue as their main language. There are two possible reasons for this: 1. several languages may be spoken at home, either because either one both parents speak more than one language, the mother tongue of one of the parents, usually the more powerful parent, begins to predominate at the age of about four or five years, which is often the father! tongue, and 2. the uprooting, caused by adverse social and economic circumstances. In such circumstances children may be removed from their mother tongue environment and placed with other families that speak a different language. For example, a seven-year old Xhosa child from the Eastern Cape might be placed with a Tswana family in the North West Province. Tswana then becomes the replacement language. It is possible that such a pupil might have only limited proficiency in both Tswana and English. The uprooting may have been not only among black families but among Coloured and Indian families as well. I shall deal with the replacement language issue in the discussion of the results.

5.2 Instruments

Two cloze tests from Pienaar’s (1984) pilot survey “Reading for meaning” are used. These tests have already been used in many schools in the North West Province and have produced a solid body of results. Many cloze tests consist of 50 deletions, because this number is thought to ensure high reliability and validity. Pienaar’s cloze tests each consist of only 10 items. I shall examine whether tests with so few deletions can be regarded as valid and reliable.

The question was whether to use the same tests for these two different language groups (L1 and L2) but give each of these groups its own norms. I decided against this because when the same syllabus (except for the language subjects) is used by L1 and L2 pupils, as was the case at Mmabatho High School, both groups have to contend with the same academic demands and content; and it is the effect that low scores on these cloze tests have on academic achievement that Pienaar) and I were ultimately concerned with.

In a review of Pienaar (1984), Johanson (1988:27) refers to the “shocking” low reading levels in many North West Province (ex-Bophuthatswana) schools revealed by Pienaar’s survey. Pienaar’s major finding was that 95% of pupils (Grade 3 to Grade 12) in the North West Province were “at risk”, i.e. they couldn’t cope with the academic reading demands made on them (See also Macdonald, 1990a, 1990b).

Pienaar’s (1984) tests comprise five graded levels – “Steps” 1 to 5, where each Step consists of four short cloze passages (Form A to Form D) with 10 blanks in each passage (Pienaar, 1984:41):

Step 1 corresponds to Grades 3 and 4 (Stds 1 and 2) for English first language and to Grades 5 to 7 for English second language. (Pienaar is using the term first language (L1) in the usual way, i.e. a mother tongue or a language a person knows best).

Step 2 corresponds to Grades 5 and 6 for first language and to Grades 7 to 8 for second language.

Step 3 corresponds to Grades 7 and 8 for first language and to Grades 9 to 11 for second language.

Step 4 corresponds to Grades 9 and 10 for first language and to Grades 11 and 12 for second language.

Step 5 corresponds to Grades 11 and 12 for first language and to Grade 12 + for second language.

If one Step proves too easy or too difficult for a specific pupil, a higher or a lower Step could be administered. For example, if Step 2 is too difficult, the pupil can be tested on Step 1. In this way it is possible to establish the level of English proficiency for each individual pupil. It must be stressed that the purpose of Pienaar’s cloze tests is to serve as predictors of general academic achievement and English achievement.

As shown above, Pienaar built into his tests an adjustment that distinguishes between L1 and L2 pupils; e.g. Step 2 is meant for Grades 5 and 6 L1 pupils and for Grades 7 to 9 L2 pupils.

It is only after the test has been performed on the “test-bench” (Pienaar, 1984:5) that it is possible to decide whether the test is too easy or too difficult (see also Cziko, 1982:368). If there are L1 and L2 subjects in the same sample, as is the case in this investigation, one might need to consider whether the norms of the L1 and the L2 groups should be separated or interlinked and how to classify precisely the L1 and L2 subjects used for the creation of norms (Baker, 1988:399). At Mmabatho High School, entrants decided themselves whether they wanted to take English first language as a course subject (designated as L1 in this article) or English second language as a course subject (designated as L2 in this article). The point of the proficiency tests, e.g. the cloze test, was to compare the former kind of classification with the test score classification.

According to Pienaar (1984), a perfect score on a cloze test indicates that the pupil has fully mastered that particular level. A score of 50% would indicate that the pupil is not ready for the next stage. Pienaar’s (1984) view is that pupils are expected to do well before they are permitted to move on to the next stage, e.g. a pupil with 50% for a Grade 7 cloze test should be in a lower grade. Recall that Pienaar is claiming that his tests are valid predictors of academic achievement.

Pienaar (1984:41) maintains that L2 pupils, i.e. ESL pupils, are generally two to three years behind English L1 pupils in the acquisition of English proficiency, and that there is often also a greater age range in the English second language classes, especially in the rural areas.

The tests were standardised in 1982 on 1068 final year JSTC (Junior Secondary Teacher’s Certificate) and PTC (Primary Teacher’s Certificate) students from nine colleges affiliated to the University of the Transkei. These standardised results became the table of norms for the tests (Pienaar, 1984:9). Below are the weighted mean scores achieved by the students of the nine colleges (Pienaar, 1984:10):

Step 1     Step 2     Step 3      Step 4     Step 5

Weighted means:          67%          53%        37%        31%         24%

Most of the colleges performed similarly on all five Steps. These results confirmed the gradient of the difficulty of the various steps.

During 1983 a major part of the test instrument was administered to a smaller group of college students selected from the original large group. No statistically significant difference between the scores of the two administrations was found, which confirmed the test-retest reliability of the instrument (Pienaar, 1984:9).

The tests underwent ongoing item analysis and refinement. By the time the final version was submitted to school pupils in the Mmabatho/Mafikeng area in 1984, 30% of the items had been revised. As a result of continuous item analysis, a further 18% of the items were revised (Pienaar, 1984:9).

An important point is that these aforementioned results claim to represent the reading ability of college students, who are supposed to be more proficient in English than school pupils. Final year student teachers only obtained a score of between 40% and 60% on Step 2 – see weighted scores above. (Step 2 is used in this investigation for Grade 7 pupils). These low scores indicate that the reading level of the student teachers, who were to start teaching the following year, was disturbingly no higher than the level of many of the pupils they would eventually teach.

In the test battery I used two tests – Form B and Form D – of Step 2. (Pienaar used four tests per Step). I shall show that two tests are sufficient to distinguish levels of proficiency. The two tests are presented below with the practice exercise:

Pienaar’s Practice exercise

(Pienaar does not provide the answers for this practice exercise. Possible answers are provided in brackets).

The 1 (rain) started falling from the sagging black 2 (clouds) towards evening. Soon it was falling in torrents. People driving home from work had to switch their 3 (headlights) on. Even then the 4 (cars, traffic) had to crawl through the lashing rain, while the lightning flashed and the 5 (thunder) roared.

Cloze Test 1:Form B Step 2 (Pienaar, 1984:59):

A cat called Tabitha

Tabitha was a well-bred Siamese lady who lived with a good family in a shiny white house on a hill overlooking the rest of the town. There were three children in the family, and they all loved Tabitha as much 1 she loved them. Each night she curled up contentedly on the eldest girl’s eiderdown, where she stayed until morning. She had the best food a cat could possibly have: fish, raw red mince, and steak. Then, when she was thirsty, and because she was a proper Siamese and did 2 like milk, she lapped water from a blue china saucer.

Sometimes her mistress put her on a Cat show, and there she would sit in her cage on 3 black padded paws like a queen, her face and tail neat and smooth, her black ears pointed forward and her blue 4 aglow.

It was on one of these cat shows that she showed her mettle. The Judge had taken her 5 of her cage to judge her when a large black puppy ran into the hall. All the cats were furious and snarled 6 spat from their cages. But Tabitha leapt out of the judge’s arms and, with arched 7 and fur erect, ran towards the enemy. The puppy 8 his tail and prepared to play. Tabitha growled, then, with blue eyes flashing, she sprang onto the puppy’s nose. Her 9 were razor-sharp, and the puppy yelped, shook her off, and dashed for the door. Tabitha then stalked back down the row of cages to where she had 10the judge. She sat down in front of him and started to preen her whiskers as if to say, “Wait a minute while I fix myself up again before you judge me.” She was quite a cat, was Tabitha!

Answers. (The words in round brackets are Pienaar’s suggested alternative answers. The words in square brackets are my suggested alternative answers):

1. as; 2. not; 3. her [four, soft]; 4. eyes (eye); 5. out; 6. and; 7. back (body); 8. wagged, twitched (waved, lifted); 9. claws (nails); 10. left (seen, met).

Cloze Test 2: Form D Step 2 (Pienaar, 1984:61):

A dog of my own

When I was ten all 1 wanted was a dog of my own. I yearned for a fluffy, fat, brown and white collie puppy. We already had two old dogs, but my best friend’s pet collie had 2 had seven fluffy, fat, brown and white puppies, and I longed for one with all my heart. However, my mother said no, so the seven puppies were all sold. I had horses, mice, chickens and guinea-pigs, and as my 3 said, I loved them all, but I wasn’t so keen on finding them food. Since she had five children to look after, it made here angry to 4 hungry animals calling, so she said crossly, “No more dogs.”

This didn’t stop me wanting one though, and I drew pictures of collie dogs, giving 5 all names, and left them lying around where she would find them. As it was 6 Christmas, I was sure that she would relent and give me a puppy for Christmas.

On Christmas morning I woke up very excited, 7 the soft little sleepy bundle that I wanted at the bottom of the bed wasn’t there. My mother had given me a book instead. I was so disappointed that I cried to myself, yet I tried not to 8 her how sad I was. But of course she noticed.

Soon after that my father went off to visit his brother and when he came back he brought me a puppy. Although it 9 a collie it was podgy and fluffy, and I loved him once. My mother saw that I looked after him properly and he grew up into a beautiful grey Alsation. We were good friends for eleven happy 10 before he went to join his friends in the Animals’ Happy Hunting Ground.

Answers.

1. I; 2. just, recently; 3. mother (mummy, mum, mom); 4. hear; 5. them; 6. near (nearly, nearer, close to; 7. but, however (though); 8. show (tell); 9. wasn’t (was not); 10. years.

6. Results

The results involve the following statistical data: parallel reliability, means, standard deviations, z-scores, and frequency distributions.

The Pearson r correlation formula measures the parallel reliability between two separate, but equivalent, tests. Henning (1987:82) explains: “In general, the procedure for calculating reliability for using parallel forms is to administer the tests to the same persons at the same time and correlate the results as indicated in the following formula:

rtt = rA,B (Pearson r formula)

where rtt = reliability coefficient, and rA,B = the correlation of form A (in our case, Cloze Test 1) with form B (in our case, Cloze Test 2) of the test when administered to the same people at the same time. The pearson r for the two cloze tests in this investigation was .79.

The summary statistics of the L1 and L2 groups are reported in Table 2, followed by the frequency distributions of these two groups in Figures 1 and 2.

Table 2. Summary statistics of the L1 and L2 Groups

Cloze Table 2

Figure 1. Frequency Distribution of Test 1

Close Figure 1

Figure 2. Frequency Distribution of Test 2

Cloze Figure 2

A perfect score on a cloze test indicates that the pupil has fully mastered that particular level (Pienaar, 1984). Thus a score of 70% or lower would indicate that the pupil is not ready for the next stage. In the light of these comments, I examine the individual scores of theL1 Group in Table 3 because one would expect this group to get a score of at least 7, owing to the fact that they were taking English as a first language course subject.

Table 3. Cloze Scores ) in Ascending Order) of Ethnic Subgroups within the L1 Group

(C = Coloured; I = Indian; Tsw = Tswana; W = White; R = Replacement language

Cloze Table 3

I first discuss the scores of the Coloureds and Indians in Table 3. The data in Table 3 is summarised in Figure 3.

Figure 3. Summary data of Coloureds and Indians

An appreciable number of Coloureds and Indians use English as a replacement language, which is a language that becomes more dominant than the mother tongue, usually at an early age, but is seldom fully mastered. The situation with Coloured and Indian children is that many of them speak a bit of both Afrikaans or an Indian language, and English. A swing might occur towards English, and it might seem that Afrikaans or the Indian language has been replaced by English. What often happens instead is that basic English skills are never fully mastered: the result is a hybrid of English and Afrikaans, or English and an Indian language. A difficulty with replacement language pupils is that cognitive development could be inhibited when basic language skills have not been mastered at an early age in one language or, in some bilingual situations, in two languages. Now consider the scores of six points and below six in Figure 3. Of the 13 Indian and Coloured subjects, eight obtained a score of six and below. These were probably replacement language subjects, because it is unlikely that mother tongue subjects, as a rule, would obtain such low scores on a test that was pitched at the L2 level. Indeed the majority of the L1 Tswana subjects, who were mother tongue speakers of Tswana, obtained scores above six. Eight of the Tswana subjects who had a score of six or less changed from English first language as a course subject in Grade 7 (the year in which the cloze tests were given) to English second language as a course subject after Grade 7.

7. Discussion

The main issue in the classification of subjects/pupils is that one’s classification should not be a priori based (i.e. based on pupils’ or teachers’ preconceptions) but empirically based (Baker, 1988:408) on valid and reliablenorm-referenced tests. Such a solution may clash with the “outcomes” approach to testing, which tends to eschew comparisons of scores between individuals or between groups (HSRC, 1995a, 1995b).

Also, it might be argued that measuring the difference in means between groups is not useful because it apportions equivalent scores to each item, and accordingly does not take into account the relative level of difficulty of items as would be done in an item analysis. I suggest that the relative difficulty of items is not important in a language proficiency test, but is indeed in a diagnostic test, which has remediation as its ultimate purpose – as in Markham, 1984, mentioned in the literature review above. With regard to proficiency tests, one is concerned with overall, or general, or global proficiency that is specified for a certain level (e.g. elementary, intermediate and advanced) for specific people at a specific time and in a specific situation. These levels are determined by theory. Within each level there are difficult and easy items. To attain a specific level of proficiency one has to get most of the items right – the difficult and the easy ones. In sum, the different bits of language have to hang together.

Statistics will tell us a great deal about the level of proficiency of individuals and groups, while a diagnosis in the form of an error analysis will help in ascertaining items for remediation.

As the results show, cloze tests with only ten deletions distinguish clearly between levels of proficiency, which is an important factor in the construct validity of a test. And this brings me back to the validity of these cloze tests, or any cloze tests as an indicator of global, or general proficiency, as the “One Best Test” (Alderson, 1981). The “One Best Test” notion is closely related to the question whether language proficiency consists of a unitary factor analogous to a g factor in intelligence, or of a number of independent factors. The debate has gone on for at least three decades receiving prominence in the work of Carroll (1961, 1983) and Oller (1979, 1983, 1983a). This question is of immense practical importance in language testing. Alderson (1981:190) discusses the “One Best Test” argument and concludes that

regardless of the correlations, and quite apart from any consideration of the lack of face validity of the One Best Test, we must give testees a fair chance by giving them a variety of language tests, simply because one might be wrong: there might be no Best Test, or it might not have the one we chose to give, or there might not be one general proficiency factor, there may be several.

It would be very difficult to find the “One Best” or “perfect” test. The problem has to do not only with construct validity but also with face validity, because even if one managed to find a “perfect” or “One Best” test – one cloze test with 10 items! – it would not find general acceptance, owing to the fact that it would lack face validity, i.e. it would not look at all as if it could predict global proficiency. Decisions based on testing often affect people’s lives, therefore, one should use a variety of tests.

The ultimate interest in language proficiency lies its effect on academic achievement. There is little dispute that low English proficiency, where it is the medium of learning, goes together with educational failure (Gamaroff, 1995a). This does not necessarily mean, of course, that low English proficiency is the direct or only cause of educational failure. There is much evidence to indicate that low proficiency must be partly responsible for academic failure. But we are also aware that academic failure is much more than language failure, and indeed than second language failure, i.e. failure in using a main language or a second language as a medium of learning, for example English (Clayton, 1996; Gamaroff, 1995, 1996, 1997b; Winkler, 1997).

In this investigation, neither Pienaar nor myself have provided hard data that these cloze tests are indeed valid predictors of academic achievement: so why should one be persuaded that they are just because Pienaar says so? The absence of such data in Pienaar (1984), however, does not detract from the value of Pienaar’s cloze tests as a measure of language proficiency. After all, Pienaar’s (1984) monograph was only a pilot survey.

In unpublished research, I did a longitudinal study (Grade 7 to Grade 12) of these cloze tests as predictors of English achievement and general academic achievement (aggregate scores) with the same sample of subjects used in this article. The results of the longitudinal study are intended for publication in the near future.

8. Bibliography

Aitchison, J. 1987. Words in the mind: An introduction to the mental lexicon. Oxford: Basil Blackwell.

Alderson, J.C. 1978. A study of the cloze procedure with native and non-native speakers of English. Unpublished Ph.D. Dissertation. Edinburgh: University of Edinburgh.

Alderson, J.C. 1979. The cloze procedure and proficiency in English as a foreign language, TESOL Quarterly, 13:219-227.

Alderson, J.C. 1980. Native and nonnative speaker performance on cloze tests, Language Learning, 30(1):59-77.

Alderson, J.C. 1981a. Report of the discussion on general language proficiency. In: Alderson, J.C. & Hughes, A. Issues in language testing: ELT Documents III. The British Council.

Alderson, J.C. and Clapham,C. 1992. Applied linguistics and language testing. A case study of ELTS test. Applied Linguistics, 13(2):149-167.

Anderson, J. 1976. Psycholinguistic experiments in foreign language testing. Queensland: University of Queensland Press.

Bachman, L.F. 1982. The trait structure of cloze test scores, TESOL Quarterly, 16(1):61-70.

Bachman, L.F. 1985. Performance on cloze tests within fixed-ratio and rational deletions, TESOL Quarterly, 19(3).

Baker, C. 1988. Normative testing and bilingual populations. Journal of Multilingual and Multicultural Development, 9(5):399-409.

Besner, N. 1985. Process against product: A real opposition. English Quarterly, 18(3):9-16.

Bormuth, J. 1964. Mean word depth as a predictor of comprehension difficulty, California Journal of Educational Research, 15:226-231.

Brown, J.D. 1983. A closer look at cloze: Validity and reliability, Oller, J.W., Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.

Carroll, J.B. 1961. Fundamental considerations in testing for English language proficiency of foreign language students. Washington, D.C: Center for Applied Linguistics.

Carroll, J.B. 1993. Human cognitive abilities: A survey of factor analytic studies. Cambridge. Cambridge University Press.

Chihara, T., Sakurai, T. & Oller, J.W. Jr. 1989. Background and culture as factors in EFL reading comprehension. Language Testing, 6(2):143-151.

Clark, J.L.D. 1983. Language testing: Past and current status – Directions for the future, Language Testing, 64(4):431-443.

Clayton, E. 1996. Is English really the culprit? Investigating the content versus language distinction. Per Linguam, 12(1):24-33.

Corder, S.P. 1981. Error analysis and interlanguage. Oxford: Oxford University Press.

Cziko, G.A. 1982. Improving the psychometric, criterion-referenced, and practical qualities of integrative testing, TESOL Quarterly, 16(3):367-379.

Fotos, S. 1991. The cloze test as an integrative measure of EFL proficiency: A substitute for essays on college entrance examinations. Language Learning, 41(3):313-336.

Gamaroff, R. 1995a. Affirmative action and academic merit. Forum, 1(1). Journal of the World Council of Curriculum Instruction, Region 2, Africa South of the Sahara, Lagos.

Gamaroff, R. 1995b. Solutions to academic failure: The cognitive and cultural realities of English as a medium of instruction among black learners. Per Linguam, 11(2):15-33.

Gamaroff, R. 1996. Is the (unreal) tail wagging the (real) dog?:

Understanding the construct of language proficiency. Per Linguam, 12(1):48-58.

Gamaroff, R. 1997a (Forthcoming). Paradigm lost, paradigm regained: Statistics in language testing. Journal of the South African Association of Language Teaching (SAALT).

Gamaroff, R. 1997b. Language as a deep semiotic system and fluid intelligence in language proficiency. South African Journal of Linguistics, 15(1):11-17.

Gamaroff, R. Forthcoming. Dictation as a test of communicative proficiency, International Review of Applied Linguistics.

Geyer, J.R. 1968. Cloze Procedure as a predictor of comprehension in secondary social studies materials. Olympia, Washington: State Board for Community College Education.

Goodman, K.S. 1969. Analysis of oral reading miscues: Applied psycholinguistics, Reading Research Quarterly, 5:9-30.

Hale, G.A., Stansfield, C.W. & Duran, R.P. 1984. TESOL Research Report 16. Princeton, New Jersey: Educational Testing Service.

Henning, A. 1987. A guide to language testing. Rowley, Massachusetts: Newbury House.

HSRC. 1995. Ways of seeing the National Qualifications Framework. Pretoria: Human Sciences Research Council.

HSRC. 1996. Language assessment and the National Qualifications Framework. Pretoria: Human Science Research Council Publishers.

Hughes, A. 1981. Conversational cloze as a measure of oral ability, English Language Teaching Journal,35(2):161-168.

Ingram, E. 1964.English Language Battery (ELBA). Edinburgh: Department of Linguistics, University of Edinburgh.

Ingram, E. 1973. English standards for foreign students, University of Edinburgh Bulletin, 9:4-5.

Ingram, E. 1985. Assessing proficiency: An overview on some aspects of testing, Hyltenstam, K. & Pienemann, M. Modelling and Assessing second language acquisition. Clevedon, Avon: Multilingual Matters Ltd.

Jacobs, B. 1988. Neurobiological differentiation of primary and secondary language acquisition, Studies in Second Language Acquisition, 33:247-52.

Jeffery, C.D. 1990. The case for grammar: Opening it wider, South African Journal of Higher Education, Special edition.

Johnson, F.C. & Kin-Lin, C.W.L. 1981. The interdependence of teaching, testing, and instructional materials, Read, J.A.S. (Ed.). Directions in language testing. Singapore: Singapore University Press.

Macdonald, C. A. 1990. Crossing the threshold into standard three in black education: The consolidated main report of the threshold project. Pretoria: Human Sciences Research Council.

Macdonald, C. A. 1990a. English language skills evaluation (A final report of the Threshold Project). Report Soling-17. Pretoria. Human Sciences Research Council.

Maclean, M. 1984. Using rational cloze for diagnostic testing in L1 and L2 reading, TESL Canada Journal, 2:53-63.

Markham, P. 1985. The rational deletion cloze and global comprehension in German, Language Learning,35:423-430.

Oller, J.W., Jr. 1973. Cloze tests of second language proficiency and what they measure, Language Learning, 23(1):105-118.

Oller, J.W., Jr. 1976. Cloze, discourse, and approximations to English, Burt, K. & Dulay, H.C. (Eds.). New directions in second language learning, teaching and bilingual education. TESOL: Washington, D.C.

Oller, J.W., Jr. 1979. Language tests at school. London: Longman.

Oller, J.W., Jr. 1983. A consensus for the 80s. In: Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.

Oller, J.W., Jr. 1983a. “g”, “What is it? In: Hughes, A. and Porter, D. (Eds.). Current developments in language testing. London: Academic Press.

Oller, J.W., Jr. 1983b. Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.

Oller, J.W. Jr. 1995. Adding abstract to formal and content schemata: Results of recent work in Peircean semiotics. Applied Linguistics, 16(3):273-306.

Oller, J.W. Jr. & Jonz, J. (Eds.). 1994. Cloze and coherence. Cranbury, N.J. Bucknell University Press.

Pienaar, P. 1984. Reading for meaning: A pilot survey of (silent) reading standards in Bophuthatswana. Mmabatho: Institute of Education, University of Bophuthatswana.

Piper, T. & McEachern, W.R. 1988. Content bias in cloze as a general language proficiency indicator. English Quarterly, 21(1):41-48.

Porter, D. 1978. Cloze procedure and equivalence, Language Learning, 28(2):333-41.

Schank, R.C. 1982. Reading and understanding: Teaching from the perspective of artificial intelligence. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Spolsky, B. 1985. The limits of authenticity in language testing. Language Testing, 2:31-40.

Stubbs, J. & Tucker, G. 1974. The cloze test as a measure of English proficiency, Modern Language Journal, 58:239-241.

Taylor, W. 1953. Cloze procedure: A new tool for measuring readability, Journalism Quarterly, 30:414-438.

Walker, R., Rattanavich, S. & Oller, J.W. Jr. 1992. Teaching all the children to read. Buckingham, England: Open University Press.

Weaver, W.W. & Kingston, A.J. 1963. A factor analysis of the Cloze procedure and other measures of reading and language ability, Journal of Communication, 13:252-261.

Weintraub, S. 1968. The cloze procedure, The Reading Teacher, 6:21, 567, 569, 571, 607.

Weir, C.J. 1993.Understanding and developing language tests. London: Prentice Hall.

Winkler, G. 1997. The myth of the mother tongue: Evidence from Maryvale College, Johannesburg. South African Journal of Applied Language Studies, 5(1):29-39.