Chapter 3: Ph.D. – Sampling, and Structure and Administration of the English Proficiency Tests

3.1 Introduction

3.2 Sampling procedures for the selection of subjects

3.2.1 The two main groups of the sample: First Language (L1) and second Language (L2) groups

3.3 Structure and administration of the English proficiency tests

3.3.1 The cloze tests

3.3.1.1 Theoretical overview

3.3.1.2 The cloze tests used in the study

3.3.2 The essay tests

3.3.2.1 Theoretical overview

3.3.2.2 The essay tests used in the study

3.3.3  Error recognition and mixed grammar tests

3.3.3.1 Theoretical overview

3.3.3.2 Error recognition and mixed grammar tests used in the study

3.3.4  The Dictation test

3.3.4.1 Introduction

3.3.4.2 Theoretical overview

3.3.4.3 The dictation tests used in the study

3.3.4.4 Presentation of the dictation tests

3.3.4.5 Method of scoring of the dictation tests

3.4 Summary of Chapter 3

3.1 Introduction

This chapter describes the sample of subjects and sampling procedures (of subjects and tests) and provides a detailed theoretical overview and description of the battery of English proficiency tests.

The sample of subjects consists of two main groups: First Language (L1) and Second Language (L2). A major issue in this study is between L1 and L2 levels of language proficiency. This issue has become a controversial distinction in South Africa where more and more applied linguists and educationists argue that the L1-L2 distinction should be jettisoned. (I discuss specific authors on this issue in Chapter 6). I use the labels L1 and L2 slightly differently from what is normally meant by these labels. The way these labels are used are explained shortly.

A literature review is provided for each of the test methods used in which an overview of the relevant theoretical issues is provided. After each literature review follows a detailed description of the structure and administration of the specific tests used.

3.2 Sampling procedures for the selection of subjects

The sampling procedures form a crucial part of the method rationale and are described in detail. The crucial issue is how to classify the subjects into different levels of proficiency.

There were 90 entrants to Grade 7 in January 1987 who also sat the Grade 7 end-of-year examinations. Owing to the fact that the battery of English proficiency tests was administered during the first week of the school year, there was some absenteeism during the three-day testing period. Thus, not all of the learners did all of the tests, and four learners did not do any of the tests. These four learners were not included in the sample. The other 86 learners (44 boys, 42 girls) comprise the sample of subjects.

3.2.1 The two main groups of the sample: First language (L1) and second language (L2) groups

Figures 3.1 and 3.2 provide a clear picture of the details of the L1 and L2 groups. The reader may want to consult these figures in conjuction with the verbal descriptions of the sample below.

At the school there were mother-tongue speakers from diverse linguistic backgrounds, e.g. Tswana, Sotho, English, Afrikaans and some expatriates, e.g. Greek and Filipino. (The exact numbers are provided in Table 3.1, which we shall come to later on). About two thirds were Tswana mother-tongue speakers. All learners had to take English as the medium of instruction at the School.

The Tswana speakers could choose from the following language subject combinations:

Tswana as a First Language and English as a Second Language. (After 1987 Afrikaans was also offered as a first language).

Tswana as a First Language and English as a First Language.

English as a First Language and Afrikaans as a Second Language. Tswana speakers never took this option.

The English and Afrikaans speakers and speakers of other languages (expatriates and those using English as a “replacement” language) could choose from the following language subject combination:

English as a First Language and Afrikaansas a Second Language.

English as a First Language and French as a Second Language. This combination was taken by the expatriates, because they had not studied Afrikaans in primary school as South Africans had done. The “replacement” learners took Afrikaans as a second language.

All the L2 learners were Bantu mother-tongue speakers, most of whom were Tswana speakers. The L1 learners were a mixture of English mother- tongue speakers, Tswana mother-tongue speakers, mother-tongue speakers of other languages. The latter consisted of (i) expatriates from other countries and (ii) South Africans who speak other South African non-Bantu languages such as Afrikaans and Gujarati. It was not always certain who among the L1 group (i.e. those who took English as First language at the School) were mother-tongue speakers because some of them identified themselves as mother-tongue speakers of English and/or another language, e.g. Afrikaans, Sotho. However, there was little doubt that many in (ii) were using English as a “replacement” language. (I show why this is so at the end of section 4.8). It is, of course, possible to have more than one “mother” (or “father”, or “native”) tongue.

I say something briefly about the notion of mother tongue and relate it to the notions of native language and replacement language. (These notions are discussed in more detail in section 6.2). The notion of native-speaker is not a simple one. According to Paikeday, the “the native speaker is dead!”. Indeed, it is difficult to identify a native speaker, who is someone who apparently should have a more thorough knowledge than the nonnative speaker. When one adds the notion of mother-tongue speaker, first-language speaker and second/additional-language speaker to the pot, it becomes difficult to see the wood for the trees. I discuss this issue in greater depth in section 6.2.

The fact that the Tswana learners had a choice between taking English as First Language or as a Second Language and that the “replacement” language learners had to take English First language becomes important in decisions of how to categorise the different levels of proficiency. I explain the problem later on. (A detailed analysis of the sample follows shortly).

All language subjects were taught in separate classes. With regard to theother subjects, L1 and L2 learners were taught through the medium of English in homogeneous groups in different classrooms, where each class contained about an equal proportion of L1 and L2 learners. For example, Grade 7 was divided into four classes. Each class bore the initial of the surname of the relevant class teacher. This four-class arrangement was maintained for the administration of the tests.

It was general practice at MHS that on entrance to the School, learners decided themselves whether they preferred to take English as a First Language subject or as a Second Language subject. As far as the battery of English proficiency tests was concerned, the subjects in the sample were requested to indicate on their protocols whether they had decided to take English First Language as a subject or English Second Language as a subject.

The results of the tests were not subsequently used by the School administration to make any changes to the choices the entrants had made regarding the English group (L1 or L2) they wanted to belong to. There were several possible reasons for this:

– The School did not wish to force the labels of “English second language” (L2) onto learners.

– Limited English proficiency learners might benefit in a class of high English proficiency learners, because the former might benefit from listening to a higher standard of English than their own.

– The School might not have been sure of the actual level of English proficiency of each individual entrant, in spite of the fact that it was aware that the level of English proficiency of disadvantaged entrants was generally low, but once these entrants had become part of the School, it would have been possible to make more accurate judgements of their English proficiency.

– Finally, the School might have been reluctant to use the results until I had produced solid evidence that these tests were valid predictors of academic achievement.

It was not a simple matter to decide how to classify the subjects. The following variables had to be taken into account (the descriptions are specific to the sample):

(1) Some were or said they were English mother-tongue speakers, while others were mother-tongue speakers of Tswana and other languages.

(2) Some had English as the medium of instruction from Grade 1 (Bantu speakers and non-Bantu speakers), while some had English as the medium of instruction from Grade 5 (only Bantu speakers).

(3) All had the freedom to choose at the beginning of Grade 7 whether they wanted to take English First Language or English Second Language.

The problem is whether one can make a clear separation between these subjects that would indicate a difference in levels of English proficiency. An obvious division in theory is mother-tongue speaker/non-mother-tongue speaker, where mother-tongue English proficiency was regarded as the level of English to aspire to. Thus, when the essays for this study were marked they were judged in terms of mother-tongue proficiency, and so non-mother–tongue English speakers’ essays were not marked more leniently than those of mother-tongue English speakers. There were difficulties in deciding on the norms for the other tests, which were all previously standardised published tests, because it is only after the test has been performed on the test-bench that it is possible to decide whether the test is too easy or too difficult. If there are mother-tongue speakers and non-mother-tongue speakers in the same sample, as in this study, one needs to consider whether the norms of the two kinds of speakers should be separated or interlinked. One can only do this if subjects have been precisely classified into mother-tongue/non-mother-tongue groups. This was not a simple matter in the sample for the following reasons:

In the truly multicultural setting of MHS a composite of the following cultural-ethnic groups said that they were English mother-tongue speakers: Ghanaian, Sri Lankan, Indian (South African and expatriate) and Coloured. There was also a Greek, a South Sotho, and a Fillipino who said that they had two mother tongues, one of them being English.Although all the above (N=18) obtained an English proficiency test score (in this case a composite of the cloze, dictation and essay test) of 60% and over, there were also quite a number of Bantu mother-tongue speakers (N=10, mostly Tswana) who also obtained a score of 60% and over. Further, there were five subjects who said that they were mother-tongue speakers but obtained scores between 50% and 55%. As I shall show later such a score is not a good score for somebody claiming to be an English mother-tongue speaker. In the light of this evidence, it was difficult to tell from the results of the proficiency tests who were mother-tongue or native-speakers of English. Although it was difficult in the sample to pinpoint mother-tongue English speakers this does not mean that the notion of “native-speaker” is a figment. (More about this in section 6.2). All I could specify about the sample in this regard was that it consisted of a wide range of English proficiency . (Recall that the labels L1 and L2 in the sample simply refer to those taking English First or Second Language as a subject, respectively).

When the English proficiency test scores were examined without resorting to summary statistics but merely sorted in ascending order there was a clear distinction between the subjects who had chosen to do English as First Language and English as a Second Language, respectively. These are called the L1 and L2 groups, respectively. In terms of the test results this meant that the L1 group on average were substantially better than the L2 group. (Six of the subjects who had decided to change form L1 to L2 in Grade 8 had obtained over 60% on the English proficiency test).

Classifications of human groups, especially in the human sciences, are not expected to be objectively “precise”; what matters is that the parameters of the classifications be delineated and used consistently. In this study I use the labels “L1” and “L2” to refer to “high proficiency” and “low proficiency” indicated by English First Language and English Second Language, respectively.

Also important is English as a medium of teaching and learning. The tests in this study aimed “to select lexical and structural items relevant to the demands of the appropriate syllabuses”, i.e. relevant to English as the medium of instruction. What these tests aim at testing is directly related to the task-demands of academic study.

At MHS L1 learners and L2 learners are not treated as two separate groups, because both groups are taught in the same classroom – where English is used as the medium of instruction – and write the same examinations except for the language subjects, i.e. English, Tswana, Afrikaans and French. This has important implications for the interpretation of the results, where it might be argued that the L1 and L2 groups belong to separate populations and therefore cannot be grouped together in a correlational analysis. (See section 4.6 for further discussion of this issue). Tables 3.1 and 3.2 provide a detailed analysis of the sample.

The subjects originated from 36 different schools. The L1 subjects (N=49)

originated from (i) CM Primary School (N=37), (ii) a “white” school (N=1), (iii) a “coloured” school (N=4), and (vi) several DET schools (N=7). (One Sri Lankan came from a DET school where his mother was a teacher).

The 37 L2 subjects originated from 28 DET schools, three church schools, a Coloured school and an Indian school.

Of the total sample of 86 subjects, there were 60 South African blacks, of which 52 were Tswanas and eight were non-Tswanas. These eight non-Tswana South African blacks, like all the Tswana subjects, had to take Tswana as a first language.

TABLE 3.1 Detailed Analysis of the L1 Subjects

.

There were 10 TL1 subjects who changed from L1 English as a subject to L2 English as a subject in Grade 8 (January 1988). I did not have any information on why these changes were made. One plausible reason for this change could have been that the School recommended to the learners concerned that it was in their best interest to change, because the change to L2 English as a subject at a later stage might have given them a better chance of passing English. Another plausible reason is that learners decided themselves to make the change, because it wasn’t necessary to take English First Language, because they already had Tswana as a First Language. Whatever the reason for the change to L2, the fact is that other members of the TL1 group who had obtained Grade 7 English achievement scores in the same range as those who changed to English Second Language did not change from English First Language to English Second Language. For example, the Grade 7 English scores of the 10 L1 subjects who changed to L2 in Grade 8 were (in ascending order in percentages) 50, 51, 53, 53, 55, 55, 58, 58, 61 and 63. (Most of these also had English proficiency scores in the 55% to 70% range). The Grade 7 English scores of five L1 subjects who did not change to L2 in Grade 8 were 45, 52, 53, 59, 62. As a matter of interest, in the first set, eight out 10 obtained a matriculation exemption, while in the second set, two left the school after passing a grade, two obtained a matriculation exemption, and one failed before reaching Grade 12 and left the school.

The “replacement language” learners were required to take English First language because they had no other First Language, while the Tswanas in TL1 group could take Tswana as a First Language. The initial choice of language group (L1 or L2) at the beginning of Grade 7, as pointed out earlier, was voluntary.

Most of the L2 subjects were mother-tongue speakers of Tswana who came from rural or peri-urban schools where English was used on a limited scale in the classroom, hardly at all in the playground, and not at all at home, which was probably the reason for the low level of English proficiency of most of them. A few L2 subjects, however, did have a high level of English proficiency. There were no L2 subjects from CM Primary School, because all learners at this school had English as the medium of instruction from Grade 1. L2 subjects even if they did very well in English Second Language did not change to English First Language. This was probably because there was no need to complicate their lives unnecessarily. Five L2s obtained over 70% in Grade 7 English achievement. Four of these five passed Grade 12 and three of them obtained a matriculation exemption. A detailed analysis of the L2 group follows.

TABLE 3.2. Detailed Analysis of the L2 Subjects

As shown in Table 3.2, four of the black L2 subjects were non-Tswanas. These took Tswana as a first language at the School.

In sum, there are two groups in the sample: the L1 group (a composite of the Tswana L1 and the Non-Tswana L1 groups) and the L2 group. The L2 group is also referred to as the TL2 (Tswana L2) group, because 33 of the 37 L2 subjects are mother-tongue speakers of Tswana.

3.3 Structure and administration of the English proficiency tests

Subjects were divided into four groups and the tests were administered in four classrooms by four Grade 7 teachers, where each classroom contained a combination of L1 and L2 subjects. The time allocated for each test will be indicated in the description of the administration of the individual tests.

The possibility exists that fatigue resulting from a three-day test period may have affected the results of all the tests, but this seems unlikely because subjects were released from all lessons and from all school activities during this three-day period. Also, the test sessions were interspersed with ample rest periods. The structure and administration of the English proficiency tests follows. A theoretical review precedes the description of each of the tests. Before I describe the tests I would like to point out that no sample of tests can adequately represent the vast variability of language, nor does it have to “because of the generative nature of language which acts as its own creative source”. The controvery, as far as general language proficiency is concerned, is which sample of tests to use: the “reductionist” kind of tests used in this study or “holistic” (formal and informal) outside-of-the-classroom “cocktail party”, “tea-party, “cooking club” type tests. In the academic context, what is important is the relationship between general, or overall, proficiency, communicative competence and academic achievement.3.3.1 The cloze test 3.3.1.1 Theoretical overviewCloze tests are deceptively simple devices that have been constructed in so many ways for so many purposes that an overview of the entire scope of the literature on the subject is challenging to the imagination not to mention the memory.

Cloze tests are deceptively simple devices that have been constructed in so many ways for so many purposes that an overview of the entire scope of the literature on the subject is challenging to the imagination not to mention the memory.

(Oller, 1973:106)

Since 1973 the literature on cloze has more than doubled, adding even more challenges to the imagination if not – thanks to the printed word – to the memory.

The aim of a cloze test is to evaluate (1) readability and (2) reading comprehension. The origin of the cloze procedure is attributed to Taylor (1953), who used it as a tool for testing readability. Of all the formulas of readability that have been devised, cloze tests have been shown, according to Geyer (1968), Weintraub (1968) and Oller (1973:106), to be the best indicators of readability. It is also regarded as a valid test of reading comprehension. Oller (1973:106) cites Bormuth (1969:265) who found a multiple correlation coefficient of .93 between cloze tests and other linguistic variables that Bormuth used to assess the difficulty of several prose passages. Bormuth (1969:265) maintains that cloze tests “measure skills closely related or identical to those measured by conventional multiple choice reading comprehension tests.”

Many standardised reading tests use cloze tests, e.g. the Stanford Proficiency Reading Test. Johnson and Kin-Lin (1981:282) believe that cloze is more efficient and reliable than reading comprehension, because it is easier to evaluate and does not, as in many reading comprehension tests, depend on long written answers for evaluation. (But it is also possible to use multiple choice reading tests; see Bormuth [1969] in the previous paragraph). Johnson and Kin-Lin’s implication is that although cloze and reading comprehension are different methods of testing, they both tap reading processes. Anderson (1976:1), however, maintains that as there is no consensus on what reading tests actually measure, all that can be said about a reading test is that it measures reading ability. On the contrary, far more can be said about reading: notions associated with reading are “redundancy utilization” (Weaver & Kingston, 1963), “expectancies about syntax and semantics” (Goodman, 1969:82) and “grammar of expectancy” (Oller, 1973:113). All these terms connote a similar process. This process involves the “pragmatic mapping” of linguistic structures into extralinguistic context (Oller, 1979:61). This mapping ability subsumes global comprehension of a passage, inferential ability, perception of causal relationships and deducing meaning of words from contexts (Schank, 1982:61). According to Bachman (1982:61),

[t]here is now a considerable body of research providing sound evidence for the predictive validity of cloze test scores. Cloze tests have been found to be highly correlated with virtually every other type of language test, and with tests of nearly every language skill or component.

Clarke (1983), in support of Bachman, is cautiously optimistic that the cloze procedure has a good future in reading research. Alderson (1979:225) who is less optimistic, maintains that

individual cloze tests vary greatly as measures of EFL proficiency. Insofar as it is possible to generalise, however, the results show that cloze in general relates more to tests of grammar and vocabulary than to tests of reading comprehension.

Hughes (1981), Porter (1978) and Alderson (1979) found that individual cloze tests produce different results. Johnson and Kin-Lin (1981) and Oller (1979), contrary to Alderson (1979), found that a great variety of cloze tests correlates highly with tests such as dictation tests, essay tests and reading tests as well as with “low order” grammar tests.The concept of closure is important in cloze theory. Alderson (1979:225) asks

whether the cloze is capable of measuring higher-order skills. The finding in Alderson (1978) that cloze seems to be based on a small amount of context, on average, suggests that the cloze is sentence – or indeed clause – bound, in which case one would expect a cloze test to be capable, of measuring, not higher-order skills, but rather much low-order skills…as a test, the cloze is largely confined to the immediate environment of a blank.

This means that there is no evidence that increases in context make it easier to complete items successfully. Oller (1976:354)  maintains, contrary to Alderson, that subjects “scored higher on cloze items embedded in longer contexts than on the same items embedded in shorter segments of prose”. Oller used five different cloze passages and obtained similar results on all of them.

Closure does not merely mean filling in items in a cloze , but filling them in a way that reveals the sensitivity to intersentential context, which measures “higher-order skills” (Alderson above). A cloze test that lacks sufficient closure would not be regarded as a good cloze test.

There are two basic methods of deletion: fixed deletion and rational deletion, or “selective” deletion). In the former, every nth word is deleted; which may range between every fifth word – which is considered to be the smallest gap permissible without making the recognition of context too difficult – and every ninth word. Pienaar’s (1984) tests, which are used in this study, are based on a rational deletion procedure.

Alderson (1980:59-60) proposes that the rational deletion procedure should not be referred to as a “cloze” but as a “gap-filling” procedure. Such a proposal has been accepted by some researchers, e.g. Weir (1993:81), but not accepted by others, e.g. Bachman’s (1985) “Performance on cloze tests with fixed-ratio and rational deletions”, Maclean’s (1984) “Using rational cloze for diagnostic testing in L1 and L2 reading” and Markham’s (1985) “The rational deletion cloze and global comprehension in German.” There is nothing wrong with the proposal that the rational-deletion procedure be called a gap-filling test, if it remains nothing more than that – a proposal.

Alderson (1979:226) suggests that what he calls cloze tests (namely, every nth word deletion) should be abandoned in favour of “the rational selection of deletions, based upon a theory of the nature of language and language processing”. Thus, although Alderson proposes that the rational selection of items, his “gap-filling” should not be called a cloze procedure, he still favours “gap-filling” tests over “cloze” tests. This superiority of the rational deletion method is supported by a substantial body of research, e.g. Bachman (1982) and Clarke (1979). However, it should be kept in mind that language hangs together, and thus the every-nth-word cloze test is also, in my view, a good test of global proficiency. In other words, whether one uses “fixed” deletions or “rational” deletions, both methods test global proficiency.

Having considered the arguments for the validity of the cloze test as a test of reading, it seems that cloze tests are valid tests of reading strategies, i.e. they can test long-range contextual constraints. One must keep in mind, however, that deletion rates, ways of scoring, e.g. acceptable words or exact words, and types of passages chosen in terms of background knowledge and of discourse devices, may influence the way reading strategies are manifested. But it is debatable whether one should make to much of these differences.

3.3.1.2 The cloze tests used in the study

In a review of Pienaar’s pilot survey called “Reading for meaning”, Johanson refers to the “shocking” low reading levels in many schools in the North-West Province revealed by Pienaar’s survey.

Pienaar tested a variety of learners from different schools: learners whose (1) first language (i.e. mother tongue or a language the learner knows best) was English, (2) replacement language was English, (3) first language was Afrikaans, (4) first language was a Bantu language, mostly Tswana speakers. Categories (1) and (2) and (3) came from upper middle class families, while category (4) was split into two sub-categories: (4a) Bantu speakers who generally came from working class families and (4b) Bantu speakers who lived in sub-economic settlements in the environs of Mmabatho. Many of the parents of (4b) were illiterate or semi-literate and are either unemployed or semi-employed. (The category labels used by Pienaar are different to mine: I have changed them for clarity sake). Pienaar’s major finding was that 95% of learners in the North West Province (Grade 3 to Grade 12), most of whom belonged to category 4b, were “at risk”, i.e. they couldn’t cope with the academic reading demands made on them.

There are four reasons why Pienaar’s cloze tests are used in this study:

(1) they have already been used in many schools in the North West Province and have produced a solid body of results, (2) their purpose is “to select lexical and structural items relevant to the demands of the appropriate syllabuses”, (3) Pienaar’s (1984) cloze tests are based on a rational deletion method, where it is possible to select gaps in such a way that closure , i.e. long-range constraints, is ensured, and (4) Pienaar’s data will be compared with the data in this study.

Pienaar’s (1984) tests comprise five graded levels – “Steps” 1 to 5, where each level consists of four short cloze passages (Form A to Form D) with 10 blanks in each passage (Pienaar 1984:41):

Step 1 corresponds to Grades 3 and 4 for English first language and to Grades 5 to 7 for English second language.

Step 2 corresponds to Grades 5 and 6 for first language and Grades 7 to 9 for second language.

Step 3 corresponds to Grades 7 and 8 for first language and to Grades 9 to 11 for second language.

Step 4 corresponds to Grades 9 and 10 for first language and to Grades 11 and 12 for second language.

Step 5 corresponds to Grades 11 and 12 for first language and to Grades 12 + for second language.

If one Step proves too easy or too difficult for a specific pupil, a higher or a lower Step could be administered. For example, if Step 2 is too difficult, the pupil can be tested on Step 1. In this way it is possible to establish the level of English proficiency for each individual pupil.

Owing to the fact that (1) many of the L1 group were not mother-tongue speakers of English and (2) I had to give the same test to both the L1 and L2 groups in order to make reliable comparisons, I used Step 2 for the L1 group and the L2 group.

I did not use the other Steps, because it was irrelevant to the purpose of study, which was not to place learners in the level they belong to, i.e. for teaching purposes (which was the purpose of Pienaar’s tests), but to test for proficiency at the Grade 7 level, and in the process to test the tests themselves, i.e. examine whether the passages chosen for the Step 2 level were valid for that level. Although annual predictions between English proficiency and academic achievement might yield higher correlations than long-term predictions, the aim is to investigate what chance Grade 7 learners who entered the School in 1987 would have of passing Grade 12.

According to Pienaar, a perfect score on a cloze test indicates that the pupil has fully mastered that particular level. A score below 50% would indicate that the learner is at risk.

Pienaar maintains that English second language learners are generally two to three years behind English first language learners in the acquisition of English proficiency, and there is often also a greater age range in the English second language classes, especially in the rural areas. Pienaar’s implication is that to be behind in English language proficiency is also to be behind in academic performance.

Pienaar standardised his tests in 1982 on 1068 final year JSTC (Junior Secondary Teacher’s Certificate) and PTC (Primary Teacher’s Certificate) students from nine colleges affiliated to the University of the Transkei. These standardised results became the table of norms for Pienaar’s tests (Pienaar 1984:9). Below are the weighted mean scores achieved by the students of the nine colleges (Pienaar 1984:10):

Step 1 Step 2   Step 3    Step 4   Step 5

Weighted means:    67%    53%       37%       31%      24%

Most of the colleges performed similarly on all five Steps. These results confirmed the gradient of the difficulty of the various steps.

During 1983 Pienaar administered a major part of the test instrument to a smaller group of college students selected from the original large group. No significant difference between the scores of the two administrations was found, which confirmed the test-retest reliability of the instrument..

The tests underwent ongoing item analysis and refinement. By the time the final version was submitted to school learners in the Mmabatho/Mafikeng area in 1984, 30% of the items had been revised. As a result of continuous item analysis, a further 18% of the items were revised.

An important point is that these results claim to represent the reading ability of college students, who are supposed to be more proficient in English than school learners. However, final year student teachers only obtained a score of between 40% and 60% on Step 2 – see Pienaar’s mean scores above. (Step 2 has been used in this study for Grade 7 learners). These low scores indicate that the reading level of the student teachers, who were to start teaching the following year, was probably no higher than the level of many of the learners they would eventually teach. Thisalarmingstate of affairs would probably have had a detrimental effect on the academic performance of these learners.

Pienaar’s original tests had four passages for each level and that he tried to establish the equivalence in difficulty between the passages for each level. In this study I used two cloze passages for Step 2 instead of four. This was done for two reasons: (1) to see whether only two passages were sufficient, and (2) the other two passages, in their “unmutilated” form, were used for the dictation tests, because they belonged to the same level, namely, Step 2. The question is whether two passages – in the cloze and dictation tests – were enough to ensure reliability and validity. The results of the tests (Chapter 4) deal with this question.

In the test battery I used Pienaar’s Form B and Form D passages of Step 2, as shown below:

Pienaar’s Practice exercise

(Pienaar does not provide the answers for this practice exercise. Possible answers are provided in brackets).

The 1 (rain) started falling from the sagging black 2 (clouds) towards evening. Soon it was falling in torrents. People driving home from work had to switch their 3 (headlights) on. Even then the 4 (cars, traffic) had to crawl through the lashing rain, while the lightning flashed and the 5 (thunder) roared.

Cloze passage 1: Form B Step 2 (Pienaar 1984:59):

A cat called Tabitha

Tabitha was a well-bred Siamese lady who lived with a good family in a shiny white house on a hill overlooking the rest of the town. There were three children in the family, and they all loved Tabitha as much 1 she loved them. Each night she curled up contentedly on the eldest girl’s eiderdown, where she stayed until morning. She had the best food a cat could possibly have: fish, raw red mince, and steak. Then, when she was thirsty, and because she was a proper Siamese and did 2 like milk, she lapped water from a blue china saucer.

Sometimes her mistress put her on a Cat show, and there she would sit in her cage on 3 black padded paws like a queen, her face and tail neat and smooth, her black ears pointed forward and her blue 4 aglow.

It was on one of these cat shows that she showed her mettle. The Judge had taken her 5 of her cage to judge her when a large black puppy ran into the hall. All the cats were furious and snarled 6 spat from their cages. But Tabitha leapt out of the judge’s arms and, with arched 7 and fur erect, ran towards the enemy.

The puppy 8 his tail and prepared to play. Tabitha growled, then, with blue eyes flashing, she sprang onto the puppy’s nose. Her 9 were razor-sharp, and the puppy yelped, shook her off, and dashed for the door. Tabitha then stalked back down the row of cages to where she had 10 the judge. She sat down in front of him and started to preen her whiskers as if to say, “Wait a minute while I fix myself up again before you judge me.” She was quite a cat, was Tabitha!

Answers. (The words in round brackets are Pienaar’s suggested alternative answers. The words in square brackets are suggested alternative answers):

1. as; 2. not; 3. her [four, soft]; 4. eyes (eye); 5. out; 6. and; 7. back (body); 8. wagged, twitched (waved, lifted); 9. claws (nails); 10. left (seen, met).

Item 9 could also be “teeth”. (There are cats who jump on to faces and bite, rather than scratch). Even in easy cloze passages, “acceptable” answers can be a problem.

Cloze passage 2: Form D Step 2 (Pienaar 1984:61):

A dog of my own

When I was ten all 1 wanted was a dog of my own. I yearned for a fluffy, fat, brown and white collie puppy. We already had two old dogs, but my best friend’s pet collie had 2 had seven fluffy, fat, brown and white puppies, and I longed for one with all my heart. However, my mother said no, so the seven puppies were all sold. I had horses, mice, chickens and guinea-pigs, and as my 3 said, I loved them all, but I wasn’t so keen on finding them food. Since she had five children to look after, it made here angry to 4 hungry animals calling, so she said crossly, “No more dogs.”

This didn’t stop me wanting one though, and I drew pictures of collie dogs, giving 5 all names, and left them lying around where she would find them. As it was 6 Christmas, I was sure that she would relent and give me a puppy for Christmas.

On Christmas morning I woke up very excited, 7 the soft little sleepy bundle that I wanted at the bottom of the bed wasn’t there. My mother had given me a book instead. I was so disappointed that I cried to myself, yet I tried not to 8 her how sad I was. But of course she noticed.

Soon after that my father went off to visit his brother and when he came back he brought me a puppy. Although it 9 a collie it was podgy and fluffy, and I loved him once. My mother saw that I looked after him properly and he grew up into a beautiful grey Alsation. We were good friends for eleven happy 10 before he went to join his friends in the Animals’ Happy Hunting Ground.

Answers.

1. I; 2. just, recently; 3. mother (mummy, mum, mom); 4. hear; 5. them; 6. near (nearly, nearer, close to; 7. but, however (though); 8. show (tell); 9. wasn’t (was not); 10. years.

Pienaar allotted six minutes for each passage. I allotted 12 minutes. Ability is dependent on speed of processing and so if a test-taker does badly with an allotted time of six minutes,perhaps results would not be much significantly better after six minutes. However, the effect of time on performance is not easy to gauge.

3.3.2 The essay tests

3.3.2.1 Theoretical overview

Language processing involves various components such as linguistic knowledge, content knowledge, organisation of ideas and cultural background. All these factors mesh together into a proficiency network of vast complexity, which makes objective evaluation of essay performance very difficult. It is this vast complexity that makes the written discourse, or essay writing, the most pragmatic of writing tasks and the main goal of formal education.

Essay writing is arguably the most complex of human tasks. If getting the better of words in writing is usually a very hard struggle for mother-tongue speakers, the difficulties are multiplied for the second language learner, and very difficult for disadvantaged or second language learners such as those described in this study. Many of the disadvantaged Grade 7 subjects are similar to young mother-tongue speakers of English first learning to write in that much mental energy is expended on attention to linguistic features rather than to content.

What makes essay writing a pragmatic task, in contradistinction to tasks at the sentence level, is that essay writing involves writing beyond the sentence level. This does not mean that non-pragmatic tasks are not integrative. As discussed in section 2.5, all language resides along a continuum of integrativeness, where pragmatic tasks are the most integrative.

Owing to the fact that the production of linguistic sequences in essay writing is not highly constrained, problems of reliability arise in essay scoring. (In this respect, essay tests have much in common with oral tests). One problem is that owing to the fact that it is inferential judgements that have to be converted into a score, “[h]ow can essays or other writing tasks be converted to numbers that will yield meaningful variance between learners?” (Oller, 1979). Oller argues that these inferential judgements should be based on intended meaning and not merely on correct structural forms. That is why in essay rating, raters should rewrite (in their minds, but preferably on paper) the intended meaning. Perhaps one can only have an absolutely objective scoring system with lower-order skills (Allen in Yeld), but Oller is not claiming that his scoring system is absolutely objective, but only that as far as psychometric measurement goes, it is a sensible method for assessing an individual’s level within a group.

Whatever one’s paradigm, structural (“old”) or communicative (“new”), when one marks an essay one can only do so through its structure. The paradox of language is that the structure must “die” so that the meaning may live; yet, if structure is not preserved, language would not be able to mean.

In the normal teaching situation, marking is done by one rater, namely the teacher who teaches the subject. Sometimes if a test is a crucial one, for example an entrance test or an end-of-year examination, more than one rater, usually two, are used. In a research situation, the number of raters depends on the nature of the research and the availability and proficiency of raters. The raters used for the essay tests in this study were all mother-tongue speakers of English, and recognised as such by their colleagues. (In the dictation tests there were three English mother-tongue presenters and one non-mother-tongue presenter).

With regard to the level of English proficiency of raters, it does not follow that because a rater (or anybody else) is not a mother-tongue speaker (of English in this case) that his or her English proficiency is necessarily lower than a mother-tongue speaker of English. In the academic context, there are many non-mother-tongue speakers of English who have a higher level of academic English proficiency (CALP) than mother-tongue speakers of English. A major reason for this is not a linguistic reason, but because these non-mother-tongue speakers are more academically able, i.e. they have better problem-solving abilities and abilities for learning content.

Kaczmarek (1980) reports high correlations between essay raters, while Hartog and Rhodes (1936) and Pilliner (1968), contrary to Kaczmarek, found essay tests to have low interrater and low intrarater reliability. With regard to scoring procedures in essay testing. Mullen (1980:161) recommends the use of four “scales” (criteria) of writing proficiency: structure, organisation, quantity, vocabulary. According to Mullen (1980), a combination of all four scales is required to validly predict proficiency. A major issue in scoring is whether marks should be separately allocated to each of the criteria, i.e. whether one should use an analytic scoring procedure or whether marks should be allocated globally. Global scoring usually refers to two ways of scoring: (1) “Overall impressions” and (2) “Major constituents of meaning”, which takes into account global errors, e.g. cohesion and coherence, but not local errors, e.g. grammar and spelling.

The following terms are used interchangeably in the literature: global rating, overall impressions, holistic scoring and global impressions (specific authors will be mentioned). The term holistic scoring is used by Perkins to refer to overall impressions, which takes into account global as well as local errors. With regard to the two ways of global scoring mentioned in the previous paragraph, it is possible that a rater’s “overall impressions” may include quick, yet thorough attention to major constituents of meaning as well as to local errors. In such a case the distinction within global scoring between “overall impressions” and “major constituents of meaning” would no longer be very useful.

With regard to the relative reliability of analytic and global scoring procedures, Kaczmarek and Ingram have shown that analytic scoring procedures are not more reliable than global scoring. According to Perkins, “[w]here there is commitment and time to do the work required to achieve reliability of judgement, holistic evaluation of writing remains the most valid and direct means of rank-ordering students by writing ability.”

Zughoul and Kambal (1983:100) also report “no significant difference between the two methods”, and Omaggio (1986:263) maintains that “holistic scoring has the highest validity” (reliability?).

According to Oller (1979:392) “judges always seem to be evaluating communicative effectiveness regardless whether they are trying to gauge ‘fluency’, ‘accentedness’, ‘nativeness’, ‘grammar’, ‘vocabulary’, ‘content’, ‘comprehension’, or whatever. Even if one does this quickly, say a minute per page, the brain of the rater still has to consider the trees to get an overall idea of the wood. The greater the experience and competence of the rater the more unconscious and quicker, but no less rational, is the judgement.

It is arguable whether judges always seem to be evaluating communicative effectiveness, as Oller’s believes. Although it seems reasonable that in essays one should be looking at the overall impact of a piece of writing (the whole) and that the only way to do this is to look at the various aspects of the writing such as those mentioned by Oller, I would question whether raters do regard communicative effectiveness as the overarching criterion. Unfortunately, I did not manage to obtain the judgements of the raters of the MHS essay protocols and so could not investigate this important issue. I did, however, at a later stage, use some of the MHS protocols on a different set of raters to verify whether Oller’s cautious observation was reasonable. (More about this in section 4.8.1).

If global or holistic scoring is more effective or even as effective as analytic scoring, then for reasons of economy global scoring should be used. However, global scoring ideally requires at least three raters (Ingram 1968:96), who would presumably, and hopefully, balance one another out. The effectiveness of global scoring depends on factors such as availability, willingness and qualifications of raters.

The unavailability of raters is often a problem. In special circumstances such as proficiency tests used for purposes of admission or placement at the beginning of an academic year, it may be possible to obtain the help of three or four raters. However, in the normal testing situation during the school year, only one rater may be available, who is usually the teacher involved in teaching the subject. Hughes (1989:87) recommends four raters because this has been shown to be the best number. Four raters were used in this study.

To ensure high interrater reliability there should only be a narrow range of scores and judgements between raters. If three or four raters are considered to be required for reliability a serious problem is what to do in the normal education situation where at most two and usually only one rater is available. I shall discuss this issue in the final chapter where “improving testing in South Africa” is dealt with.

3.3.2.2 The essay tests used in the study

There were two essays:

Essay 1: Everybody in this world has been frightened at one time or another. Describe a time when you were frightened. Write between 80 and 100 words.

Essay 2: Do (a) or (b) or (c). Do only one of the following topics. Don’t forget to write the letter (a) or (b or (c) next to the topic you choose. The topic you choose must not be shorter than 80 words and not longer than 100 words.

(a) Describe how a cup of tea is made.

(b) Describe how shoes are cleaned.

(c) Describe how a school book is covered.

Both the L1 and L2 group’s essays were judged in terms of English-mother- tongue proficiency.

The TOEFL Test of Written English (Hughes 1989:86) recommends spending a rapid one and a half minutes per page using a holistic scoring method. I would imagine that when working at such a speed, the scoring criteria are assumed to be known to the point of automaticity. Raters in this study were recommended to spend about one and a half minutes on each protocol, where protocols were much shorter than a page in length.

As I discussed earlier, clarity and consistency of judgements are difficult to ensure. The TOEFL scoring method seems to be the same as the “overall impressions” approach of Perkins (1983), which takes into account global as well as local errors.

Four raters, including myself, rated 86 protocols. The other three raters were Grade 7 teachers at the School, who were the same teachers who had also participated in the administration of the test battery. As I mentioned earlier (section 3.3.3.1) these three raters were all mother-tongue speakers of English, and were also recognised to be such by their colleagues, of which I was one. Each rater, in turn, was given the original 86 protocols and was requested to give an impressionistic score based on such considerations as topic relevance, content and grammatical accuracy.

Raters did not provide any judgements. The reason for this was because these raters, who were also the Grade 7 teachers at the School, were fully involved in the three-day administration of the test battery; accordingly I did not want to overload them with too much extra work after the three days were over, because they had to return to a full teaching load. Thus, they merely gave a score based on a global impression. It would have been useful to compare raters’ scores and judgements because this would have provided insights into the knotty problem of interrater reliability. I did manage at a later stage to obtain data on the same essay test (given to the Grade 7 subjects in this study) from a workshop on language testing that I conducted (Gamaroff 1996c, 1998c). (More about this in section 4.8.1).

Bridgeman (1991:9) recommends that each rater assign a holistic score on a six point scale, where zero is given if the essay is totally off the topic or unreadable. A nine-point scale was used in this study: from Scale 1; 0 to 1 point = totally incomprehensible, to Scale 9; 9 to 10 points = outstanding. The points were converted to percentages.

Raters did not record their scores on the protocols but were each provided with a copy of the list of the names of subjects on which they had to record their scores. Raters were requested not to consult one another on the procedures they used to evaluate the protocols. The results, as with all the tests in the study, are presented in the next chapter.

3.3.3  Error recognition and mixed grammar tests

3.3.3.1 Theoretical overview

Grammar is an important component in most standardised test batteries, e.g. English as a Foreign Language (EFL), English Language Testing Service (ELTS), Testing English as a Foreign Language (TOEFL) (O’Dell 1986). Error recognition has been used in various studies on language proficiency testing. Olshtain, Shohamy, Kemp and Chatow (1990:31) use it as part of a battery of first language proficiency tests to predict second language proficiency. The emphasis in Olshtain et al. is on appropriacy, i.e. language use, and not on acceptability, i.e. grammatical correctness. Different to Olshtain et al.’s aim of trying to find a connection between error recognition and language use, Irvine, Atai and Oller (1974:247) use the multiple-choice error recognition test from the TOEFL battery of tests to find out whether integrative tests such as cloze tests and dictation tests correlate more highly with each other than they do with the multiple-choice tests of TOEFL.

Henning, Ghawaby, Saadalla, El-Rifai, Hannallah and Mattar’s (1981) revised GSCE (Egyptian General Secondary Certificate Examination) test battery contains an error identification test and a grammar accuracy test. Henning et al. (1981) found that the highest correlation with their “composition” subtest was with Error identification (.76). They accordingly maintain that “Error Identification may serve as an indirect measure of composition writing ability.” (Henning et al. 1981:462).

A grammar component has always featured prominently in all the standardised English first and second language proficiency and achievement tests of the Human Sciences Research Council (HSRC). The recent tests of the HSRC range across various school levels from junior secondary school to senior secondary school (Barry, Cahill, Chamberlain, Lowe, Reinecke and Roux, undated).

3.3.3.2 Error recognition and mixed grammar tests used in the study

The error recognition test and the mixed grammar test in this study are both multiple choice tests that have been designed for learners who have completed the “elementary” stage (Bloor et al. 1970) of second/foreign language learning. These two tests each comprise 50 items. Bloor et al. (1970:ix – Teacher’s book) state that tests were analysed by the authors over an extended period subjected to an item analysis and validated under test conditions. No data on this item analysis or validation under test conditions were provided by the authors, probably because that kind of data would not appear in a Teacher’s book. I shall show (Chapter 4) that their tests have high (split-half) reliability and high correlations with other test methods.

Bloor et al. (1970) divide their tests into three levels; First Stage/ Elementary Stage, Second Stage/Intermediatory Stage, and Third Stage/Advanced Stage. Although the levels have been designated relative to one another, these are merely guidelines, and therefore the tester must use discretion in fitting the level to the relevant group of students. I have used the First Stage level for the Grade 7 subjects.

The tests were administered at the same sitting over a one-and-a- half-hour period, the emphasis being on completing the task and not on speed. It is true, however, that ability is dependent on speed of processing.Sample items from the error recognition and mixed grammar tests are now provided.

Error recognition test: Sample items

The test used was Test 1 from Bloor et al. (1970:70-77; Book 2).

Instructions: In some of the following sentences there are mistakes. (There are no mistakes in spelling and punctuation). Indicate in which section of the sentence the mistake occurs by writing its letter on your answer sheet. If there is no mistake, write E.

Example: (THE LETTERS ARE TOO MUCH TO THE RIGHT BUT IT IS NOT DIFFICULT TO SEE WHERE THEY BELONG)

A                                     B

Although he has lived in England/since he was fifteen,/he

C                             D

still speaks English/much badly. Correct – E.

Answer: D.

A                          B                            C

Item 8. Both Samuel and I/are much more richer/than we/

D

used to be. Correct – E.

Answer: B.

A                                  B

Item 19. Some believe that/a country should be ruled/by

C                                  D

men who are/too clever than ordinary people. Correct – E.

Answer: D.

A                              B

Item 25. His uncle is owning/no fewer than ten houses,/

C                          D

and all of them/are let at very high rents. Correct – E.

Answer: A.

A                              B

Item 27. As I have now studied/French for over three

C                                          D

years/I can be able to/make myself understood when I go to France. Correct – E. Answer: C

Mixed Grammar test: Sample items

The test used is from the “First Stage: Test 2 in Bloor et al., (1970:35-40; Book 1). (By “mixed” is meant a variety of grammatical structures were tested).

The test consists of choosing the correct alternative that fits into the gap within a sentence. The following are the instructions and an example from the test, followed by five selected items from the test:

Instructions: Choose the correct alternative and write its letter on your answer sheet.

Example: His sister is….than his wife. A) more prettier B) prettier C) very pretty

D) most pretty. Answer: B.

Item 1. They often discuss…. A) with me B) about whether there is a problem C) the problem D) about the problem with me. Answer: C.

Item 28. This dress was made… A) by hands B) by hand C) with hands D) with hand.

Answer: B.

Item 30. When the door-bell…., I was having a bath. A) rang B) rings C) rung D) ringed. Answer: A.

Item 38. My friend always goes home….foot. A) by B) with C) a D) on. Answer: D.

Item 50. We….our meat from that shop nowadays. A) were never buying B) do never buy C) never buy D) never bought. Answer: C.

The mixed grammar tests and error recognition tests of this study (Bloor et al. 1970) are commonly used tests. Compare these test items with equivalent test items from the Egyptian study of Henning et al. (1981). Consider the following two items from their test battery:

Grammar Accuracy

Ahmed enjoys….us. A. helping B. to help C. help to D. helping to

The item requires the selection of one of the four options. This test has the same format as Bloor et al.’s (1970) mixed grammar.

Error Identification

A                     B                       C                           D

In my way to school,/ I met a man/ who told me that/ the school was on fire.

One has to choose the incorrect segment in a sentence, i.e. item A. This format is almost identical to Bloor et al.’s (1970) error recognition test. The difference is that Bloor et al.’s test has five options, which makes Bloor’s test more difficult.

I judged Bloor et al.’s “elementary stage” to be appropriate for the Grade 7 subjects. If this is correct it suggests that the standard of grammatical knowledge that is required for Egyptian university entrants (Henning et al. 1981) is very similar to the standard required for Grade 7 non-mother-tongue speakers of English. Quite alarming. Further, as mentioned, Henning et al’s Error Identification test is easier than Bloor’s, because it only has four options whereas Bloor’s error recognition test has five options. The greater the number of options the more difficult it is to guess.

3.3.4 The dictation tests

3.3.5.1 Introduction

Listening comprehension “poses one of the greatest difficulties for learners of English” (Suenobu et al. 1986:239). This section examines language proficiency through the dictation test, which is the most demanding type of listening comprehension test, because it forces the test-taker to focus on structure as well as meaning.

3.3.4.2 Theoretical overview

Language tests involve one or various combinations of the four language modes, namely listening, speaking, reading and writing. Although the most important combination in education, except for the early school years, is usually reading-writing (Lado 1961:28), the

listening-writing combination, which is what a dictation test measures, also plays an essential role. The dictation test is a variation of a listening comprehension test where subjects write down verbatim what they listen to.

Some authors regard the dictation test merely as a test of spelling or of grammar (Lado 1961; Froome 1970:30-31). For Lado (1961) dictation was only useful as a test of spelling because dictation, he argued, did not test word order or vocabulary, both of which were already given; neither did it test aural comprehension, owing to the fact that the learner could often guess the context. For protagonists of the audio-lingual method the dictation test was considered to be a “hybrid test measuring too many different language features, mixing listening and writing, and giving only imprecise information” (Tönnes-Schnier and Scheibner-Herzig 1988:35). Savignon (1983:264) maintains that dictation does not test communicative proficiency. The reasons for its popularity, she suggests, is that it has high concurrent validity, is easy to develop and score, and has high reliability.

Contrary to these negative views, other authors regard dictation as a valid test of “pragmatic” language (Oller 1979:42 1983; Bacheller 1980:67; Cziko 1982; Larsen- Freeman 1987; Bott and Satithyudhakarn 1986:40; Tönnes-Schnier and Scheibner-Herzig 1988). Dictation for these authors is a robust test of the ability to reconstruct surface forms to express meaning at the sentence level and beyond. For Oller (1976:61) and Tönnes-Schnier and Scheibner-Herzig (1988:37-38), dictation tests are valid measures of communicative proficiency. Spelling, which for Lado was the dictation test’s only justification, is disregarded by some of these authors (e.g. Oller 1979:278; Bacheller 1980:69).

There is substantial evidence in the interdisciplinary co-operation between research in linguistic pragmatics and reading comprehension to show that reading and listening employ the same underlying processing strategies (Horowitz and Samuels 1987; Hoover and Gough 1990; Vellutino et al. 1991). Vellutino et al. (1991:107, 124) found significant correlations between reading and listening comprehension in young children and in adults. Of the four language skills, reading performance was found to be the best predictor of listening performance, and vice versa. (For different views on the relationship between listening and reading see Atkin, Bray, Davidson, Herzberger, Humphreys and Selzer 1977a, 1977b; Jackson and McClelland 1979; Carroll 1993:179-80).

Cloze and dictation have been found to reveal similar production errors in writing (Oller 1976:287ff; Oller 1979:57), and a combination of cloze tests and dictation tests have been used effectively in determining general language proficiency (Stump 1978; Hinofotis 1980). Oller (1979:61) maintains that all pragmatic tasks such as dictation tests probe the same underlying skill. The reason why a dictation and a cloze test (which are apparently such different tasks) intercorrelate so strongly is that both are effective devices for assessing the efficiency of the learner’s developing grammatical system, or language ability, or pragmatic expectancy grammar. This underlying skill is overall or general language proficiency. Spolsky (1989:72) describes the overall proficiency claim in the following “necessary” condition: “As a result of its systematicity, the existence of redundancy, and the overlap in the usefulness of structural items, knowledge of a language may be characterized as a general proficiency and measured.”

Tönnes-Schnier and Scheibner-Herzig (1988) compared Oller’s method and Bacheller’s (1980:71) “scale of communicative effectiveness” – where both methods distinguish between spelling errors and”real” errors – with the relatively much simpler traditional method, where different errors are not distinguished. Tönnes-Schnier and Scheibner-Herzig (1988:38) maintain that “thissimplifiedway of marking errors in dictations chosen in accord with the learners’ level of structures and vocabulary proves an effective way to rank a class of learners according to their communicative capacities.” The fact that a superficial and reductionist method such as the traditional method of counting surface errors could rank learners according to their communicative capacities shows that reductionist methods of testing can predict “pragmatic” language, i.e. language that straddles sentences. (Analogously, an eye that is involved in an eye test is no less alive looking at letters on an optician’s screen than reading a book or looking at the sunset). Recall (section 2.5) Spolsky’s (1989:61) suggestion that the “microlevel” is “in essence” the “working level of language, for items are added one at a time”, keeping in mind that “any new item added may lead to a reorganisation of the existing system, and that items learnt contribute in crucial, but difficult to define ways to the development of functional and general proficiency.” The micro-level is not less alive than the micro-level, just as an eye in a socket is not less alive than the brain that turns sensation into perception. Accordingly we can talk about a symbiotic relationship between “reductive”, or “discrete”, micro-levels and “holistic”, or “pragmatic”, macro-levels.

For Ur (1996:40), dictation “mainly tests spelling, perhaps punctuation, and perhaps surprisingly, on the face of it, listening comprehension”. Tönnes-Schnier and Scheibner-Herzig’s (1988) “surface” (discrete?) findings expose our “abyss of ignorance” (Alderson (1983:90). Constructs may not be lurking beneath the surface after all, but staring us in the face; or more accurately lurking beneath the surface and staring us in the face. The German term aufheben (sublation) illustrates the paradox. This term means “to clear away” as well as “to preserve” : the simultaneous preservation and transcendence of the structure/meaning antithesis. Language (i.e. language structure) has to be cleared away and preserved in order to convey its meaning. Coe (1987:13) usessublation to explain the paradox of structural preservation and transcendence. According to Scholes (1988), however, sublation remains inconsequential from a practical point of view. Certainly not inconsequential from the practical point of view of language testing.

3.3.4.3 The dictation tests used in the study

Excerpts from two restored cloze tests of Step 2 of Pienaar’s (1984) “Reading for meaning” were used. Step 2 corresponds to Grade 5 and Grade 6 for English mother-tongue speakers, and Grades 7 to 9 for English non-mother-tongue speakers. These dictation passages were different from the passages that were used for the cloze test (which were Forms B and D). For the dictation tests, I used the restored texts of Forms A and C of Pienaar’s Step 2. Thus, all four passages – two for the cloze test and two for the dictation test – belong to the same level. (Recall that Step 2 corresponds to Grades 5 and 6 for English First Language and to Grades 7 to 9 for English Second Language).

I judged the conceptual difficulty of the word sequences to be within the range of academic abilities required of Grade 7 learners who have to use English as the medium of instruction. I decided to pitch the dictation test at the L2 level and not at the L1 level, because I suspected that STEP 3, which was meant for the Grade 7 L1 level, would be too difficult for the Grade 7 L2 subjects. Accordingly, I used the passages of STEP 2, which were aimed at the Grade 7 L2 level.

Test 1

The fire

We were returning from a picnic up the river when the fire-engine raced past us. Of course we followed it. We hadn’t gone far when we saw black smoke pouring from an old double-storey house in the high street. When we drew nearer we saw angry tongues of flame leaping from the downstairs windows. There was already a curious crowd watching the fire, and we heard people say that there was a sick child in one of the upstairs bedrooms. A black cat was also mentioned.

(86 words)

Test 2

A close call

It was early evening and we were driving at a steady ninety when a small buck leapt into the road about a hundred metres ahead of us. At the last moment it swerved and ran directly towards us. I flicked on the headlights and swerved at the same time. The car slithered to a halt in a cloud of dust, and it was only then that we saw why the buck had changed direction. A number of sinister shapes were hard on the Duiker’s* heels. Wild dogs!

(87 words)

*Duiker is a South African species of small buck.

The reason for the choice of these dictation passages was the same as Pienaar’s for the choice of his cloze passages (see section 3.3.1.2) which was “to select lexical and structural items relevant to the demands of the appropriate syllabuses” (Pienaar 1984:3), i.e. relevant to English as the medium of instruction. However, the relevant demands of the appropriate syllabuses cannot be separated from general language proficiency, which is often the hardest part of learning English for ESL learners (Hutchison and Waters 1987).

3.3.4.4 Presentation of the dictation tests

The following procedures were used:

1. The degree of difficulty of the texts was regulated by controlling factors such as speed of delivery and length of segments between pauses. The text was read at a speed that preserved the integrative nature of the sequences, while catering for subjects who might not have been able to write at the required speed. The length of sequences between pauses was also sufficient, which satisfied the requirements of both mechanical speed and speed of information-processing.

2. The background noise level was kept to a minimum.

3. The text was presented three times. Once straight through, which involved listening only, a second time with pauses, and a third time without pauses, but at a speed that allowed for quick corrections.

4. And very importantly for this study, more than one presenter was used. This procedure is explained in the next section.

It is normal procedure in a dictation test to use one presenter for all subjects – in this case all four groups. It has been argued that a ” dictation can only be fair to students if its presented in the same way to them all” (Alderson, Clapham and Wall 1995:57), i.e. using only one presenter; and old procedure for an “old” method/test.

In this study, I used “old” tests, but the procedure of presentation was new. If one is using indirect tests such as dictation, this does not mean that one has to stick to “old” procedures. One can still try to be exploratory.

The normal procedure in a dictation test is to use one presenter even when subjects are split up into different venues/classrooms. Owing to the exploratory nature of the dictation tests, four presenters (including myself) were used. The presenters then repeated the process on a rotational basis so that each of them presented the two dictation tests to all four groups. The dictation scores used in this study were the scores of the first presentation of each presenter. Thus, I did not use any scores for the statistical analysis from dictations that had been heard more than once by the subjects. Table 3.3 shows the procedure of the first rotational presentation.

TABLE 3.3

Presentation of Dictation with Four Presenters: First Presentation of Each Presenter

The reason why I chose such a design instead of doing the usual and simple thing of using one presenter was because I wanted to investigate whether different presenters, i.e. different methods of presentation, would have any significant effect on the results. Three of the presenters were English mother-tongue speakers, and one was a Tswana mother-tongue speaker. In order to test for any significant difference in the means between the results of the four different presenters, I did an analysis of variance (One-way ANOVA). The ANOVA results are presented and discussed in section 4.1. The reason why an ANOVA was used is because I had to test whether there was any significant difference between the results of the four procedures of administration: each rater’s presentation represents a different procedure. An ANOVA deals with the four sets of data simultaneously, whereas a T-test can only deal with two at a time.

The results of an ANOVA are more reliable than the results of six T-tests (AB, AC, AD, BC, BD, CD), which would be required if the four sets of data (represented by the results of the four different presentations) had to be submitted to T-tests. The more T-tests you have the less the reliability of the results: “If means are to be cross-compared, you cannot use a t-test” (Hatch and Farhady 1982:114). The cross comparison, or, as Brown (1988:170) calls it “multiple t-tests”, may be between more than two groups, or more than one test on the same two groups. Brown (1988:170) gives the following example of this pitfall:

[A] researcher might want to compare the placement means, including four sets of subtest scores, of the males and females at a particular institution. In such a siutation, it is tempting to make the comparisons between the means for males and females on each of the four subtests with four different t tests. Unfortunately, doing so makes the results difficult to interpret. yet I have seen studies in our literature that make 2,3,4,5, and more comparisons of means on the same groups ot test scores.

(See Hatch and Farhady [1982:114] who provide the statistics that show why multiple t-tests increase the likelihood of being able to reject the null hypothesis).

3.3.4.5 Method of scoring the dictation tests

Various methods of scoring dictation are examined and reasons are given for the scoring method used in this study:

Cziko (1982) uses the method of scoring by segment using an exact-spelling criterion where a point is awarded for a correct segment on condition that there is no mistake in the segment. Bacheller (1980) created a scale of communicative effectiveness, where spelling, unlike in Cziko above, was disregarded. In Bacheller’s (1980) method each segment is rated on a scale of 0 to 5 according to how much meaning is understood. A score of zero indicates that none of the intended meaning of the segment has been captured; a score of 3 indicates that the subject apparently understands the meaning of the segment”; a score of 5 indicates that the meaning is understood (see also Tönnes-Schnier and Scheibner-Herzig 1988). Owing to the fact that the emphasis in Bacheller is on the top-down process of coherence/meaning, this method tends to be subjective, especially if only one rater is involved.

Tönnes-Schnier and Scheibner-Herzig (1988) compared Bacheller’s method with the “traditional German method” of scoring dictation (henceforth referred to as the traditional method), where all words are counted, including spelling – one point for each word. Thus, the total score is the number of possible correct words minus the number of errors.

In the method used by Oller (1979:276, 282) and Stump (1978:48) each correct word is worth one point. One point is deducted for each deletion, intrusion or phonological or morphological distortion. Spelling errors, punctuation and capitalisation are not counted. As in the case of the traditional method, the total score is the number of possible correct words minus the number of errors. In the traditional method of counting errors, the “different kinds of errors” are not distinguished. Thus spelling errors – unlike in Oller’s method – are lumped together with omissions, intrusions, lexical and grammatical errors.

Cziko (1982:378) found that his “exact-spelling segment scoring procedure…was three to four times faster than an “appropriate-spelling word-by-word scoring system”. The reason is that one only has to look for one mistake in each segment, whereas in Oller’s procedure one has to take into account each and every error. Cziko (1982:375) found a correlation of .89 between his method and Oller’s method. In my method I did not have to count every word or mistake but decided on a maximum possible score of 20 (before the test was given), which was determined by the difficulty of the test. In this study I used a variation of the traditional method, where errors were subtracted from a possible score of 20 points. One point was deducted for any kind of error, including spelling, and the actual score was deducted from a possible score of 20. This was done because in my opinion this method yielded a valid indication of the level of proficiency of individual subjects. If one is only interested in norm-referenced tests, it wouldn’t matter what the possible score was, because in norm-referenced tests one is only interested in the relative position of individuals in a group, not with their actual scores. One could then measure the correlation between this procedure and Oller’s procedure. If the correlation is found to be high, one could use the shorter procedure. I did a correlational analysis on the dictation tests between Oller’s method and my variation of the traditional method (a possible 20 points). High correlations were found. These are reported and discussed after the ANOVA results in section 4.3).

All the dictation protocols (N=86) were marked by myself. The main reason for this was that the teachers/presenters did not have time for a lot of marking, because the administration of the tests had to be done within the limited time allotted within the School’s programme. Recall that the dictation tests were only a part of a large battery of tests. The teachers/presenters of the dictation tests were involved in the administration of the whole battery. At the end of section 4.2 I discuss the issues of rater reliability when using a single rater.

3.4 Summary of Chapter 3

The sampling procedures, and the structure, administration and scoring procedures of the tests were described. Although subjects are divided into an a L1 sub-group and an L2 sub-group the two sub-groups should be treated as a composite group in the correlational analyses that are to follow in Chapters 4 and 5. The reason for this is that the following conditions were the same for all the subjects:

(1) The admission criteria to the School.

(2) The English proficiency tests and their administration (this investigation).

(3) The academic demands of the School.

(4) The treatment they were given at the School. What is relevant to the statistical rationale of this investigation is not the fact that the entrants to the School had received different treatment prior to entering the School, where some may have been disadvantaged, but only the fact that the all entrants received the same treatment after admission to the School.

(5) The proportion of L1 and L2 learners (as I have defined these labels) was similar from year to year at the School.

All five conditions show that the 1987 Grade 7 sample represented subjects who came from the same population of Grade 7 learners at the school from year to year, specifically from 1980 to 1993, irrespective of their origin and whether they are divided into “L1” and “L2” groups.

Various methods of administration and scoring  procedures in the different tests were appraised to show the reason for the methods of administration and scoring procedures chosen for  this study. The next chapter presents the results and statistical analysis of the battery of proficiency tests.

A                                     B

Although he has lived in England/since he was fifteen,/he

C                             D

still speaks English/much badly. Correct – E.

Answer: D.

A                          B                            C

Item 8. Both Samuel and I/are much more richer/than we/

D

used to be. Correct – E.

Answer: B.

A                                  B

Item 19. Some believe that/a country should be ruled/by

C                                  D

men who are/too clever than ordinary people. Correct – E.

Answer: D.

A                              B

Item 25. His uncle is owning/no fewer than ten houses,/

C                          D

and all of them/are let at very high rents. Correct – E.

Answer: A.

A                              B

Item 27. As I have now studied/French for over three

C                                          D

years/I can be able to/make myself understood when I go to France. Correct – E. Answer: C

.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: