Category Archives: Ph.D. Language Testing

Chapter 4 – Results of the Proficiency Tests

(The author references are missing in this wordpress copy because in the original thesis they appeared as footnotes, which I couldn’t reproduce here. If you would like to know a reference, please contact me).

Results of the Proficiency Tests

4.1 Introduction

4.2 Reliability coefficients

4.3 Analysis of variance of the dictation tests

4.4 Validity coefficients

4.5 Descriptive data of the L1 and L2 groups

4.6 The L1 and L2 groups: Do these levels represent separate populations?

4.7 Comparing groups and comparing tests

4.8 Error analysis and rater reliability

4.8.1 Rater reliability among educators of teachers of English

4.9 Summary of Chapter 4

4.1 Introduction

There are two basic kinds of statistics: descriptive and inferential. Descriptive statistics provides summary data of a whole array of data. Examples of summary data are means, standard deviations, analysis of variance, and reliability and validity coefficients. Inferential statistics indicates the extent to which a sample (of anything) represents the population from which it is claimed to have been drawn.

The population in this study refers to the Grade 7 entrants at MHS from its inception in 1980 up to the present day. This study is particularly interested in the L2 learners at MHS and the wider population of Grade 6Tswana-mother-tongue speakers at DET schools in the North West Province of South Africa who were admitted to Grade 7 at MHS from 1980 onwards.

This chapter provides the descriptive statistics of the English proficiency tests. In the next chapter, descriptive and inferential statistics are provided of the prediction of academic achievement. This chapter also deals with inferential issues regarding the L1 and L2 groups, which has an important bearing on the notion of “levels” of proficiency, a central notion in this study.

From the outset, I need to point out that there was a significant difference between the means of the L1 and the L2 groups. This has important inferential implications, for, if the L1 and L2 groups belong to separate populations (in the statistical sense of the word), one couldn’t consider the two groups as a uniform group for correlational purposes. I shall argue that the L1 and L2 groups do not belong to separate populations.

This chapter contains the following sets of results:

(1) Reliability coefficients of all the tests.

(2) Analysis of variance (one-way ANOVA) of the dictation tests only.

(3) Validity coefficients of all tests.

(4) Means and standard deviations of the L1 and L2 groups on all the tests.

4.2 Reliability coefficients

Two kinds of reliability measurements were used:

– The Pearson r correlation formula measures the parallel reliability between two separate, but equivalent, i.e. parallel, tests. The tests involved are the two cloze tests, the two dictation tests and the two essay tests. The procedure used for calculating the reliability of parallel (forms of) tests is to administer the tests to the same persons at the same time and to correlate the results as indicated in the following formula:

rtt = rA,B (Pearson r formula)

where

rtt = reliability coefficient,

and

rA,B = the correlation of test A with test B when administered to the same people at the same time.

– The Kuder-Richardson 20 (KR-20) formula splits a single test in half, and treats the two halves of the test as if they are parallel tests. The tests involved are the error recognition test and the mixed grammar test. The following parallel and KR-20 reliability coefficients are reported:

TABLE 4.1

Reliability Coefficients of All the Tests

The problem is ensuring that two tests are parallel. For example, it is very difficult to ensure parity of content, not only in “integrative” tests such as cloze, dictation and essay but also in “discrete-point”, or “objective”, tests. This is so because all tests no matter how “objective” they look are subjective. Accordingly, it is better to speak of integrative and discrete-point formats than integrative or discrete-pointtests. From this position it is not a big step to take to speak of parallel scoring, because it is only in the sense that test scores are found to be “parallel” that we can talk of tests being parallel. Statistics becomes not only sensible but indispensable in this matter: (1) if the “parallel” tests ranked individuals in a group in a similar way, i.e. if there were to be a high correlation between the tests, and (2) if there were to be no significant difference between the means of the two tests, this would be pretty good evidence that the tests of similar formats and scores were parallel tests. Table 4.2 shows that there was no significant difference between the means within each of the three pairs of integrative tests, because the t Stat was less than the Critical value.

TABLE 4.2 Means and Standard Deviations of Parallel Tests (N=86)

Interrater reliability was only a factor in the essay tests, because the essay tests were marked by more than one rater. With regard to essay tests, Henning points out that “because the final mark given to the examinee is a combination of the ratings of all judges, whether an average or a simple sum of ratings, the actual level of reliability will depend on the number of raters or judges.” According to Alderson,

[t]here is considerable evidence to show that any four judges, who may disagree with each other, will agree as a group with any other four judges of a performance. (It was pointed out that it is, however, necessary for markers to agree on their terms of reference, on what their bands, or ranges of scores, are meant to signify: this can be achieved by means of a script or tape library). (Original emphasis).

If interrater reliability is measured in this way, this would make complex statistical procedures of calculating interrater reliability unnecessary. One would simply compute the average of the four raters’ scores for Essay 1 and Essay 2, respectively, and then compute the parallel reliability coefficient between the average of Essay 1 and the average of Essay 2. This was the procedure used to compute the reliability coefficient of the essay tests.

A reliability analysis was also done in a cumulative fashion of the error recognition test (ER) and the mixed grammar test (GRAM). The reason why the reliability coefficients were computed in this cumulative fashion was to find out the minimum items required to ensure high reliability.

As shown in Figure 4.1, I used the first 10 items (items 1-10), then the first 20 items (items 1-20), then 30 items (items 1-30), and so on. The KR-20 formula was used to compute the reliability coefficients.

FIGURE 4.1  Comparison between the reliability coefficients of ER and GRAM (N=80)

The reliability coefficients of both ER and GRAM have an almost identical pattern. An important statistical truth is illustrated by these reliability data, namely that less than 40 items are not likely to produce satisfactory reliability coefficients, i.e. of .90 or higher, for discrete, objective, items. In multiple-choice grammar tests a reliability coefficient between .90 and .99 is usually required to be considered a reliable test, whereas in tests such as an essay test, a reliability coefficient of .90 is considered high.. There is also a tapering off of the reliability coefficient after 40 items until it reaches a point of asymptote, where any increase in items does not result in a significant increase in reliability.

Some of the reliability calculations may appear odd, because if it is true, as I have shown, that 40 items produce low reliability coefficients, then why (1) use 10 items for CLOZE, and (2) why use the parallel method of reliability for CLOZE and the KR-20 (split-half) method of reliability for GRAM and ER. The answer to these questions requires an answer to another question: (3) Why is the parallel reliability of CLOZE with only 10 items as high as .80 while the KR-20 reliability of GRAM and ER with ten items is a low .60. Answers to these questions lie in the relationship between grammatical/linguistic competence (sentence meaning) and discourse competence (pragmatic meaning) and the continuum of “integrativeness” (see section 2.5).

One doesn’t merely look at the format of a test to decided whether it is a “discrete-point” test. One looks at what the test is testing. As pointed out earlier, it is possible to write few words, i.e. a “discrete-point” format, as in a cloze test (or as in “natural” settings) and still be testing “communicative competence”, or “pragmatic” language. In the case of GRAM and ER, each of these tests consists of unrelated “objective” items; that is, there is no “pragmatic” connection between them. The KR-20 formula is used to measure the reliability of objective items. The Pearson r formula is used to measure parallel reliability of tests at the pragmatic end of the integrative continuum: the cloze, dictation and essay tests. It is true that sometimes the KR-20 formula is used to measure the reliability of cloze tests, because some authors, e.g. Alderson, maintain that many cloze tests test “low order” skills that “in general [relate] more to tests of grammar and vocabulary… than to tests of reading comprehension”. This is not a point of view shared by many other authors (see section 3.3.1.1).

A split-half reliability method (of which the KR-20 formula is a sophisticated version) on an “integrative” test such as a cloze test, dictation test, or an essay test may not be a good idea for the very important reason that the two halves of such tests do not consist of clusters of comparable items, owing to their “pragmatic” nature, i.e. items are not completely independent, i.e. they all hang together. If items hang together as in integrative tests one may not have to worry about searching for an “empirical basis for the equal weighting of all types of errors”, as Cziko believes it necessary to do for all tests.

If there is only one test form, e.g. as in Cziko’s dictation test, one cannot use the parallel reliability method, but one can measure reliability using other methods such as the test-retest method. With regard to the cloze tests, the parallel reliability coefficient of .80 is not only quite acceptable for a “pragmatic” test, but also very good for only ten deletions.

I would like to add a few remarks on rater consistency, or rater reliability. I argued that in the dictation test, presenters and groups were not confounded (i.e. each group had its respective presenter). Therefore, it was legitimate to subsequently do an ANOVA of the four groups/presenters/presentations. One may accept this rationale but still be concerned about the rater reliability of the dictation test (and the cloze test for that matter) because only one rater was involved, and not four as in the essay test. The question, therefore, is whether the scoring procedures in these tests lack evidence of consistency of application owing to the fact that there was only one rater (myself). This should not be a problem in the dictation test, because I didn’t have to worry about distinguishing between spelling and grammatical errors (which can be a serious problem), owing to the fact that only wrong forms of words, intrusions and omissions were considered in my marking procedure. In the cloze tests special care was taken that all acceptable answers were taken into account. The error recognition test and mixed grammar test had only one possible answer. The answers to the latter two tests were provided by the test compilers.

4.3 Analysis of variance of the dictation tests

Recall that a separate presenter was used for each of four groups of subjects. There were four presentations on a rotational basis (Table 3.3). An analysis of variance (ANOVA) was conducted on the – and I must stress this point – first presentation to test for any significant difference between the four presenters’ procedure of presentation. As I explained in section 3.3.5.4, no scores for the statistical analysis were used from dictations that had been heard more than once by any group in the rotation of presenters. Accordingly, presenters and groups were not confounded. In other words, Presenter 1 coincided with Group 1, Presenter 2 with Group 2, and so on.

The ANOVA showed (Table 3.3) that there was no significant difference between the four groups, i.e. the null hypothesis was not rejected. If the null hypothesis had been rejected this would have demonstrated that there was a significant difference between the four presenters’ procedures of presentation. Under these circumstances the use of the dictation in a corrrelational analysis with other tests would be invalid because it would have been illegitimate to combine the four dictation groups into a composite group. The results of the ANOVA are reported below.

TABLE 4.3   Analysis of Variance of the Dictation Tests with First Presentation

Statistical results, in this case the ANOVA, cannot tell us why there was no significant difference between the different presenters. To make qualitative comparisons between the results obtained from different presenters, one would have to examine the results of the four different presenters on the firstpresentation given to the four groups of subjects. I stress the first presentation because the possibility exists that the dictation passages would become progressively easier with each subsequent presentation.

I examined a random selection of protocols (from both the L1 and L2 groups) to find out whether there was any difference in the quality of output using different presenters. This analysis of protocols was a lengthy enterprise and would thus take up too much space if reported in this study. It is fully reported elsewhere (Gamaroff, forthcoming). The intention is not at all to treat qualitative data in a cavalier fashion. The point is that this study’s main emphasis is on quantitative data. Qualitative data are not at all ignored, however (see section 4.8ff). What I shall do here is summarise the conclusions of the qualitative analysis of the dictation tests:

The dictation passages (Pienaar’s restored [or “unmutilated”] cloze passages) for the Grade 7 subjects were intended for the Grades 5 to 7 L2 levels and for the Grades 5 and 6L1 levels. Consequently, the L1 group would be expected to do well, even if the presenter’s prosody were unfamiliar. As the statistics will show (section 4.5, Table 4.5), the L1 group did well and the L2 group did badly.

Recall (section 3.3.5.4) that I used a variation of the traditional procedure, where errors were subtracted from a possible score of 20 points. One point was deducted for any kind of error, including spelling, and the actual score was deducted from a possible score of 20. This was done because I believed that this procedure would yield a valid indication of the level of proficiency of individual subjects. If one was only interested in norm-referenced tests, it wouldn’t matter what the possible score was, because in norm-referenced tests one is only interested in the relative position of individuals in a group, not with their actual scores. One could then measure the correlation between this procedure and Oller’s procedure. If the correlation is found to be high, one could use the shorter procedure. A correlational analysis was done on the dictation tests between Oller’s procedure and my variation of the traditional procedure (a possible 20 points). High correlations were found: .98 for the first dictation passage, and .89 for the second dictation passage. The reason for the high correlations is probably the following:

The word forms of the L2 group were so deviant that I regarded them as grammatical errors. In the L1 group, in contrast, the scores were very high, which meant that no scores were subtracted for spelling or for grammatical errors. As a result, in both groups, spelling had no significant effect, which meant that very few marks were subtracted for spelling. This means that whatever possible score I chose, the correlations between my procedure and Oller’s procedure would have been high; hence the high correlations reported in the previous paragraph. (Correlation is not concerned with whether scores are equivalent between two variables, but only with the common variance between two variables, i.e. whether the scores “go together”; see section 5.3). So, if Oller’s dictation procedure yielded relatively higher scores than my procedure, this doesn’t effect the correlation.

One can explain the difference in performance between the L1 and L2 groups in terms of the difference between the information-processing strategies used by low- proficiency and high-proficiency learners. When we process language, we process in two directions: bottom-up from sound input and top-down from the application of the cognitive faculties. With regard to the dictation test in the study, the words were highly predictable for the L1 group, and therefore this group did not have to rely totally on the sound input. The opposite was the case for the L2 group, where there was almost a total reliance on the bottom-up process of sound recognition. In other words, native listeners or listeners with high proficiency “can predict the main stresses and can use that fact to ‘cycle’ their attention, saving it as it were, for the more important words. It should be kept in mind, however, that bottom-up processes from sound input plays a major role at all levels of proficiency, not only at the low levels.

The difficulties experienced by the L2 group did not only have to do with lexical lacunae: there is much more to knowing a word than knowing the various meanings it may have. To master a word one also needs to know its form, its frequency of use, its context, its relationship to other words. Problems can occur in any of these areas. This applies to all the tests of the test battery.

4.4 Validity coefficients

The singular term test will be used to refer to the means of the two cloze tests (CLOZE), of the two essay tests (ESSAY) and of the two dictation tests (DICT). With the single mixed grammar test (GRAM) and the single error recognition test (ER), there are five “tests” altogether. Table 4.4 shows the validity coefficients of the English proficiency tests. The numbers in the top row refer to the tests that appear next to the corresponding numbers in the extreme left hand column.

TABLE 4.4

Validity Coefficients of English Proficiency Tests p < .01

* Corrected for part-whole overlap. Part-whole overlap occurs when an individual test score is correlated with the total score of all the tests of which its score is a part. In such a situation, one would not be measuring two variables that are separate from one another; which would result in part-whole overlap between the individual test and the total score. This part-whole overlap would increase the correlation, thus giving an inaccurate picture.

The high validity coefficients are impressive and perhaps unusually high. For this reason the raw data and computations (using the statistical programme “Statgraphics”) were rechecked twice. High validity coefficients, however, are not unusual between these tests (See Note 80, Chapter 1). Validity coefficients, unfortunately, do not give a close up picture and thus often need to be supplemented by other descriptive data such as frequency distributions, means and standard deviations. The next section shows these other descriptive data where a comparison is made between the L1 and the L2 groups.

 4.5 Descriptive results of the L1 and L2 groups

The difference between the performance of the L1 and L2 groups are shown. The following data are provided:

1. Means and standard deviations (Table 4.5).

 2. A Frequency distribution (Table 4.6).

The following measures appear in the tables:

1. CLOZE – Average of Cloze tests 1 and 2 (N=86).

2. DICT – Average of Dictation tests 1 and 2 (N=-86).

3. ESSAY – Average of Essay tests 1 and 2 (N=86).

4. GRAM – Mixed grammar test (N=80).

5. ER – Error recognition test (N=80)

A statistically significant as well as a substantial difference was found between the means of the two groups as shown in Table 4.5.

TABLE 4.5   Means and standard deviations for the L1 and L2 groups

When the t Stat is more than the t Critical value, this shows that there is a significant difference between the two groups. (According to Nunan, when two sets of scores have substantially different means or standard deviations, it is not necessary to use a T-test to test for a significant difference between means). The frequency distributions are shown in Table 4.6.

 TABLE 4.6 Frequency Distribution of all the Tests

The L2 group did very poorly on the dictation and the error recognition tests, less poorly on the cloze and essay tests, and best of all on the grammar test. The L1 group did best on the dictation and the grammar tests, while in the other tests, the order of increasing difficulty are the cloze, the essay and error recognition tests. In Chapter 5 the frequency distributions are analysed in more detail in relation to the prediction of academic achievement

Does the significant difference between the L1 and L2 scores above mean that these two groups come from different populations and therefore should not be treated as a composite group in a correlational analysis? I examine this question in section 4.6.

The multiple-choice format is vulnerable to guessing. Sometimes it is recommended that scores be adjusted for guessing, as in Bloor et al.’s GRAM and ER that was used in this study. Guessing was taken into account in the study, which meant that in the mixed grammar (GRAM) test, a score of 88% was reduced to 85%; a score of 64% to 55%; and a score of 40% to 25%. Thus the person who has more, loses proportionately less. The score of 40% in GRAM is used to show how to calculate the adjustment for guessing:

100 minus Actual Score (40%)

divided by

Number of options in item (4 options)

equals 15%

40% (actual score) minus 15% = 25%

As shown in the last line of the equation, the result of the first line (15%) is subtracted from the actual score of 40% to give an adjusted score of 25%. The greater the number of options, the less the adjustment, because the test would be more difficult. ER has five options, and so the adjustment is less than for GRAM. Suppose the actual ER score was also 40%, as in the GRAM example above. The adjusted score of ER would be 28%, which is 3% higher than the adjusted score of GRAM:

100 minus Actual Score (40%)

divided by

Number of options in item (5 options)

equals 12%

40% (actual score) minus 12% = 28%

One cannot prove that someone is guessing, and without proof, it might be argued that one would be penalising non-guessers as well as guessers: “in multiple-choice formats, guessing affects scores and, though statistical procedures are available to correct for it, they necessarily apply indiscriminately whether or not a learner actually has guessed” (Ingram 1985). However, the logic behind correction for guessing is not indiscriminate even though it affects everybody. As shown in the examples above, the less one knows the more the likelihood of guessing. Although one cannot be sure who is guessing, the rationale of the adjustment for guessing is based on what we know about learning and test performance. The key point of logic in the adjustment for guessing is that the lower the original score, the greater the possibility that one is guessing. If the scores are not adjusted for guessing, this would of course affect the ranges of scores. But in this study, I am not interested so much in the absolute values of these ranges as in the relative values: the L1 group relative to the L2 group.

I now focus on the cloze results because these cloze tests have been used elsewhere and have produced a solid body of results which one can compare with the results in this study. I shall also introduce cloze data from another school (to be described shortly). Recall that Pienaar tested a variety of learners from different schools. including Bantu speakers living sub-economic settlements in the environs of Mmabatho (category 4b; see section 3.3.1.2). Pienaar used the label “III” for the group that I called 4b). Many of the parents of category 4b were illiterate or semi-literate and were either unemployed or semi-employed. The sample at MHS did not contain learners that belonged to this category, [as shown by the occupations of the parents of the L2 group in Table 4.7 below. I HAVE NOY INCLUDED THIS TABLE FROM THE ORIGINAL THESIS]

The L2 group contains a good number of parents who work in the education field (see highlighted occupations). One cannot infer that the children of educators are usually advantaged, because in South Africa, education was one of the few professions open to blacks.

To make the statistical data more comparable I included in the investigation a Middle school (Grades 7 to 9) situated in the environs of Mmabatho that accommodated learners similar to Pienaar’s category 4b. The sample from this school, referred to as MID, consisted of 40 Grade 7 learners. Learners at MID come from many primary schools in the area, owing to the fact that there are far more Primary schools than Middle schools in the area. Figure 4.2 compares the cloze test frequency distributions of the MHS L2 group with the Middle School (MID).

FIGURE 4.2 A Comparison of the MHS L2 Group with the Middle School (MID) on the Cloze Tests

The MID school results are very similar to Pienaar’s category of sub-economic learners, namely, his category III (which I have called category 4b). By comparing MID with the MHS sample we see that the L2 group at MHS – poorly as it has done – is better than the MID group. The MID group is comparable with Pienaar’s “at risk” group: indeed at high risk. The MHS L2 group is also at high risk, but the MID group is much worse.

4.6 The L1 and L2 groups: Do these two levels represent separate populations?

Section 4.4 showed that there were high correlations between the discrete-point and integrative tests of the test battery. This study is not only concerned with statistical concepts such as correlation but also with the problem of assigning levels of language proficiency to learners. The discussion to follow is relevant to both these issues:

It is only after the test has been performed on the test-bench that it is possible to decide whether the test is too easy or too difficult. Furthermore, if there are L1 and L2 subjects in the same sample, as is the case with the sample in this study, one needs to consider not only whether the norms of the L1 and the L2 groups should be separated or interlinked but also how to ensure the precise classification of the L1 and L2 subjects used for the creation of norms.

As far as the correlational analysis was concerned, I interlinked the L1 and L2 groups and treated them as a composite group. But I also separated the L1 and L2 groups in order to find out whether there was a significant difference between the means of the two groups. If a significant difference were not to be found between the L1 and L2 groups, this would militate against the construct validity of the tests, because this would mean that the L2 group, who should be weaker than the L1 group, was just as proficient as the L2 group. Under such conditions, we would have no idea what we were testing.

The question is whether it is legitimate to treat the L1 and L2 subjects as a composite group (for a correlational analysis) as well as two separate groups (for comparing the means between the L1 and L2 groups). One may object that one cannot do both; that one cannot interlink groups and also separate them. I shall argue that one can.

As shown in Table 4.4 the correlations between the different tests were high. The means and standard deviations (Table 4.5), however, show that there was a significant difference between the L1 and L2 groups. Can one, accordingly, maintain that if there was a significant difference between the L1 and L2 groups that these two groups belong to separate populations, and thus argue that the correlations were artificially inflated by combining samples that represent two separate populations?A discussion of this question raises the question of the logicality of dividing the subjects into L1 and L2 groups. Is this division arbitrary or does it have a cogent theoretical rationale on which one can make the inference that the L1 and L2 groups represent different populations?

There are two distinct issues, which are also related, namely, levels of proficiency and correlations. The logic of correlation, which is based on a bell curve distribution, is that tests that do not have a reasonably wide spread of scores (high achievers and low achievers) could give a false picture because tests that have a large spread of scores around the mean have more likelihood of being replicable, owing to the fact that in a representative sample of human beings there is likely to be a wide range of ability.

This does not mean that it is not possible to have a high correlation with a narrow spread, or a low correlation with a wide spread, but it is more likely that a correlation would be higher with a wide spread of scores, say 0% to 80%, than a narrow spread, say 40% to 80%. The sample in this study represented the Grade 7 population at the school throughout the years.

In the assessment of levels of proficiency I separated the high achievers from the low achievers in the sample because they could be distinguished – unsurprisingly – as those who took English as a First Language and English as a Second Language, respectively.

I discuss briefly the theory of investigating the difference between groups. Consider the following example:

In South Africa there are many immigrants from different countries for whom English is a foreign language. If one tested and compared the English proficiency of a group of Polish immigrants and Chinese immigrants and found no significant difference between these two groups, one wouldn’t be surprised, because one would probably conclude that there was a wide spread of scores in both groups. If a significant difference were to be found, one may be curious to know why the one national/ethnic group did worse or better than the other.

Replace the Polish and Chinese immigrants with two other groups, the L1 and L2 groups in this study. A significant difference was found between these two groups, but this is not surprising at all, because it is to be expected that the group taking English as a First Language subject (the L1 group) would be better at English than the group taking English as a Second Language subject (the L2 group): assuming that the subjects (test takers) in the sample made a reasonable choice of which group to belong to. (Recall that learners at MHS initially decided themselves whether they were belonged to L1 or L2. In most cases they had a good idea where they belonged). Accordingly, it is quite logical that there would be a significant difference between the L1 and L2 groups, or levels. To get a clearer grasp of the issue of the respective levels of proficiency of the two groups and the “separate populations” question one has to examine whether:

(1) The reliability aspects were the same for the L1 and L2 subjects, e.g. the same tests, same testing facets and same testing conditions, etc. (see section 2.9). This was so.

(2) The composition of the sample, i.e. the proportion of L1 and L2 learners, was similar from year to year at MHS. This was so. In other words, the 1987 Grade 7 sample represented the population of Grade 7 learners at MHS from year to year, specifically from 1980 to 1993.

One would also look at whether there were differences in:

(3) Admission criteria for the L1 and L2 groups.

(4) The background, or former treatment of L1 and L2 learners before they entered the school.

(5) What one expected from the L1 and L2 learners.

(6) The treatment they were given in the same education situation.

All the above points except for (4) apply to both the L1 and L2 subjects. I discuss (40):

MHS endeavours to provide disadvantaged learners with the opportunity to learn in an advantaged school situation. In the validation of the sample, the notion of dis- advantage is important. In South Africa the term disadvantage often bears the connotation of “consciously manipulated treatment” meted out by apartheid. Treatment can have the following two connotations: (i) consciously manipulated treatment in an empirical investigation and (ii) the long-term treatment – be it educational, social, economic, cultural or political – of human beings in a non-experimental life situation.

What is relevant to the statistical rationale of this investigation is not the fact that the entrants to MHS had received different treatment prior to entering MHS, where some may have been victims of apartheid and others not, but only the fact that all entrants received the same treatment after admission to MHS. I am not implying that their background experience is inconsequential as far as the teaching situation – past (at former schools) or future (at MHS) – is concerned, but only that all entrants were expected to fulfil the same academic demands. I discuss later the role of language background, specifically the role of English input.

The vast majority of the 1987 Grade 7 intake had high Grade 6 scores from their former schools . This was the main reason why many of them were admitted to MHS. The disadvantaged group and the advantaged group both consisted of high-scoring entrants revealed by the Grade 6 school reports. Accordingly, it appeared that all the entrants were extremely able, whether they came from an advantaged or disadvantage background. Now, suppose one found that (i) high Grade 6 scores (from former schools) were obtained by both the L1 and L2 groups but that (ii) while high English proficiency test scores were obtained by the L1 group, low English proficiency test scores were obtained by the L2 group. The findings showed that both these facts were so. This does not mean, however, that the L1 and L2 subjects belong to different populations. What it shows – on condition that the English proficiency tests were valid and reliable, which the findings show was the case – is the true nature of the population, namely, a wide spread of scores.

So, although it seemed, from the good Grade 6 reports of all entrants at MHS (L1 and L2 entrants) from year to year, that MHS only admitted high achievers, the reality was that MHS admitted a mixture of academically weak learners (who were generally disadvantaged) and academically strong learners (who were generally advantaged), as was the case with the 1987 Grade 7 sample. Further, learners at MHS received the same treatment.

The statistical analysis should be kept distinct from educational, social, economic, cultural, political and other deprivations that pre-existed admission to MHS. The principal issue in this study is what learners areexpected to do after admission, where all learners are called upon to fulfil the same academic demands, except for the language syllabuses, namely English, Tswana, Afrikaans and French, and where all are required to use English as the medium of instruction.

If it is true that former academic achievement (Grade 6 in this case) is the best predictor of subsequent achievement, it would follow that many of these entrants should have had at least a reasonable standard of academic ability. What happened in fact was that although almost all of the 1987 Grade 7 entrants (L1 and L2) obtained high Grade 6 scores on English achievement and on their aggregates, many of the L2 entrants (who were mostly disadvantaged learners) obtained low scores on the English proficiency tests. For this reason, the sample turned out to be, as far as the English proficiency tests were concerned, a representative mixture of weak and strong learners, i.e. a random sample. This fact is crucial to the validation of any sample, whose essential ingredient is randomness.

I am arguing, therefore, that the L1 and L2 groups do not represent separate populations: they are merely a mixture of weak and strong performers, where it is only logical that weak subjects would prefer to belong to the L2 group than to the L1 group and that the L2 group would also do relatively worse than the L1 group on the English proficiency tests. It turned out that there was a clear distinction between the L1 and L2 groups. Most of the L2 group did poorly and most of the L1 group did relatively much better on the tests, hence the significant difference in the means between the two groups.

In the traditional distinction between L1 and L2 learners, these two kinds of learners differ only in so far as L2 learners aspire to reach the L1 level. The difference, therefore, between L1 and L2 learners lies in the different levels of mastery. And that is what tests measure within a sample that represents a population: it measures which members are strong, which are weak. If the tests are too difficult for the L2 group or too easy for the L1 group this does not mean that the tests are invalid, i.e. that they have been used for the wrong purpose, if the purpose is to distinguish between weak and strong learners. One does not look at the actual scores as far as construct validity is concerned but at whether the tests distinguish between weak and strong learners. Oller elucidates (he is talking about one learner and one task, while I am talking about many learners and several tasks: I make the necessary adjustments in Oller to suit the context):

It is probably true that the [tasks were] too difficult and therefore [were] frustrating and to that extent pedagogically inappropriate for [these students] and others like [them], but it does not follow from this that the [tasks were] invalid for [these learners]. Quite the contrary, the [tasks were] valid inasmuch as [they] revealed the difference between the ability of the beginner, the intermediate, and the advanced [learners] to perform the [tasks].

(Section 4.7 elaborates on the comparison between test scores and the comparison between groups).

If one is or believes one is weak at English, one would sensibly prefer to take English as a Second Language, if one had a choice: one did have a choice at MHS. This is not to say that if one were good at English one would not take English as a Second Language, owing to the fact that somebody good at English could obtain higher marks taking English as a Second Language than taking English as a First Language.

Accordingly, those Tswana speakers in the L1 group at MHS who later changed to English Second Language could have done so not because they were weak at English but in spite of the fact that they were good at English.

4.7 Comparing groups and comparing tests

It might be argued that measuring the difference in means between groups apportions equivalent scores to each item, and accordingly, does not take into account the relative level of difficulty of items. I suggest that the relative difficulty of items is not important in a language proficiency test, but is indeed in a diagnostic test, which has remediation as its ultimate purpose. With regard to proficiency tests, one is concerned with a specific level (e.g. elementary, intermediate and advanced) for specific people at a specific time and in a specific situation. Within each level there is a wide range of item difficulty. To attain a specific level of proficiency one has to get most of the items right – the difficult and the easy ones. In sum, the different bits of language have to hang together, which is what we mean by general, or overall, language proficiency. As pointed out earlier (section 2.5), the controversy is about which bits do and which bits don’t hang together.

We now come to a very important issue. What does a score of 60% on a test for an L2 learner in this study mean? An answer to that question requires a distinction between (1) the comparison between tests and (2) the comparison between groups: the L1 and L2 groups.

As a preliminary I refer to the relationship between norm-referenced tests and criterion-referenced tests. The former is only concerned with ranking individuals in a group and not, as in the case of criterion-referenced tests, with individual scores achieved in different tests. So, in norm-referenced tests one is interested in correlations, which is concerned with how individuals are ranked in a group on the tests involved in the correlation, and not with whether individuals achieved equivalent scores within a group. The latter is the concern of criterion-referenced tests. But, of course, one needs both kind of information to get a empirically-based idea of language tests. One has to be careful, however, when one compares tests.

In this study I have been comparing different kinds of tests and contrasting different groups of learners. The main focus is on the difference between groups, and thus there was no explicit and sustained attempt to contrast the scores between the different tests. This was deliberate. If one is going to compare the results between different tests, e.g. the dictation test and the cloze test, extreme caution is required, because such comparisons could lead to false conclusions. This is not to say that such comparisons are not useful; they can be very useful, but when one makes such comparisons, one must be aware of the parameters involved. Scores reveal nothing and surface errors reveal little about why a particular score was awarded. One has to look at the construction of the test, i.e. what, why and who is being tested and doing the testing, and how it is being tested. All these parameters are related to the scales of measurement that one uses. Consider the following measurement scales, especially the ratio scale. The ratio scale could be confused with the other scales:

– Nominal scale (also called categorical scale). This is used when data is categorised into groups, e.g. gender (male/female); mother tongue (English/Tswana).

– Ordinal scale. One could arrange proficiency scores from highest to lowest and then rank them, e.g. first, second, etc.

– Interval scales. One retains the rank order but also considers the distances (intervals) between the points, i.e. the relative order between the points on the scale.

-Ratio scales are interval scales with the added property of a true zero score, where the “points on the scale are precise multiples, or ratios, of other points on the scale”. Examples would be the number of pages in a book, or the number of learners in a classroom. If there were 200 pages in a book, 100 pages would be half the book.

Taking these scales into account, consider the proficiency tests of the study:

– The cloze test. To be considered proficient enough to cope in a higher grade in the cloze tests, one should obtain a score of at least 60%. (The mean scores of the L1 and L2 groups for the cloze test were 66% and 26%, respectively). (See Table 4.5).

– The essay test. A score over 60% in the essay, in contrast to 60% on the cloze testin this study, would be considered a good score. A score of 40% on an essay test or on any test is not half as good as a score of 80%. As far as essay tests are concerned, 80% would be an excellent score, while 30% would be a poor score. But, poor is not half of excellent.(The mean scores of the L1 and L2 groups for the essay test were 56% and 29%, respectively). (See Table 4.5).

– The dictation test. I used a score of a possible 20 points; one point for every correct word. But if I had made the score out of 86 or 87 points (the dictation passages consisted of 86 and 87 words, respectively), where every word counted one point, a score of 60% would mean that 40% of the words in the dictation passage would be wrong. It is hardly likely that a dictation protocol with a score of 60% marked in this way would be comprehensible. Accordingly, an individual’s score of 60% on a dictation test would not mean the same thing at all as 60% on the cloze and essay tests. (The mean scores of the L1 and L2 groups for the dictation test were 71% and 16%, respectively). (See Table 4.5).

– The error recognition and the mixed grammar tests. These test scores were adjusted for guessing. If one adjusts for guessing, one must take this into account.

To sum up, it is the “relative difference in proficiencies” between learners of high ability (in this case the L1 group) and low ability (in this case the L2 group) and not the equivalence in scores between the tests that determines the reliability and construct validity of the tests.

4.8 Error analysis and rater reliability

Although the average of four or even three raters may be a reliable assessment of a “subjective” test such as an essay test, it is usual in the teaching situation to have only one rater available, who is the teacher involved in setting the test. If there are two raters available it is generally only the teacher who sets the test and is overall in charge of the test who has the time or inclination to do a thorough job. The problem of rater consistency is an extremely serious problem in assessment. The nub of the problem is one of interpretation, an issue that fills innumerable tomes in the human sciences, especially during this “postmodern” era. This is what is involved:

Logically prior to any question of the reliability and validity of an assessment instrument is the question of the human and social process of assessing…This is a radically interpersonal series of events, in which there is an enormous, unavoidable scope of subjectivity – especially when the competences being assesses are relatively intangible ones to do with social and personal skills, or ones in which the individual’s performance is intimately connected with the context.

It is the interpretation, or judgment, of errors that is the main problem in language testing. Ashworth and Saxton (in their quotation above) are concerned with the lack of equivalence in judgements and scores between raters. The subjectivity question in the battery of tests of this study remains a problem in the essay test. I tried to solve the problem by using four raters. But, in most testing situations only one rater and at most two raters are available. I would like to expand on the issue of rater reliability, because this seems to be the major problem in the assessment of “subjective” tests such as essay tests. Error analysis is brought into the picture.

The qualitative analysis of errors and quantitative measurement are closely related in issues of interrater reliability. In this section I discuss more theory, in this case the uses and limitations of error analysis, which serves as a background to the examination of a detailed practical example of the uses and limitations of error analysis and quantitative measurement. I begin the discussion by assessing the value of the quantitative procedures used in this study in relation to the lack of qualitative procedures used so far:

One may feel that the linguistic substance of individual errors obtained in an error analysis has more bite than reductionist “number-crunching” and that consequently this study has overreached itself by limiting itself to something as insubstantial as a statistical investigation. One might want to see additional analyses of a qualitative nature of the proficiency tests, especially of the integrative tests, where writing output is involved. Such a desire is understandable because scores by themselves don’t illuminate the linguistic substance behind the numbers owing to the fact that similar scores between raters do not necessarily mean similar judgements, and different scores between raters do not necessarily mean different judgements.

Error analysis can be useful because it provides information on the progress made towards the goal of mastery and provides insights into how languages are learnt and the strategies learners employ. Concerning learning strategies, the making of errors is part of the learning process. (An error analysis need not involve a “linguistic” analysis. For example, in an error analysis of writing one could look for cohesion errors, but if one were to examine the noun-to-verb ratio in individual protocols, this would not be an error analysis but a linguistic analysis).

This study is mainly concerned with norm-referenced testing. So, to include a linguistic/error analysis of the tests, would, besides being far too long and ambitious a project, go beyond the objectives of this study. The problem would be which tests to use in such an analysis, and how long such an analysis should be. Naturally, qualitative analysis is very important, but in the empirical part of the study I focus on quantitative data. As far as qualitative data are concerned, relevant to this study is examining the problems of error analysis as it relates to rater reliability. As mentioned a few paragraphs earlier, I shall be using a detailed concrete example later on in this section to examine this problem. But first some theory.

Often mother-tongue proficiency is advocated as an absolute yardstick of language proficiency, but, as Bachman and Clark point out “native speakers show considerable variation in proficiency, particularly with regard to abilities such as cohesion, discourse organisation, and sociolinguistic appropriateness.” As a result, theoretical differences between testers can affect the reliability of the test. Raters who know the language well and even mother-tongue speakers can differ radically in their assessments of such pragmatic tasks as essay tasks. That is why different raters’ scores on a particular protocol are often incommensurate with their judgements. Owing to these problems it is virtually impossible to define criterion levels of language proficiency in terms of actual individuals or actual performance. Bachman and Clark suggest that such levels must be defined abstractly in terms of the relative presence or absence of the abilities that constitute the domain. But again this doesn’t solve the problem because the difficulty is how to apply the definition to concrete situations of language behaviour.

Another problem is the representativeness of specific errors. In previous research I did an error analysis of Tswana speakers’ English but did not establish statistically whether the errors I was dealing with were common errors, e.g. *cattles (a plural count noun in Tswana “dikgomo”) and *advices (a plural count noun in Tswana “dikgakoloko). Under such circumstances one can be duped into believing that errors are common if one comes across them a few times, which may only create the feeling that they are common. Error analysis under such circumstances could indeed become merely an idiosyncratic – and mildly interesting – “stamp collection”.

Another example: Bonheim, coordinator of the Association of Language Testing in Europe, gives an example of a test taker who had done very well on a multiple-choice test, but in one of his/her few incorrect items of the test had circled an option that was an unlikely answer. Bonheim suggested that one should try and find out why this highly proficient learner had circled this option. This idiosyncratic example surely cannot contribute anything to the general principles of error analysis, i.e. tell us whether the error is common enough to warrant a time-consuming investigation. In proficiency testing, one is not looking for idiosyncratic errors but for general errors. In diagnostic testing, of course, the situation is quite different because one focuses on both individual and general errors, because the main aim of a diagnostic test is therapy, not finding out the level of a person’s present ability, which is what proficiency tests are about.

Obviously, the different types of tests, e.g. proficiency, diagnostic, aptitude and achievement, are related, but it is important to keep their main purposes distinct: otherwise there would be no point in creating these distinctive categories. For example, an itemised analysis can reveal the relative strengths and weaknesses of the different groups in different parts of the language. Such an analysis can be used as a diagnostic tool at the beginning or end of a specific course of instruction, or, in this case, as a measurement of specific points of proficiency. However, without quantitative procedures, the data one gathers remain unconvincing. For example, consider the percentage error on individual items of the error recognition test for the Non-Tswana L1 sub-group (NTL1) and the L2 group in Table 4.8:

TABLE 4.8  Error Recognition (Identification) Test: Percentage Error

Care must be taken in the interpretation of Table 4.8. The higher the scores, i.e. the higher the percentage error, the more difficult the item. The information in Table 4.8 reveals the similarities and differences between groups on each item. For example, in item 8, the difference between the NTL1 and the L2 group is not substantial. Item 8 is given below.

                     A                             B                      C             D

Item 8. Both Samuel and I/are much more richer/than we/ used to be.

Correct: Answer: B

In item 19 the NTL1 group does substantially better than the L2 group.

                     A                           B                                   C                              D

Item 19. Some believe that/a country should be ruled/by men who are/too clever than ordinary people. Correct: Answer: D

ESL learners often confuse intensifier forms such as “too clever”, “very clever” and “so clever” and comparative forms such as “more beautiful” and “cleverer.” The error in Item 19 is a double confusion between intensifier and comparative forms probably caused by false generalisation, or false analogy, from the English forms.

A quantitative analysis of errors was also found useful in identifying the “replacement language” subjects, which helps in establishing levels of proficiency between learners. Recall (Note 2, Chapter 3) that a “replacement language” is a language that becomes more dominant than the mother tongue, usually at an early age, but is seldom fully mastered, as in the case of some of the Coloured and Indian subjects in the sample, who belong to the Non-Tswana L1 (NTL1) sub-group. (Bantu speakers, of course, can also have replacement languages).

The”replacement” language subjects could be identified, to a certain extent, by the very low scores they obtained in the tests. An examination of particular errors made by those I suspected of being “replacement language” subjects increased the accuracy of the identification of these subjects. (These were Indians and “coloureds” who had beenusing English as a medium of instruction from the beginning of primary school).

Mother-tongue speakers do make or cannot recognise several grammatical errors. Consider the percentage error of the NTL1 group on items 8 and 19 given above – 75 and 30 respectively. In item 8, it is possible that 12-year old English-mother-tongue speakers would not be able to recognise that segment B (“are much more richer”) is an error. In item 19, it is more likely that 12-year old English-mother-tongue speakers would recognise segment D (“too clever than ordinary people”) as an error. But there is still a slight possibility that an English mother-tongue speaker would not recognise the error. There are, however, certain errors that all English-mother-tongue speakers would recognise. Thus if such a mistake were to be made by subjects that one suspected of being “replacement language” subjects, one would be almost certain that they indeed were. For example, consider the percentage error of item 27 of the error recognition test.

                          A                                 B                                    C                                      D

Item 27. As I have now studied/French for over three years/I can be able to/make myself understood when I go to France.

Correct: Answer: C (Recall that by “correct” her is meant identifying/recognising the error).

Percentage Error of Item 27

NTL1     L2

N=20   N=38

40%     93%

In item 27 segment C (“can be able”) is a notorious error among South African black ESL users. The L2 group had a percentage error of 93. What is interesting is that 40% of the NTL1 group got this item wrong. It is highly likely that an English-mother- tongue speaker would recognise this error. Such an example is good, if not absolute, evidence that those in the NTL1 group who didn’t recognise this error were “replacement language” subjects.

4.8.1 Rater reliability among educators of teachers of English This section is published as an article “Rater reliability in language assessment: the bug of all bears.” System, 28, 1-23. If you request, I shall scan this section from the original thesis and post it in wordpress.

4.9 Summary of Chapter 4

The statistical results were reported. High correlations were found between the tests and there was a substantial difference between the L1 and L2 groups. Reasons were given for not treating the L1 and L2 groups as separate populations in the correlational analysis. The dangers of comparing tests were also discussed.

Singled out for special attention was interrater reliability. The lack of interrater reliability is arguably the greatest problem in assessment because it is often the cause, though indirectly, of student failure – and success! It is on the issue of interrater reliability that matters of validity and reliability come to a head, because it brings together in a poignant, and often humbling, way what is being (mis)measured, and how it is (mis)measured. The next chapter deals with the battery of proficiency tests as predictors of academic achievement.

Chapter 3: Ph.D. – Sampling, and Structure and Administration of the English Proficiency Tests

3.1 Introduction

3.2 Sampling procedures for the selection of subjects

3.2.1 The two main groups of the sample: First Language (L1) and second Language (L2) groups

3.3 Structure and administration of the English proficiency tests

3.3.1 The cloze tests

3.3.1.1 Theoretical overview

3.3.1.2 The cloze tests used in the study

3.3.2 The essay tests

3.3.2.1 Theoretical overview

3.3.2.2 The essay tests used in the study

3.3.3  Error recognition and mixed grammar tests

3.3.3.1 Theoretical overview

3.3.3.2 Error recognition and mixed grammar tests used in the study

3.3.4  The Dictation test

3.3.4.1 Introduction

3.3.4.2 Theoretical overview

3.3.4.3 The dictation tests used in the study

3.3.4.4 Presentation of the dictation tests

3.3.4.5 Method of scoring of the dictation tests

3.4 Summary of Chapter 3

3.1 Introduction

This chapter describes the sample of subjects and sampling procedures (of subjects and tests) and provides a detailed theoretical overview and description of the battery of English proficiency tests.

The sample of subjects consists of two main groups: First Language (L1) and Second Language (L2). A major issue in this study is between L1 and L2 levels of language proficiency. This issue has become a controversial distinction in South Africa where more and more applied linguists and educationists argue that the L1-L2 distinction should be jettisoned. (I discuss specific authors on this issue in Chapter 6). I use the labels L1 and L2 slightly differently from what is normally meant by these labels. The way these labels are used are explained shortly.

A literature review is provided for each of the test methods used in which an overview of the relevant theoretical issues is provided. After each literature review follows a detailed description of the structure and administration of the specific tests used.

3.2 Sampling procedures for the selection of subjects

The sampling procedures form a crucial part of the method rationale and are described in detail. The crucial issue is how to classify the subjects into different levels of proficiency.

There were 90 entrants to Grade 7 in January 1987 who also sat the Grade 7 end-of-year examinations. Owing to the fact that the battery of English proficiency tests was administered during the first week of the school year, there was some absenteeism during the three-day testing period. Thus, not all of the learners did all of the tests, and four learners did not do any of the tests. These four learners were not included in the sample. The other 86 learners (44 boys, 42 girls) comprise the sample of subjects.

3.2.1 The two main groups of the sample: First language (L1) and second language (L2) groups

Figures 3.1 and 3.2 provide a clear picture of the details of the L1 and L2 groups. The reader may want to consult these figures in conjuction with the verbal descriptions of the sample below.

At the school there were mother-tongue speakers from diverse linguistic backgrounds, e.g. Tswana, Sotho, English, Afrikaans and some expatriates, e.g. Greek and Filipino. (The exact numbers are provided in Table 3.1, which we shall come to later on). About two thirds were Tswana mother-tongue speakers. All learners had to take English as the medium of instruction at the School.

The Tswana speakers could choose from the following language subject combinations:

Tswana as a First Language and English as a Second Language. (After 1987 Afrikaans was also offered as a first language).

Tswana as a First Language and English as a First Language.

English as a First Language and Afrikaans as a Second Language. Tswana speakers never took this option.

The English and Afrikaans speakers and speakers of other languages (expatriates and those using English as a “replacement” language) could choose from the following language subject combination:

English as a First Language and Afrikaansas a Second Language.

English as a First Language and French as a Second Language. This combination was taken by the expatriates, because they had not studied Afrikaans in primary school as South Africans had done. The “replacement” learners took Afrikaans as a second language.

All the L2 learners were Bantu mother-tongue speakers, most of whom were Tswana speakers. The L1 learners were a mixture of English mother- tongue speakers, Tswana mother-tongue speakers, mother-tongue speakers of other languages. The latter consisted of (i) expatriates from other countries and (ii) South Africans who speak other South African non-Bantu languages such as Afrikaans and Gujarati. It was not always certain who among the L1 group (i.e. those who took English as First language at the School) were mother-tongue speakers because some of them identified themselves as mother-tongue speakers of English and/or another language, e.g. Afrikaans, Sotho. However, there was little doubt that many in (ii) were using English as a “replacement” language. (I show why this is so at the end of section 4.8). It is, of course, possible to have more than one “mother” (or “father”, or “native”) tongue.

I say something briefly about the notion of mother tongue and relate it to the notions of native language and replacement language. (These notions are discussed in more detail in section 6.2). The notion of native-speaker is not a simple one. According to Paikeday, the “the native speaker is dead!”. Indeed, it is difficult to identify a native speaker, who is someone who apparently should have a more thorough knowledge than the nonnative speaker. When one adds the notion of mother-tongue speaker, first-language speaker and second/additional-language speaker to the pot, it becomes difficult to see the wood for the trees. I discuss this issue in greater depth in section 6.2.

The fact that the Tswana learners had a choice between taking English as First Language or as a Second Language and that the “replacement” language learners had to take English First language becomes important in decisions of how to categorise the different levels of proficiency. I explain the problem later on. (A detailed analysis of the sample follows shortly).

All language subjects were taught in separate classes. With regard to theother subjects, L1 and L2 learners were taught through the medium of English in homogeneous groups in different classrooms, where each class contained about an equal proportion of L1 and L2 learners. For example, Grade 7 was divided into four classes. Each class bore the initial of the surname of the relevant class teacher. This four-class arrangement was maintained for the administration of the tests.

It was general practice at MHS that on entrance to the School, learners decided themselves whether they preferred to take English as a First Language subject or as a Second Language subject. As far as the battery of English proficiency tests was concerned, the subjects in the sample were requested to indicate on their protocols whether they had decided to take English First Language as a subject or English Second Language as a subject.

The results of the tests were not subsequently used by the School administration to make any changes to the choices the entrants had made regarding the English group (L1 or L2) they wanted to belong to. There were several possible reasons for this:

– The School did not wish to force the labels of “English second language” (L2) onto learners.

– Limited English proficiency learners might benefit in a class of high English proficiency learners, because the former might benefit from listening to a higher standard of English than their own.

– The School might not have been sure of the actual level of English proficiency of each individual entrant, in spite of the fact that it was aware that the level of English proficiency of disadvantaged entrants was generally low, but once these entrants had become part of the School, it would have been possible to make more accurate judgements of their English proficiency.

– Finally, the School might have been reluctant to use the results until I had produced solid evidence that these tests were valid predictors of academic achievement.

It was not a simple matter to decide how to classify the subjects. The following variables had to be taken into account (the descriptions are specific to the sample):

(1) Some were or said they were English mother-tongue speakers, while others were mother-tongue speakers of Tswana and other languages.

(2) Some had English as the medium of instruction from Grade 1 (Bantu speakers and non-Bantu speakers), while some had English as the medium of instruction from Grade 5 (only Bantu speakers).

(3) All had the freedom to choose at the beginning of Grade 7 whether they wanted to take English First Language or English Second Language.

The problem is whether one can make a clear separation between these subjects that would indicate a difference in levels of English proficiency. An obvious division in theory is mother-tongue speaker/non-mother-tongue speaker, where mother-tongue English proficiency was regarded as the level of English to aspire to. Thus, when the essays for this study were marked they were judged in terms of mother-tongue proficiency, and so non-mother–tongue English speakers’ essays were not marked more leniently than those of mother-tongue English speakers. There were difficulties in deciding on the norms for the other tests, which were all previously standardised published tests, because it is only after the test has been performed on the test-bench that it is possible to decide whether the test is too easy or too difficult. If there are mother-tongue speakers and non-mother-tongue speakers in the same sample, as in this study, one needs to consider whether the norms of the two kinds of speakers should be separated or interlinked. One can only do this if subjects have been precisely classified into mother-tongue/non-mother-tongue groups. This was not a simple matter in the sample for the following reasons:

In the truly multicultural setting of MHS a composite of the following cultural-ethnic groups said that they were English mother-tongue speakers: Ghanaian, Sri Lankan, Indian (South African and expatriate) and Coloured. There was also a Greek, a South Sotho, and a Fillipino who said that they had two mother tongues, one of them being English.Although all the above (N=18) obtained an English proficiency test score (in this case a composite of the cloze, dictation and essay test) of 60% and over, there were also quite a number of Bantu mother-tongue speakers (N=10, mostly Tswana) who also obtained a score of 60% and over. Further, there were five subjects who said that they were mother-tongue speakers but obtained scores between 50% and 55%. As I shall show later such a score is not a good score for somebody claiming to be an English mother-tongue speaker. In the light of this evidence, it was difficult to tell from the results of the proficiency tests who were mother-tongue or native-speakers of English. Although it was difficult in the sample to pinpoint mother-tongue English speakers this does not mean that the notion of “native-speaker” is a figment. (More about this in section 6.2). All I could specify about the sample in this regard was that it consisted of a wide range of English proficiency . (Recall that the labels L1 and L2 in the sample simply refer to those taking English First or Second Language as a subject, respectively).

When the English proficiency test scores were examined without resorting to summary statistics but merely sorted in ascending order there was a clear distinction between the subjects who had chosen to do English as First Language and English as a Second Language, respectively. These are called the L1 and L2 groups, respectively. In terms of the test results this meant that the L1 group on average were substantially better than the L2 group. (Six of the subjects who had decided to change form L1 to L2 in Grade 8 had obtained over 60% on the English proficiency test).

Classifications of human groups, especially in the human sciences, are not expected to be objectively “precise”; what matters is that the parameters of the classifications be delineated and used consistently. In this study I use the labels “L1” and “L2” to refer to “high proficiency” and “low proficiency” indicated by English First Language and English Second Language, respectively.

Also important is English as a medium of teaching and learning. The tests in this study aimed “to select lexical and structural items relevant to the demands of the appropriate syllabuses”, i.e. relevant to English as the medium of instruction. What these tests aim at testing is directly related to the task-demands of academic study.

At MHS L1 learners and L2 learners are not treated as two separate groups, because both groups are taught in the same classroom – where English is used as the medium of instruction – and write the same examinations except for the language subjects, i.e. English, Tswana, Afrikaans and French. This has important implications for the interpretation of the results, where it might be argued that the L1 and L2 groups belong to separate populations and therefore cannot be grouped together in a correlational analysis. (See section 4.6 for further discussion of this issue). Tables 3.1 and 3.2 provide a detailed analysis of the sample.

The subjects originated from 36 different schools. The L1 subjects (N=49)

originated from (i) CM Primary School (N=37), (ii) a “white” school (N=1), (iii) a “coloured” school (N=4), and (vi) several DET schools (N=7). (One Sri Lankan came from a DET school where his mother was a teacher).

The 37 L2 subjects originated from 28 DET schools, three church schools, a Coloured school and an Indian school.

Of the total sample of 86 subjects, there were 60 South African blacks, of which 52 were Tswanas and eight were non-Tswanas. These eight non-Tswana South African blacks, like all the Tswana subjects, had to take Tswana as a first language.

TABLE 3.1 Detailed Analysis of the L1 Subjects

.

There were 10 TL1 subjects who changed from L1 English as a subject to L2 English as a subject in Grade 8 (January 1988). I did not have any information on why these changes were made. One plausible reason for this change could have been that the School recommended to the learners concerned that it was in their best interest to change, because the change to L2 English as a subject at a later stage might have given them a better chance of passing English. Another plausible reason is that learners decided themselves to make the change, because it wasn’t necessary to take English First Language, because they already had Tswana as a First Language. Whatever the reason for the change to L2, the fact is that other members of the TL1 group who had obtained Grade 7 English achievement scores in the same range as those who changed to English Second Language did not change from English First Language to English Second Language. For example, the Grade 7 English scores of the 10 L1 subjects who changed to L2 in Grade 8 were (in ascending order in percentages) 50, 51, 53, 53, 55, 55, 58, 58, 61 and 63. (Most of these also had English proficiency scores in the 55% to 70% range). The Grade 7 English scores of five L1 subjects who did not change to L2 in Grade 8 were 45, 52, 53, 59, 62. As a matter of interest, in the first set, eight out 10 obtained a matriculation exemption, while in the second set, two left the school after passing a grade, two obtained a matriculation exemption, and one failed before reaching Grade 12 and left the school.

The “replacement language” learners were required to take English First language because they had no other First Language, while the Tswanas in TL1 group could take Tswana as a First Language. The initial choice of language group (L1 or L2) at the beginning of Grade 7, as pointed out earlier, was voluntary.

Most of the L2 subjects were mother-tongue speakers of Tswana who came from rural or peri-urban schools where English was used on a limited scale in the classroom, hardly at all in the playground, and not at all at home, which was probably the reason for the low level of English proficiency of most of them. A few L2 subjects, however, did have a high level of English proficiency. There were no L2 subjects from CM Primary School, because all learners at this school had English as the medium of instruction from Grade 1. L2 subjects even if they did very well in English Second Language did not change to English First Language. This was probably because there was no need to complicate their lives unnecessarily. Five L2s obtained over 70% in Grade 7 English achievement. Four of these five passed Grade 12 and three of them obtained a matriculation exemption. A detailed analysis of the L2 group follows.

TABLE 3.2. Detailed Analysis of the L2 Subjects

As shown in Table 3.2, four of the black L2 subjects were non-Tswanas. These took Tswana as a first language at the School.

In sum, there are two groups in the sample: the L1 group (a composite of the Tswana L1 and the Non-Tswana L1 groups) and the L2 group. The L2 group is also referred to as the TL2 (Tswana L2) group, because 33 of the 37 L2 subjects are mother-tongue speakers of Tswana.

3.3 Structure and administration of the English proficiency tests

Subjects were divided into four groups and the tests were administered in four classrooms by four Grade 7 teachers, where each classroom contained a combination of L1 and L2 subjects. The time allocated for each test will be indicated in the description of the administration of the individual tests.

The possibility exists that fatigue resulting from a three-day test period may have affected the results of all the tests, but this seems unlikely because subjects were released from all lessons and from all school activities during this three-day period. Also, the test sessions were interspersed with ample rest periods. The structure and administration of the English proficiency tests follows. A theoretical review precedes the description of each of the tests. Before I describe the tests I would like to point out that no sample of tests can adequately represent the vast variability of language, nor does it have to “because of the generative nature of language which acts as its own creative source”. The controvery, as far as general language proficiency is concerned, is which sample of tests to use: the “reductionist” kind of tests used in this study or “holistic” (formal and informal) outside-of-the-classroom “cocktail party”, “tea-party, “cooking club” type tests. In the academic context, what is important is the relationship between general, or overall, proficiency, communicative competence and academic achievement.3.3.1 The cloze test 3.3.1.1 Theoretical overviewCloze tests are deceptively simple devices that have been constructed in so many ways for so many purposes that an overview of the entire scope of the literature on the subject is challenging to the imagination not to mention the memory.

Cloze tests are deceptively simple devices that have been constructed in so many ways for so many purposes that an overview of the entire scope of the literature on the subject is challenging to the imagination not to mention the memory.

(Oller, 1973:106)

Since 1973 the literature on cloze has more than doubled, adding even more challenges to the imagination if not – thanks to the printed word – to the memory.

The aim of a cloze test is to evaluate (1) readability and (2) reading comprehension. The origin of the cloze procedure is attributed to Taylor (1953), who used it as a tool for testing readability. Of all the formulas of readability that have been devised, cloze tests have been shown, according to Geyer (1968), Weintraub (1968) and Oller (1973:106), to be the best indicators of readability. It is also regarded as a valid test of reading comprehension. Oller (1973:106) cites Bormuth (1969:265) who found a multiple correlation coefficient of .93 between cloze tests and other linguistic variables that Bormuth used to assess the difficulty of several prose passages. Bormuth (1969:265) maintains that cloze tests “measure skills closely related or identical to those measured by conventional multiple choice reading comprehension tests.”

Many standardised reading tests use cloze tests, e.g. the Stanford Proficiency Reading Test. Johnson and Kin-Lin (1981:282) believe that cloze is more efficient and reliable than reading comprehension, because it is easier to evaluate and does not, as in many reading comprehension tests, depend on long written answers for evaluation. (But it is also possible to use multiple choice reading tests; see Bormuth [1969] in the previous paragraph). Johnson and Kin-Lin’s implication is that although cloze and reading comprehension are different methods of testing, they both tap reading processes. Anderson (1976:1), however, maintains that as there is no consensus on what reading tests actually measure, all that can be said about a reading test is that it measures reading ability. On the contrary, far more can be said about reading: notions associated with reading are “redundancy utilization” (Weaver & Kingston, 1963), “expectancies about syntax and semantics” (Goodman, 1969:82) and “grammar of expectancy” (Oller, 1973:113). All these terms connote a similar process. This process involves the “pragmatic mapping” of linguistic structures into extralinguistic context (Oller, 1979:61). This mapping ability subsumes global comprehension of a passage, inferential ability, perception of causal relationships and deducing meaning of words from contexts (Schank, 1982:61). According to Bachman (1982:61),

[t]here is now a considerable body of research providing sound evidence for the predictive validity of cloze test scores. Cloze tests have been found to be highly correlated with virtually every other type of language test, and with tests of nearly every language skill or component.

Clarke (1983), in support of Bachman, is cautiously optimistic that the cloze procedure has a good future in reading research. Alderson (1979:225) who is less optimistic, maintains that

individual cloze tests vary greatly as measures of EFL proficiency. Insofar as it is possible to generalise, however, the results show that cloze in general relates more to tests of grammar and vocabulary than to tests of reading comprehension.

Hughes (1981), Porter (1978) and Alderson (1979) found that individual cloze tests produce different results. Johnson and Kin-Lin (1981) and Oller (1979), contrary to Alderson (1979), found that a great variety of cloze tests correlates highly with tests such as dictation tests, essay tests and reading tests as well as with “low order” grammar tests.The concept of closure is important in cloze theory. Alderson (1979:225) asks

whether the cloze is capable of measuring higher-order skills. The finding in Alderson (1978) that cloze seems to be based on a small amount of context, on average, suggests that the cloze is sentence – or indeed clause – bound, in which case one would expect a cloze test to be capable, of measuring, not higher-order skills, but rather much low-order skills…as a test, the cloze is largely confined to the immediate environment of a blank.

This means that there is no evidence that increases in context make it easier to complete items successfully. Oller (1976:354)  maintains, contrary to Alderson, that subjects “scored higher on cloze items embedded in longer contexts than on the same items embedded in shorter segments of prose”. Oller used five different cloze passages and obtained similar results on all of them.

Closure does not merely mean filling in items in a cloze , but filling them in a way that reveals the sensitivity to intersentential context, which measures “higher-order skills” (Alderson above). A cloze test that lacks sufficient closure would not be regarded as a good cloze test.

There are two basic methods of deletion: fixed deletion and rational deletion, or “selective” deletion). In the former, every nth word is deleted; which may range between every fifth word – which is considered to be the smallest gap permissible without making the recognition of context too difficult – and every ninth word. Pienaar’s (1984) tests, which are used in this study, are based on a rational deletion procedure.

Alderson (1980:59-60) proposes that the rational deletion procedure should not be referred to as a “cloze” but as a “gap-filling” procedure. Such a proposal has been accepted by some researchers, e.g. Weir (1993:81), but not accepted by others, e.g. Bachman’s (1985) “Performance on cloze tests with fixed-ratio and rational deletions”, Maclean’s (1984) “Using rational cloze for diagnostic testing in L1 and L2 reading” and Markham’s (1985) “The rational deletion cloze and global comprehension in German.” There is nothing wrong with the proposal that the rational-deletion procedure be called a gap-filling test, if it remains nothing more than that – a proposal.

Alderson (1979:226) suggests that what he calls cloze tests (namely, every nth word deletion) should be abandoned in favour of “the rational selection of deletions, based upon a theory of the nature of language and language processing”. Thus, although Alderson proposes that the rational selection of items, his “gap-filling” should not be called a cloze procedure, he still favours “gap-filling” tests over “cloze” tests. This superiority of the rational deletion method is supported by a substantial body of research, e.g. Bachman (1982) and Clarke (1979). However, it should be kept in mind that language hangs together, and thus the every-nth-word cloze test is also, in my view, a good test of global proficiency. In other words, whether one uses “fixed” deletions or “rational” deletions, both methods test global proficiency.

Having considered the arguments for the validity of the cloze test as a test of reading, it seems that cloze tests are valid tests of reading strategies, i.e. they can test long-range contextual constraints. One must keep in mind, however, that deletion rates, ways of scoring, e.g. acceptable words or exact words, and types of passages chosen in terms of background knowledge and of discourse devices, may influence the way reading strategies are manifested. But it is debatable whether one should make to much of these differences.

3.3.1.2 The cloze tests used in the study

In a review of Pienaar’s pilot survey called “Reading for meaning”, Johanson refers to the “shocking” low reading levels in many schools in the North-West Province revealed by Pienaar’s survey.

Pienaar tested a variety of learners from different schools: learners whose (1) first language (i.e. mother tongue or a language the learner knows best) was English, (2) replacement language was English, (3) first language was Afrikaans, (4) first language was a Bantu language, mostly Tswana speakers. Categories (1) and (2) and (3) came from upper middle class families, while category (4) was split into two sub-categories: (4a) Bantu speakers who generally came from working class families and (4b) Bantu speakers who lived in sub-economic settlements in the environs of Mmabatho. Many of the parents of (4b) were illiterate or semi-literate and are either unemployed or semi-employed. (The category labels used by Pienaar are different to mine: I have changed them for clarity sake). Pienaar’s major finding was that 95% of learners in the North West Province (Grade 3 to Grade 12), most of whom belonged to category 4b, were “at risk”, i.e. they couldn’t cope with the academic reading demands made on them.

There are four reasons why Pienaar’s cloze tests are used in this study:

(1) they have already been used in many schools in the North West Province and have produced a solid body of results, (2) their purpose is “to select lexical and structural items relevant to the demands of the appropriate syllabuses”, (3) Pienaar’s (1984) cloze tests are based on a rational deletion method, where it is possible to select gaps in such a way that closure , i.e. long-range constraints, is ensured, and (4) Pienaar’s data will be compared with the data in this study.

Pienaar’s (1984) tests comprise five graded levels – “Steps” 1 to 5, where each level consists of four short cloze passages (Form A to Form D) with 10 blanks in each passage (Pienaar 1984:41):

Step 1 corresponds to Grades 3 and 4 for English first language and to Grades 5 to 7 for English second language.

Step 2 corresponds to Grades 5 and 6 for first language and Grades 7 to 9 for second language.

Step 3 corresponds to Grades 7 and 8 for first language and to Grades 9 to 11 for second language.

Step 4 corresponds to Grades 9 and 10 for first language and to Grades 11 and 12 for second language.

Step 5 corresponds to Grades 11 and 12 for first language and to Grades 12 + for second language.

If one Step proves too easy or too difficult for a specific pupil, a higher or a lower Step could be administered. For example, if Step 2 is too difficult, the pupil can be tested on Step 1. In this way it is possible to establish the level of English proficiency for each individual pupil.

Owing to the fact that (1) many of the L1 group were not mother-tongue speakers of English and (2) I had to give the same test to both the L1 and L2 groups in order to make reliable comparisons, I used Step 2 for the L1 group and the L2 group.

I did not use the other Steps, because it was irrelevant to the purpose of study, which was not to place learners in the level they belong to, i.e. for teaching purposes (which was the purpose of Pienaar’s tests), but to test for proficiency at the Grade 7 level, and in the process to test the tests themselves, i.e. examine whether the passages chosen for the Step 2 level were valid for that level. Although annual predictions between English proficiency and academic achievement might yield higher correlations than long-term predictions, the aim is to investigate what chance Grade 7 learners who entered the School in 1987 would have of passing Grade 12.

According to Pienaar, a perfect score on a cloze test indicates that the pupil has fully mastered that particular level. A score below 50% would indicate that the learner is at risk.

Pienaar maintains that English second language learners are generally two to three years behind English first language learners in the acquisition of English proficiency, and there is often also a greater age range in the English second language classes, especially in the rural areas. Pienaar’s implication is that to be behind in English language proficiency is also to be behind in academic performance.

Pienaar standardised his tests in 1982 on 1068 final year JSTC (Junior Secondary Teacher’s Certificate) and PTC (Primary Teacher’s Certificate) students from nine colleges affiliated to the University of the Transkei. These standardised results became the table of norms for Pienaar’s tests (Pienaar 1984:9). Below are the weighted mean scores achieved by the students of the nine colleges (Pienaar 1984:10):

Step 1 Step 2   Step 3    Step 4   Step 5

Weighted means:    67%    53%       37%       31%      24%

Most of the colleges performed similarly on all five Steps. These results confirmed the gradient of the difficulty of the various steps.

During 1983 Pienaar administered a major part of the test instrument to a smaller group of college students selected from the original large group. No significant difference between the scores of the two administrations was found, which confirmed the test-retest reliability of the instrument..

The tests underwent ongoing item analysis and refinement. By the time the final version was submitted to school learners in the Mmabatho/Mafikeng area in 1984, 30% of the items had been revised. As a result of continuous item analysis, a further 18% of the items were revised.

An important point is that these results claim to represent the reading ability of college students, who are supposed to be more proficient in English than school learners. However, final year student teachers only obtained a score of between 40% and 60% on Step 2 – see Pienaar’s mean scores above. (Step 2 has been used in this study for Grade 7 learners). These low scores indicate that the reading level of the student teachers, who were to start teaching the following year, was probably no higher than the level of many of the learners they would eventually teach. Thisalarmingstate of affairs would probably have had a detrimental effect on the academic performance of these learners.

Pienaar’s original tests had four passages for each level and that he tried to establish the equivalence in difficulty between the passages for each level. In this study I used two cloze passages for Step 2 instead of four. This was done for two reasons: (1) to see whether only two passages were sufficient, and (2) the other two passages, in their “unmutilated” form, were used for the dictation tests, because they belonged to the same level, namely, Step 2. The question is whether two passages – in the cloze and dictation tests – were enough to ensure reliability and validity. The results of the tests (Chapter 4) deal with this question.

In the test battery I used Pienaar’s Form B and Form D passages of Step 2, as shown below:

Pienaar’s Practice exercise

(Pienaar does not provide the answers for this practice exercise. Possible answers are provided in brackets).

The 1 (rain) started falling from the sagging black 2 (clouds) towards evening. Soon it was falling in torrents. People driving home from work had to switch their 3 (headlights) on. Even then the 4 (cars, traffic) had to crawl through the lashing rain, while the lightning flashed and the 5 (thunder) roared.

Cloze passage 1: Form B Step 2 (Pienaar 1984:59):

A cat called Tabitha

Tabitha was a well-bred Siamese lady who lived with a good family in a shiny white house on a hill overlooking the rest of the town. There were three children in the family, and they all loved Tabitha as much 1 she loved them. Each night she curled up contentedly on the eldest girl’s eiderdown, where she stayed until morning. She had the best food a cat could possibly have: fish, raw red mince, and steak. Then, when she was thirsty, and because she was a proper Siamese and did 2 like milk, she lapped water from a blue china saucer.

Sometimes her mistress put her on a Cat show, and there she would sit in her cage on 3 black padded paws like a queen, her face and tail neat and smooth, her black ears pointed forward and her blue 4 aglow.

It was on one of these cat shows that she showed her mettle. The Judge had taken her 5 of her cage to judge her when a large black puppy ran into the hall. All the cats were furious and snarled 6 spat from their cages. But Tabitha leapt out of the judge’s arms and, with arched 7 and fur erect, ran towards the enemy.

The puppy 8 his tail and prepared to play. Tabitha growled, then, with blue eyes flashing, she sprang onto the puppy’s nose. Her 9 were razor-sharp, and the puppy yelped, shook her off, and dashed for the door. Tabitha then stalked back down the row of cages to where she had 10 the judge. She sat down in front of him and started to preen her whiskers as if to say, “Wait a minute while I fix myself up again before you judge me.” She was quite a cat, was Tabitha!

Answers. (The words in round brackets are Pienaar’s suggested alternative answers. The words in square brackets are suggested alternative answers):

1. as; 2. not; 3. her [four, soft]; 4. eyes (eye); 5. out; 6. and; 7. back (body); 8. wagged, twitched (waved, lifted); 9. claws (nails); 10. left (seen, met).

Item 9 could also be “teeth”. (There are cats who jump on to faces and bite, rather than scratch). Even in easy cloze passages, “acceptable” answers can be a problem.

Cloze passage 2: Form D Step 2 (Pienaar 1984:61):

A dog of my own

When I was ten all 1 wanted was a dog of my own. I yearned for a fluffy, fat, brown and white collie puppy. We already had two old dogs, but my best friend’s pet collie had 2 had seven fluffy, fat, brown and white puppies, and I longed for one with all my heart. However, my mother said no, so the seven puppies were all sold. I had horses, mice, chickens and guinea-pigs, and as my 3 said, I loved them all, but I wasn’t so keen on finding them food. Since she had five children to look after, it made here angry to 4 hungry animals calling, so she said crossly, “No more dogs.”

This didn’t stop me wanting one though, and I drew pictures of collie dogs, giving 5 all names, and left them lying around where she would find them. As it was 6 Christmas, I was sure that she would relent and give me a puppy for Christmas.

On Christmas morning I woke up very excited, 7 the soft little sleepy bundle that I wanted at the bottom of the bed wasn’t there. My mother had given me a book instead. I was so disappointed that I cried to myself, yet I tried not to 8 her how sad I was. But of course she noticed.

Soon after that my father went off to visit his brother and when he came back he brought me a puppy. Although it 9 a collie it was podgy and fluffy, and I loved him once. My mother saw that I looked after him properly and he grew up into a beautiful grey Alsation. We were good friends for eleven happy 10 before he went to join his friends in the Animals’ Happy Hunting Ground.

Answers.

1. I; 2. just, recently; 3. mother (mummy, mum, mom); 4. hear; 5. them; 6. near (nearly, nearer, close to; 7. but, however (though); 8. show (tell); 9. wasn’t (was not); 10. years.

Pienaar allotted six minutes for each passage. I allotted 12 minutes. Ability is dependent on speed of processing and so if a test-taker does badly with an allotted time of six minutes,perhaps results would not be much significantly better after six minutes. However, the effect of time on performance is not easy to gauge.

3.3.2 The essay tests

3.3.2.1 Theoretical overview

Language processing involves various components such as linguistic knowledge, content knowledge, organisation of ideas and cultural background. All these factors mesh together into a proficiency network of vast complexity, which makes objective evaluation of essay performance very difficult. It is this vast complexity that makes the written discourse, or essay writing, the most pragmatic of writing tasks and the main goal of formal education.

Essay writing is arguably the most complex of human tasks. If getting the better of words in writing is usually a very hard struggle for mother-tongue speakers, the difficulties are multiplied for the second language learner, and very difficult for disadvantaged or second language learners such as those described in this study. Many of the disadvantaged Grade 7 subjects are similar to young mother-tongue speakers of English first learning to write in that much mental energy is expended on attention to linguistic features rather than to content.

What makes essay writing a pragmatic task, in contradistinction to tasks at the sentence level, is that essay writing involves writing beyond the sentence level. This does not mean that non-pragmatic tasks are not integrative. As discussed in section 2.5, all language resides along a continuum of integrativeness, where pragmatic tasks are the most integrative.

Owing to the fact that the production of linguistic sequences in essay writing is not highly constrained, problems of reliability arise in essay scoring. (In this respect, essay tests have much in common with oral tests). One problem is that owing to the fact that it is inferential judgements that have to be converted into a score, “[h]ow can essays or other writing tasks be converted to numbers that will yield meaningful variance between learners?” (Oller, 1979). Oller argues that these inferential judgements should be based on intended meaning and not merely on correct structural forms. That is why in essay rating, raters should rewrite (in their minds, but preferably on paper) the intended meaning. Perhaps one can only have an absolutely objective scoring system with lower-order skills (Allen in Yeld), but Oller is not claiming that his scoring system is absolutely objective, but only that as far as psychometric measurement goes, it is a sensible method for assessing an individual’s level within a group.

Whatever one’s paradigm, structural (“old”) or communicative (“new”), when one marks an essay one can only do so through its structure. The paradox of language is that the structure must “die” so that the meaning may live; yet, if structure is not preserved, language would not be able to mean.

In the normal teaching situation, marking is done by one rater, namely the teacher who teaches the subject. Sometimes if a test is a crucial one, for example an entrance test or an end-of-year examination, more than one rater, usually two, are used. In a research situation, the number of raters depends on the nature of the research and the availability and proficiency of raters. The raters used for the essay tests in this study were all mother-tongue speakers of English, and recognised as such by their colleagues. (In the dictation tests there were three English mother-tongue presenters and one non-mother-tongue presenter).

With regard to the level of English proficiency of raters, it does not follow that because a rater (or anybody else) is not a mother-tongue speaker (of English in this case) that his or her English proficiency is necessarily lower than a mother-tongue speaker of English. In the academic context, there are many non-mother-tongue speakers of English who have a higher level of academic English proficiency (CALP) than mother-tongue speakers of English. A major reason for this is not a linguistic reason, but because these non-mother-tongue speakers are more academically able, i.e. they have better problem-solving abilities and abilities for learning content.

Kaczmarek (1980) reports high correlations between essay raters, while Hartog and Rhodes (1936) and Pilliner (1968), contrary to Kaczmarek, found essay tests to have low interrater and low intrarater reliability. With regard to scoring procedures in essay testing. Mullen (1980:161) recommends the use of four “scales” (criteria) of writing proficiency: structure, organisation, quantity, vocabulary. According to Mullen (1980), a combination of all four scales is required to validly predict proficiency. A major issue in scoring is whether marks should be separately allocated to each of the criteria, i.e. whether one should use an analytic scoring procedure or whether marks should be allocated globally. Global scoring usually refers to two ways of scoring: (1) “Overall impressions” and (2) “Major constituents of meaning”, which takes into account global errors, e.g. cohesion and coherence, but not local errors, e.g. grammar and spelling.

The following terms are used interchangeably in the literature: global rating, overall impressions, holistic scoring and global impressions (specific authors will be mentioned). The term holistic scoring is used by Perkins to refer to overall impressions, which takes into account global as well as local errors. With regard to the two ways of global scoring mentioned in the previous paragraph, it is possible that a rater’s “overall impressions” may include quick, yet thorough attention to major constituents of meaning as well as to local errors. In such a case the distinction within global scoring between “overall impressions” and “major constituents of meaning” would no longer be very useful.

With regard to the relative reliability of analytic and global scoring procedures, Kaczmarek and Ingram have shown that analytic scoring procedures are not more reliable than global scoring. According to Perkins, “[w]here there is commitment and time to do the work required to achieve reliability of judgement, holistic evaluation of writing remains the most valid and direct means of rank-ordering students by writing ability.”

Zughoul and Kambal (1983:100) also report “no significant difference between the two methods”, and Omaggio (1986:263) maintains that “holistic scoring has the highest validity” (reliability?).

According to Oller (1979:392) “judges always seem to be evaluating communicative effectiveness regardless whether they are trying to gauge ‘fluency’, ‘accentedness’, ‘nativeness’, ‘grammar’, ‘vocabulary’, ‘content’, ‘comprehension’, or whatever. Even if one does this quickly, say a minute per page, the brain of the rater still has to consider the trees to get an overall idea of the wood. The greater the experience and competence of the rater the more unconscious and quicker, but no less rational, is the judgement.

It is arguable whether judges always seem to be evaluating communicative effectiveness, as Oller’s believes. Although it seems reasonable that in essays one should be looking at the overall impact of a piece of writing (the whole) and that the only way to do this is to look at the various aspects of the writing such as those mentioned by Oller, I would question whether raters do regard communicative effectiveness as the overarching criterion. Unfortunately, I did not manage to obtain the judgements of the raters of the MHS essay protocols and so could not investigate this important issue. I did, however, at a later stage, use some of the MHS protocols on a different set of raters to verify whether Oller’s cautious observation was reasonable. (More about this in section 4.8.1).

If global or holistic scoring is more effective or even as effective as analytic scoring, then for reasons of economy global scoring should be used. However, global scoring ideally requires at least three raters (Ingram 1968:96), who would presumably, and hopefully, balance one another out. The effectiveness of global scoring depends on factors such as availability, willingness and qualifications of raters.

The unavailability of raters is often a problem. In special circumstances such as proficiency tests used for purposes of admission or placement at the beginning of an academic year, it may be possible to obtain the help of three or four raters. However, in the normal testing situation during the school year, only one rater may be available, who is usually the teacher involved in teaching the subject. Hughes (1989:87) recommends four raters because this has been shown to be the best number. Four raters were used in this study.

To ensure high interrater reliability there should only be a narrow range of scores and judgements between raters. If three or four raters are considered to be required for reliability a serious problem is what to do in the normal education situation where at most two and usually only one rater is available. I shall discuss this issue in the final chapter where “improving testing in South Africa” is dealt with.

3.3.2.2 The essay tests used in the study

There were two essays:

Essay 1: Everybody in this world has been frightened at one time or another. Describe a time when you were frightened. Write between 80 and 100 words.

Essay 2: Do (a) or (b) or (c). Do only one of the following topics. Don’t forget to write the letter (a) or (b or (c) next to the topic you choose. The topic you choose must not be shorter than 80 words and not longer than 100 words.

(a) Describe how a cup of tea is made.

(b) Describe how shoes are cleaned.

(c) Describe how a school book is covered.

Both the L1 and L2 group’s essays were judged in terms of English-mother- tongue proficiency.

The TOEFL Test of Written English (Hughes 1989:86) recommends spending a rapid one and a half minutes per page using a holistic scoring method. I would imagine that when working at such a speed, the scoring criteria are assumed to be known to the point of automaticity. Raters in this study were recommended to spend about one and a half minutes on each protocol, where protocols were much shorter than a page in length.

As I discussed earlier, clarity and consistency of judgements are difficult to ensure. The TOEFL scoring method seems to be the same as the “overall impressions” approach of Perkins (1983), which takes into account global as well as local errors.

Four raters, including myself, rated 86 protocols. The other three raters were Grade 7 teachers at the School, who were the same teachers who had also participated in the administration of the test battery. As I mentioned earlier (section 3.3.3.1) these three raters were all mother-tongue speakers of English, and were also recognised to be such by their colleagues, of which I was one. Each rater, in turn, was given the original 86 protocols and was requested to give an impressionistic score based on such considerations as topic relevance, content and grammatical accuracy.

Raters did not provide any judgements. The reason for this was because these raters, who were also the Grade 7 teachers at the School, were fully involved in the three-day administration of the test battery; accordingly I did not want to overload them with too much extra work after the three days were over, because they had to return to a full teaching load. Thus, they merely gave a score based on a global impression. It would have been useful to compare raters’ scores and judgements because this would have provided insights into the knotty problem of interrater reliability. I did manage at a later stage to obtain data on the same essay test (given to the Grade 7 subjects in this study) from a workshop on language testing that I conducted (Gamaroff 1996c, 1998c). (More about this in section 4.8.1).

Bridgeman (1991:9) recommends that each rater assign a holistic score on a six point scale, where zero is given if the essay is totally off the topic or unreadable. A nine-point scale was used in this study: from Scale 1; 0 to 1 point = totally incomprehensible, to Scale 9; 9 to 10 points = outstanding. The points were converted to percentages.

Raters did not record their scores on the protocols but were each provided with a copy of the list of the names of subjects on which they had to record their scores. Raters were requested not to consult one another on the procedures they used to evaluate the protocols. The results, as with all the tests in the study, are presented in the next chapter.

3.3.3  Error recognition and mixed grammar tests

3.3.3.1 Theoretical overview

Grammar is an important component in most standardised test batteries, e.g. English as a Foreign Language (EFL), English Language Testing Service (ELTS), Testing English as a Foreign Language (TOEFL) (O’Dell 1986). Error recognition has been used in various studies on language proficiency testing. Olshtain, Shohamy, Kemp and Chatow (1990:31) use it as part of a battery of first language proficiency tests to predict second language proficiency. The emphasis in Olshtain et al. is on appropriacy, i.e. language use, and not on acceptability, i.e. grammatical correctness. Different to Olshtain et al.’s aim of trying to find a connection between error recognition and language use, Irvine, Atai and Oller (1974:247) use the multiple-choice error recognition test from the TOEFL battery of tests to find out whether integrative tests such as cloze tests and dictation tests correlate more highly with each other than they do with the multiple-choice tests of TOEFL.

Henning, Ghawaby, Saadalla, El-Rifai, Hannallah and Mattar’s (1981) revised GSCE (Egyptian General Secondary Certificate Examination) test battery contains an error identification test and a grammar accuracy test. Henning et al. (1981) found that the highest correlation with their “composition” subtest was with Error identification (.76). They accordingly maintain that “Error Identification may serve as an indirect measure of composition writing ability.” (Henning et al. 1981:462).

A grammar component has always featured prominently in all the standardised English first and second language proficiency and achievement tests of the Human Sciences Research Council (HSRC). The recent tests of the HSRC range across various school levels from junior secondary school to senior secondary school (Barry, Cahill, Chamberlain, Lowe, Reinecke and Roux, undated).

3.3.3.2 Error recognition and mixed grammar tests used in the study

The error recognition test and the mixed grammar test in this study are both multiple choice tests that have been designed for learners who have completed the “elementary” stage (Bloor et al. 1970) of second/foreign language learning. These two tests each comprise 50 items. Bloor et al. (1970:ix – Teacher’s book) state that tests were analysed by the authors over an extended period subjected to an item analysis and validated under test conditions. No data on this item analysis or validation under test conditions were provided by the authors, probably because that kind of data would not appear in a Teacher’s book. I shall show (Chapter 4) that their tests have high (split-half) reliability and high correlations with other test methods.

Bloor et al. (1970) divide their tests into three levels; First Stage/ Elementary Stage, Second Stage/Intermediatory Stage, and Third Stage/Advanced Stage. Although the levels have been designated relative to one another, these are merely guidelines, and therefore the tester must use discretion in fitting the level to the relevant group of students. I have used the First Stage level for the Grade 7 subjects.

The tests were administered at the same sitting over a one-and-a- half-hour period, the emphasis being on completing the task and not on speed. It is true, however, that ability is dependent on speed of processing.Sample items from the error recognition and mixed grammar tests are now provided.

Error recognition test: Sample items

The test used was Test 1 from Bloor et al. (1970:70-77; Book 2).

Instructions: In some of the following sentences there are mistakes. (There are no mistakes in spelling and punctuation). Indicate in which section of the sentence the mistake occurs by writing its letter on your answer sheet. If there is no mistake, write E.

Example: (THE LETTERS ARE TOO MUCH TO THE RIGHT BUT IT IS NOT DIFFICULT TO SEE WHERE THEY BELONG)

A                                     B

Although he has lived in England/since he was fifteen,/he

C                             D

still speaks English/much badly. Correct – E.

Answer: D.

A                          B                            C

Item 8. Both Samuel and I/are much more richer/than we/

D

used to be. Correct – E.

Answer: B.

A                                  B

Item 19. Some believe that/a country should be ruled/by

C                                  D

men who are/too clever than ordinary people. Correct – E.

Answer: D.

A                              B

Item 25. His uncle is owning/no fewer than ten houses,/

C                          D

and all of them/are let at very high rents. Correct – E.

Answer: A.

A                              B

Item 27. As I have now studied/French for over three

C                                          D

years/I can be able to/make myself understood when I go to France. Correct – E. Answer: C

Mixed Grammar test: Sample items

The test used is from the “First Stage: Test 2 in Bloor et al., (1970:35-40; Book 1). (By “mixed” is meant a variety of grammatical structures were tested).

The test consists of choosing the correct alternative that fits into the gap within a sentence. The following are the instructions and an example from the test, followed by five selected items from the test:

Instructions: Choose the correct alternative and write its letter on your answer sheet.

Example: His sister is….than his wife. A) more prettier B) prettier C) very pretty

D) most pretty. Answer: B.

Item 1. They often discuss…. A) with me B) about whether there is a problem C) the problem D) about the problem with me. Answer: C.

Item 28. This dress was made… A) by hands B) by hand C) with hands D) with hand.

Answer: B.

Item 30. When the door-bell…., I was having a bath. A) rang B) rings C) rung D) ringed. Answer: A.

Item 38. My friend always goes home….foot. A) by B) with C) a D) on. Answer: D.

Item 50. We….our meat from that shop nowadays. A) were never buying B) do never buy C) never buy D) never bought. Answer: C.

The mixed grammar tests and error recognition tests of this study (Bloor et al. 1970) are commonly used tests. Compare these test items with equivalent test items from the Egyptian study of Henning et al. (1981). Consider the following two items from their test battery:

Grammar Accuracy

Ahmed enjoys….us. A. helping B. to help C. help to D. helping to

The item requires the selection of one of the four options. This test has the same format as Bloor et al.’s (1970) mixed grammar.

Error Identification

A                     B                       C                           D

In my way to school,/ I met a man/ who told me that/ the school was on fire.

One has to choose the incorrect segment in a sentence, i.e. item A. This format is almost identical to Bloor et al.’s (1970) error recognition test. The difference is that Bloor et al.’s test has five options, which makes Bloor’s test more difficult.

I judged Bloor et al.’s “elementary stage” to be appropriate for the Grade 7 subjects. If this is correct it suggests that the standard of grammatical knowledge that is required for Egyptian university entrants (Henning et al. 1981) is very similar to the standard required for Grade 7 non-mother-tongue speakers of English. Quite alarming. Further, as mentioned, Henning et al’s Error Identification test is easier than Bloor’s, because it only has four options whereas Bloor’s error recognition test has five options. The greater the number of options the more difficult it is to guess.

3.3.4 The dictation tests

3.3.5.1 Introduction

Listening comprehension “poses one of the greatest difficulties for learners of English” (Suenobu et al. 1986:239). This section examines language proficiency through the dictation test, which is the most demanding type of listening comprehension test, because it forces the test-taker to focus on structure as well as meaning.

3.3.4.2 Theoretical overview

Language tests involve one or various combinations of the four language modes, namely listening, speaking, reading and writing. Although the most important combination in education, except for the early school years, is usually reading-writing (Lado 1961:28), the

listening-writing combination, which is what a dictation test measures, also plays an essential role. The dictation test is a variation of a listening comprehension test where subjects write down verbatim what they listen to.

Some authors regard the dictation test merely as a test of spelling or of grammar (Lado 1961; Froome 1970:30-31). For Lado (1961) dictation was only useful as a test of spelling because dictation, he argued, did not test word order or vocabulary, both of which were already given; neither did it test aural comprehension, owing to the fact that the learner could often guess the context. For protagonists of the audio-lingual method the dictation test was considered to be a “hybrid test measuring too many different language features, mixing listening and writing, and giving only imprecise information” (Tönnes-Schnier and Scheibner-Herzig 1988:35). Savignon (1983:264) maintains that dictation does not test communicative proficiency. The reasons for its popularity, she suggests, is that it has high concurrent validity, is easy to develop and score, and has high reliability.

Contrary to these negative views, other authors regard dictation as a valid test of “pragmatic” language (Oller 1979:42 1983; Bacheller 1980:67; Cziko 1982; Larsen- Freeman 1987; Bott and Satithyudhakarn 1986:40; Tönnes-Schnier and Scheibner-Herzig 1988). Dictation for these authors is a robust test of the ability to reconstruct surface forms to express meaning at the sentence level and beyond. For Oller (1976:61) and Tönnes-Schnier and Scheibner-Herzig (1988:37-38), dictation tests are valid measures of communicative proficiency. Spelling, which for Lado was the dictation test’s only justification, is disregarded by some of these authors (e.g. Oller 1979:278; Bacheller 1980:69).

There is substantial evidence in the interdisciplinary co-operation between research in linguistic pragmatics and reading comprehension to show that reading and listening employ the same underlying processing strategies (Horowitz and Samuels 1987; Hoover and Gough 1990; Vellutino et al. 1991). Vellutino et al. (1991:107, 124) found significant correlations between reading and listening comprehension in young children and in adults. Of the four language skills, reading performance was found to be the best predictor of listening performance, and vice versa. (For different views on the relationship between listening and reading see Atkin, Bray, Davidson, Herzberger, Humphreys and Selzer 1977a, 1977b; Jackson and McClelland 1979; Carroll 1993:179-80).

Cloze and dictation have been found to reveal similar production errors in writing (Oller 1976:287ff; Oller 1979:57), and a combination of cloze tests and dictation tests have been used effectively in determining general language proficiency (Stump 1978; Hinofotis 1980). Oller (1979:61) maintains that all pragmatic tasks such as dictation tests probe the same underlying skill. The reason why a dictation and a cloze test (which are apparently such different tasks) intercorrelate so strongly is that both are effective devices for assessing the efficiency of the learner’s developing grammatical system, or language ability, or pragmatic expectancy grammar. This underlying skill is overall or general language proficiency. Spolsky (1989:72) describes the overall proficiency claim in the following “necessary” condition: “As a result of its systematicity, the existence of redundancy, and the overlap in the usefulness of structural items, knowledge of a language may be characterized as a general proficiency and measured.”

Tönnes-Schnier and Scheibner-Herzig (1988) compared Oller’s method and Bacheller’s (1980:71) “scale of communicative effectiveness” – where both methods distinguish between spelling errors and”real” errors – with the relatively much simpler traditional method, where different errors are not distinguished. Tönnes-Schnier and Scheibner-Herzig (1988:38) maintain that “thissimplifiedway of marking errors in dictations chosen in accord with the learners’ level of structures and vocabulary proves an effective way to rank a class of learners according to their communicative capacities.” The fact that a superficial and reductionist method such as the traditional method of counting surface errors could rank learners according to their communicative capacities shows that reductionist methods of testing can predict “pragmatic” language, i.e. language that straddles sentences. (Analogously, an eye that is involved in an eye test is no less alive looking at letters on an optician’s screen than reading a book or looking at the sunset). Recall (section 2.5) Spolsky’s (1989:61) suggestion that the “microlevel” is “in essence” the “working level of language, for items are added one at a time”, keeping in mind that “any new item added may lead to a reorganisation of the existing system, and that items learnt contribute in crucial, but difficult to define ways to the development of functional and general proficiency.” The micro-level is not less alive than the micro-level, just as an eye in a socket is not less alive than the brain that turns sensation into perception. Accordingly we can talk about a symbiotic relationship between “reductive”, or “discrete”, micro-levels and “holistic”, or “pragmatic”, macro-levels.

For Ur (1996:40), dictation “mainly tests spelling, perhaps punctuation, and perhaps surprisingly, on the face of it, listening comprehension”. Tönnes-Schnier and Scheibner-Herzig’s (1988) “surface” (discrete?) findings expose our “abyss of ignorance” (Alderson (1983:90). Constructs may not be lurking beneath the surface after all, but staring us in the face; or more accurately lurking beneath the surface and staring us in the face. The German term aufheben (sublation) illustrates the paradox. This term means “to clear away” as well as “to preserve” : the simultaneous preservation and transcendence of the structure/meaning antithesis. Language (i.e. language structure) has to be cleared away and preserved in order to convey its meaning. Coe (1987:13) usessublation to explain the paradox of structural preservation and transcendence. According to Scholes (1988), however, sublation remains inconsequential from a practical point of view. Certainly not inconsequential from the practical point of view of language testing.

3.3.4.3 The dictation tests used in the study

Excerpts from two restored cloze tests of Step 2 of Pienaar’s (1984) “Reading for meaning” were used. Step 2 corresponds to Grade 5 and Grade 6 for English mother-tongue speakers, and Grades 7 to 9 for English non-mother-tongue speakers. These dictation passages were different from the passages that were used for the cloze test (which were Forms B and D). For the dictation tests, I used the restored texts of Forms A and C of Pienaar’s Step 2. Thus, all four passages – two for the cloze test and two for the dictation test – belong to the same level. (Recall that Step 2 corresponds to Grades 5 and 6 for English First Language and to Grades 7 to 9 for English Second Language).

I judged the conceptual difficulty of the word sequences to be within the range of academic abilities required of Grade 7 learners who have to use English as the medium of instruction. I decided to pitch the dictation test at the L2 level and not at the L1 level, because I suspected that STEP 3, which was meant for the Grade 7 L1 level, would be too difficult for the Grade 7 L2 subjects. Accordingly, I used the passages of STEP 2, which were aimed at the Grade 7 L2 level.

Test 1

The fire

We were returning from a picnic up the river when the fire-engine raced past us. Of course we followed it. We hadn’t gone far when we saw black smoke pouring from an old double-storey house in the high street. When we drew nearer we saw angry tongues of flame leaping from the downstairs windows. There was already a curious crowd watching the fire, and we heard people say that there was a sick child in one of the upstairs bedrooms. A black cat was also mentioned.

(86 words)

Test 2

A close call

It was early evening and we were driving at a steady ninety when a small buck leapt into the road about a hundred metres ahead of us. At the last moment it swerved and ran directly towards us. I flicked on the headlights and swerved at the same time. The car slithered to a halt in a cloud of dust, and it was only then that we saw why the buck had changed direction. A number of sinister shapes were hard on the Duiker’s* heels. Wild dogs!

(87 words)

*Duiker is a South African species of small buck.

The reason for the choice of these dictation passages was the same as Pienaar’s for the choice of his cloze passages (see section 3.3.1.2) which was “to select lexical and structural items relevant to the demands of the appropriate syllabuses” (Pienaar 1984:3), i.e. relevant to English as the medium of instruction. However, the relevant demands of the appropriate syllabuses cannot be separated from general language proficiency, which is often the hardest part of learning English for ESL learners (Hutchison and Waters 1987).

3.3.4.4 Presentation of the dictation tests

The following procedures were used:

1. The degree of difficulty of the texts was regulated by controlling factors such as speed of delivery and length of segments between pauses. The text was read at a speed that preserved the integrative nature of the sequences, while catering for subjects who might not have been able to write at the required speed. The length of sequences between pauses was also sufficient, which satisfied the requirements of both mechanical speed and speed of information-processing.

2. The background noise level was kept to a minimum.

3. The text was presented three times. Once straight through, which involved listening only, a second time with pauses, and a third time without pauses, but at a speed that allowed for quick corrections.

4. And very importantly for this study, more than one presenter was used. This procedure is explained in the next section.

It is normal procedure in a dictation test to use one presenter for all subjects – in this case all four groups. It has been argued that a ” dictation can only be fair to students if its presented in the same way to them all” (Alderson, Clapham and Wall 1995:57), i.e. using only one presenter; and old procedure for an “old” method/test.

In this study, I used “old” tests, but the procedure of presentation was new. If one is using indirect tests such as dictation, this does not mean that one has to stick to “old” procedures. One can still try to be exploratory.

The normal procedure in a dictation test is to use one presenter even when subjects are split up into different venues/classrooms. Owing to the exploratory nature of the dictation tests, four presenters (including myself) were used. The presenters then repeated the process on a rotational basis so that each of them presented the two dictation tests to all four groups. The dictation scores used in this study were the scores of the first presentation of each presenter. Thus, I did not use any scores for the statistical analysis from dictations that had been heard more than once by the subjects. Table 3.3 shows the procedure of the first rotational presentation.

TABLE 3.3

Presentation of Dictation with Four Presenters: First Presentation of Each Presenter

The reason why I chose such a design instead of doing the usual and simple thing of using one presenter was because I wanted to investigate whether different presenters, i.e. different methods of presentation, would have any significant effect on the results. Three of the presenters were English mother-tongue speakers, and one was a Tswana mother-tongue speaker. In order to test for any significant difference in the means between the results of the four different presenters, I did an analysis of variance (One-way ANOVA). The ANOVA results are presented and discussed in section 4.1. The reason why an ANOVA was used is because I had to test whether there was any significant difference between the results of the four procedures of administration: each rater’s presentation represents a different procedure. An ANOVA deals with the four sets of data simultaneously, whereas a T-test can only deal with two at a time.

The results of an ANOVA are more reliable than the results of six T-tests (AB, AC, AD, BC, BD, CD), which would be required if the four sets of data (represented by the results of the four different presentations) had to be submitted to T-tests. The more T-tests you have the less the reliability of the results: “If means are to be cross-compared, you cannot use a t-test” (Hatch and Farhady 1982:114). The cross comparison, or, as Brown (1988:170) calls it “multiple t-tests”, may be between more than two groups, or more than one test on the same two groups. Brown (1988:170) gives the following example of this pitfall:

[A] researcher might want to compare the placement means, including four sets of subtest scores, of the males and females at a particular institution. In such a siutation, it is tempting to make the comparisons between the means for males and females on each of the four subtests with four different t tests. Unfortunately, doing so makes the results difficult to interpret. yet I have seen studies in our literature that make 2,3,4,5, and more comparisons of means on the same groups ot test scores.

(See Hatch and Farhady [1982:114] who provide the statistics that show why multiple t-tests increase the likelihood of being able to reject the null hypothesis).

3.3.4.5 Method of scoring the dictation tests

Various methods of scoring dictation are examined and reasons are given for the scoring method used in this study:

Cziko (1982) uses the method of scoring by segment using an exact-spelling criterion where a point is awarded for a correct segment on condition that there is no mistake in the segment. Bacheller (1980) created a scale of communicative effectiveness, where spelling, unlike in Cziko above, was disregarded. In Bacheller’s (1980) method each segment is rated on a scale of 0 to 5 according to how much meaning is understood. A score of zero indicates that none of the intended meaning of the segment has been captured; a score of 3 indicates that the subject apparently understands the meaning of the segment”; a score of 5 indicates that the meaning is understood (see also Tönnes-Schnier and Scheibner-Herzig 1988). Owing to the fact that the emphasis in Bacheller is on the top-down process of coherence/meaning, this method tends to be subjective, especially if only one rater is involved.

Tönnes-Schnier and Scheibner-Herzig (1988) compared Bacheller’s method with the “traditional German method” of scoring dictation (henceforth referred to as the traditional method), where all words are counted, including spelling – one point for each word. Thus, the total score is the number of possible correct words minus the number of errors.

In the method used by Oller (1979:276, 282) and Stump (1978:48) each correct word is worth one point. One point is deducted for each deletion, intrusion or phonological or morphological distortion. Spelling errors, punctuation and capitalisation are not counted. As in the case of the traditional method, the total score is the number of possible correct words minus the number of errors. In the traditional method of counting errors, the “different kinds of errors” are not distinguished. Thus spelling errors – unlike in Oller’s method – are lumped together with omissions, intrusions, lexical and grammatical errors.

Cziko (1982:378) found that his “exact-spelling segment scoring procedure…was three to four times faster than an “appropriate-spelling word-by-word scoring system”. The reason is that one only has to look for one mistake in each segment, whereas in Oller’s procedure one has to take into account each and every error. Cziko (1982:375) found a correlation of .89 between his method and Oller’s method. In my method I did not have to count every word or mistake but decided on a maximum possible score of 20 (before the test was given), which was determined by the difficulty of the test. In this study I used a variation of the traditional method, where errors were subtracted from a possible score of 20 points. One point was deducted for any kind of error, including spelling, and the actual score was deducted from a possible score of 20. This was done because in my opinion this method yielded a valid indication of the level of proficiency of individual subjects. If one is only interested in norm-referenced tests, it wouldn’t matter what the possible score was, because in norm-referenced tests one is only interested in the relative position of individuals in a group, not with their actual scores. One could then measure the correlation between this procedure and Oller’s procedure. If the correlation is found to be high, one could use the shorter procedure. I did a correlational analysis on the dictation tests between Oller’s method and my variation of the traditional method (a possible 20 points). High correlations were found. These are reported and discussed after the ANOVA results in section 4.3).

All the dictation protocols (N=86) were marked by myself. The main reason for this was that the teachers/presenters did not have time for a lot of marking, because the administration of the tests had to be done within the limited time allotted within the School’s programme. Recall that the dictation tests were only a part of a large battery of tests. The teachers/presenters of the dictation tests were involved in the administration of the whole battery. At the end of section 4.2 I discuss the issues of rater reliability when using a single rater.

3.4 Summary of Chapter 3

The sampling procedures, and the structure, administration and scoring procedures of the tests were described. Although subjects are divided into an a L1 sub-group and an L2 sub-group the two sub-groups should be treated as a composite group in the correlational analyses that are to follow in Chapters 4 and 5. The reason for this is that the following conditions were the same for all the subjects:

(1) The admission criteria to the School.

(2) The English proficiency tests and their administration (this investigation).

(3) The academic demands of the School.

(4) The treatment they were given at the School. What is relevant to the statistical rationale of this investigation is not the fact that the entrants to the School had received different treatment prior to entering the School, where some may have been disadvantaged, but only the fact that the all entrants received the same treatment after admission to the School.

(5) The proportion of L1 and L2 learners (as I have defined these labels) was similar from year to year at the School.

All five conditions show that the 1987 Grade 7 sample represented subjects who came from the same population of Grade 7 learners at the school from year to year, specifically from 1980 to 1993, irrespective of their origin and whether they are divided into “L1” and “L2” groups.

Various methods of administration and scoring  procedures in the different tests were appraised to show the reason for the methods of administration and scoring procedures chosen for  this study. The next chapter presents the results and statistical analysis of the battery of proficiency tests.

A                                     B

Although he has lived in England/since he was fifteen,/he

C                             D

still speaks English/much badly. Correct – E.

Answer: D.

A                          B                            C

Item 8. Both Samuel and I/are much more richer/than we/

D

used to be. Correct – E.

Answer: B.

A                                  B

Item 19. Some believe that/a country should be ruled/by

C                                  D

men who are/too clever than ordinary people. Correct – E.

Answer: D.

A                              B

Item 25. His uncle is owning/no fewer than ten houses,/

C                          D

and all of them/are let at very high rents. Correct – E.

Answer: A.

A                              B

Item 27. As I have now studied/French for over three

C                                          D

years/I can be able to/make myself understood when I go to France. Correct – E. Answer: C

.

Ph.D. – Chapter 2: Theoretical Issues in the Testing of Language Proficiency and Academic Achievement

(The notes appear at the bottom of the post. The clicking function on the notes does not work).
2.1 Introduction

2.2 Ability, cognitive skills and language ability

2.3 Competence and performance

2.4 Proficiency

2.5 The discrete-point/integrative controversy

2.6 Cognitive and Academic Language Proficiency (CALP) and “test language”

2.7 Language proficiency and academic achievement

2.8 Validity

2.8.1 Face validity

2.8.2 Content validity

2.8.3 Construct validity

2.8.4 Criterion validity: concurrent and predictive validity

2.9 Reliability

2.9.1 Approaches to the measurement of reliability

2.10 Ethics of measurement

2.11 Summary of Chapter 2

2.1 Introduction

Whatever we say about language – or about anything – originates from a theory, i.e. a combination of knowledge, beliefs, wants and needs. For some authors[1] ( the testing  of language proficiency has been a circular enterprise. Vollmer[2] maintains that “language proficiency is what language proficiency tests measure” and  this circular statement is all that can be firmly said when asked for a definition of  language proficiency. Perhaps this is all that can be firmly established about language proficiency, but we swim on in the hope of  reaching terra firma.

2.2  Ability

I examine the notion of ability first and then discuss language ability. In the next sections I move on to competence and performance, which I shall relate to ability. I then discuss proficiency and the discrete-point/integrative controversy in language testing. As I mentioned in the first paragraph of the study, language testing draws on three areas: the nature of language, assessment, and language ability . The latter is closely related to language proficiency.

The precise definition of ability is not only seldom explicated, but, unlike competence, is often not even considered, in spite of the fact that the term is  used widely in everyday language as well as in scientific circles.  Important issues in the study of abilities, of which language  is only one ability, are:

– The fixety of abilities. If  abilities were highly variable over time they would reflect a state rather than a trait or an attribute. The two latter terms imply a fixed structure, or construct,  rather than a variable process. The constructs that this study is concerned with belong to the domain of language acquisition. (Construct validity is discussed in section 2.8.3).

– Consistency. For example, if an athlete  in a one-off streak “accident” breaks a world record, but is never able to repeat the performance, or even get near the record again, we still say that he or she has the ability to break a world record. We cannot apply the same logic to cognitive abilities, where  consistency of output, not records, is the name of the game. Consistency does not only apply to the ability of learners but also of teachers, who are usually also testers. The consistency, or the reliability, of judgements and scoring is a major issue in language testing. This issue is dealt with in various parts of study.

– When we say people have the ability to perform academically we mean that they are able to achieve a certain liminal level, i.e. minimum or threshold level. In trying to set a minimum level, one is concerned with what the individual can do in terms of established criteria. What the individual can do cannot be separated from what others can do. Hence the importance of norm-referenced tests .

– The variability in ability between individuals obeys a “bell-curve” distribution, as in the case of nature as a whole. The “bell-curve” or “normal” distribution  is the foundational principle of psychometrics.[3]

Taking  the four points above into account, Carroll[4] suggests the following definition of ability:

As used to describe an attribute of individuals, ability refers to the possible variations over individuals in the liminal levels of task difficulty (or in derived measurements based on such liminal levels) at which, on any given occasion in which all conditions appear favorable, individuals perform successfully on a defined class of tasks.

Several modern theories of education and psychology reject the notion of traits, i.e. the fixety of psychological constructs[5]. In traditional trait theories, e.g. Carroll (above), psychological constructs are like any other human trait, animal trait, or plant trait, where biological differences between living things are distributed according to a bell curve and differences in human abilities are also distributed according to the bell curve. This does not mean that people cannot improve, but only that the degree of improvement depends on fixed psychobiological constraints.[6]

A few comments on Carroll’s idea that ability is a fixed psychological trait is in order. Ability is closely connected to the notion of proficiency (to be discussed shortly) and proficiency is certainly something that can improve by using the correct strategies to enhance learning. The notions of “transferable” and “transferring skills” are used to explain the idea of “fixed” ability.

A major problem with learners with limited academic ability is the underdevelopment of “transfer skills”[7]. There are two kinds of transfer skills: (1) lower order “transferable skills” and (2) higher order “transferring skills”[8]. Transferable skills are skills that are learnt in one situation or one kind of subject-matter that are transferable to another. Examples are:  (i) a reading skill such as scanning that is learnt in the English class can be transferred to the geography class, (ii) using a dictionary, (iii) making charts and diagrams, (iv) completing assignments, (v) reviewing course material, (vi) learning formulas and dates, (vii) memorising material. “Transferring skills” are “metacompetences” of a far higher order. These metacompetences are : (i) A sensitive and intelligent discernment of similarities and differences, (ii) Cognitive equipment that one  uses to modify, adapt and extend, and (iii) attitudes and dispositions that support both of the above.[9]

These three “metacompetences” are interrelated. For example, without the “cognitive equipment” that enables one to modify, adapt and extend, it would not be possible to sensitively and intelligently discern similarities and differences. With regard to Bridges’ third “metacompetence” of “attitudes and dispositions”, which have to deal with intention, motivation and resulting approach to a task, I suggest that its successful development is to a large degree dependent on the successful development of the other two “metacompetences”. If one has the right healthy cognitive equipment, as well as the desire and opportunity to develop it, one will understand more; consequently, one will be more motivated to learn. Of course, socialisation into a community of learners and the correct mediation/intervention procedures between learner and task also  play an important role in cognitive development, e.g. the development of critical awareness and learning strategies.

Bridges’[10] distinction between lower order “transferable skills” and higher order “transferring skills” is useful in understanding the nature of the problem of transfer. The problem of transfer refers mostly to the higher order “transferring skills”. The question is whether higher order cognitive skills (i.e. Bridges’ “transferring skills”) can be acquired at all (whether independently or through teaching). Millar[11] maintains that courses in skills development (e.g. the development of executive processes) pursue the “impossible” because processes such as classifying and hypothesising cannot be taught, but can only develop (i.e. they are part of inborn potential, or ability). For Millar the challenge is to find ways of “motivating pupils to feel that it is personally valuable and worthwhile to pursue the cognitive skills (or processes) they [children] already possess to gain understanding of the scientific concepts which can help them make sense of their world”[12] (square brackets and italics added).

According to Millar, these cognitive skills, especially the higher order transferring skills (e.g. a sensitive and intelligent discernment of similarities and differences), can only be developed if they are based on something that learners already possess, namely, academic potential, or ability. I have raised some highly controversial issues, but they needed to be raised in order to explain what I meant by “fixed” ability. I cannot pursue the matter further here.[13]

I now discuss how the term ability is used in relation to language. In section 1.1,  I mentioned the four major test uses, achievement, proficiency, aptitude and diagnosis. These are all manifestations of what Davies’ calls “language ability”.[14]

I mentioned above that “fixed” ability in Carroll’s sense does not mean that people cannot develop and become better. If this were not so, it would be nonsensical to talk about things such as transitional competence and interlanguage, which feature, justifiably, so strongly in the applied linguistic literature. In the next section I discuss the notions of competence and its sibling, performance.

2.3  Competence and performance

The notions of competence and performance are essential to understanding language assessment.

For Chomsky[15] competence is the capacity to generate an infinite number of sentences from a limited set of grammatical rules. This view posits that competence is logically prior to performance and is therefore the generative basis for further

learning.[16] Competence, on this view, is equivalent to “linguistic” (or “grammatical”) competence. Chomsky distinguishes between “performance”, which is “the actual use of language in concrete situations”, and  “competence” or “linguistic competence” or “grammatical competence”, which is “the speaker-hearer’s knowledge of his language”.[17] Chomsky’s description of language involves no “explicit reference to the way in which this instrument is put to use…this formal study of language as an instrument may be expected to provide insight into the actual use of language, i.e. into the process of understanding sentences.”[18] Chomsky’s great contribution was to focus on linguistic introspection, without giving introspection (linguistic intuitions) the final word.[19]

Canale and Swain[20] make a distinction between knowledge of use and a demonstration  of this knowledge. Knowledge of use is often referred to in the literature as “communicative competence”[21], and the demonstration of this knowledge as performance. Communicative competence has come to subsume four sub- competences: grammatical competence, sociolinguistic competence, discourse competence and strategic competence[22]:

(1) Grammatical competence is concerned with components of the language code at the sentence level, e.g. vocabulary and word formation.

(2) Sociolinguistic competence is concerned with contextual components such as topic, status of interlocutors, purposes of communication, and appropriateness of meaning and form.

(3) Discourse competence is concerned with: (i) a knowledge of text forms, semantic relations and an organised knowledge of the world; (ii) cohesion – structural links to create meaning, and (iii) coherence – links between different meanings in a text; literal and social meanings, and communicative functions.

(4) Strategic competence, which is concerned with (i) improving the effectiveness of communication, and (ii) compensating for breakdowns in communication. Strategic competence means something very different in Bachman and Palmer , namely, metacognitive strategies, which is central to communication. For these authors “language ability” consists of “language knowledge” and “metacognitive strategies”.[23] (See Skehan[24]).

According to Widdowson communicative competence should subsume the notion of performance:

[T]he idea of communicative competence arises from a dissatisfaction with the Chomskyan distinction  between competence and performance and essentially seeks to establish competence status for aspects of language behaviour which were indiscriminately collected into the performance category.[25]

How does ability fit into the competence-performance distinction? Chomsky equates “ability” with “performance” (“actual use”),  which he regards as a completely different notion from “competence” or knowledge”.

Characteristically, two  people who share the same knowledge will be inclined to say quite different things on different occasions. Hence it is

hard to see how knowledge can be identified with ability…Furthermore, ability can improve with no change in knowledge.[26]

Thus, as Haussmann points out, “it should be noted that Chomsky’s original definition of the term [i.e. competence] always excluded this idea [i.e. ability].”[27] There doesn’t seem to be any reason, however, why “ability” cannot refer to (linguistic/grammatical) competence, (which is Chomsky’s interest), as well as to the knowledge one has of how to use the language in appropriate situations. We can retain performance  to mean the actual use of this knowledge. For example, Bachman and Clark use the term “ability” in the following way:

We will use the term “ability” to refer both to the knowledge, or competence, involved in language use and to the skill in implementing that knowledge, and the term “language use” to refer to both productive and receptive performance.[28]

Weir equates “ability” with “competence” as well:

There is a potential problem with terminology in some recent communicative approaches to language testing. References are often made in the literature to testing communicative ‘performance’ [e.g. B.J. Carroll 1980[29]]. It seems reasonable to talk of testing performance if the reference is to an individual’s performance in one isolated situation, but as soon as we wish to generalise about ability  to handle other situations, ‘competence’  would seem to be involved.[30] (Square brackets added)

This is Skehan’s position as well: “it is defensible to speak of competence-orientated abilities.”[31]

In other words, different performances point back to the underlying  competence or ability.

“Competency-based education and training” has a different set of concepts for the labels of competence, performance and ability to those discussed above.CBET is discussed in section 6.4 where the future of assessment in South Africa is dealt with.

2.4  Proficiency

Proficiency is closely related to ability, competence and performance discussed above. Proficiency is used in at least two different ways: it can refer to  (1) the “construct or competence level”[32], which is at a given point in time independent of a specific textbook or pedagogical method[33] or to (2) the “performance level”[34], which is a reflection of achievement in the test situation. The construct level or competence level is the knowledge of the language, and the performance level is the use of language.

Proficiency, like the notions of competence and performance,  is very much of a “chameleon” notion[35], because it can be defined not only in terms of knowledge (the construct or competence level) and in terms of specific tasks or functions (the performance level), but also in terms of degrees of behaviour that are observed at different stages (minimum to native-like[36]), in terms of language development (e.g. interlanguage studies), in terms of situations that require some skills but not others, or in terms of general proficiency, where no specific skill is specified.

Porter[37] uses the term  “communicative proficiency”, which seems to subsume the notions of  “communicative competence” and “performance” discussed above. According to Child[38], “proficiency” is a “general `across-the-board’ potential”, while “performance” is the “actualised skill”, the “mission performance” involved in “communicative” tasks, i.e. the output. Child has much in common with Alderson and  Clapham[39], who distinguish between “language proficiency” and “language use”, where proficiency, not use, is part of output.

I would like to spend some time on Landtolf and Frawley’s[40] views on language proficiency because they epitomise the opposition to the view that I am arguing for in this study. These authors will be referred to as L and F. In their abstract they state that they “argue against a definitional approach to oral proficiency and in favor of a principled approach based on sound theoretical considerations.” (The L and F reference throughout is their 1988 article). The authors use oral proficiency as a backdrop to their views on language proficiency in general. L and F,  in their criticism of “reductionism” in the assessment, of language proficiency,  leave few authors unscathed; authors that many would consider to be in the vanguard of the real-life/communicative movement, e.g. Hymes, Omaggio and Widdowson.

To adumbrate: in the second section of their article “The tail wagging the dog”, L and F use Omaggio’s[41] section of her manual entitled :”Defining language proficiency” to lament that the “construct of proficiency, reified in the form of the [American Council on the Teaching of Foreign Languages- ACTFL] Guidelines, has begun to determine how the linguistic performance of real people mustbe perceived”:

In her discussion, she considers various models of communicative competence, including those of Hymes, Munby, Widdowson, and Canale and Swain, all of which are reductionist approaches to communicative competence, because they define communicative competence by reference to a set of constitutional criteria. She then proceeds to a subsection entitled “From Communicative competence to Proficiency.” However, nowhere in her analysis is there any in-depth consideration of proficiency that is independent of the proficiency test itself.[42]

Strange that L and F consider Widdowson[43] a reductionist, who I would think fully appreciates the distinction between language structure and language in use, where grammar plays a vital role. By “grammar”, Widdowson does not mean merely morphology, phonology and syntax but lexico-grammar, where semantics is included under “grammar”. The inclusion of semantics under “grammar”, or “linguistic knowledge”, is what modern linguistics understands by these terms). The papers of the Georgetown University Round Table Conference[44] were concerned with the reality and authenticity of communicative language proficiency, where Widdowson argued that grammar  is not dead, but the life blood of language, communication and social meaning. Such a view is not reductionist!

L and F  reject the ACTFL’s  (American Council of the Teaching of Foreign Languages) adoption of a uniform yardstick for the measurement of foreign language ability based on real-life  behaviour[45]. The ACTFL’s tail (the series of real life descriptors) that is wagging the real dog is not, according to L and F, a real tail. The unreal tail for L and F is the unreal “construct”; the real dog being wagged is real people. The metaphor is clear: it is researchers who have fabricated the “construct”, and fabrications have no psychological reality.  In other words the construct constricts the reality of “the nontest world of human interaction”[46]. The test world, which represents the “construct” for these authors, “has come to determine the world, the reverse of proper scientific methodology”.[47]

Recall that L and F are arguing in “favor of a principled approach based on sound theoretical considerations” (italics added), which L and F  seem to think authors such as Widdowson do not  use. Yet Widdowson (who was probably not unaware of L and F’s criticism) ends his  “Aspects of language teaching” with the following:” There needs to be a continuing process of principled pragmatic enquiry. I offer this book as a contribution to this process – and as such, it can have no conclusion”[48] ( italics added). (See Gamaroff 1996a[49]).

Widdowson perceives the content of both the structural and the notional syllabus to be, in Nunan’s words, “synthetic” and “product-orientated”[50], i.e. the content of both syllabuses is static and lacks the power to consistently generate communicative behaviour. Widdowson’s argument against structuralist and notional syllabuses is that “[i]t has been generally assumed…that performance is a projection of competence…that once the rules are specified we automatically account for how people use language.”[51] His argument is that structural and functional-notional syllabuses do not link in past experiences with  new experiences, because they lack proper learner involvement[52]. Widdowson also believes that “the most effective means towards this achievement [i.e. “complete native-speaker mastery”] is through an experience of authentic language in the classroom.”[53]

L and F, and Widdowson are backing the same communicative horse. The main difference between them appears to lie in the value they place on school learning. All three believe in teaching language as communication, with the difference that much of Widdowson’s work is concerned with academic achievement and school learning rather than with real-life “natural” contexts.

I examine more closely the cogency of the distinction between “natural” contexts in “real-life” and “unnatural” contexts in the classroom. According to L and F “tasks cannot be authentic by definition”[54], which implies that very little in school is authentic, i.e. natural. The nub of L and F’s criticism is that the exchange between tester and test-taker is not a natural one,  therefore any kind of test cannot be a natural kind of communication. Communicative testing, it seems, would be for L and F a contradiction in terms. What is more, communicative school “tasks” would also be a contradiction in terms. In that case, school, which may be defined as an institution whose role it is to guide learners by defining and dispensing tasks, is another tail wagging the world (of  “reality”). The ACTFL Guidelines, according to L and F, draw a line between the world and the individual. L and F regard such a situation as scientifically unprincipled and morally untenable. There is very little in “tasks” such as  instructional activities and nothing in tasks such as tests that L and F find authentic in the Proficiency literature. L and F want language tasks to  be contextualised in natural settings such as cooking clubs.

Byrnes and Canale caution that the danger of “the proficiency movement” as espoused by L and F and others as “with any movement is that a rhetoric of fear and enthusiasm will develop which is more likely to misrepresent and confuse than to clarify the crucial issues.”[55] One confusing issue is that of “natural environment”.

“Just what is a ‘natural environment’ as far as learning or acquiring a second language under any circumstances is concerned?'” asks Morrissey[56] :

There is no environment, natural or unnatural, that is comparable with the environment in which one learns one’s mother tongue. Furthermore, it seems to me that there is a teaching (i.e. unnatural?) element in any L2-L1 contact situation, not just in cases of formal instruction. This element, even if it only consists in the awareness of the communicants that the [teaching or testing] situation exists, may be a more significant factor in L2 learning  and L2 acquisition [and L2 testing] than any other factor that is common to [the natural setting of]  L1 acquisition and L2 acquisition.[57]

L and F are seeking a testing situation analogous to the L2 “acquisition” situation (which by Krashen’s[58] definition is “natural”). But as Morrissey above suggest much of language and learning, like culture, consists of  extrapsychological elements, and in this sense are an “imposition” upon nature. However, although the test situation, i.e. school,  may be less “authentic” in the sense that the test is more concerned with learning language than with using it, with regard to the laws of learning, the dichotomy between “natural” and “unnatural” is a spurious one. It is incorrect to assume that “natural” approaches (e.g. Krashen and Terrell[59]) and immersion programmes mirror natural language acquisition and that the ordinary classroom doesn’t.[60] The swimming club, cooking club, tea party or cocktail party are in a sense not more neither less natural than the traditional classroom. That is why one doesn’t have to go outside the classroom in of search for “real reality”[61]. The learning brain needs stimulation, and it can get it in the classroom or at an informal (or formal) cocktail party. In other words, there is much informal learning in classrooms and much formal learning outside the classroom. But both kinds of learning are completely “natural” to the brain that is doing the learning.

2.5  The discrete-point/integrative controversy

L and F point out that in real life one uses far less words than one would use in school “tasks”, and this is one of the reasons why they maintain that tests are  inauthentic by definition.[62] However, as Politzer and Mcgroarty[63] show, it is possible to say or write few words (as one often does in natural settings) in a “communicative competence” test by using a “discrete-point” format. When one uses far less words in natural settings than one would use in many “artificial” school tasks, one is in fact using a “discrete-point” approach to communication. One doesn’t merely look at the format of a test to decided whether it is a “discrete-point” test, one looks at what the test is testing.

It is now opportune to examine what tests are testing. The way I have chosen to do so is by means of an examination of  the discrete-point/integrative controversy.

This controversy can only fully understood within the context of a parallel controversy: the structuralism/functionalism controversy. The discussion of the former controversy will serve as a background to the discrete-point/integrative controversy.

It is impossible to test the structures and functions of language without understanding how it is learnt. Language learning is language processing. The central issue in testing is assessing this language processing skill. Language processing, as with all knowledge, exists within a hierarchical organisation: from the lower level atomistic “bits” to the higher level discoursal “bytes”. The lower level bits traditionally belong to  the  “structuralist” levels, while the higher level bytes belong to the “functionalist”  levels. It is difficult to know where structure ends and function begins.[64].

The following continuum, adapted from Rea[65], includes the concepts on competence and performance discussed earlier.

TABLE 2.1

Functionalist and Structuralist Levels of language

This mutually exclusive classification of functionalism and structuralism is a highly controversial one. The structuralist/functionalist controversy is about whether the semantic meaning of words and sentences (structuralism) can be distinguished from the  pragmatic (encyclopaedic, i.e. world knowledge) meaning of  discourse (functionalism).

Halliday[66] proposes two meanings of the term function, namely,  “functions in structure” and “functions of language”. “Functions in structure” is concerned with the relationship between different words of a sentence. Structuralism is traditionally associated with the study of language at the sentence level and below. “Functions of language”, on the other hand, goes beyond individual linguistic elements or words (Saussure’s “signs”) to discourse. Functionalism is traditionally associated with the study of discourse.

I understand the terms linguistic knowledge, lexico-grammar (which is what recent modern linguistics understands by grammar ) and Halliday’s “functions in structure” to be synonymous. Therefore, lexico-grammar only deals with linguistic knowledge at the sentence level and below that level. Halliday’s “functions of language”, what I call functionalism deal with discourse, i.e. the intersentential domain. (The division of language  into a sentence level and an intersential level is itself problematic).

Functionalism rejects the Chomskyan idea that grammar is logically and psychologically the origin of “functions in language” (Halliday above). For functionalists like Halliday[67], the grammar of a specific language is merely “the linguistic device for hooking up the selections in meaning which are derived from the various functions of language”.

In functionalism it is communication that is claimed to be logically and psychologically prior to grammar. Givon[68], for whom the supreme function of language is communication, criticises Chomsky for trying to describe language without referring to its communicative function. Givon argues: “If language is an instrument of communication, then it is bizarre to try and understand its structure without reference to communicative setting and communicative function.”[69] Rutherford), whose view is similar to Givon’s communicative view, rejects the “mechanistic” view that grammatical structure (Givon’s “syntax”) is logically or psychologically prior to  communication.[70] Rutherford sees language as a  dynamic process and not a static “accumulation of entities”.[71]

In this regard Spolsky[72] suggests that the “microlevel” is “in essence” the “working level of language, for items are added one at a time”, keeping in mind that “any new item added may lead to a reorganisation of the existing system, and that items learnt contribute in crucial, but difficult to define ways to the development of functional and general proficiency.” Thus, according to Spolsky, building up the language from the microlevel to the macrolevel need not  be a static “accumulation of entities” (Rutherford above), but maylead toa dynamic “reorganisation of the existing system” (Spolsky above). Alderson’s view is similar to Spolsky’s:

Another charge levelled against (unidentified) traditional testing is that it views learning as a ‘process of accretion’. Now, if this were true, one would probably wish to condemn such an aberration, but is it? Does it follow from an atomistic approach to language that one views the process of language as an accretion? This does not necessarily follow from the notion that the product of  language learning is a series of items (among other things). (Original emphasis).[73]

The process and product methodologies “are too often perceived as generally separate”, i.e. they suffer from an “oppositional fallacy”.[74] The product is considered to be discrete, static and, accordingly, is not party to language processing,while the process is considered to be integrative and dynamic, and accordingly, the process is seen as belonging to language processing. (More about this in section 6.4). It is this oppositional fallacy that is the battleground of the discrete-point/integrative controversy.

It is widely believed that tests such as essay tests test the “use” of language, i.e. authentic communicative language, while tests such as error recognition tests and grammar accuracy tests test the “usage” of language[75], i.e. the elements of language. Such a distinction between the two kinds of tests, which Farhady describes as the “disjunctive fallacy”[76], is an oversimplification.

Many studies report high correlations between “discrete-point tests” and “integrative tests”.[77] It may be asked how the construct is able to account for this: “Shouldn’t supposedly similar types of tests relate more to each other than to supposedly different types of  tests?”  An adequate response presupposes three further questions: (1) “What are similar/different types of tests?” (2) Wouldn’t it be more correct to  speak of so-called discrete-point tests and so-called integrative tests?  (3) Isn’t the  discrete/integrative dichotomy irrelevant to what any test is measuring?

I consider some of the issues in the discrete-point/integrative controversy that are related to the questions posed above. The notion of “real-life” tests is also critically examined.

The terms “integrative” and “discrete-point” have fallen out of favour with some applied linguists, while for others these terms are still in vogue. For example, Fotos equates “integrative”  skills with “advanced skills and global proficiency”[78], which he contrasts with Alderson’s”basic skills”[79]. These “basic skills” are Alderson’s “low order” skills[80]. Alderson prefers to distinguish between “low order” and “higher order” tests  than between “discrete- point” and “integrative” tests.[81] Alderson in 1979[82] refrained from talking about discrete-point and integrative tests, but preferred to talk of “low order” and “higher order” tests. Yet, in his later collaborative textbook on testing, one of the book’s test specifications is that “tasks” should be “discrete point, integrative, simulated ‘authentic’, objectively assessable”.[83] These tests specifications would dovetail with the notion that although these tests do not mirror life, they are nevertheless “good dirty methods [of testing] overall proficiency”.[84]

Whatever one’s classification, all tests, except for the most atomistic of tests, reside along a continuum of “integrativeness”.[85] For example,  consider  two items from Rea[86]:

1. How —-  milk have you got?

(a) a lot (b) much of (c) much (d) many

2. —- to Tanzania in April, but I’m not sure.

(a) I’ll come (b) I’m coming (c) I’m going to come (d) I may come.

Item 1 is testing a discrete element of grammar. All that is required is an understanding of the “collocational constraints of well-formedness”[87], i.e. to answer the question it is sufficient to know that “milk” is a mass noun (see also Canale and Swain[88]). Item 2 relates form to global meaning. Therefore, all parts of the sentence must be taken into account, which makes it an integrative task. To use Rea’s terminology, her item 1 is testing “non-communicative performance”, while her item 2 (above) is testing what she calls “communicative performance”.[89] Other discrete-point, or “low order”, items could be shown to be more integrative, or “higher order”, than the items described above. (The terms in inverted commas are Alderson’s [1979[90]]).

Above I described an “objective” type of test. Consider now a test that lies toward the “pragmatic” extreme of the integrative continuum: the cloze test. Although cloze answers are short, usually a single word, the cloze test can still be regarded as an “integrative” test. A distinction needs to be made between integrative and discrete formats and integrative and discrete processing strategies. The salient issue in a cloze test or in any test is not the length of the answer or the length of the question,  but whether the test measures what it is supposed to measure, in this case integrative processing strategies. One should distinguish between the structure of the test  –  long answer, short answer, multiple choice  –  and what one is measuring. One is measuring the natural ability to process language and one component of this ability is the behaviour of supplying missing linguistic data in a discourse.

According to the “pop”[91] view, it is only in language use that natural language processing can take place. Although the “pop” view may not conflict with the idea of a continuum of integrativeness, such a view would nevertheless hold that language tests should only test language “use”, i.e. direct language, or authentic language. For language “naturalists” the only authentic tests are those presented in a direct real-life situation. Spolsky[92] maintains that  “authenticity of task is generally submerged by the greater attention given to psychometric criteria of validity and reliability, where face validity receives no more than “lip service”. For Spolsky and others[93], authenticity is closely related to communicative language, i.e. to direct language. Authentic tests for Spolsky would be “direct” tests  in contradistinction to “indirect” tests. Owing to the lack of clarity on the relationship between a global skill like composition writing and the components of composition writing, e.g. vocabulary, punctuation and grammar, Hughes recommends that it is best, in terms of one’s present knowledge, to try and be as comprehensive as possible, and the best way to do this would be to use direct tests.

“Direct” testers argue that in language use we do not process language in a multiple choice way as in the case of discrete-point tests. Yet, many multiple choice tests do test processing strategies, e.g. the making of predictions. Furthermore, multiple choice tests are neutral in themselves, i.e. they can serve any purpose; communicative or non-communicative. Rea[94] gives the following reasons why indirect tests should be used:

1. There is no such thing as a pure direct test.

2. Direct tests are too expensive and involve too much administration.

3.  Direct tests only sample a restricted portion of the language, which makes valid inferences difficult. (Of course, no battery of tests can sample the whole language. Rea’s point seems to be that indirect tests are able to be much more representative than direct tests).

If it could be shown that indirect test performance is a valid predictor of direct performance, this would be the best reason for using indirect tests. Even if indirect performance is accepted to be a valid predictor of direct “natural” performance, one may object that indirect tests are unnatural, and consequently lack face validity. As mentioned earlier.

The laws of learning and testing apply to all contexts, “naturalistic”[95] and otherwise. One can have authentic indirect tests, because tests are authentic activity types in their own right[96]. The quality of learning outcomes depends, of course, on the quality of input – and more importantly on the quality of intake.

There is a sense, though, in which “real-life” “authentic” tasks in the classroom, if not a contradiction in terms, are not possible: in the sense that learners are aware that life in the classroom is a preparation for, and simulation of, life outside the classroom – an understanding of life, which comprises not only life skills but content knowledge in specific disciplines and an understanding of their relationship. But this “preparation for life” view of the classroom, does not justify, I suggest, the radical rupture between “real-life” and the classroom, described by Lantolf and Frawley (see end of section 2.4). Tritely, life is one big classroom; and less tritely, the classroom is one small part of life. This does not mean that the classroom has to be turned into a cooking club or a cocktail lounge to get learners to respond authentically to a recipe or to something “stronger” – for instance, a test.

If by some good fortune we come of age in our understanding of what an “authentic” task is (and, accordingly, isn’t), it still doesn’t follow that it is necessary to do “authentic” tasks in order to prove that we are proficient to do them, because communicative tasks can be tested successfully through indirect tests[97]. For example, an eye test doesn’t directly, or “holistically”, measure whether someone can see the illuminated road clearly, but its a jolly good predictor of whether one will be able to see, if not avoid, that oncoming road-hog on that same illuminated road.

In sum, both direct tests and an indirect tests – as in all direct and indirect classroom activities – have communicative, or real-life, language as their aim. The difference lies in this: direct  tests, or outcomes, or activities are based on the view that communicative language should be directly taught and tested, while indirect tests are based on the view that indirect teaching materials and tests are a prerequisite and solid basis for ultimate real-life language. But I would go even further and agree with Widdowson that “semantic meaning is primary” (Chomsky’s dated! “linguistic competence”) where semantic meaning should (naturally, i.e. obviously) be internalised  to provide for “communicative capacity”[98]; which is the same idea as Spolsky’s[99] (mentioned above) where the building up of language from the microlevel to the macrolevel may be a dynamic and not necessarily a static “accumulation of entities” (Rutherford[100] above), which in turn leads toa dynamic “reorganisation of the existing system”.[101] [102]

2.6 Cognitive and Academic Language Proficiency and “test language”

Cognitive and Academic Language Proficiency (CALP) is closely related to the ability to do tests. Its features are better understood when  compared and contrasted with Basic Interpersonal and Communicative Skills (BICS).[103] BICS refers to salient basic features such as fluency (speed of delivery) and accent, and not to advanced social and communicative skills, which is a cognitively demanding task.  For example, the skills of persuading or negotiating in face-to-face communication require relatively much more cognitive involvement than a BICS task, and are therefore cognitively demanding CALP tasks. Thus, it would be incorrect to equate BICS with all face-to-face communication, because face-to-face communication may involve informal as well as formal speaking. Formal speech acts such as persuading and negotiating belong to advanced communicative skills, and are consequently part of CALP. Spoken language can be just as complex as written language. They differ in that speech is dynamic while writing is synoptic, and writing is lexically denser than speech: “written language does not have to be immediately apprehended in flight and does not need to be designed to counter the limitations of processing capacity”.[104]

Cummin’s BICS and CALP  have affinities with Bernstein’s “restricted code” and “elaborated code”, respectively[105]. The “elaborated code” has the following features:  precise verbalisations, large vocabulary, complex syntax, unpredictability, low redundancy, individuality differences between speakers; in contrast, the “restricted code” has the following features: loose verbalisations, limited vocabulary, simple syntax, high redundancy where assumptions are based on shared social experience.

Wald makes a distinction between “test language” (spoken and written), which he equates with CALP, and “spontaneous language”/”face-to-face” communication.[106] For Wald, test skills are CALP skills, which can involve all four language skills: listening, speaking, reading and writing. For example, an oral cloze test would be a CALP task.  In terms of these distinctions, it would be possible to have tests of basic language (grammar tests). Basic language tests  involve CALP because, according to Wald, they are tests. All tests are formal, no matter how “natural” one tries to make them. In terms of Wald’s definition of CALP as test language, the tests in this study are CALP tasks because they are tests. Accordingly, if  Wald is correct, and I think he is,  one could not have a BICS test.

Ur uses the term “informal”[107] not in the sense of natural, but in the sense that test takers are not told in advance what they need to know for a test. One could, for example, spontaneously test learners on their homework. On such an interpretation of “informal”, it follows that there can be “informal” tests (“informal” CALP tasks).

2.7  Language proficiency and academic achievement

Much of the research in second language acquisition involves finding factors which affect language proficiency. In such a research scheme, factors such as intelligence, motivation, mother-tongue interference and socio-economic standing are defined as the independent variables and language proficiency is defined as the dependent variable. Language proficiency in such a research context does not look beyond itself to its effect on academic achievement.

In the investigation of academic achievement the focus changes from considering language proficiency as the dependent variable (the criterion) to considering academic achievement as the dependent variable, as in Saville-Troike[108].

Consider the following schema. (The schema is highly simplified and is merely meant to present some of the predictor variables that could be involved in academic achievement and is therefore not a comprehensive “model”):

FIGURE  2.1

Second  Language Proficiency as a Criterion Variable and as a Predictor Variable

FIRST FOCUS – Second language proficiency as the criterion

Predictor variables Criterion variable

Intelligence

Motivation  (active participation)Second language

Mother-tongue interference                                                             Proficiency

Socio-economic standing

Personality (e.g. emotional maturity)

SECOND FOCUS – Academic achievement as the criterion

Predictor variables                                                                  Criterion variable

Intelligence

Motivation                                                                               Academic achievement

Mother-tongue interference

Socio-economic standing

Second language proficiency

Subject learning

One needs to know how language proficiency, which is embedded in other factors, promotes or hinders academic achievement. These other factors comprise an complex network of variables such as intelligence, learning processes and styles, organisational skills and content knowledge,  teaching methods, motivation and cultural factors. Owing to the complexity of the interaction between these variables, it is often difficult, perhaps impossible, to isolate them from language proficiency, which means that any one or any combination of these above-mentioned variables might be the cause of academic failure. Therefore, care must be taken not to make spurious causal links between any of these variables and academic failure. Although prediction does not necessarily imply causation, this does not mean that prediction should be ignored; on the contrary, prediction plays a very important role in the selection and placement of candidates. What is important is that these predictions be valid. (I stress that this study focuses mainly on those causes of academic failure that are related to the testing situation, e.g. rater unreliability).

Although the distinction between the two kinds of focus (Figure 2.1) can be useful, the change from one focus to the other does not merely involve rejuggling the variables that were previously used to predict language proficiency (the first focus) and then assigning them to the new game of predicting academic achievement (the second focus), where language proficiency, previously a criterion variable, would then become another predictor variable among those that were previously used to predict it. The reason is that when academic achievement is brought into the foreground, the predictive mechanism becomes far more complex. One cannot merely shift variables around, because in the second focus, learning in or through a second language is added to the demands of learning the second language itself.

Upshur[109] distinguishes between two distinct general questions: “Does somebody have proficiency?” and “Is somebody proficient?”  The first question considers such issues as grammatical competence and the use of language in discourse. The second question is concerned with the ability of language proficiency tests to predict future performance in tasks that require language skills, i.e. with the “prerequisites for a particular job or course of study”[110]. It is this second question,  namely, “Is somebody proficient?” (to do a particular task) that is the main concern of educationists.

2.8  Validity

Validity is concerned with “the purposes of a test”[111], which is basically concerned with the meaning of scores and the ways they are used to make decisions. A major difficulty in this regard is ensuring that one’s descriptions of validity are validly constituted, which involves reconciling “objective” reality with one’s own interpretation of “objective” reality – a daunting and  probably circular task.

For some researchers, validity comprises face validity, content validity, construct validity and criterion validities (concurrent and predictive validity), whereas for others, especially those belonging to the American Psychological Association[112] (APA), construct validity itself is validity. Face validity does not feature in the APA’s definitions of validity. The reason for this is explained in the next section.

2.8.1  Face validity

Face validity is concerned with what people (which includes test analysts and lay people) believe must be done in a test, i.e. what the test looks like it is supposed to be doing.

For Clark[113], face validity, oddly, covers the “whole business” of tests, i.e. looking “at what it’s got in it, at the way it is administered, at the way it’s scored.” Clark’s definition is unusual, because it covers everything to do with testing. Clark’s meaning of face validity is not what a test looks like to the non-tester but what it is to the tester, who should know from what it looks like what it is, i.e. “what it’s got in it”.

Spolsky’s meaning of face validity has affinities with Clarke’s. Spolsky equates face validity with “authenticity”: “authenticity of task is generally submerged by the greater attention given to psychometric criteria of validity and reliability”, where “face validity receives no more than lip service”.[114]

For Davies “face validity is desirable but not necessary, cosmetic but useful because it helps convince the public that the test is valid.”[115] The reason why face validity is desirable, according to Davies, is that, in spite of its “cosmetic” nature, it can still have a “major and creative influence for change and development”[116]. Yeld maintains that face validity should be capitalised on as a point of entry into testing for those “who have not been trained in the use of techniques of statistical analysis and are suspicious of what they perceive as ‘number-crunching’.”[117]

Thus, face validity (what Stevenson calls “pop” validity is so popular today because many language teachers have a poor knowledge of language testing and educational measurement, i.e. they are “metrically naive” [118]. Accordingly, they could remain satisfied with superficial impressions.

There are others who reject face validity altogether, because it relies too much on the subjective judgement of the observer[119]:

Adopting a test just because it appears reasonable is bad practice; many a `good-looking’ test has failed as a predictor… If one must choose between a test with `face validity’ and no technically verified validity and one with technical validity and no appeal to the layman, he had better choose the latter.[120]

Gardner and Tremblay[121] consider face validity to be the lowest form of validity, and should, accordingly, not be generally recommended as a research strategy. The difficulty with face validity in its usual connotation of what a test appears to be is that the prettier the package the worse may be the inherent quality of the tests. No matter what one’s opinions of face validity, it does have the following useful features: it increases a learner’s motivation to study for the test; it keeps sponsors happy; and it sustains the parents’ resolve to pay the ever-escalating school fees.

2.8.2 Content validity

Face validity and content validity can overlap, because what must be done in a test involves content. The latter subsumes subject matter as well as skills. Content validity  “implies a rational strategy whereby a particular behavioural domain of interest is identified, usually by reference to curriculum objectives or task requirements or job characteristics”[122]. Content validity is concerned with how test items represent the content of a syllabus or the content of real-life situations. Content validity is not only a match between (the situation, topic and style of ) tests and  real-life situations but also a match between tests and school life, both of which are part of “real” life.

2.8.3  Construct validity

The constructs, or human abilities, that this study is interested in belong to the domain of language acquisition. As I mentioned earlier (section 2.2), abilities are fixed attributes, or constructs (in the sense of consistent, not immutable). If behaviour is inconsistent it would be impossible to find out what lies behind the behaviour, i.e. discover the construct. The problem for scientists, whether physical scientists or linguistic scientists, is figuring out the nature and sequence of the contribution of (abstract) theory and (concrete) experience to construct validity.

Consider how evidence for construct validity is assembled. There are two main stages:  (1) hypothesise a construct and (2) construct a method that involves collecting empirical data  to test the hypothesis, i.e. develop a test to measure the construct. Hypothesising is concerned with theory, while the construction of a method, although inseparable from theory, is largely an empirical issue. The problem is that it is not clear whether theory should be the cart and experience  the horse, or vice versa, or some other permutation. Consider some of the problems in assessing the relative contribution of theory and experience in construct validatiFor Messick[123] construct validity is a unitary concept that subsumes other kinds of (sub-)validities, e.g. content validity and criterion validity. Messick[124] defines validity, which for him is construct validity, as  a

unitary concept that describes an integrated evaluative judgement of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferencesand actions based on test scores or other modes of assessment. (Original emphasis)

“Test scores” above refers to “quantitative summaries” (Messick 1987:3), which is the commonly understood meaning of the term. Messick’slonger and much denser definition of construct validity

implies a joint convergent and discriminant strategy entailing both substantive coverage and response consistency in tandem. The boundaries and facets of a behavioural domain are specified, but in this case as delineated by a theory of the construct in question. Items are written to cover that domain in some representative fashion, as with content validity, but in this approach the initial item pool is deliberately expanded to include items relevant to competing theories of the construct, if possible, as well as items theoretically irrelevant to the construct… Item responses are then obtained, and items are selected that exhibit response homogeneity consistent with the focal theory of the construct but are theoretically distinct from theoretically irrelevant items or exemplars of competing theories. [125]

What Messick’s definition loses in brevity and simplicity it gains in scientific  rigour. In Messick’s definition one should have a theory and only then a method. And the theory must be able to specify the problem without prejudging a solution: something very difficult to do. We try and ensure a “substantive coverage” (Messick above) of entities or qualities that are similar (a convergent strategy) and of those that are different (a discriminant strategy), which is “delineated by a theory of the construct in question” (Messick above). The problem is knowing what items to include or exclude in a test, because owing to the infinity of the corpus and the fact that elements and skills hang together[126], it is difficult to distinguish between “items relevant to competing theories of the construct, if possible” and “items theoretically irrelevant to the construct”? (Messick above; italics added). The Unitary Competence controversy is concerned with the nature and degree of interdependence between elements and skills, i.e. how, how much and which elements and skills hang together. Difficulties exist in the discrimination between items. One of these difficulties is distinguishing between low order (so called “discrete”) items and higher order (so called “integrative”) items, as was shown in the discussion of the discrete-point/ integrative controversy (section 2.5).

The group-differences approach to construct validity that is used in this study is now explained: The aim of testing is to discern levels of ability. If one uses academic writing ability as an example of a construct, one would hypothesise that people with a high level of this ability would have a good command of sentence structure, cohesion and coherence, while people with a low level of this ability would have a poor command of these. Tests are then administered and if it is found that there is a significant difference between a group of high achievers and a group of low achievers, this would be valid evidence for the existence of the construct. Second language

learners are often relatively less competent than fist language or mother-tongue users.[127]

Important for the arguments presented and the validation of the sample of subjects is that those who take English First Language as a subject are generally more competent than those who take English Second Language as a subject. If a test fails to discriminate between low-ability and high-ability learners there are three possible reasons for this:

– The construction of the test is faulty, e.g. the test may be too easy or too difficult for all or most of the test takers participating in the test.

– The theory undergirding the construct is faulty.

– The test has been inaccurately administered and/or scored, which would decrease the reliability, and hence also the validity of the test. (Reliability is discussed shortly).

2.8.4  Criterion validity: concurrent and predictive validity

Criterion validity is concerned with correlating one test against an external criterion such as another test or non-test behaviour. Ebel[128] maintains that “unless a measure is related to other measures, it is scientifically and operationally sterile.” Criterion validity should not be confused with criterion-referenced tests. Criterion-referenced tests deal with profiles, i.e. with setting a predetermined cut-off score for an individual.

Criterion validity, which relies mainly on empirical methods, ignores the theoretical contribution of construct and content validity. For this reason, some researchers, particularly those of the American Psychological Association, prefer to dissociate validity from descriptions of the criterion, e.g. Loevinger[129] and Bachman[130]. Bachman[131] prefers the term “concurrent relatedness”  to  “concurrent validity”, and  “predictive utility”  to “predictive validity”.

A term used by Messick is “criterion-related validity”, where the latter “implies an empirical strategy whereby items are selected that significantly discriminate between relevant criterion groups or that maximally predict relevant criterion behaviours”.[132] Messick’s “criterion-related validity” is the same notion as the simpler term “criterion validity”, which was defined in the first sentence of this section.

Criterion validity consists of concurrent validity and predictive validity.  Concurrent validity is also concerned with prediction because there is only a chronological  difference between concurrent and predictive validity.[133] So, we could distinguish between concurrent prediction and  prediction proper, which is concerned with the ability of one test to predict another test where the predictor and the criterion are not given concurrently and thus are separated from each other by a reasonable period of time.

The reason why predictive validity is easier to measure than other kinds of validity is that predictive validity does not depend on the nature of the test items, but on the consistency of the predictions of performance.[134] It would be possible to ignore all the other kinds of validity and still have a high degree of predictive validity. The question is whether one should be satisfied with predictive validity alone. No. That is why this study is also concerned  with construct validity.

If  the  construct validity  of one  test  always depends  on  the  validity  of  another test, there cannot exist any one test that  stands by itself such as an equivalent of a “Prime Mover”. Lado’s solution is to compare all tests in terms of  “some other criterion whose validity is self-evident, e.g. the actual use of the language.”[135] The question is: What is self-evident? Is there a self-evident test that pre-exists all other tests?  There isn’t, because “the buttressing validity of an external criterion is often neither definable nor, when found, reliable”.[136] This does not mean, of course, that any test or battery of tests, direct or indirect, will do. The problem, however, remains: what tests will do? [137]

Having said that, we don’t need to worry about the difficulty of establishing construct validity if we are merely interested in predictive validity. If Test A is a good predictor of Test B, then it seems we don’t Test C as a second predictor, because Test A is doing a good job already. However, recall the discussion of the “One Best Test” question: we can never be sure, and furthermore, it doesn’t look (face validity) fair to use only one test as a predictor. To do so would be regarded by some researchers as highly unethical. Spolsky[138] (1995:358) is a case in point:

Only the most elaborate test batteries, with multiple administrations of multiple methods of testing the multiple traits or abilities that make up language proficiency, are capable of producing rich and accurate enough profiles to be used for making critical or fateful decisions about individuals.

Such Herculean conditions, however, would probably “paralyze”[139] most testing endeavours, because it is impracticable in its unrealisability. We shouldn’t only start  measuring when we are clear about what we are measuring; rather we should do the best we can; always taking into account generally accepted theories, but not necessarily following them slavishly if we have cogent reasons why we shouldn’t.

2.9  Reliability

If the validity of a test depends on its close approximation to real life then validity would relate to subjectivity. We try to be as objective as possible in the test compilation, administration and  assessment. This search for objectivity is the domain of reliability. Reliability in testing is concerned with the accuracy and consistency of scoring and of administration procedures. The less the accuracy and consistency, the more the measurement error.

A major difficulty in testing is how to make the “leap from scores  to profiles “[140], i.e. how to define the cut-off points. In norm-referenced testing, one  defines cut-off points by computing the measurement error. In criterion-referenced tests, one makes a value judgement of what is progress enough for a specific individual.

To the extent that one can decrease the measurement error, one increases the reliability of the test. Measurement error has important ethical implications. It would be unjust to fail students  because they get 49% – perhaps even 47%, where does one draw the line? – instead of 50%. In subjective tests such as essay tests the problem is more serious, because even the best essay test, owing to its subjective scoring

procedures, is often not more than 80%-90%[141] reliable, and therefore measurement error should be calculated in order to make more equitable judgements.

The following “aspects” are germane to reliability:

– Facets: These refer to such factors as the (1) testing environment, e.g. the time of testing and the test setting, (2) test organisation, e.g. the sequence in which different questions are presented, and (3) the relative importance of different questions and topic content. Facets also include cultural and sexual differences between test takers, the attitude of the test taker, and whether the tester does such things as point out the importance of the test for a test-taker’s future.[142]

– Features or conditions: These refer to such factors as clear instructions, unambiguous questions and items that do or do not permit guessing.

– The manner in which the test is scored. A central factor in this regard is rater consistency. Rater consistency becomes a problem mainly in the kind of tests that involve subjective judgements such as essay tests. (I discuss interrater and intrarater consistency in the next section). According to Ebel and Frisbie[143],  consistency is not only concerned with the correlations between raters, but also with the actual scores, more specifically, the equivalence in scores between tests and between raters. (I discuss this issue in section 4.8.1.2).

I  clarify a possible confusion between rater reliability and concurrent validity.  Rater reliability has to do with the consistency between raters’ judgements on one test, e.g. an essay test.Concurrent validity, in contrast, has to do with the correlation between two or more different tests e.g. a dictation test and an essay test. In the next section more details are provided on the approaches to reliability, which may help clarify the concepts discussed.

2.9.1  Approaches to the measurement of reliability

There are five approaches to measuring reliability. Owing to the structure of this study, only approaches 2, 4 and 5 are used:

1. Stability, i.e. consistency over time. The method used to measure stability is the test-retest method, which involves giving the same test a second time and comparing the scores of the two test trials. If the scores are equivalent, the test is considered to be stable. A disadvantage of the test-retest method is that students may not be motivated to do the test a second time, which might affect performance on the retest.

2.  Internal consistency. This approach, also called the “split-half” test, divides the test into two halves. The two halves of the test are regarded as two parallel tests. For each student there is a separate score for each half of the test. It is possible to correlate the two sets of scores as if they were parallel tests.

3.  Rater reliability. Rater reliability is particularly important in non-objective tests such as essay tests, where there are liable to be fluctuations in scores between (1) different raters, which is the concern of interrater reliability, and (2) within the same rater, which is the concern of intrarater reliability. In this study I use essay assessment to examine interrater reliability (section 4.8.1).

4. Equivalence (in the form of the test). There are two meanings of equivalence:  firstly, the equivalence between test scores, and secondly, between the facets of the tests. The method used to measure equivalence is the parallel test method. In parallel tests it is difficult to ensure equivalent conditions within the many facets of a test, especially whether the content of the two parallel are equivalent. The problem doesn’t exist for multiple-choice type tests because the split-half method is used. In the case of “pragmatic” tests, however,  such as the cloze, dictation and essay tests in this study, there is a problem. This problem is examined in the discussion of the parallel reliability of the pragmatic tests in the study.

5.  A combination of stability and equivalence (in forms). The method used is a parallel test which is administered a period of time after the first test. The difficulties are compounded here, because they include the problems of both equivalence and of stability mentioned above.

The degree of reliability required depends on the relative importance of the decisions to be made. For example, an admission test would require more reliability than a placement test, because decisions based on a placement test can be more easily adjusted than decisions based on an admission test. A final evaluation for promotion purposes would require the most reliability of all.

2.10  Ethics of measurement

Validity should not be separated from what Sammonds[144] refers to as the following “ethical” questions. Most of these ethical questions are scientific questions as well: (The kinds of validity corresponding to each question are the appellations given by Sammonds).

1. Are the measures that are chosen to represent the underlying concepts appropriate? (“Construct validity”).

2. Has measurement error been taken into account, because in all measurement there is always a degree of error? (“Statistical conclusion validity”; in other words, reliability).

3. Are there other variables that need to be taken into account? (“Internal validity”).

4. Are the statistical procedures explained in such a way that a non- statistical person can understand them? Or sometimes even a statistical person. One reader may find an explanation superfluous or too detailed, while another  may find the same information patchy. Much depends on the background knowledge of readers and/or what they are looking for. Pinker[145] maintains that expository writing requires writers to overcome their “natural egocentricism” where “trying to anticipate the knowledge state of the generic reader at every stage of the exposition is one of the important tasks of writing well.” True, but there is much more, namely the basic expository problem of negotiating a path between under-information and overkill. Getting experts to read and provide comments before one submits one’s work to public scrutiny is one way of  reducing the expository problem. It may also compound the problem, however, owing to the diversity of beliefs in the world: of interpretations of interpretations (see section 6.2).

4. Has the description of the sample and the data analysis been properly done so that generalisations can be made from it? (“External validity”). This important issue is dealt with in the last chapter (section 7.1).

2.11  Summary of Chapter 2

The first part of the chapter dealt with theoretical issues in language proficiency, language learning, language testing and academic achievement. Key concepts such as authenticity, competence, performance, ability, proficiency, test language, integrative continuum and achievement were explained.

The second part of the chapter was concerned with explaining the key concepts in summative assessment. The two principal concepts in summative assessment are validity and reliability. Different kinds of validity were discussed, namely, content validity, face validity, construct validity and criterion validity, where the latter comprises concurrent and predictive validity. Other kinds of validity were also referred to in the context of the ethics of measurement. The group-differences approach to construct validity, to be used in the study, was described. Different approaches to the examination of reliability were discussed and those chosen for the study were specified.


[1]Ingram, E.Assessing proficiency: An overview on some aspects of testing, 1985, p.218. (2) Vollmer, H.J.Why are we interested in general language proficiency, 1981, p.152).[2]Ibid.

[3]“Norm” is used in the sense of an idealisation against which comparisons are made of what scientists call the “real” world. Although this “normal” curve is a mathematical abstraction it is based on the reasoning that if there were an infinitely large population then human abilities (and the milk yeld of cows) would be represented by a perfect bell curve.

[4]Carroll, J.B.Human cognitive abilities: A survey of factor analytic studies, 1993, p.10.

[5]Minick, N.J.L.S. Vygotsky and Soviet activity theory: New perspectives on the relationship between mind and society, 1985, pp.13-14.

[6]The concept of fixety in the social sciences carries the stigma of colonialism and racism, “an ideological constriction of otherness” (Bhaba 1994:66; see also Leung, Harris and Rampton 1997). This came out clearly in an article by Phatekile Holomisa, president of the Congress of Traditional Leaders of South Africa  (Financial Mail, October 16, 1998, p.22), who maintained that one of the reasons why Transkeians who have made good outside the Transkei do not return to help the rural poor is because they believe that they “might be seen as promoting ubuXhosa (Xhosa-ness), in contradiction to the ideal of nonracialism.” There is also the danger that the concept of fixety could translate into the constriction of the historical individual, which the generalising mode of science considers to have significance only insofar as it reveals a universal rule (Cassirer 1946:27).

[7]Botha, H.L. and Cilliers, C.D. ‘Programme for educationally disadvantaged pupils in South Africa: A multi-disciplinary approach.’ South African Journal of education, 13 (2),  55-60 (1993).

[8]Bridges, P.’Transferable skills: A philosophical perspective.’ Studies in Higher   Education, 18 (1), 43-52 (1993), p.50.

[9]Ibid.

[10]Ibid.

[11]Millar, R.‘The pursuit of the impossible.’ Physics Education, 23, 156-159 (1988), p.157.

[12]Ibid.

[13]Gamaroff, R.  ‘Solutions to academic failure: The cognitive and cultural realities of English as the medium of instruction among black ESL learners.’ Per Linguam, 11 (2), 15-33 (1995c).

__________   ‘Abilities, access and that bell curve.’ Grewar, A. (ed.).   Proceedings of the South African Association of Academic Development “Towards meaningful access to tertiaty education”. (Alice: Academic Development Centre, Fort Hare, 1996b).

___________  ‘Language as a deep semiotic system and fluid intelligence in language proficiency.’ South African Journal of Linguistics, 15 (1), 11-17 (1997b).

[14]Davies, A. Principles of language testing, 1990, p.6,.

[15]Chomsky, N. Aspects of the theory of syntax, 1965, p.6.

[16](1) Brown, K.Linguistics today, 1984, p.144.

(2) Leech, G. Semantics, 1981, p.69.

(3) Hutchinson, T. and Waters, A. English for special purposes: A learner-centred approach, 1987, p.28.

[17]Chomsky, N. Aspects of the theory of syntax, 1965, pp. 3-4.

[18]Chomsky, N. Syntactic structures, 1957, p.103.

[19]Atkinson, M., Kilby, D. and Roca, I. Foundations of general linguistics, 1982, 369.

[20]Canale, M. and Swain, M. ‘Theoretical bases of communicative approaches to second language teaching and testing.’ Applied Linguistics, 1 (1), 1-47 (1980), p.34.

[21]Hymes, D. ‘On communicative competence’, in Pride, J.B. and Holmes, J. (eds.). Sociolinguistics. (Harmondsworth, Penguin, 1972).

[22](1) Canale, M. and Swain, M. ‘Theoretical bases of communicative approaches to second language teaching and testing.’ Applied Linguistics, 1 (1), 1-47 (1980), p.34.

(2) Swain, S. ‘Large-scale communicative language testing: A case study’, in Lee, Y., Fok, A., Lord, R. and Low, G. (eds.). New directions in language testing. (Oxford,  Institute of English, 1985).

(3) Savignon, S.J.  Communicative competence: Theory and classroom practice.  (Reading, Mass. Addison-Wesley Publishing Company, 1983).

[23]Bachman, L.F. and Palmer, A.S.Language testing in practice, 1996. (See their Chapter 4).

[24]Skehan, P. A cognitive approach to language learning, 1998, p.16.

[25]Widdowson, H.G.Aspects of language teaching, 1990, p.40.

[26]Chomsky, N. Language and the problem of knowledge, 1988, p.9

[27]Haussmann, N.C. The testing of  English mother-tongue competence by means of a multiple-choice test: An applied linguistics perspective, 1992, p.16.

[28]Bachman, L.F. and Clark, J.L.D. ‘The measurement of foreign/second   language proficiency.’ American Academy of the Political and Social Science Annals, 490, 20-33 (1987), p.21.

[29]Carroll, B.J. 1980. Testing communicative performance. Oxford. Pergamon.

[30]Weir, C.J. Communicative language testing, 1988, p.10.

[31]Skehan, P. A cognitive approach to language learning, 1998, p.154

[32]Vollmer, H.J.The structure of foreign language competence, 1983, p.5.

[33]Brière, E. ‘Are we really measuring proficiency with our foreign language tests?’ Foreign Language Annals, 4, 385-91 (1971), p.322.

[34]Vollmer, H.J. Ibid, p.5.

[35]Hyltenstam, K. and Pienemann, M. Modelling and Assessing second language   acquisition, 1985, p.15.

[36]The term native is problematic. I discuss this problem in sections 3.2.1 and 6.1.1.

[37]Porter, D.Assessing communicative proficiency: The search for validity, 1983.

[38]Child, J. ‘Proficiency and performance in language testing.’ Applied Linguistic Theory, 4 (1/2), 19-54 (1993).

[39]Alderson, J.C. and Clapham, C. ‘Applied linguistics and language testing: A case study of the ELTS test.’ Applied Linguistics, 13 (2), 149-167 (1992), p.149.

[40]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988).

[41]Omaggio, A.C.Teaching language in context: Proficiency-orientated instruction, 1986.

[42]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.182..

[43]Widdowson, H.G. Explorations in applied linguistics. (Oxford,  Oxford University Press, 1979).

______________   ‘Knowledge of language and ability of use.’ Applied Linguistics, 10 (2), 128-137 (1989).

______________ Aspects of language teaching. (Oxford,  Oxford University Press, 1990).

______________ ‘Communication, community and the problem of appropriate use’, in Alatis, J.E. Georgetown University Round Table on Languages and Linguistics. (Washington, D.C. Georgetown University Press, 1992).

[44]Alatis, J.E. (ed.). Georgetown University Round Table on Languages and Linguistics,  1992.

[45]Byrnes, H. and Canale, M. (eds.). Defining and developing proficiency: Guidelines, implementations and concepts, 1987.

[46]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.182.

[47]Ibid.

[48]Widdowson, H.G.Aspects of language teaching. (Oxford,  Oxford University Press, 1990).

[49]Gamaroff, R. ‘Is the (unreal) tail wagging the (real) dog?: Understanding the construct of language proficiency.’ Per Linguam, 12 (1), 48-58 (1996a).

[50]Nunan, D. Syllabus design, 1988, p.28.

[51]Widdowson, H.G. Explorations in applied linguistics,1979, p.141.

[52]Ibid, p.246.

[53]Widdowson, H.G. ‘Communication, community and the problem of appropriate use’, in Alatis, J.E. Georgetown University Round Table on Languages andLinguistics, 1992, p.306.

[54]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.183.

[55]Byrnes, H. and Canale, M.(eds.). Defining and developing proficiency: Guidelines, implementations and concepts, 1987, p.1.

[56]His specific context is the second language “acquisition”/second language “learning” controversy  of Krashen (1981); see Note 56 below..

[57]Morrissey, M. D.‘Toward a grammar of learner’s errors.’ InternationalReview of Applied Linguistics, 21 (3), 193, 207 (1983), p.200.

[58]Krashen, S. Second language acquisition and second language learning. (Oxford,  Pergamon Press, 1981).

[59]Krashen, S. and Terrell, T. The natural approach:Language acquisition in the   classroom, 1983.

[60]Butzkamm, W.’Review of H. Hammerly, “Fluency and accuracy: Toward balance in               language teaching and learning”‘. System, 20 (4), 545-548 (1992).

[61]Taylor, B.P. ‘In search of real reality.’ TESOL Quarterly, 16 (1), 29-43 (1982).

[62]Lantolf and Frawley, 1988, p.183.

[63]Politzer, R.L. and McGroarty, M. ‘A discrete-point test of communicative competence.’ International Review of Applied Linguistics, 21 (3), 179-191 (1983).

[64]Entwhistle, W.J. Aspects of language, 1953, p.157.

[65]Rea, P. Language testing and the communicative language teaching curriculum, 1985.

[66]Halliday, M.A.K. Learning how to mean, 1975, p.5.

[67]Ibid, p.2.

[68]Givon, T.Understanding grammar, 1979, pp.5 and 22

[69]Ibid, p.31.

[70]Rutherford, W.E. Second language grammar: Learning and teaching,1987, pp.1-5

[71]Ibid, pp.4 and 36-37.

[72]Spolsky, B.Conditions for second language learning, 1989, p.61.

[73]Alderson, J.C. Reaction to the Morrow paper, 1981c, p.47.

[74]Besner, N. ‘Process against product: A real opposition?’ English Quarterly, 18 (3), 9-16 (1985), p.9.

[75]Widdowson, H.G. Explorations in applied linguistics, 1979.

[76]Farhady, H. The disjunctive fallacy between discrete-point tests and integrative tests,  1983.

[77](1)Hale, G.A., Stansfield, C.W. and Duran, R.P. TESOL Research Report 16.(Princeton, New Jersey: Educational Testing Service, 1984).

(2) Henning, G.A., Ghawaby, S.M., Saadalla, W.Z., El-Rifai, M.A., Hannallah, R.K. and Mattar, M. S. ‘Comprehensive assessment of language proficiency and achievement among learners of English as a foreign language.’ TESOL Quarterly, 15 (4), 457-466 (1981).

(3) Oller, J.W., Jr. Language tests at school. London,  Longman, 1979).

(4) Oller, J.W. (Jr.) and Perkins, K. (eds.). Language in education: testing the tests. (Rowley, Massachusetts,  Newbury House, 1978).

[78]Fotos, S. ‘The cloze test as an integrative measure of EFL proficiency: A substitute for essays on college entrance examinations.’ Language Learning, 41 (3), 313-336 (1991), p.318.

[79]Alderson, J.C. ‘The cloze procedure and proficiency in English as a foreign language.’ TESOL Quarterly, 13, 219-227 (1979).

[80]Ibid.

[81]Ibid.

[82]Ibid.

[83]Alderson, J.C., Clapham, C. and Wall, D. Language test construction and evaluation. (Cambridge, CUP, 1995).

[84]Bonheim, H.  Roundtable on language testing. European Society of the Study of   English (ESSE) conference, Debrecen, Hungary, September, 1997).

[85]Oller, J.W., Jr. A consensus for the 80s, 1983, p.137.

[86]Rea, P. ‘Language testing and the communicative language teaching curriculum’, in Lee, Y.P. et al. New directions in language testing. (Oxford. Institute of English, 1985), p.22.

[87]Ibid.

[88]Canale, M. and Swain, M. ‘Theoretical bases of communicative approaches to second language teaching and testing.’ Applied Linguistics, 1 (1), 1-47 (1980), p.35.

[89]Ibid.

[90]Alderson, J.C. ‘The cloze procedure and proficiency in English as a foreign language.’ TESOL Quarterly, 13, 219-227 (1979).

[91]Stevenson, D.K.Pop validity and performance testing, 1985.

[92]Spolsky, ibid, p.33-34.

[93]For example, Hughes, A. Testing for language teachers. (Cambridge, Cambridge University Press, 1989), p.15.

[94]Rea, P. ‘Language testing and the communicative language teaching curriculum’, in Lee, Y.P. et al. New directions in language testing. (Oxford. Institute of English, 1985).

[95]Omaggio, A.C. Teaching language in context: Proficiency-orientated instruction. (Boston, Massachusetts,  Henle and Henle, 1986), p.312-313.

[96]Alderson, J.C. ‘Who needs jam?’, in Hughes, A. and Porter, D. Current   developments in language testing. (London, Academic Press, 1983), p.89.

[97]Politzer, R.L. and McGroarty, M. ‘A discrete-point test of communicative competence.’ International Review of Applied Linguistics, 21 (3), 179-191 (1983).

[98]Widdowson, H.G. ‘Skills, abilities, and contexts of reality.’ Annual Review of Applied Linguistics, 18, 323-333 (1998), p.329.

[99]Spolsky, B. Conditions for second language learning. (Oxford,  Oxford University Press , 1989), p.61.

[100]Rutherford, W.E. Second language grammar: Learning and teaching, 1987.

[101]Spolsky, ibid.

[102]This raises the contentious issue of separating “semantics” from “pragmatics” (See Hudson 1984). From the point of view of the ideational (or conceptualising) function of language, which is what most of language processing is concerned with, or should be concerned with, much more demands are made on semantic and syntactic encoding than the  communicative act itself, which, after all, is only the last stage of language in action – unless one speaks before one thinks (Widdowson 1998:330)

[103]Cummins, J. ‘The cross-lingual dimensions of language proficiency: Implications for bilingual education and the optimal age issue.’ TESOL Quarterly, 14 (2), 175-87 (1980).

_________Language proficiency and academic achievement, 1983.

_________  Wanted: A theoretical framework for relating language proficiency to academic achievement among bilingual students, 1984.

[104]Widdowson, H.G. ‘Skills, abilities, and contexts of reality.’ Annual Review of Applied Linguistics, 18, 323-333 (1998), p.326.

[105]Bernstein, B. Class, codes and control, 1971.

[106]Wald, B. A sociolinguistic perspective on Cummins’ current framework for relating language proficiency to academic achievement, 1984, p.57

[107]Ur, P. A course in language teaching: practice and theory, 1996.

[108]Saville-Troike, M. ‘What really matters in second language learning for academic achievement.’ TESOL Quarterly, 18 (2), 199-219 (1984), p.199.

[109]Upshur, J.A. ‘English language tests and predictions of academic success’, in Wigglesworth, D.C. (ed.). Selected conference papers of the Association of Teachers of English as a Second Language. Los Altos, California,  National Association for foreign Student Affairs (NAFSA) Studies and Papers, English Language Series 13, 85-93 (1967), p.85.

[110]Valette, R.L. Modern language testing: A handbook, 1969, p.5.

[111]Carmines, G. and Zeller, A. Reliability and validity assessment, 1979, 15.

[112]American Psychological Association.Standards of educational and   psychological measurement. 1974.

[113]Clark, J.L.D. Theoretical and technical considerations in oral proficiency testing, 1975, p.28.

[114]Spolsky, B. ‘The limits of authenticity in language testing.’ Language Testing, 2, 31-40 (1985), p.33-34.

[115]Davies, A. Principles of language testing, 1990), p.44.

[116]Ibid, p.7.

[117]Yeld, N. ‘Communicative language testing and validity.’ Journal of Language Teaching, 21 (3), 69-82 (1987), p.78.

[118]Stevenson, D.K.Pop validity and performance, 1985, p.112.

[119](1)American Psychological Association. 1974. Standards of educational and   psychological measurement. (Washington D.C,, American Psychological Association, 1974).

(2)Cronbach, L.J. Essentials of psychological testing. (New York: Harper and Row, 1970).

(3) Gardner, R.C.  and Tremblay, P.F. ‘On motivation: measurement and   conceptual considerations.’ The Modern Language Journal, 78 (4), 524-527 (1994).

(4) Stevenson, D.K.’Pop validity and performance testing’, in Lee, Y., Fok, A., Lord, R. and Low, G. (eds.). New directions in language testing. (Oxford, Pergamon, 1985).

[120]Cronbach, ibid, p.183.

[121]Gardner and Tremblay, ibid, p.525.

[122]Messick, S.Constructs and their vicissitudes in educational and psychological measurement, 1989a, p.1.

[123]Messick, S. Validity, 1989b, p.1-2

[124]Messick, S. Meaning and values in test validation: The science and ethics of measurement, 1988, p.2.

[125]Messick, S. Constructs and their vicissitudes in educational and psychological measurement, 1989a, p.1.

[126]In Item Response Theory (IRT) a major issue is the unidimensionality assumption that all items in a test measure a single ability. If, however, language “competence” is multidimensional (see Bachman 1990a), the assumption that all items hang together might not be correct. From the psychometric point of view, though, the unidimentionality assumption does not, as Henning (1992) argues, preclude the psychological basis of multidimensionality (see Douglas 1995).

[127]Of course, there are many second language users who have a far better command of academic discourse than mother-tongue  users. This is so because the ability to understand and produce academic discourse depends on much more than “linguistic ability”: it also depends on CALP and academic intelligence (Gamaroff 1995c, 1996b, 1997b)

[128]Ebel, R.L. ‘Must all tests be valid?’ American Psychologist, 16, 640-647 (1961), p.645.

[129]Loevinger, J. Objective tests as instruments of psychological theory, 1967, p.93.

[130]Bachman, L.F. Fundamental considerations in language testing, 1990b, p.253.

[131]Ibid.

[132]Messick, S. Constructs and their vicissitudes in educational and psychological measurement. (Princeton, New Jersey, Educational Testing Service, 1989a).

[133]Cronbach, L.J. Essentials of psychological testing, 1970, p.122.

[134]Weir, C.J. Communicative language testing, 1988, p.30

[135]Lado, R. Language testing, 1961, p.324.

[136]Davies, A. Principles of language testing, 1990, p.3.

[137]This problem is indicative of the much larger problem of the indeterminacy of language (and hence also of epistemology) itself. “There is no perfect hypothetical language, to which the languages we have are clumsy approximations” (Harris 1981:175). And this must inevitably lead the applied linguist to grapple with the slippery notions of native speaker and mother-tongue speaker. I discuss this issue further in section 6.1.1.

[138]Spolsky, B. Measured words, 1995, p.358.

[139]Ibid.

[140]Yeld, N. Communicative language testing. Report on British Council Course 559 offered at Lancaster University from 8 September to 20 September 1985. (Cape Town. University of Cape Town, 1986, p.31.

[141]According to Perkins (1983:655),“raters, guided by…holistic scoring guides…, can achieve a scoring reliability as high as .90 for individual writers.” Indirect objective tests such as multiple choice grammar and vocabulary tests, on the other hand, can have reliability coefficients as high as .99, because there is no problem of rater reliability involved, i.e. subjective judgements will not affect the scores (Hughes 1989:29).

[142]Bachman, L.F. Fundamental considerations in language testing. (Oxford: Oxford University Press, 1990b, pp. 116ff, 168-172, 244.

[143]Ebel, R.L. and Frisbie, D.A. Essentials of educational measurement, 1991, p.76.

[144]Sammonds, P. Ethical issues and statistical work, 1989, p.53.

[145]Pinker, S. The language instinct, 1995, p.401.

[Home] [Services & Rates] [Contact] [Ph.D.] [Articles and Conference Papers]

// <!–[CDATA[–>
_doc=document; _nav=navigator; _timezone=new Date(); function count() { _doc.write(
“);} _color=”0″; _rez=”0″; _java=”U”; _kr=”U”; _doc.cookie=”_kr=y”;
_doc.cookie.length>0?_kr=”Y”:_kr=”N”;
// ]]> // // Free Web Hit Counter

Click on the graphic to vote for this
page as a pMonkey.com Hot Site.

Chapter 1: Scope of the Study Ph.D.

1

1.1  Introduction: The  problem and purpose of the study

1.2  Psychometrics and norm-referenced testing

1.3  Summative assessment

1.4  The One Best Test

1.5  Hypotheses of the study

1.6  Historical and educational context

1.7  Measures used in the study

1.8  Method overview

1.9  Preview of Chapters 2 to 6

1.10 Summary of Chapter 1

1.1 Introduction: The  problem and purpose of the study

Language testing draws on three areas: (1) the nature of language, (2) assessment and (3) language ability.[1] Language ability is closely related to language proficiency, which is a key term in this study. (The relationship between ability and proficiency is explained in section 2.2 ff).

    Central concepts in the measurement of language ability are: (1) validity (what one is measuring), (2) reliability (how one is measuring), (3) practicability (economics of time and expense) and (4) accountability (why one is testing). If a test is not practicable, even if  judged to be valid and reliable, it would be uneconomical and accordingly of little use.

Owing to our ignorance of the processes of language learning and learning processes in general, much of what we know about language testing, and therefore also about teaching, remain tentative. (“Language testing is rightly central to language teaching”[2]). A major obstacle in test development has been the lack of agreement on what it means to know a language, on what aspects of language knowledge should be tested – and taught – and how they should be tested and assessed. Accordingly, it is not always easy to know why one is testing. As far as the how and the what is concerned, the more explicit, i.e. the more accurate, or reliable,  one tries to make a test, the more uncertain one becomes about (the validity of) what one is testing.

The study consists of  three parts:

Part I. A large part of this study is devoted to statistical measurement in assessment, specifically the relationship between communicative competence and traditional testing theories insofar as these shed light on the (1) construct of language proficiency, and (2) the use of proficiency tests as predictors of academic achievement. Statistical measurement in scoring procedures are also closely related to the structure and administration of tests.  This study re-examines and defends this interdependence in terms of the three key notions in testing: validity, reliability and practicality.

Part II. The prediction of academic achievement where English proficiency tests are used as the predictors. A longitudinal study is undertaken of the prediction of academic achievement from Grade 7 to Grade 12.

Part III. Implications of the study for procedures of assessment of language proficiency.

The educational context of this study is a High School in the North West Province, which will be referred to as MHS, where I taught and did language research for over seven years (January 1980 to April 1987).

The primary focus of the study is not on what “really matters  in first or second language proficiency and academic achievement[3], or, put another way, on “develop[ing] an adequate theoretical framework for relating language proficiency to academic achievement”.[4] Neither does it deal with individual differences in cognitive styles of learning.[5] Nor does it dwell on the many causes of academic failure. The longitudinal part of the study is much more about predictions than about causes of academic failure. I do, however, refer to causes of academic failure where relevant. For example, one of the causes of academic failure (and success!) that this study is particularly interested in is the lack of scoring consistency among raters, which is arguably the greatest problem in assessment.

There are two urgent needs in minority education:

(1) to pursue fundamental research on the nature of language proficiency and how it can be measured, and (2) to provide teachers with up-to-date knowledge of language proficiency assessment so they can improve their classroom assessment practices.[6]

The term minority has much more than a numerical meaning. In South Africa the majority of learners use English as an additional/secondlanguage, but the tradition is to refer to such learners as originating from minority language backgrounds. The term has an obvious discriminatory ring that implies that some acceptable level has not yet been reached. Yet, tests have to distinguish between levels of proficiency for them to have construct validity. Chapters 3, 4 and 5 deal with the statistical issues of assigning people to the same or different groups, and Chapter 6 deals with the educational and political implications of doing so. Two main levels of proficiency are examined, which are given the well-known labels of L1 and L2. The L1 and L2 labels are central to the study and are used differently to the normal connotation of  users of a first language and learners of a second language, which I shall explain shortly. These well-known terms together with the terms mother tongue and native language have been the occasion of much controversy. I discuss this controversy in section 6.2. In the empirical investigation the labels L1 and L2 will refer to the sample of subjects (i.e. informants) who take the subject English as a First Language and English as a Second Language, respectively, at MHS. This definition of L1 and L2 needs constant reminding throughout the study. In Chapter 6 (section 6.2), various other definitions of “L1” and “L2” are examined.

The sample of subjects is described in detail in Chapter 3. Tables 3.1 and 3.2 give a clear explanation of the sample of subjects and so it might be useful to continually refer to these two figures, which serve as guideposts to the description of the sample.

Psychometric questions, and discrete-point and integrative testing have been discussed at length in the literature on language testing and assessment (the distinction between testing and assessment is explained shortly). So one may ask why the need to, and where is the merit and originality of, devoting a PhD to such old and outdated  issues, which has not been a research issue in language assessment for over 15 years? Thereis indeed a pressing need because although communicative methods have become the prevalent source of testing with a “richer conceptual base for characterizing the language abilities to be measured, it has presented language testers with a major challenge in defining these abilities and the interactions among them with sufficient precision to permit their measurement.”[7] It is this problem of reconciling authentic subjectivity and objective precision that is the major problem in testing, indeed the major problem of cognition and language.[8] The authenticity issue has wider ramifications. It is not only central to testing but also to syllabus design and materials development. This study focuses on testing only.

In spite of decades of attempts to define it, the how[9] and the why[10] of language proficiency remains a conundrum. Although we may no longer stand before an “abyss of ignorance”[11] and may be able to agree with Alderson (in Douglas[12]) that language testing has “come of age”, there are still many problems in language testing, the greatest one being, I suggest, the problem of reliability[13] and specifically rater reliability (see Alderson and Clapham’s case studies of this problem[14]). There are two kinds of rater reliability: interrater reliability and intrarater reliability. These are dealt with in section 2.9.1.

Owing to our ignorance of the processes of language learning and learning processes in general, much of what we know about language testing, and therefore also about teaching, remain tentative. (“Language teaching is rightly central to language teaching”[15]). A major obstacle in test development has been the lack of agreement on what it means to know a language, on what aspects of language knowledge should be tested – and taught – and how they should be tested and assessed.

This problem is not a surprising one because language is closely connected to human rationalities, imaginations, motivations and desires, which comprise an extremely complex network of biological, cognitive, cultural and educational factors. As a result, all language testing theories are inadequate owing to the difficulties involved in devising tests that test authentic language reception and production. This does not mean that we should stop measuring until we’ve decided what we are measuring. We do the best we can by taking account of generally accepted views of the nature of language proficiency , of modern views and dated ones. In the modern literature on testing there seems to be an overemphasis on up-to-date theories, which gives the impression that “what is  dated is outdated”.[16] Widdowson’s up-to-date admonition that we should take more seriously dated views is taken to heart in this study.

What is a test? It is “the most explicit form of description, on the basis of which the tester comes clean about his/her ideas”.[17] What all testers are looking for are systematic elicitation techniques on which one can base useful decisions. The three underlying issues in testing are: to infer abilities, to predict performance and to generalise from context to context[18].  This means that tests should be valid, reliable and practicable. Communicative testers would add the notions of “impact” (i.e. face validity) and “interactionist”.[19] Opponents of discrete-point tests (such as grammar tests) and integrative tests (such as cloze tests and dictation tests) would probably concede that such tests are reliable and practical, but they would argue that they are not valid, i.e. they tell us little or nothing about the learner’s knowledge of authentic language. I shall argue that, on the contrary, they indicate a great deal about authentic language and that these old issues are not outdated and are still worthy of attention.

I use an analysis of a battery of English proficiency tests to substantiate my theoretical position. I then examine the predictive validity of the battery of English proficiency tests. Part of the predictive investigation involves a comparison between the predictive validity of the school reports from former schools of entrants to the School with the predictive validity of the English proficiency tests. These reports were the main criterion for admission to the School. Few entrants with School report aggregates under 60% were admitted.

A curriculum framework consists of the following components[20]:

–  Needs analysis

–  Objectives

–  Materials

–  Teaching

–  Testing

Accordingly, the curriculum is concerned with the syllabus as well as everything to do with pedagogical matters, i.e. teaching what to whom, when and how.[21]Syllabus is defined  as the content and sequence of content of the programme selected in order to make learning and teaching effective.[22] Although testing is the last component in the curriculum framework, this is only so chronologically, and not logically, because testing permeates the whole of the curriculum. This is the reason why there is the possibility  – and the temptation; perhaps justifiably so –  of teaching to the test.

A major part of testing is concerned with assessment.  In this study I use the term tests to refer to “elicitation techniques”[23] and the term assessment to  refer to the methods used to measure or analyse test results. In assessment one is concerned with the control of rater judgements and  scoring techniques. (Assessment is discussed in detail in section 1.3). Thus assessment, quantitative or qualitative, is not an intrinsic property of a test but a method one uses to measure or analysis the test results.

There are a variety of language test uses (Pollitt in Yeld[24]). The basic four uses are mentioned[25]:

–  Proficiency tests, which evaluate present knowledge in order to predict future achievement, usually at the beginning of a course of study. Proficiency tests are based on knowledge that has been gained independent of any specific syllabus but not independent of typical syllabuses because the knowledge to be tested must have been gained from some syllabus or other.

– Achievement tests, which evaluate how much has been learnt of a particular syllabus, where the focus is on success, usually at the end of a teaching programme.

– Diagnostic tests, which evaluate points not yet mastered, where the focus is on failure and therapy. Diagnostic tests. therefore may be considered to be the reverse of achievement tests.[26] Proficiency tests often involve diagnosing items that have not been mastered, and therefore diagnostic testing may be  part of proficiency testing.

– Aptitude tests, which evaluate abilities for language mastery, and is thus, like proficiency tests, of predictive value. Unlike the three other kinds of test uses, aptitude tests have no specific or general content, and are thus difficult tests to compile. They require, arguably, the most knowledge and care in their construction and application, for it is far worse to be told that one has no aptitude than to be told that one has low proficiency or has failed an achievement test. No aptitude means no hope at all, unless it is possible to have potential without aptitude.

Thus, both proficiency and achievement are concerned with present knowledge. It may be that a proficiency test contains material previously contained in an achievement test, but this difference is irrelevant to the validity of the proficiency test because, unlike an achievement test, a proficiency test is not concerned with whether the content of a test was previously taught. The American Council of the Teaching of Foreign Languages (ACTFL) Proficiency Guidelines are a case in point:

Because these guidelines identify stages of proficiency, as opposed to achievement, they are not intended to measure what an individual has achieved through specific classroom instruction but rather to allow assessment of what an individual can and cannot do, regardless of where, when, or how the language  has been learned or acquired: thus the words learned and acquired are used in the broadest sense.[27]

Proficiency is concerned with what somebody knows and can do here and now. Achievement should be ultimately concerned  with proficiency as well. That is why Spolsky[28] omits the term achievement in the following definition of language tests: “Language tests involve measuring a subject’s knowledge of, and proficiency in, the use of language.”

There are four important considerations in language testing[29]:

1.  How valid is the test?

2. How easy is it to compose?

3. How easy is it to administer?

4. How easy is it to mark?

The first, which is concerned with the purpose of a test is the most important theoretical issue in testing.  Ur feels so strongly about practicability that her next three considerations for choosing a test have to do with practicability. The fourth is also related to rater reliability. A test may be everything communicative testers require, but it would still be no good if it took too long or was too difficult to administer and assess. The more objective the test, the less the danger of rater unreliability. An essay test is a supreme example of a subjective test, because it is vulnerable to fluctuations in judgements between raters. The problem is finding the appropriate balance between the different testing considerations, where it is difficult, indeed, impossible, to give all of them equal prominence.

The reason why the value of discrete-point tests and many integrative tests such as cloze tests and dictation tests must be reassessed is mainly because of their practicality; if only one could solve the problem of  authenticity of  these tests.  The basic problem is whether indirect tests such grammar tests, cloze tests and dictation tests can predict real-life performance, which many authors (these authors are discussed at length in the study) equate with authentic language, and thus reject the notion that an indirect elicitation procedure of real-life language can be authentic. Of course, this problem of authenticity of indirect tests is not limited to language tests but to all kinds of indirect tests, e.g. intelligence tests. A major part of this study is concerned with the meaning of authenticity in testing.

Every test is an operationalisation about certain beliefs and values about language, whether the test is called authentic or not . These beliefs and values determine to a certain extent our mental and emotional reactions to language and to knowledge in general. What is required in this study is to justify the beliefs I hold about discrete-point tests, integrative tests and the necessary psychometric methods of assessment that they imply.

1.2 Psychometrics and norm-referenced testing

In language testing the opposition to psychometrics is closely connected to the “suspicion of quantitative methods”[30] and the opposition to “reductionist approaches to communicative competence”.[31]

The history of the quantitative/qualitative controversy can be viewed from two diametrically opposite angles: (1) qualitative research has been dominated by quantitative research for many decades and is only in recent years becoming accepted as a legitimate scientific approach[32] or (2) quantitative research has been for more than two decades challenging qualitative methods and also setting itself up as the only legitimate form of research.[33]

Galton’s view is that it is the scientists’ job to “devise tests by which the value of beliefs may be ascertained, and to feel sufficiently masters of themselves to discard contemptuously whatever may be found untrue”[34] (Rushton 1995; his frontispiece). For Galton tests must be statistically validated. Although I do not share Galton’s sweeping faith in statistics, statistical measurement in language testing has been given a undeserved bad press, e.g. Spolsky[35], Lantolf and Frawley[36] and  Macdonald[37].

The increasing number of studies in  purely ethnographical/sociolinguistic approaches to language proficiency assessment[38] is witness to the opposition  to the objectivist, or positivistic, or reductionist methods of psychometric research. For Harrison, psychometric measurement is inappropriate due to its subjective nature:

Testing is traditionally associated with exactitude, but it is not an exact science…The quantities resulting from test-taking look like exact figures – 69 per cent looks different from 68 pert cent but cannot be so for practical purposes, though test writers may imply that they are distinguishable by working out tables of  precise equivalences of test and level, and teachers may believe them. These interpretations of scores are inappropriate even for traditional testing but for communicative testing they are completely irrelevant. The outcome of a communicative test is a series of achievements, not a score denoting an abstract `level’.[39]

Thus, “the quantities resulting from test-taking [which] look like exact figures” (in the quotation above) appear to measure objectively, but in fact they measure subjectively. (See Morrow[40] for a similar view). Lantolf and Frawley[41] maintain that

    • [w]hat must be done is to set aside the test-based approach to proficiency and to begin to develop a theory of proficiency that is independent of the psychometrics. Only after such a theory has been developed and is proven to be consistent and exhaustive by empirical research should we reintroduce the psychometric factor into the picture, with the full realization that such a reintroduction may not be possible, given our earlier remarks on the scalability of human behavior.

Spolsky was contemptuous of “psychometrists”:

    • In the approach of scientific modern tests, the criterion of authenticity of task is generally submerged by the greater attention given to psychometric criteria of validity and reliability. The psychometrists[42] are ‘hocus-pocus’ scientists in the fullest sense; in their arguments, they sometimes even claim  not to care what they measure provided that their measurement predicts the criterion variable: face validity receives no more than lip service.[43]

Spolsky’s recent “postmodern” approach to psychometrics is that it should be used in conjunction with “humanist” approaches.[44] The view in this study, which, for some researchers is taken for granted but for others is highly contested, is that “language testing cannot be done without adequate statistics”.[45]

Psychometric assessmentin language assessment traditionally means norm-referenced assessment, which is not concerned with individual scores but with the dispersion of scores within a group, where the concern is with maximising individual differences between test takers on the variable that is being measured.[46] For many, psychometrics has become synonymous with quantitative measurement and statistical measurement[47], and this is how I use psychometrics in this study.

The perennial problem in language testing, indeed of all testing, is finding a balance between reliability and validity. Spolsky’s coupling of validity with psychometrics may create the impression that authenticity and validity are separate issues. Of course, there is much more to validity than validity coefficients. The major issue in validity has to do with specifying what authentic tests are. Thus, psychometric data only become digestible  – and palatable – when theory gives meat to the number-crunching.

The  term psychometric tests has another meaning.. For example, at a conference on academic development, where I presented a paper on this topic[48] , a member of the audience said that she was “boiling” because psychometrics, she insisted was far more than norm- referenced measurement”. Her view was that psychometric tests measured the psyche (the literal meaning of psychometrics) and was embedded, she insisted (correctly), in the flesh-and-blood context of individuals, and therefore psychometric tests involve far more than a comparison between individuals and groups. I was taken aback by her outburst, not because she did not have a valid definition of psychometrics, but because this definition of psychometrics, because the of the different context, had never entered my mind, owing, no doubt, to my maximum attention to what interested me.

The method used in this study is mainly quantitatively based, where the emphasis is on norm-referenced testing. In language testing, as in second language acquisition research in general, quantitative measurement has been challenged for more than two decades by qualitative methods of research: indeed, qualitative measurement has been setting itself up as the only legitimate form of research.[49] Terre Blanche distinguishes between “two different constituencies” of qualitative researchers:

those who would use qualitative methods as a humanist, emancipatory tool to access authentic subjective experiences so easily censored out by more hard-nosed quantitative methods, and those who want to use qualitative methods such as discourse analysis to critique the semantic practices of both ‘scientific’ and ‘humanist’ psychologies.[50]

Both constituencies reject the domination of the norm over the individual.

“Norm” has two meanings, which are sometimes not distinguished. For example, at the conference of the National Educators of Teachers of English (NAETE) at Potchefstroom (September 17-18 1998) I was discussing the concept of “norm” with Johan van der Walt of Potchefstroom University of Christian Education. We were in one accord that the individual without the norm is an abstraction. It is was only while Van der Walt was presenting his paper that I realised that we were using the same term for two distinct concepts.

In this regard, consider the following extract from a  repartee between Johan van der Walt and Colyn Davey at the 1998 National Association of Educators of Teachers of English (NAETE) conference[51] that was concerned with the topic of establishing norms of English.

Van der Walt.  Then you agree that there should be a norm.

Davey. Yes but learners should be able to choose the norm they prefer.

The question arises of whether it is possible to use the term “norm” in the sense of (1). Van der Walt’s imperative of conforming to a standard, or norm, by which he means “Standard” English, and 2. Davey’s imperative of freedom to choose the norm that one refers, which could be “Standard” English, or, say, institutionalised black South African English (IBSAE), which would comprise ubiquitous constructions such as “I am having a problem”, “He write English perfectly” and “When I was in Town I see my English teacher”.[52] It is indeed possible to use the term “norm” in these two senses, but if done so, the two meanings of “norm” should be clearly distinguished, which wasn’t the case in the discussion between Van der Walt and Davey.

To clarify the distinction between these two meanings it is necessary to introduce the notions of criterion-referenced and norm-referenced tests Criterion-referenced tests are concerned with how well an individual performs relative to a fixed criterion, e.g. how to ask questions. Norm-referenced tests are concerned with how well an individual performs compared to a group. This is traditional psychometric testing. Now, both Van der Walt and Davey believe in norms; the former a standardised norm, the latter an unstandardised norm. So,, in fact, both kinds of norm are concerned with how well an individual performs relative to a fixed criterion, which is what criterion-referenced tests are concerned with. The difference is that Van d Walt’s norm is imposed from above (the standardised norm), whereas Davey’s norm is bottom up, where the group chooses the norm it wishes to aspire to. Both these kinds of norm have got nothing to do with norm-referenced tests, which has got nothing to do with whether one is able to choose ones norm (Davey) or not (Van der Walt), but with, as defined above, how well an individual performs in a group, i.e. with the difference in ability revealed in some form “quantitative summary”.[53]

Norm-referenced,which is a key notion in this study, can be distinguished from criterion-referenced and individual-referenced tests:

1. Norm-referenced tests are concerned with how well an individual performs compared to a group which he or she is a member of. This is traditional psychometric testing.

2. Criterion-referenced tests are concerned with how well an individual performs relative to a fixed criterion, e.g. how to ask questions. This is what Cziko calls “edumetric” testing.[54]

3. Individual-referenced are concerned with how individuals perform relative to their previous performance or to an estimate of their ability.

Strictly speaking it is not the test that is norm-referenced or criterion- referenced or individual-referenced, but the purpose for which it is used. Similarly, tests in themselves are not valid, but rather it is the purpose that they are used for that makes them valid. (Validity is discussed in section 2.8).

The idea that norm-referenced tests, on the one hand, and criterion-referenced tests and individual-referenced tests, on the other, are mutually exclusive is based on two contrasting philosophical positions: the former “positivistic”, the latter “humanistic”. The former is interested in what makes people different, the latter finds it morally reprehensible to compare people and thus focuses on what they have in common. The latter view, is represented most vocally in South Africa by the protagonists of “outcomes-based education” where ” learner’s progress will be measured against criteria that indicate attainment of learning outcomes, rather than against other learners’ performances”.[55]

I shall argue that norm-referenced tests are indispensable. Norm- referenced tests are important because without data on the variance between individuals within a group, it is not possible to separate what (which is the concern of criterion-referenced tests) an individualknows from what other peopleknow.

Individual-referenced tests also cannot be separated from what other people know. The differences between individuals actually clarifies the matter under test.  In other words, the construct validity of a test is dependent on some people doing well and others doing less well, for if everybody did equally well, we would have little idea of what we were testing. That is the norm-referenced view, which is based on the notion that differences in nature, of which human abilities are a part, is distributed as a bell-curve. In what can be regarded as a key quotation in the defence of psychometrics in this study, Rowntree explains the importance of group statistics, i.e. norm-referenced tests, in assessment:

Consider a test whose results we are to interpret by comparison with criteria. To do so we must already have decided on a standard of performance and we will regard students who attain it as being significantly different from those who do not…The question is: How do we establish the criterion level? What is to count as the standard? Naturally, we can’t wait to see how students actually do and base our criterion on the average performance of the present group: this would be to go over into blatant norm-referencing. So suppose we base our criterion on what seems reasonable  in the light of past experience? Naturally, if the criterion is to be reasonable, this experience must be of similar groups of students in the  past. Knowing what has been achieved in the past will help us avoid setting the criteria inordinately high or low. But isn’t this very close to norm-referencing? It would even be closer if we were to base the criterion not just on that of previous students but on students in general.[56]

What we think is going on in each individual’s invisible mind, can be scientifically inferred and described only when one has some idea of what is going on in a many individual minds, i.e. what is going on in a group. Emphasising the individual over the group or vice versa is “somewhat metaphysical [because both] types of test sampling (for that is what norm and criterion referencing do: they sample) need one another”.[57]Recall the two meanings of “norm” discussed earlier:“norm” can refer to a criterion (!) such as Standard English or to the comparison between individuals within a group.The latter is the concern of norm-referenced tests.

Ranking individuals and generating scores is only one purpose of norm-referenced tests; another purpose is to gain understanding into the nature of the constructs under examination, which cannot be achieved if an individual is not compared with what other individuals do.  (I elaborate on the  complementary role of norm-referenced and criterion-referenced tests in the discussion of the inadequacies of correlational data in section 4.7).

1.3  Summative assessment

There are different kinds of assessment and a diversity of definitions. I  discuss the descriptions of Rea[58] and Rowntree.[59]

TABLE  1.1

Rea’s Schema of Assessment


Formative Assessment     Summative Assessment
Quantitative Methods Assessment                                  Evaluation
Qualitative Methods Appraisal

Rea[60] uses the term “evaluation” to refer to formal testing activities, which are external to the teaching situation, and which involve “test scores”. She uses the terms “assessment” and “appraisal” to refer to activities which are internal to the teaching program. Grades are given for assessment but not for appraisal. In Rea’s scheme, assessment and evaluation both use  measurement, i.e. quantitative methods, while appraisal does not.

Before I comment further on Rea, it is appropriate to say something about “evaluation”, which are terms used by Rea and Rowntree. There are many different definitions of evaluation. Bachman[61] defines evaluation as the “systematic gathering of information for the purpose of making decisions” while Brown defines evaluation as

the systematic collection and analysis of all relevant information necessary to promote the improvement of a curriculum, and assess its effectiveness and efficiency, as well as the participants’ attitudes within the context of the particular institutions involved.[62]

Recently, “evaluation” has been contrasted with “grading”.[63] Dreyer argues that grading, i.e. summative tests, cause people to fail (see the conclusion to the study for comment; section 6.7).

To return to Rea. What may be confusing in her schema is that “assessment”  is used generically  to cover everything to with testing as well as specifically to refer to “formative quantitative assessment”. Consider Rowntree’s schema: 

TABLE 1.2

Rowntree’s Schema of Assessment and Evaluation

Assessment (Focus on learner) Evaluation (Focus on teaching)
Quantitative Methods Summative Summative
Qualitative Methods Formative (Diagnostic appraisal) Formative

Rowntree’s “assessment” –  which he also uses in this generic way – is “put[ting] a value on something” , which translates into everything concerned with  “obtaining and interpreting information”  of any kind about another person in order to “[try] and discover what the student is becoming or has accomplished”.[64] Rowntree’s “formative (pedagogic) assessment” emphasises “potential”, while his “summative (classificatory) assessment” emphasises “actual achievement”.[65]

For Rowntree, “evaluation” is “an attempt to identify and explain the effects (and effectiveness) of the teaching.”[66] Rowntree’s “formative evaluation is intended to develop and improve a piece of teaching until it is as effective as it possibly can be…[s]ummative evaluation on the other hand, is intended to establish the effectiveness of the teaching once it is fully developed.”[67] Thus Rowntree’s “formative evaluation” is concerned with the washback effect of a syllabus and/or teaching programme, while “summative evaluation” is concerned with “terminal tests and examinations coming at the end of the student’s course, or indeed by any attempt to reach an overall description or judgement of the student (e.g. in an end-of-term report or a grade or class- rank”.[68]

For both Rea and Rowntree quantitative methods always involve summative assessment only. In Rea quantitative methods are also used in formative assessment, while in Rowntree quantitative methods are only used in summative assessment.. In this study, I use Rowntree’s meaning of summative assessment to involve (1) quantitative methods of assessing (2) the learner only (and not the teacher). I am concerned with quantitative methods in summative assessment. An important point: although I am not concerned with assessing the teacher, I am very much concerned with how a teacher (i.e. a rater) assesses a learner; in other words with rater reliability, which consists of two major kinds of judgements: (1) the order of priority for individual raters of performance criteria (criteria such as grammatical accuracy, appropriateness of vocabulary and factual relevance) and (2) the agreement between raters on the scores that should be awarded if or when agreement is reached on how to weight different criteria .[69] Rater reliability is discussed in sections 2.9.1, 4.2, 4.7 and 5.5.

Summative assessment is “terminal”, “formal” and “external”, and is only concerned with the beginning and end of a course, i.e. with classifying individuals in terms of numerical products, or scores.[70] Weir[71] believes that there is a more pressing need for research in formative testing as opposed to research in summative testing. There is still a pressing need for research in summative testing, even though mainstream language testing, in South Africa, at least, is taking a different turn, e.g. “outcomes-based education”. This is discussed in section 5.5.

1.4 The One Best Test

A major empirical problem in language testing is establishing  valid and reliable criteria for the assessment of  language proficiency, which is basically concerned with fluency and accuracy. Three important issues in language testing are:

1. The kinds of tests that should be used to assess levels of language proficiency.

2.  The relationship between statistical significance (numerical data) and their meaning (information).

3. Whether language proficiency tests can validly predict academic achievement.[72] Academic achievement in this study is represented by 1.  end-of-year aggregate scores and 2. Pass rate.

The above three issues are directly related to the search for the “best” test. In the 70s a major issue in language testing was whether it was possible to find the “One Best Test”. The “One Best Test” question is closely related to the question of whether language proficiency consists of a unitary factor analogous to a g  factor in intelligence, or of a number of independent factors. Bachman and Palmer relate the concerns they had 25 years ago:

[W]e shared a common concern: to develop the “best” test for our situations. We believed that there was a model language test and a set of straightforward procedures – a recipe, if you will – that we could follow to create a test that would be the best one for our purposes and situations.[73]

Yet, almost two decades ago, Alderson had already graduated from this kind of thinking and suggested that

regardless of the correlations, and quite apart from any consideration of the lack of face validity of the One Best Test, we must give testees a fair chance by giving them a variety of language tests, simply because one might be wrong: there might be no Best Test, or it might not have the one we chose to give, or there might not be one general proficiency factor, there may be several.[74]

The Unitary Competence Hypothesis (UCH), which is closely associated with the “best test” question, is a very important issue. High correlations between different kinds of tests show that the UCH, in its weak form, remains a force to be dealt with.[75] The weak form of the UCH adopts an interactionist approach between global and discrete components of language. Oller, in his famous paragraph, describes this approach:

[N]ot only is some sort of global factor dependent for its existence on the  differentiated components which comprise it, but in their turn, the components are meaningfully differentiated only in relation to the larger purpose(s) to which all of them in some integrated (integrative?) fashion contribute. (See  also Oller and Khan[76] and Carroll[77] for similar views).[78]

If one is no longer searching for that one Grand Unified Test (GUT), one should be still looking for good tests, indeed for the best tests available. This implies, I suggest, that what one is looking for has an “objective” reality, which, of course, does not mean that we can completely grasp it.[79]

If we have given up on finding or constructing that elusive (and illusory?) one best test, we are nevertheless looking, indeed are compelled to look, for a plurality of the best tests that we can find. The problem remains what tests to choose to test language proficiency, and ultimately, in this study, to predict academic achievement. A useful test is one that “correspond[s] in demonstrable ways to language in non-test situations.”[80] These non-test situations are described in the “new” paradigm of language testing as authentic, direct, real-lifenatural(istic) or communicative. An important part of this study consists of a critical analysis of these terms.

1.5 Hypotheses of the study

The following threenull hypotheses are investigated:

1. Discrete-point tests and/or integrative tests are not valid measures of levels of language  proficiency.

2. Discrete-point tests and/or integrative tests are not valid predictors of academic achievement.

3. Many of the reports (Grade 6)  from former schools that were used as criteria for admission to MHS were not valid predictors of academic achievement. Many of the entrants with high Grade 6 report scores did not get beyond Grade 9 at MHS. I investigate the question of indiscriminate advancement in DET[81] (Department of Education and Training) schools in the light of the consistently poor Grade 12 results (“matric”) from most former DET schools over the years. The predictive validity of the tests are examined in an attempt to shed clarity on this question. (I shall henceforth refer to “DET schools” and not “former DET schools”, because at the time the investigation in this study was conducted, the DET was still in existence). A major issue in this study is the relationship between the predictive validity of  the tests and the reliability (i.e. the consistency and accuracy) of these DET reports (section 5.5).

Although the study is not directly concerned with investigating mother-tongue[82] proficiency, it is referred to when required.

Most of the tests in this study belong to the “old paradigm”. I did not devise new tests because  it was  not  ermane to  the  objective  of  this investigation, which was to examine the validity and reliability and practicality of using traditional tests to predict academic achievement. the fact that most of these tests were already established tests meant that I had more time to devote to this objective.

Although it is possible that annual predictions between English proficiency and academic achievement would yield higher correlations than long-term predictions, the aim in this study is to try and find out what chance Grade 7 learners who entered the School in 1987 would have of  passing Grade 12.

1.6 Historical and educational context

MHS  has already had 19 years experience dealing with linguistic, cultural and educational problems, which are only now beginning to surface in many schools in South Africa.  English is used as the single medium of instruction at the School. The School offered the Joint Matriculation Board (JMB) syllabus up to 1992, and the Independent Examinations Board (IEB) syllabus after 1992. MHS was the only state school in a wide area containing hundreds of DET secondary schools that offered the JMB syllabus. This study shows how DET pupils coped at such a school.

One problem that the School has been dealing with since its inception in 1980 is how to reconcile affirmative action with academic merit. By affirmative action I mean the endeavour to put right the  imbalances of the past, where the majority of South Africans was discriminated against on the basis of race. The School’s policy is to provide education for advantaged as well as disadvantaged learners, where the latter are given the opportunity to learn in an advantaged school situation. Disadvantaged learners are those who have suffered educational, social and economic deprivation – often caused by political injustice – and this was what the School also meant by the term. It is also, paradoxically, the School’s policy to accept learners only on merit, which was indicated by high scores on former school reports. One problem with affirmative action is that it is often difficult to marry the idea of redress and the idea of academic merit (high achievement, in this case), potential or aptitude.

This difficulty was evidenced by the School’s Prospectus of 1986, which informed parents that their children “are admitted solely on the basis of merit”; by “merit” the School meant  high scores on reports from previous schools. Candidates are considered on the basis of the results of an entrance examination and their previous school achievement.” Thus, the  School’s intention was to select only those candidates who could cope with a JMB equivalent syllabus. Unfortunately, many learners dropped out along the way or were pushed out along the way by the system.

MHS’s policy was to use admission criteria. The tests in this study, although conducted after admission – during the first three days of the first school term, were partly concerned with the admission question because I wanted to find out whether those who were admitted on the basis of their former school reports should have been admitted. Owing to the recent abolition of  admission tests in South African state schools there would no longer be any point in trying to find the best admission tests. But it would certainly still be useful to find out (1) whether  those learners who had been admitted to the School had an adequate level of English proficiency to perform in a school where English was the medium of  instruction and (2) whether their former school reports were authentic, i.e. accurate, reflections of this level. As far as I am aware, former school reports , as is the general practice in all schools in South Africa, are still considered by the School as an important indication of an entrant’s ability – if not a criterion for admission.

It would seem that the School’s criteria for admission would generally have pinpointed those candidates who could not cope at the School, but this did not happen. Of concern at the School was the large number of failures in Grades 7, 8 and 9 among the DET learners. At the School there were no automatic internal promotions through the system as is claimed to occur in many DET schools.[83] (This issue is dealt with in Chapter 7). When low achievers at the School failed they often left without repeating a year. Many who failed at the School, whether they repeated a year or eventually were asked to leave owing to failure, did not manage to get beyond Grade 9. Table 1.3 shows the Grade 9 pass rate for  three consecutive years.

TABLE 1.3

Grade 9  Pass Rate

Number of learners in Grade 7 Passed Grade 9 % Passes
1 36 (1982) 13 36.1
2 67 (1983) 25 37.3
3 81 (1987) 49 60.5
184 87 47.3

Row 3 is the sample used in the prediction of academic achievement in this study. It excludes learners who passed a  Grade and then left the School before reaching Grade 9 (N=5), and includes learners who failed between Grades 7 and 9 but passed Grade 9 at a later stage (N=9). Samples 1 and 2 do not take this fact into account.

Many learners who reached Grade 12 managed to achieve a matriculation exemption at MHS but obtained disappointing symbols, e.g. D and E symbols. Mcintyre remarks in this regard:

Observations at [MHS] over the last few years would tend to support Young’s[84] statement [about the low “language competence” of  Grade 12 learners]. However, at [MHS] the problem is slightly different in the sense that Matric students pass successfully but often with symbols that are disappointing. [85]

Mcintyre’s statement that “Matric students pass successfully” requires comment: Although it is correct that there was a high Grade 12 pass rate at the School, this does not take into account the high failure rate between Grade 7 and Grade 9, which means that even though most Grade 12 learners passed, this does not imply that many others that started in Grade 7 didn’t drop out along the way. This high failure rate is what had been occurring at the School since its inception in 1980. (I am concerned with the period 1980 to 1993). Table 1.6 shows  the number of  Grade 12 passes (1992) that originated from the group of Grade 7 learners (1987) used in this study.

TABLE 1.4

Grade 12  Pass Rate

Original number  of learners in Grade 7 (1987) Total Grade 12 passes from original Grade 7
79 39 (49.4%)

Table 1.4 takes into account those who failed and passed Grade 12 in the subsequent year (12 learners) and those who left the school during their schooling for reasons other than failure, for example, relocation.

The following criteria of admission to MHS provides important background information: Admission to the School was based on (1) the results of entrance tests administered in October of the previous year (1986) and (2) former school achievement as revealed by Grade 6 reports from former schools. The School’s criteria for admission to Grade 7 consisted of:

– Grade 6 reports from former schools (the aggregate).

–  A Culture Fair Intelligence Test.[86]

– An English proficiency test, which consisted of a short essay of about half a   page. I was not involved in the administration or marking of this test and so had no information about this test.

– A mathematics proficiency test. As in the case of MHS’s English proficiency test, I had no information on this test.

The admission tests for the sample were written in October of the previous year (1986). The Grade 6 reports  were considered by the School to be the most important criterion for admission. However, a few pupils were admitted with Grade 6 aggregates below 60%.  Of the Schools admission criteria only the Grade 6 reports are used in this study. I was not able to obtain the scores of the School’s Grade 6 essay admission test. In any case, the admission essay test was marked by only one rater per protocol and thus there would have been no way of establishing the interrater reliability of the School’s essay test.

With regard to the culture-fair test data, the original intention was to include these in the predictive investigation, but owing to the problematic (scientific and political) nature of intelligence tests and the fact that the use of these tests as predictors would not be directly pertinent to the topic, I decided to exclude these tests from this investigation. Suffice it to say that L2 learners who score above average on intelligence tests tend to be better at formal second language learning – and first language learning.[87]

The School’s policy is that at least half of all admissions should consist of disadvantaged learners. Disadvantaged does not mean low scoring, because the School selects on the basis of good performance as indicated by former (Grade 6) school reports. These disadvantaged entrants come from DET Schools. The investigation will show that many of the Grade 6 reports (of former schools) that belonged to disadvantaged entrants were unreliablein the sense of being inconsistent with their language proficiency scores on the tests in that the English aggregates on these Grade 6 reports were  in many cases radically higher than the scores on the proficiency tests.

The full sample of subjects (N=86) is discussed in detail elsewhere, but for the moment I deal briefly with 70 subjects (Table 1.7). Compare the Aggregate and English scores of the Grade 6 reports of the following two groups of entrants to Grade 7 at the School, who comprise a major part of the sample of subjects used in this study:

(1) CM Primary School (N=33), which provided (at the time this research was conducted) most of the entrants who took English First Language as a subject at MHS. English was the official medium of instruction from Grade 1 at CM Primary School. The learners from this school were generally advantaged. As mentioned, disadvantaged  learners are those who have suffered educational, social and economic deprivation

(2) 28 DET schools (N=37), which provided the vast majority of the entrants who took English Second Language as a subject at the School. English was the medium of instruction from Grade 5 at DET schools. Entrants from DET schools were generally disadvantaged.

TABLE 1.5

Comparison of Grade 6 reports between CM Primary School

and 28 DET Schools (N=70)

Aggregate Grade 6 English Grade 6
Mean STD Mean STD
CM Primary (N=33): mostly advantaged and English used as a First Language 68.9 8.8 72.5 8.4
28 DET Schools  (N=34): mostly disadvantaged and English used as a Second Language 68.6 10.8 71.1 12.6
t Stat – 0.106 – 0.550
t Critical two-tail 1.995 1.995

The T-Test in Table 1.5 shows that there was no significant difference between the two groups, because the t Stat is less than the critical value.

This equivalence in Grade 6 report scores between these groups plays an important role in the arguments and predictions of subsequent chapters, where I investigate the reliability of the Grade 6 scores of the DET  entrants.

1.7  Measures used in the study

The measures used in the study are now briefly described.  The study only commenced after the intake of learners to the School, and thus these measures differed in purpose from the School’s criteria, which were admission criteria. A detailed description of the measures, e.g. format, instructions, layout,  is given in Chapter 3. For the moment I provide only a brief description of the measures:

I. English proficiency tests. Eight English proficiency tests were administered in January 1987 (Grade 7). I devised the essay tests myself, while all the other tests were obtained from various published sources. The English proficiency test battery consists of:

(i) Two cloze tests from Pienaar’s[88] “Reading for Meaning”.

(ii) Two dictation tests. These were two restored cloze tests from Pienaar.[89] The passages from Pienaar used for the cloze tests are different to the passages used for the dictation tests, but they both belong to the same level. (I explain later what I mean by “level”).

(iii) Two essay tests (devised by myself).

(iv) An “error recognition” test.[90]

(v) A “mixed grammar” test.[91]

The tests from Bloor et al. consist of multiple-choice items.  The “mixed grammar” test consists of  items that test a variety of structures, hence the term “mixed”.

I shall argue that although the tests used in the study, except for the essay test, may be out of fashion with many testers they are nevertheless still very useful for assessing language proficiency and predicting academic achievement.

II. Grade 6 end-of-year school reports from former schools. The Grade 6 aggregate scores are used.

III. End-of-year English scores and Aggregates from Grade 7 to Grade 11. These scores were obtained from the School’s mark schedules.

IV. Grade 12 results (of 1992 and 1993). These results are those of the JMB (1992) and the Independent Examinations Board (IEB; 1993). The reason why the IEB results are also taken into account is that included in the study are those subjects that failed once between Grade 7 and Grade 12, repeated a year and sat for the IEB Grade 12 examination in 1993. (The JMB matriculation examination ceased to exist after 1992).

1.8  Method overview

Some researchers separate statistical research from empirical research. For Lantolf and

Frawley[92] empirical research and statistical measurement are distinct. In contrast, when Tremblay and Gardner[93] state that in their opinion “empirical investigation is essential to demonstrate the theoretical and pragmatic value” of research, their “empirical” research is firmly based on statistics, without which they would have very little of what they consider to be “empirical” research (see also Cziko’s[94] “empirically-based models of communicative competence”).  Some empirical research is statistically based, while other empirical research (e.g. much of  ethnographical research) isn’t. I adopt Tremblay and Gardner’s view that statistics is an indispensable component of empirical research.

The empirical investigation consists of:

1. An examination of the structure and administration of the English proficiency tests.

2. A predictive investigation, where the English proficiency tests are used to predict academic achievement from Grade 7 to Grade 12.  The reliability of the Grade 6 achievement of entrants from former schools also examined.

Under method  I subsume, as is usual in most studies, the following:

–  Subjects (sampling)

–  Structure of the measures.

–  Procedures of administration and scoring.

A  common design in empirical studies is that data analysis, results and discussion are each reported in separate sections. I depart from this traditional structure and follow Sternberg[95] who recommends that these be treated together. This is a wise arrangement for this study because the data analysis, discussion and results are closely connected.

A clarification of the following terms are in order: type, method, procedure. Sometimes type refers to such things as multiple-choice type questions versus gap-filling type tests; sometimes method refers to such things as cloze methods versus dictation methods, in other words, methods is used to mean tests. Then there is procedure, e.g. the cloze procedure, the dictation procedure, etc., which also can mean methods or tests. I shall use tests to refer to elicitation techniques, and procedure to the way in which the test is presented and scored.

1.9  Preview of Chapters 2 to 6

Chapter 2 deals with theoretical issues in the testing of language proficiency and academic achievement, where the main focus falls on assessment. The chapter comprises  a review of the literature on the testing of language proficiency and an overview of key concepts such as assessment,validity and reliability.

Chapter 3 describes the sample of subjects and sampling procedures, and the structure and administration of the tests.

Chapter 4 presents the results of the tests and discussion.

Chapter 5 deals with the prediction of academic achievement,  examines the reliability of the Grade 6 reports from previous schools and summarises the findings.

Chapter 6 discusses the implications of the study for language testing and presents the conclusions. The three main implications are: (1) the viability of the distinction between first language and second language, (2) the kind of tests or tasks that should be used in the future, and (3) the problem of rater reliability. Also discussed are a few contemporary initiatives to improve language assessment in South Africa.

1.10  Summary of Chapter 1

The purpose, problem, main topics, method, hypotheses and educational context of the study were specified. The study deals with  the measurement of differences between learners in English proficiency and with assessing the reliability, validity and practicality of discrete-point and/or integrative tests as predictors of academic achievement. Central to the study is the argument that the “old paradigm” of discrete-point and integrative tests and the statistical methods required to measure them are very useful in language acquisition research and educational measurement. The next chapter deals with the theory of language testing where I examine what it means to describe language behaviour and language tests as “authentic”.

Endnotes

[1]Davies, A. Principles of language testing. p.4, 1990.

[2]Ibid., p.2.

[3]Saville-Troike, M. ‘What really matters in second language learning for academic achievement.’ TESOL Quarterly, 18(2):199-219.

[4]Cummins, J. Wanted: A theoretical framework for relating language proficiency to academic achievement among bilingual students’,  in Rivera, C. (ed.). Language proficiency and academic achievement, 1984.

[5](1) Diller, K.C., Individual differences and universals in language learning aptitude, 1981. (2) Skehan, P. Individual differences in second language learning, 1989. (3) Skehan, P. A cognitive approach to language learning, 1998.

[6]Rivera, C., The ethnographical/sociolinguistic approach to language proficiency assessment.1983, p.xii.

[7]Bachman, L.F.’Assessment and evaluation. Annual Review of Applied Linguistics (1989), 10:210-226 (1990a), p.210

[8]Lakoff, G., Women, fire and dangerous things, 1987.

[9]Bachman, L.F. Fundamental considerations in language testing, 1990b.

[10]Davies, A. Principles of language testing, 1990.

[11]Alderson, J.C. ‘Who needs jam?’, in Hughes, A. and Porter, D. Current developments in language testing, 1983, p.90.

[12]Douglas, D.’Developments in language testing.’ Annual Review of Applied Linguistics, 15:167-187 (1995), p.176.

[13]Moss, P. ‘Can there be validity without reliability?‘ Educational Researcher, 23 (2), 5-12 (1994).

[14]Alderson, J.C. and Clapham, C. ‘Applied linguistics and language testing: A case study of the ELTS test.’ Applied Linguistics, 13 (2), 149-167 (1992).

[15]Ibid., p.2.

[16]Widdowson, H.G.‘Skills, abilities, and contexts of reality.’ Annual Review of Applied Linguistics, 18, 323-333 (1998), p.323.

[17]Principles of language testing, 1990, p.2.

[18]Skehan, P. A cognitive approach to language learning, 1998, p.153.

[19]Bachman, L.F. and Palmer, A.S.Language testing in practice 1996, p.17.

[20]Brown, J.D. ‘Language programme evaluation: A synthesis of existing possibilities’, 1989, p.235.

[21]Stern, H. H. Fundamental concepts of language teaching, 1983.

[22] (1)Wilkins, D.A. ‘Notional syllabuses revisited.’ Applied Linguistics, 2 (1), p.83-89 (1981), p.83.

(2) Brumfit, C.J. Notional syllabuses revisited: A response. Applied   Linguistics, 2(1), 90-92 (1981), p.90.

[23]Ur, P.A course in language teaching: practice and theory, 1996, p.37.

[24]Yeld, N. Communicative language testing, 1986, p.36.

[25](1)  Corder, S.P. Error analysis and interlanguage, 1981, p.20.

(2) Davies, A. Principles of language testing, 1990, pp.20-21.

[26]Davies, ibid., 21.

[27]Byrnes, H. and Canale, M. Defining and developing proficiency: Guidelines, implementations and concepts, 1987, p.15.

[28]Spolsky, B.Conditions for second language learning, 1989, p.138.

[29]Ur, P.A course in language teaching: practice and theory, 1996, p.37

[30]Davies, A. Principles of language testing, 1990, p.1.

[31]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.182.

[32]Lazaraton, A. ‘Qualitative research in applied linguistics: A progress report.’ TESOL Quarterly, 29 (3), 455-471 (1995, p.455.

[33]Magnan, S.S. Review of Creswell, J.W. 1994. Research design: qualitative and quantitative approaches, 1997.

[34]Rushton, J.P. Race, evolution and behaviour, 1995; his frontispiece.

[35]Spolsky, B. ‘The limits of authenticity in language testing.’ Language Testing, 2, 31-40 (1985).

[36]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988).

[37](1) Macdonald, C.A. English language skills evaluation (A final report of the Threshold Project), Report Soling-17, 1990a.

(2) Macdonald, C.A. Crossing the threshold into standard three in black education: The consolidated main report of the Threshold Project, 1990b.

[38](1) Bennett, A. and Slaughter, H. ‘A sociolinguistic/discourse approach to the description of the communicative competence of linguistic minority children’, in Rivera, C.The ethnographical/ sociolinguistic approach to language proficiency assessment, 1983.

(2) Jacob, E. ‘Studying Puerto Rican children’s informal education at home’, in Rivera, C. (ed.). The ethnographical/sociolinguistic approach to language proficiency, 1983.

(3) Phillips, S. ‘An ethnographic approach to bilingual language proficiency assessment’, in Rivera, C. (ed.). The ethnographical/sociolinguistic approach to language proficiency assessment, 1983.

[39]Harrison, A. ‘Communicative testing: Jam tomorrow?’, in Hughes, A. and Porter, D. (eds.). Current developments in language testing, 1983. p.84.

[40]Morrow, K. ‘Communicative language testing: Revolution or evolution’, in Alderson, J (ed.). Issues in language testing, 1981, p.12.

[41]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.185.

[42]Psychometrist has two meanings: 1. a statistician, and 2. somebody with the paranormal power to find lost objects. I guess the double meaning is not lost on Spolsky.

[43]Spolsky, B. ‘The limits of authenticity in language testing.’ Language Testing, 2, 31-40 (1985), pp.33-34.

[44]Spolsky, B.Measured words, 1995, p.357.

[45]Davies, A. Principles of language testing, 1990, p.16.

[46]Cziko, G.A. ‘Improving the psychometric, criterion-referenced, and practical   qualities of integrative testing.’ TESOL Quarterly, 16 (3), 367-379 (1982), pp.27-28.

[47](1) Spolsky, B. ‘The limits of authenticity in language testing.’ Language Testing, 2, 31-40 (1985).

(2) Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.185.

[48]Gamaroff, R. Psychometrics and reductionism in language assessment. Paper   presented at the SAAAD/SAARDHE conference “Capacity-building for quality teaching and learning in further and higher education”, University of Bloemfontein, 22-24 September, 1998e.

[49]Magnan, S.S. Review of Creswell, J.W. 1994. Research design: qualitative and quantitative approaches, 1997.

[50]Terre Blanche, M.‘Crash.’ South African Journal of Psychology, 27 (2), 59-63 (1997), p.61.

[51]Van der Walt, J. The implications for language testing of IBSA [Institutionalised Black South African English].  National Association of Educators of Teachers of English (NAETE) conference ” Training teachers for the South African context,  Potchefstroom College of Education, September 17-18, 1998.

[52]Van der Walt’s paper followed immediately after the presentation of Makalela’s (1998) paper “Institutionalized Black South African English” (IBSAE) in which Makalela advocates that IBSAE be adopted as the norm among blacks in South Africa. The examples of IBSAE cited above are those given by Makalela.

[53]Messick, S.Validity, 1987, p.3.

[54]Cziko, G.A. ‘Improving the psychometric, criterion-referenced, and practical   qualities of integrative testing.’ TESOL Quarterly, 16 (3), 367-379 (1982).

[55]Gultig, J., Lubisi, C., Parker, B. and Wedekind, V. Understandingoutcomes-based education: Teaching and assessment in South Africa, 1998, p.12.

[56]Rowntree, D.Assessing students: How shall we know them, 1977, p.185.

[57]Davies, A. Principles of language testing, 1990, p.19.

[58]Rea, P. ‘Language testing and the communicative language teaching curriculum’, in Lee, Y.P. et al. New directions in language testing, 1985.

[59]Rowntree, D.Assessing students: How shall we know them, 1977, p.185.

[60]Rea, P. ‘Language testing and the communicative language teaching curriculum’, in Lee, Y.P. et al. New directions in language testing, 1985, p.29

[61]Bachman, L.F.Fundamental considerations in language testing, 1990b, p.20.

[62]Brown, J.D. ‘Language programme evaluation: A synthesis of existing possibilities’, in Johnson, R.K. (ed.). The second language curriculum, 1989, p.223.

[63]Dreyer, C. Testing: The reason why pupils fail. National Association of Educators of Teachers of English conference (NAETE) ” Training teachers for the South African context, Potchefstroom College of Education, September 17-18, 1998.

[64]Rowntree, D.Assessing students: How shall we know them, 1997, p.4.

[65]Ibid., p.8.

[66]Ibid., p.7.

[67]Ibid.

[68]Rowntree, D.Assessing students: How shall we know them, 1997, p.7.

[69]Gamaroff, R. ‘Language, content and skills in the testing of English for academic purposes.’South African Journal of Higher Education, 12 (1), 109-116 (1998b).

[70]Rea, P. ‘Language testing and the communicative language teaching curriculum’, in Lee, Y.P. et al. New directions in language testing, 1985. p. 29.

[71]Weir, C.J. Understanding and developing language tests, 1993, p.68.

[72]An applied linguist, who had read some of my research, queried whether I was doing applied linguistics research or educational research. “Applied linguistics” in the restricted meaning of the term is not directly concerned with predicting academic achievement, but “educational linguistics” is, which has a close connection to applied linguistics. The most important reason for the study of  academic language proficiency is its connection to academic achievement.

[73]Bachman, L.F. and Palmer, A.S.Language testing in practice, 1996, p.4.

[74]Alderson, J.C.  ‘Report of the discussion on general language proficiency’,  in Alderson, J.C. and Hughes, A. Issues in language testing: ELT Documents III. (The British Council, 1981a, p.190.

[75](1)Brown, J.D. A closer look at cloze: Validity and reliability, 1983.

(2)Hale, G.A., Stansfield, C.W. and Duran, R.P.TESOL Research Report 16, 1984.

(3) Oller, J.W., Jr. ‘Cloze tests of second language proficiency and what they measure.’ Language Learning, 23 (1), 105-118 (1973).

(4) Oller, J.W., Jr.  ‘A consensus for the 80s’, in Issues in language testing research, 1983).

(5) Oller, J.W., Jr. ‘Cloze, discourse, and approximations to English’, in  Burt, K. and Dulay, H.C. New directions in second language learning, teaching and bilingual education, 1976.

(6) Oller, J.W., Jr. ‘”g”, “What is it?’, in Hughes, A. and Porter, D. (eds.). Current developments in language testing, 1983a.

(7) Oller, J.W., Jr. Issues in language testing research,  1983b.

(8) Stubbs, J. and Tucker, G. ‘The cloze test as a measure of English proficiency.’ Modern Language Journal, 58, 239-241 (1974).

[76]Oller, J.W., Jr.  and Kahn, F.  Is there a global factor of language proficiency?, in Read, J.A.S. Directions in language testing, 1981.

[77]Carroll, J.B.Psychometric theory and language testing, 1983, p.82.

[78]Oller, J.W., Jr.  ‘A consensus for the 80s’, in Issues in language testing research, 1983, p.36.

[79] Many scientists, in contrast to many applied linguists, have not given up looking for Grand Unified Theory (GUT). Many applied linguists would probably say that physics deals with non-living matter whereas language testing deals with human beings. But this, in my view, is no justification for rejecting the search for unifying linguistic principles in humans; if one is interested in linguistic science, and not just in linguistic thought.

[80]Bachman, L.F. and Palmer, A.S.Language testing in practice, 1996, p.9.

[81]The DET  was the education department in charge of black education up to 1994. It is now defunct.

[82]At a translation committee meeting at the University of Fort Hare  in April 1998, the secretary of the meeting suggested that the term “mother tongue” was sexist.

[83]Educamus. Editorial: Internal promotions, 36 (9), 3 (1990), p.3

[84]Young, D. ‘A priority in language education: Language across the curriculum in black education’, in Young, D. and Burns, R. (eds.). Education at the crossroads, 1987.

[85]Mcintyre, S.P. ‘Language learning across the curriculum: A possible solution to poor results.’ Popagano, 9 and10, June (1992), p.10.

[86]Cattell, R.B.Measuring intelligence with culture-fair tests, 1973.

[87]Mitchell, R. and Myles, F. Second language acquisition, 1998.

[88]Pienaar, P. Reading for meaning: A pilot survey of (silent) reading standards in Bophuthatswana,1984, pp.59 and 61.

[89]Ibid., pp.58 and 62.

[90]Bloor, M., Bloor, T., Forrest, R., Laird, E. and Relton, H. Objective tests in English as a foreign language, 1970, pp.70-77.

[91]Ibid., pp.35-40.

[92]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.181.

[93]Tremblay, R.F. and Gardner, R.C.Expanding the motivation construct of language learning.’ The Modern Language Journal, 79 (4), 505-518 (1995), p.505.

[94]Cziko, G.A. ‘Some problems with empirically-based models of communicative competence.’ Applied Linguistics, 5 (1), 23-37 (1984).

[95]Sternberg, R.J.The psychologist’s companion: A guide to scientific writing for students and researchers, 1993, p.53.

Table of Contents Ph.D.

The following is a list of chapters and chapter sections of the Ph.D.CHAPTER 1: Scope of the Study

1.1  Introduction: The  problem and purpose of the study

1.2  Psychometrics and norm-referenced testing

1.3  Summative assessment

1.4  The One Best Test

1.5  Hypotheses of the study

1.6  Historical and educational context

1.7  Measures used in the study

1.8  Method overview

1.9  Preview of Chapters 2 to 6

1.10 Summary of Chapter 1

CHAPTER 2: Theoretical Issues in the Testing of Language Proficiency and Academic Achievement

2.1   Introduction

2.2   Ability, cognitive skills and language ability

2.3   Competence and performance

2.4   Proficiency

2.5   The discrete-point/integrative controversy

2.6   Cognitive and Academic Language Proficiency (CALP) and “test language”

2.7   Language proficiency and academic achievement

2.8   Validity

2.8.1 Face validity

2.8.2 Content validity

2.8.3 Construct validity

2.8.4 Criterion validity: concurrent and predictive validity

2.9    Reliability

2.9.1  Approaches to the measurement of reliability

2.10   Ethics of measurement

2.11   Summary of Chapter 2

CHAPTER 3: Sampling, and Structure and Administration of the English Proficiency Tests

3.1      Introduction

3.2       Sampling procedures for the selection of subjects

3.2.1     The two main groups of the sample: First Language (L1) and second Language (L2) groups

3.3        Structure and administration of the English proficiency tests

3.3.1    The cloze tests

3.3.1.1  Theoretical overview

3.3.1.2  The cloze tests used in the study

3.3.2      The essay tests

3.3.2.1  Theoretical overview

3.3.2.2  The essay tests used in the study

3.3.3      Error recognition and mixed grammar tests

3.3.3.1  Theoretical overview

3.3.3.2  Error recognition and mixed grammar tests used in the study

3.3.4    The dictation test

3.3.4.1  Introduction

3.3.4.2  Theoretical overview

3.3.4.3  The dictation tests used in the study

3.3.4.4  Presentation of the dictation tests

3.3.4.5  Method of scoring of the dictation tests

3.4       Summary of Chapter 3

CHAPTER 4: Results of the proficiency tests

4.1    Introduction

4.2    Reliability coefficients

4.3    Analysis of variance of the dictation tests

4.4    Validity coefficients

4.5    Descriptive data of the L1 and L2 groups

4.6    The L1 and L2 groups: Do these levels represent separate populations?

4.7    Comparing groups and comparing tests

4.8    Error analysis and rater reliability

4.8.1  Rater reliability among educators of teachers of English

4.8.1  Interrater reliability: scores and judgements

4.9    Summary of Chapter 4

CHAPTER 5: The Prediction of Academic achievement

5.1  Introduction

5.2  Correlational analysis and multiple regressions of the predictions

5.3  Frequency distributions of the predictions and data analysis

5.3.1 What do we mean by correlation?

5.4  General discussion of language proficiency tests as predictors of academic achievement

5.5  The reliability and predictive validity of the Grade 6 reports of previous schools

5.5.1 Introduction

5.5.2 An examination of the Grade 6 reports

5.6    Summary of the findings and their generalisability

5.5  Summary of Chapter 5

CHAPTER 6: Implications and Conclusions

6.1 Introduction

6.2 The L1/L2 and native speaker/non-native speaker distinctions

6.3  Negotiating the task-demands and the “Threshold Project”

6.4  Competence-based Education  training (CBET) and “Outcomes-based Education” (OBE)

6.5  Rater consistency, or reliability

6.6  Paradigm lost paradigm regained (recreated?)

6.7  Conclusion of the study

6.8  Summary of Chapter 6

Tables

Table 1.1 Rowntree’s Schema of Assessment and Evaluation

Table 1.2 Rea’s Schema of Assessment

Table 1.3  Grade 9 Pass Rate

Table 1.4  Grade 12 Pass Rate

Table 1.5  Comparison of Grade 6 reports between Connie

Minchin Primary School and 28 DET Schools

Table 1.6   Comparison of Grade 6 reports between mother tongue

English speakers and non-mother tongue English speakers

Table 2.1    Functionalist and Structuralist Levels of language

Table 3.1    Detailed Analysis of the L1 Subjects

Table 3.2    Detailed Analysis of the L2 Subjects

Table 3.3    Normal Presentation of Dictation using One Presenter with One Presentation to all Four Groups

Table 3.4    Presentation of Dictation with Four Presenters: First Presentation of Each Presenter

Table 4.1    Reliability Coefficients of All the Tests    107

Table 4.2    Cumulative Reliability Coefficients: Comparison between Error Recognition Test (ER) and Mixed Grammar Test  (GRAM)

Table 4.3    Analysis of Variance of the Dictation Tests with First Presentation

Table 4.4    Validity Coefficients of English Proficiency Tests

Table 4.5    Means, Standard Deviations and z-tests for the L1 and L2 groups

Table 4.6    Frequency Distribution of all the Tests

Table 4.7    Occupation  of  Parents of the  L2 Subjects

Table 4.8    Error Recognition test: Percentage Error

Table 4.9    Protocol 1 (L2): Summary NAETE Judgements of  Raters on Content, Grammar and Spelling

Table 4.10 Protocol 2 (L1): Summary of NAETE Judgements of  Raters on Content, Grammar and Spelling

Table 5.1  Grade 6 to Grade 11 Correlational Analysis of the Prediction of Academic Achievement with Aggregate as Criterion

Table 5.2    Stepwise Multiple Regression Analysis of Predictions

Table 6.1.   Attitudes to Testing

Figures

Figure 1.1   Rea’s Schema of Assessment

Figure 1.2   Rowntree’s Schema of Assessment & Evaluation

Figure 2.1   Second Language Proficiency as a Criterion Variable and as a Predictor Variable

Figure 4.1   Comparison between the reliabilities of ER and GRAM

Figure 4.2   A Comparison of the Groups at Mmabatho High School with the Middle School (MID) on the Cloze Tests

Figure 4.3   Frequency Distribution of the Scores Awarded by the 24 Raters on Protocol 1

Figure 4.4   Frequency Distribution of the Scores Awarded by the 24 Raters on Protocol 2

Figure 4.4   Histogram of frequency distribution of NAETE scores on Protocol 2 (L1)

Figure 5.1   NTL1 (Non-Tswana L1) Grade 7 to Grade 11 Pass Rate

Figure 5.2   TL1 (Tswana L1) Grade 7 to Grade 11 Pass Rate

Figure 5.3   L1 (NTL1 & TL1) Grade 7 to Grade 11 Pass Rate

Figure 5.4   L2 Grade 7 to Grade 11 Pass Rate


 

Abstract of Ph.D.

“Language proficiency tests and the prediction of academic achievement,” University of Cape Town, 2000.

The study investigates the ability of English proficiency tests (1) to measure levels of English proficiency among learners who have English as the medium of teaching and learning, and (2) to predict long-term academic achievement (Grade 7 to Grade 12). The tests are “discrete-point” tests, namely, error recognition and grammar tests (both multiple-choice tests), and “integrative” tests, namely, cloze tests, essay tests and dictation tests.

The sample of subjects consists of two groups: (1) those taking English as a First Language subject and those taking English as a Second Language subject. These groups are given the familiar labels of L1 and L2. The main interest lies in the L2 group. The main educational context is a high school in the North West Province of South Africa.

The empirical investigation is divided into four parts:

(1) A description of the battery of English proficiency tests. (Chapter 3). These tests were given to Grade 7 school entrants.

(2) An examination of the validity and reliability of the battery of the English proficiency tests . (Chapter 4). High correlations were found between all of the tests and a substantial difference in English proficiency was found between the L1 and L2 groups.

(3) A longitudinal investigation of predictive validity, where the English proficiency tests were used as the predictors, and academic achievement (Grades 7 to 12) as the criterion. (Chapter 5). The main interest of the longitudinal investigation lies in long-term prediction. It is generally believed that low English proficiency is a major cause of academic failure. The longitudinal study corroborates this belief empirically and also shows that very high English proficiency is a good predictor of success. The matriculation exemptions of the L1 group, scored substantially higher on the English proficiency tests than the L2 group, were three times higher than those of the L2 group.

(4) A longitudinal investigation of the predictive validity of the Grade 6 reports. (Chapter 5). These Grade 6 reports served as the main criterion for admission to Grade 7 at the high school. Almost all of the Grade 6 reports of the L2 group emanated from former Department of Education and Training (DET) schools. Most of the Grade 6 reports of the L1 group emanated from a “feeder” school in close proximity to the high school. The L1 Grade 6 reports were found to be good predictors, while the L2 Grade 6 reports were found to be poor predictors. A probable reason for the poor predictions of the L2 Grade 6 reports was that these reports were inflated, and therefore unreliable.

The outline of the chapters is as follows:

Chapter 1 describes the scope of the study.

Chapter 2 deals with theoretical issues in the testing of language proficiency and academic achievement. The chapter comprises a review of the literature on language testing and a discussion of germane concepts such as ability, competence, proficiency, authenticity, norm-referenced tests, discrete-point tests, integrative tests, assessment, validity and reliability.

Chapter 3 describes the sample of subjects and sampling procedures, and the structure and administration of the tests.

Chapter 4 presents the results of the English proficiency tests and discussion. Included in the chapter is an investigation of rater reliability among a group of educators of teachers of English.

Chapter 5 deals with the prediction of academic achievement, investigates the reliability of the Grade 6 reports from previous schools, summarises the findings and examines the generalisability of the findings.

Chapter 6 discusses the implications of the study for English testing and presents the conclusions. The four main implications dealt with are: (1) the viability of the distinction between English first language and English second language, (2) the kind of English proficiency tests or tasks that should be used, (3) the problem of rater reliability, and (4) the necessity of psychometric measurement. Woven into the discussion of the implications are a few contemporary initiatives to improve language testing in South Africa and elsewhere.