Chapter 4 – Results of the Proficiency Tests

(The author references are missing in this wordpress copy because in the original thesis they appeared as footnotes, which I couldn’t reproduce here. If you would like to know a reference, please contact me).

Results of the Proficiency Tests

4.1 Introduction

4.2 Reliability coefficients

4.3 Analysis of variance of the dictation tests

4.4 Validity coefficients

4.5 Descriptive data of the L1 and L2 groups

4.6 The L1 and L2 groups: Do these levels represent separate populations?

4.7 Comparing groups and comparing tests

4.8 Error analysis and rater reliability

4.8.1 Rater reliability among educators of teachers of English

4.9 Summary of Chapter 4

4.1 Introduction

There are two basic kinds of statistics: descriptive and inferential. Descriptive statistics provides summary data of a whole array of data. Examples of summary data are means, standard deviations, analysis of variance, and reliability and validity coefficients. Inferential statistics indicates the extent to which a sample (of anything) represents the population from which it is claimed to have been drawn.

The population in this study refers to the Grade 7 entrants at MHS from its inception in 1980 up to the present day. This study is particularly interested in the L2 learners at MHS and the wider population of Grade 6Tswana-mother-tongue speakers at DET schools in the North West Province of South Africa who were admitted to Grade 7 at MHS from 1980 onwards.

This chapter provides the descriptive statistics of the English proficiency tests. In the next chapter, descriptive and inferential statistics are provided of the prediction of academic achievement. This chapter also deals with inferential issues regarding the L1 and L2 groups, which has an important bearing on the notion of “levels” of proficiency, a central notion in this study.

From the outset, I need to point out that there was a significant difference between the means of the L1 and the L2 groups. This has important inferential implications, for, if the L1 and L2 groups belong to separate populations (in the statistical sense of the word), one couldn’t consider the two groups as a uniform group for correlational purposes. I shall argue that the L1 and L2 groups do not belong to separate populations.

This chapter contains the following sets of results:

(1) Reliability coefficients of all the tests.

(2) Analysis of variance (one-way ANOVA) of the dictation tests only.

(3) Validity coefficients of all tests.

(4) Means and standard deviations of the L1 and L2 groups on all the tests.

4.2 Reliability coefficients

Two kinds of reliability measurements were used:

– The Pearson r correlation formula measures the parallel reliability between two separate, but equivalent, i.e. parallel, tests. The tests involved are the two cloze tests, the two dictation tests and the two essay tests. The procedure used for calculating the reliability of parallel (forms of) tests is to administer the tests to the same persons at the same time and to correlate the results as indicated in the following formula:

rtt = rA,B (Pearson r formula)

where

rtt = reliability coefficient,

and

rA,B = the correlation of test A with test B when administered to the same people at the same time.

– The Kuder-Richardson 20 (KR-20) formula splits a single test in half, and treats the two halves of the test as if they are parallel tests. The tests involved are the error recognition test and the mixed grammar test. The following parallel and KR-20 reliability coefficients are reported:

TABLE 4.1

Reliability Coefficients of All the Tests

The problem is ensuring that two tests are parallel. For example, it is very difficult to ensure parity of content, not only in “integrative” tests such as cloze, dictation and essay but also in “discrete-point”, or “objective”, tests. This is so because all tests no matter how “objective” they look are subjective. Accordingly, it is better to speak of integrative and discrete-point formats than integrative or discrete-pointtests. From this position it is not a big step to take to speak of parallel scoring, because it is only in the sense that test scores are found to be “parallel” that we can talk of tests being parallel. Statistics becomes not only sensible but indispensable in this matter: (1) if the “parallel” tests ranked individuals in a group in a similar way, i.e. if there were to be a high correlation between the tests, and (2) if there were to be no significant difference between the means of the two tests, this would be pretty good evidence that the tests of similar formats and scores were parallel tests. Table 4.2 shows that there was no significant difference between the means within each of the three pairs of integrative tests, because the t Stat was less than the Critical value.

TABLE 4.2 Means and Standard Deviations of Parallel Tests (N=86)

Interrater reliability was only a factor in the essay tests, because the essay tests were marked by more than one rater. With regard to essay tests, Henning points out that “because the final mark given to the examinee is a combination of the ratings of all judges, whether an average or a simple sum of ratings, the actual level of reliability will depend on the number of raters or judges.” According to Alderson,

[t]here is considerable evidence to show that any four judges, who may disagree with each other, will agree as a group with any other four judges of a performance. (It was pointed out that it is, however, necessary for markers to agree on their terms of reference, on what their bands, or ranges of scores, are meant to signify: this can be achieved by means of a script or tape library). (Original emphasis).

If interrater reliability is measured in this way, this would make complex statistical procedures of calculating interrater reliability unnecessary. One would simply compute the average of the four raters’ scores for Essay 1 and Essay 2, respectively, and then compute the parallel reliability coefficient between the average of Essay 1 and the average of Essay 2. This was the procedure used to compute the reliability coefficient of the essay tests.

A reliability analysis was also done in a cumulative fashion of the error recognition test (ER) and the mixed grammar test (GRAM). The reason why the reliability coefficients were computed in this cumulative fashion was to find out the minimum items required to ensure high reliability.

As shown in Figure 4.1, I used the first 10 items (items 1-10), then the first 20 items (items 1-20), then 30 items (items 1-30), and so on. The KR-20 formula was used to compute the reliability coefficients.

FIGURE 4.1  Comparison between the reliability coefficients of ER and GRAM (N=80)

The reliability coefficients of both ER and GRAM have an almost identical pattern. An important statistical truth is illustrated by these reliability data, namely that less than 40 items are not likely to produce satisfactory reliability coefficients, i.e. of .90 or higher, for discrete, objective, items. In multiple-choice grammar tests a reliability coefficient between .90 and .99 is usually required to be considered a reliable test, whereas in tests such as an essay test, a reliability coefficient of .90 is considered high.. There is also a tapering off of the reliability coefficient after 40 items until it reaches a point of asymptote, where any increase in items does not result in a significant increase in reliability.

Some of the reliability calculations may appear odd, because if it is true, as I have shown, that 40 items produce low reliability coefficients, then why (1) use 10 items for CLOZE, and (2) why use the parallel method of reliability for CLOZE and the KR-20 (split-half) method of reliability for GRAM and ER. The answer to these questions requires an answer to another question: (3) Why is the parallel reliability of CLOZE with only 10 items as high as .80 while the KR-20 reliability of GRAM and ER with ten items is a low .60. Answers to these questions lie in the relationship between grammatical/linguistic competence (sentence meaning) and discourse competence (pragmatic meaning) and the continuum of “integrativeness” (see section 2.5).

One doesn’t merely look at the format of a test to decided whether it is a “discrete-point” test. One looks at what the test is testing. As pointed out earlier, it is possible to write few words, i.e. a “discrete-point” format, as in a cloze test (or as in “natural” settings) and still be testing “communicative competence”, or “pragmatic” language. In the case of GRAM and ER, each of these tests consists of unrelated “objective” items; that is, there is no “pragmatic” connection between them. The KR-20 formula is used to measure the reliability of objective items. The Pearson r formula is used to measure parallel reliability of tests at the pragmatic end of the integrative continuum: the cloze, dictation and essay tests. It is true that sometimes the KR-20 formula is used to measure the reliability of cloze tests, because some authors, e.g. Alderson, maintain that many cloze tests test “low order” skills that “in general [relate] more to tests of grammar and vocabulary… than to tests of reading comprehension”. This is not a point of view shared by many other authors (see section 3.3.1.1).

A split-half reliability method (of which the KR-20 formula is a sophisticated version) on an “integrative” test such as a cloze test, dictation test, or an essay test may not be a good idea for the very important reason that the two halves of such tests do not consist of clusters of comparable items, owing to their “pragmatic” nature, i.e. items are not completely independent, i.e. they all hang together. If items hang together as in integrative tests one may not have to worry about searching for an “empirical basis for the equal weighting of all types of errors”, as Cziko believes it necessary to do for all tests.

If there is only one test form, e.g. as in Cziko’s dictation test, one cannot use the parallel reliability method, but one can measure reliability using other methods such as the test-retest method. With regard to the cloze tests, the parallel reliability coefficient of .80 is not only quite acceptable for a “pragmatic” test, but also very good for only ten deletions.

I would like to add a few remarks on rater consistency, or rater reliability. I argued that in the dictation test, presenters and groups were not confounded (i.e. each group had its respective presenter). Therefore, it was legitimate to subsequently do an ANOVA of the four groups/presenters/presentations. One may accept this rationale but still be concerned about the rater reliability of the dictation test (and the cloze test for that matter) because only one rater was involved, and not four as in the essay test. The question, therefore, is whether the scoring procedures in these tests lack evidence of consistency of application owing to the fact that there was only one rater (myself). This should not be a problem in the dictation test, because I didn’t have to worry about distinguishing between spelling and grammatical errors (which can be a serious problem), owing to the fact that only wrong forms of words, intrusions and omissions were considered in my marking procedure. In the cloze tests special care was taken that all acceptable answers were taken into account. The error recognition test and mixed grammar test had only one possible answer. The answers to the latter two tests were provided by the test compilers.

4.3 Analysis of variance of the dictation tests

Recall that a separate presenter was used for each of four groups of subjects. There were four presentations on a rotational basis (Table 3.3). An analysis of variance (ANOVA) was conducted on the – and I must stress this point – first presentation to test for any significant difference between the four presenters’ procedure of presentation. As I explained in section 3.3.5.4, no scores for the statistical analysis were used from dictations that had been heard more than once by any group in the rotation of presenters. Accordingly, presenters and groups were not confounded. In other words, Presenter 1 coincided with Group 1, Presenter 2 with Group 2, and so on.

The ANOVA showed (Table 3.3) that there was no significant difference between the four groups, i.e. the null hypothesis was not rejected. If the null hypothesis had been rejected this would have demonstrated that there was a significant difference between the four presenters’ procedures of presentation. Under these circumstances the use of the dictation in a corrrelational analysis with other tests would be invalid because it would have been illegitimate to combine the four dictation groups into a composite group. The results of the ANOVA are reported below.

TABLE 4.3   Analysis of Variance of the Dictation Tests with First Presentation

Statistical results, in this case the ANOVA, cannot tell us why there was no significant difference between the different presenters. To make qualitative comparisons between the results obtained from different presenters, one would have to examine the results of the four different presenters on the firstpresentation given to the four groups of subjects. I stress the first presentation because the possibility exists that the dictation passages would become progressively easier with each subsequent presentation.

I examined a random selection of protocols (from both the L1 and L2 groups) to find out whether there was any difference in the quality of output using different presenters. This analysis of protocols was a lengthy enterprise and would thus take up too much space if reported in this study. It is fully reported elsewhere (Gamaroff, forthcoming). The intention is not at all to treat qualitative data in a cavalier fashion. The point is that this study’s main emphasis is on quantitative data. Qualitative data are not at all ignored, however (see section 4.8ff). What I shall do here is summarise the conclusions of the qualitative analysis of the dictation tests:

The dictation passages (Pienaar’s restored [or “unmutilated”] cloze passages) for the Grade 7 subjects were intended for the Grades 5 to 7 L2 levels and for the Grades 5 and 6L1 levels. Consequently, the L1 group would be expected to do well, even if the presenter’s prosody were unfamiliar. As the statistics will show (section 4.5, Table 4.5), the L1 group did well and the L2 group did badly.

Recall (section 3.3.5.4) that I used a variation of the traditional procedure, where errors were subtracted from a possible score of 20 points. One point was deducted for any kind of error, including spelling, and the actual score was deducted from a possible score of 20. This was done because I believed that this procedure would yield a valid indication of the level of proficiency of individual subjects. If one was only interested in norm-referenced tests, it wouldn’t matter what the possible score was, because in norm-referenced tests one is only interested in the relative position of individuals in a group, not with their actual scores. One could then measure the correlation between this procedure and Oller’s procedure. If the correlation is found to be high, one could use the shorter procedure. A correlational analysis was done on the dictation tests between Oller’s procedure and my variation of the traditional procedure (a possible 20 points). High correlations were found: .98 for the first dictation passage, and .89 for the second dictation passage. The reason for the high correlations is probably the following:

The word forms of the L2 group were so deviant that I regarded them as grammatical errors. In the L1 group, in contrast, the scores were very high, which meant that no scores were subtracted for spelling or for grammatical errors. As a result, in both groups, spelling had no significant effect, which meant that very few marks were subtracted for spelling. This means that whatever possible score I chose, the correlations between my procedure and Oller’s procedure would have been high; hence the high correlations reported in the previous paragraph. (Correlation is not concerned with whether scores are equivalent between two variables, but only with the common variance between two variables, i.e. whether the scores “go together”; see section 5.3). So, if Oller’s dictation procedure yielded relatively higher scores than my procedure, this doesn’t effect the correlation.

One can explain the difference in performance between the L1 and L2 groups in terms of the difference between the information-processing strategies used by low- proficiency and high-proficiency learners. When we process language, we process in two directions: bottom-up from sound input and top-down from the application of the cognitive faculties. With regard to the dictation test in the study, the words were highly predictable for the L1 group, and therefore this group did not have to rely totally on the sound input. The opposite was the case for the L2 group, where there was almost a total reliance on the bottom-up process of sound recognition. In other words, native listeners or listeners with high proficiency “can predict the main stresses and can use that fact to ‘cycle’ their attention, saving it as it were, for the more important words. It should be kept in mind, however, that bottom-up processes from sound input plays a major role at all levels of proficiency, not only at the low levels.

The difficulties experienced by the L2 group did not only have to do with lexical lacunae: there is much more to knowing a word than knowing the various meanings it may have. To master a word one also needs to know its form, its frequency of use, its context, its relationship to other words. Problems can occur in any of these areas. This applies to all the tests of the test battery.

4.4 Validity coefficients

The singular term test will be used to refer to the means of the two cloze tests (CLOZE), of the two essay tests (ESSAY) and of the two dictation tests (DICT). With the single mixed grammar test (GRAM) and the single error recognition test (ER), there are five “tests” altogether. Table 4.4 shows the validity coefficients of the English proficiency tests. The numbers in the top row refer to the tests that appear next to the corresponding numbers in the extreme left hand column.

TABLE 4.4

Validity Coefficients of English Proficiency Tests p < .01

* Corrected for part-whole overlap. Part-whole overlap occurs when an individual test score is correlated with the total score of all the tests of which its score is a part. In such a situation, one would not be measuring two variables that are separate from one another; which would result in part-whole overlap between the individual test and the total score. This part-whole overlap would increase the correlation, thus giving an inaccurate picture.

The high validity coefficients are impressive and perhaps unusually high. For this reason the raw data and computations (using the statistical programme “Statgraphics”) were rechecked twice. High validity coefficients, however, are not unusual between these tests (See Note 80, Chapter 1). Validity coefficients, unfortunately, do not give a close up picture and thus often need to be supplemented by other descriptive data such as frequency distributions, means and standard deviations. The next section shows these other descriptive data where a comparison is made between the L1 and the L2 groups.

 4.5 Descriptive results of the L1 and L2 groups

The difference between the performance of the L1 and L2 groups are shown. The following data are provided:

1. Means and standard deviations (Table 4.5).

 2. A Frequency distribution (Table 4.6).

The following measures appear in the tables:

1. CLOZE – Average of Cloze tests 1 and 2 (N=86).

2. DICT – Average of Dictation tests 1 and 2 (N=-86).

3. ESSAY – Average of Essay tests 1 and 2 (N=86).

4. GRAM – Mixed grammar test (N=80).

5. ER – Error recognition test (N=80)

A statistically significant as well as a substantial difference was found between the means of the two groups as shown in Table 4.5.

TABLE 4.5   Means and standard deviations for the L1 and L2 groups

When the t Stat is more than the t Critical value, this shows that there is a significant difference between the two groups. (According to Nunan, when two sets of scores have substantially different means or standard deviations, it is not necessary to use a T-test to test for a significant difference between means). The frequency distributions are shown in Table 4.6.

 TABLE 4.6 Frequency Distribution of all the Tests

The L2 group did very poorly on the dictation and the error recognition tests, less poorly on the cloze and essay tests, and best of all on the grammar test. The L1 group did best on the dictation and the grammar tests, while in the other tests, the order of increasing difficulty are the cloze, the essay and error recognition tests. In Chapter 5 the frequency distributions are analysed in more detail in relation to the prediction of academic achievement

Does the significant difference between the L1 and L2 scores above mean that these two groups come from different populations and therefore should not be treated as a composite group in a correlational analysis? I examine this question in section 4.6.

The multiple-choice format is vulnerable to guessing. Sometimes it is recommended that scores be adjusted for guessing, as in Bloor et al.’s GRAM and ER that was used in this study. Guessing was taken into account in the study, which meant that in the mixed grammar (GRAM) test, a score of 88% was reduced to 85%; a score of 64% to 55%; and a score of 40% to 25%. Thus the person who has more, loses proportionately less. The score of 40% in GRAM is used to show how to calculate the adjustment for guessing:

100 minus Actual Score (40%)

divided by

Number of options in item (4 options)

equals 15%

40% (actual score) minus 15% = 25%

As shown in the last line of the equation, the result of the first line (15%) is subtracted from the actual score of 40% to give an adjusted score of 25%. The greater the number of options, the less the adjustment, because the test would be more difficult. ER has five options, and so the adjustment is less than for GRAM. Suppose the actual ER score was also 40%, as in the GRAM example above. The adjusted score of ER would be 28%, which is 3% higher than the adjusted score of GRAM:

100 minus Actual Score (40%)

divided by

Number of options in item (5 options)

equals 12%

40% (actual score) minus 12% = 28%

One cannot prove that someone is guessing, and without proof, it might be argued that one would be penalising non-guessers as well as guessers: “in multiple-choice formats, guessing affects scores and, though statistical procedures are available to correct for it, they necessarily apply indiscriminately whether or not a learner actually has guessed” (Ingram 1985). However, the logic behind correction for guessing is not indiscriminate even though it affects everybody. As shown in the examples above, the less one knows the more the likelihood of guessing. Although one cannot be sure who is guessing, the rationale of the adjustment for guessing is based on what we know about learning and test performance. The key point of logic in the adjustment for guessing is that the lower the original score, the greater the possibility that one is guessing. If the scores are not adjusted for guessing, this would of course affect the ranges of scores. But in this study, I am not interested so much in the absolute values of these ranges as in the relative values: the L1 group relative to the L2 group.

I now focus on the cloze results because these cloze tests have been used elsewhere and have produced a solid body of results which one can compare with the results in this study. I shall also introduce cloze data from another school (to be described shortly). Recall that Pienaar tested a variety of learners from different schools. including Bantu speakers living sub-economic settlements in the environs of Mmabatho (category 4b; see section 3.3.1.2). Pienaar used the label “III” for the group that I called 4b). Many of the parents of category 4b were illiterate or semi-literate and were either unemployed or semi-employed. The sample at MHS did not contain learners that belonged to this category, [as shown by the occupations of the parents of the L2 group in Table 4.7 below. I HAVE NOY INCLUDED THIS TABLE FROM THE ORIGINAL THESIS]

The L2 group contains a good number of parents who work in the education field (see highlighted occupations). One cannot infer that the children of educators are usually advantaged, because in South Africa, education was one of the few professions open to blacks.

To make the statistical data more comparable I included in the investigation a Middle school (Grades 7 to 9) situated in the environs of Mmabatho that accommodated learners similar to Pienaar’s category 4b. The sample from this school, referred to as MID, consisted of 40 Grade 7 learners. Learners at MID come from many primary schools in the area, owing to the fact that there are far more Primary schools than Middle schools in the area. Figure 4.2 compares the cloze test frequency distributions of the MHS L2 group with the Middle School (MID).

FIGURE 4.2 A Comparison of the MHS L2 Group with the Middle School (MID) on the Cloze Tests

The MID school results are very similar to Pienaar’s category of sub-economic learners, namely, his category III (which I have called category 4b). By comparing MID with the MHS sample we see that the L2 group at MHS – poorly as it has done – is better than the MID group. The MID group is comparable with Pienaar’s “at risk” group: indeed at high risk. The MHS L2 group is also at high risk, but the MID group is much worse.

4.6 The L1 and L2 groups: Do these two levels represent separate populations?

Section 4.4 showed that there were high correlations between the discrete-point and integrative tests of the test battery. This study is not only concerned with statistical concepts such as correlation but also with the problem of assigning levels of language proficiency to learners. The discussion to follow is relevant to both these issues:

It is only after the test has been performed on the test-bench that it is possible to decide whether the test is too easy or too difficult. Furthermore, if there are L1 and L2 subjects in the same sample, as is the case with the sample in this study, one needs to consider not only whether the norms of the L1 and the L2 groups should be separated or interlinked but also how to ensure the precise classification of the L1 and L2 subjects used for the creation of norms.

As far as the correlational analysis was concerned, I interlinked the L1 and L2 groups and treated them as a composite group. But I also separated the L1 and L2 groups in order to find out whether there was a significant difference between the means of the two groups. If a significant difference were not to be found between the L1 and L2 groups, this would militate against the construct validity of the tests, because this would mean that the L2 group, who should be weaker than the L1 group, was just as proficient as the L2 group. Under such conditions, we would have no idea what we were testing.

The question is whether it is legitimate to treat the L1 and L2 subjects as a composite group (for a correlational analysis) as well as two separate groups (for comparing the means between the L1 and L2 groups). One may object that one cannot do both; that one cannot interlink groups and also separate them. I shall argue that one can.

As shown in Table 4.4 the correlations between the different tests were high. The means and standard deviations (Table 4.5), however, show that there was a significant difference between the L1 and L2 groups. Can one, accordingly, maintain that if there was a significant difference between the L1 and L2 groups that these two groups belong to separate populations, and thus argue that the correlations were artificially inflated by combining samples that represent two separate populations?A discussion of this question raises the question of the logicality of dividing the subjects into L1 and L2 groups. Is this division arbitrary or does it have a cogent theoretical rationale on which one can make the inference that the L1 and L2 groups represent different populations?

There are two distinct issues, which are also related, namely, levels of proficiency and correlations. The logic of correlation, which is based on a bell curve distribution, is that tests that do not have a reasonably wide spread of scores (high achievers and low achievers) could give a false picture because tests that have a large spread of scores around the mean have more likelihood of being replicable, owing to the fact that in a representative sample of human beings there is likely to be a wide range of ability.

This does not mean that it is not possible to have a high correlation with a narrow spread, or a low correlation with a wide spread, but it is more likely that a correlation would be higher with a wide spread of scores, say 0% to 80%, than a narrow spread, say 40% to 80%. The sample in this study represented the Grade 7 population at the school throughout the years.

In the assessment of levels of proficiency I separated the high achievers from the low achievers in the sample because they could be distinguished – unsurprisingly – as those who took English as a First Language and English as a Second Language, respectively.

I discuss briefly the theory of investigating the difference between groups. Consider the following example:

In South Africa there are many immigrants from different countries for whom English is a foreign language. If one tested and compared the English proficiency of a group of Polish immigrants and Chinese immigrants and found no significant difference between these two groups, one wouldn’t be surprised, because one would probably conclude that there was a wide spread of scores in both groups. If a significant difference were to be found, one may be curious to know why the one national/ethnic group did worse or better than the other.

Replace the Polish and Chinese immigrants with two other groups, the L1 and L2 groups in this study. A significant difference was found between these two groups, but this is not surprising at all, because it is to be expected that the group taking English as a First Language subject (the L1 group) would be better at English than the group taking English as a Second Language subject (the L2 group): assuming that the subjects (test takers) in the sample made a reasonable choice of which group to belong to. (Recall that learners at MHS initially decided themselves whether they were belonged to L1 or L2. In most cases they had a good idea where they belonged). Accordingly, it is quite logical that there would be a significant difference between the L1 and L2 groups, or levels. To get a clearer grasp of the issue of the respective levels of proficiency of the two groups and the “separate populations” question one has to examine whether:

(1) The reliability aspects were the same for the L1 and L2 subjects, e.g. the same tests, same testing facets and same testing conditions, etc. (see section 2.9). This was so.

(2) The composition of the sample, i.e. the proportion of L1 and L2 learners, was similar from year to year at MHS. This was so. In other words, the 1987 Grade 7 sample represented the population of Grade 7 learners at MHS from year to year, specifically from 1980 to 1993.

One would also look at whether there were differences in:

(3) Admission criteria for the L1 and L2 groups.

(4) The background, or former treatment of L1 and L2 learners before they entered the school.

(5) What one expected from the L1 and L2 learners.

(6) The treatment they were given in the same education situation.

All the above points except for (4) apply to both the L1 and L2 subjects. I discuss (40):

MHS endeavours to provide disadvantaged learners with the opportunity to learn in an advantaged school situation. In the validation of the sample, the notion of dis- advantage is important. In South Africa the term disadvantage often bears the connotation of “consciously manipulated treatment” meted out by apartheid. Treatment can have the following two connotations: (i) consciously manipulated treatment in an empirical investigation and (ii) the long-term treatment – be it educational, social, economic, cultural or political – of human beings in a non-experimental life situation.

What is relevant to the statistical rationale of this investigation is not the fact that the entrants to MHS had received different treatment prior to entering MHS, where some may have been victims of apartheid and others not, but only the fact that all entrants received the same treatment after admission to MHS. I am not implying that their background experience is inconsequential as far as the teaching situation – past (at former schools) or future (at MHS) – is concerned, but only that all entrants were expected to fulfil the same academic demands. I discuss later the role of language background, specifically the role of English input.

The vast majority of the 1987 Grade 7 intake had high Grade 6 scores from their former schools . This was the main reason why many of them were admitted to MHS. The disadvantaged group and the advantaged group both consisted of high-scoring entrants revealed by the Grade 6 school reports. Accordingly, it appeared that all the entrants were extremely able, whether they came from an advantaged or disadvantage background. Now, suppose one found that (i) high Grade 6 scores (from former schools) were obtained by both the L1 and L2 groups but that (ii) while high English proficiency test scores were obtained by the L1 group, low English proficiency test scores were obtained by the L2 group. The findings showed that both these facts were so. This does not mean, however, that the L1 and L2 subjects belong to different populations. What it shows – on condition that the English proficiency tests were valid and reliable, which the findings show was the case – is the true nature of the population, namely, a wide spread of scores.

So, although it seemed, from the good Grade 6 reports of all entrants at MHS (L1 and L2 entrants) from year to year, that MHS only admitted high achievers, the reality was that MHS admitted a mixture of academically weak learners (who were generally disadvantaged) and academically strong learners (who were generally advantaged), as was the case with the 1987 Grade 7 sample. Further, learners at MHS received the same treatment.

The statistical analysis should be kept distinct from educational, social, economic, cultural, political and other deprivations that pre-existed admission to MHS. The principal issue in this study is what learners areexpected to do after admission, where all learners are called upon to fulfil the same academic demands, except for the language syllabuses, namely English, Tswana, Afrikaans and French, and where all are required to use English as the medium of instruction.

If it is true that former academic achievement (Grade 6 in this case) is the best predictor of subsequent achievement, it would follow that many of these entrants should have had at least a reasonable standard of academic ability. What happened in fact was that although almost all of the 1987 Grade 7 entrants (L1 and L2) obtained high Grade 6 scores on English achievement and on their aggregates, many of the L2 entrants (who were mostly disadvantaged learners) obtained low scores on the English proficiency tests. For this reason, the sample turned out to be, as far as the English proficiency tests were concerned, a representative mixture of weak and strong learners, i.e. a random sample. This fact is crucial to the validation of any sample, whose essential ingredient is randomness.

I am arguing, therefore, that the L1 and L2 groups do not represent separate populations: they are merely a mixture of weak and strong performers, where it is only logical that weak subjects would prefer to belong to the L2 group than to the L1 group and that the L2 group would also do relatively worse than the L1 group on the English proficiency tests. It turned out that there was a clear distinction between the L1 and L2 groups. Most of the L2 group did poorly and most of the L1 group did relatively much better on the tests, hence the significant difference in the means between the two groups.

In the traditional distinction between L1 and L2 learners, these two kinds of learners differ only in so far as L2 learners aspire to reach the L1 level. The difference, therefore, between L1 and L2 learners lies in the different levels of mastery. And that is what tests measure within a sample that represents a population: it measures which members are strong, which are weak. If the tests are too difficult for the L2 group or too easy for the L1 group this does not mean that the tests are invalid, i.e. that they have been used for the wrong purpose, if the purpose is to distinguish between weak and strong learners. One does not look at the actual scores as far as construct validity is concerned but at whether the tests distinguish between weak and strong learners. Oller elucidates (he is talking about one learner and one task, while I am talking about many learners and several tasks: I make the necessary adjustments in Oller to suit the context):

It is probably true that the [tasks were] too difficult and therefore [were] frustrating and to that extent pedagogically inappropriate for [these students] and others like [them], but it does not follow from this that the [tasks were] invalid for [these learners]. Quite the contrary, the [tasks were] valid inasmuch as [they] revealed the difference between the ability of the beginner, the intermediate, and the advanced [learners] to perform the [tasks].

(Section 4.7 elaborates on the comparison between test scores and the comparison between groups).

If one is or believes one is weak at English, one would sensibly prefer to take English as a Second Language, if one had a choice: one did have a choice at MHS. This is not to say that if one were good at English one would not take English as a Second Language, owing to the fact that somebody good at English could obtain higher marks taking English as a Second Language than taking English as a First Language.

Accordingly, those Tswana speakers in the L1 group at MHS who later changed to English Second Language could have done so not because they were weak at English but in spite of the fact that they were good at English.

4.7 Comparing groups and comparing tests

It might be argued that measuring the difference in means between groups apportions equivalent scores to each item, and accordingly, does not take into account the relative level of difficulty of items. I suggest that the relative difficulty of items is not important in a language proficiency test, but is indeed in a diagnostic test, which has remediation as its ultimate purpose. With regard to proficiency tests, one is concerned with a specific level (e.g. elementary, intermediate and advanced) for specific people at a specific time and in a specific situation. Within each level there is a wide range of item difficulty. To attain a specific level of proficiency one has to get most of the items right – the difficult and the easy ones. In sum, the different bits of language have to hang together, which is what we mean by general, or overall, language proficiency. As pointed out earlier (section 2.5), the controversy is about which bits do and which bits don’t hang together.

We now come to a very important issue. What does a score of 60% on a test for an L2 learner in this study mean? An answer to that question requires a distinction between (1) the comparison between tests and (2) the comparison between groups: the L1 and L2 groups.

As a preliminary I refer to the relationship between norm-referenced tests and criterion-referenced tests. The former is only concerned with ranking individuals in a group and not, as in the case of criterion-referenced tests, with individual scores achieved in different tests. So, in norm-referenced tests one is interested in correlations, which is concerned with how individuals are ranked in a group on the tests involved in the correlation, and not with whether individuals achieved equivalent scores within a group. The latter is the concern of criterion-referenced tests. But, of course, one needs both kind of information to get a empirically-based idea of language tests. One has to be careful, however, when one compares tests.

In this study I have been comparing different kinds of tests and contrasting different groups of learners. The main focus is on the difference between groups, and thus there was no explicit and sustained attempt to contrast the scores between the different tests. This was deliberate. If one is going to compare the results between different tests, e.g. the dictation test and the cloze test, extreme caution is required, because such comparisons could lead to false conclusions. This is not to say that such comparisons are not useful; they can be very useful, but when one makes such comparisons, one must be aware of the parameters involved. Scores reveal nothing and surface errors reveal little about why a particular score was awarded. One has to look at the construction of the test, i.e. what, why and who is being tested and doing the testing, and how it is being tested. All these parameters are related to the scales of measurement that one uses. Consider the following measurement scales, especially the ratio scale. The ratio scale could be confused with the other scales:

– Nominal scale (also called categorical scale). This is used when data is categorised into groups, e.g. gender (male/female); mother tongue (English/Tswana).

– Ordinal scale. One could arrange proficiency scores from highest to lowest and then rank them, e.g. first, second, etc.

– Interval scales. One retains the rank order but also considers the distances (intervals) between the points, i.e. the relative order between the points on the scale.

-Ratio scales are interval scales with the added property of a true zero score, where the “points on the scale are precise multiples, or ratios, of other points on the scale”. Examples would be the number of pages in a book, or the number of learners in a classroom. If there were 200 pages in a book, 100 pages would be half the book.

Taking these scales into account, consider the proficiency tests of the study:

– The cloze test. To be considered proficient enough to cope in a higher grade in the cloze tests, one should obtain a score of at least 60%. (The mean scores of the L1 and L2 groups for the cloze test were 66% and 26%, respectively). (See Table 4.5).

– The essay test. A score over 60% in the essay, in contrast to 60% on the cloze testin this study, would be considered a good score. A score of 40% on an essay test or on any test is not half as good as a score of 80%. As far as essay tests are concerned, 80% would be an excellent score, while 30% would be a poor score. But, poor is not half of excellent.(The mean scores of the L1 and L2 groups for the essay test were 56% and 29%, respectively). (See Table 4.5).

– The dictation test. I used a score of a possible 20 points; one point for every correct word. But if I had made the score out of 86 or 87 points (the dictation passages consisted of 86 and 87 words, respectively), where every word counted one point, a score of 60% would mean that 40% of the words in the dictation passage would be wrong. It is hardly likely that a dictation protocol with a score of 60% marked in this way would be comprehensible. Accordingly, an individual’s score of 60% on a dictation test would not mean the same thing at all as 60% on the cloze and essay tests. (The mean scores of the L1 and L2 groups for the dictation test were 71% and 16%, respectively). (See Table 4.5).

– The error recognition and the mixed grammar tests. These test scores were adjusted for guessing. If one adjusts for guessing, one must take this into account.

To sum up, it is the “relative difference in proficiencies” between learners of high ability (in this case the L1 group) and low ability (in this case the L2 group) and not the equivalence in scores between the tests that determines the reliability and construct validity of the tests.

4.8 Error analysis and rater reliability

Although the average of four or even three raters may be a reliable assessment of a “subjective” test such as an essay test, it is usual in the teaching situation to have only one rater available, who is the teacher involved in setting the test. If there are two raters available it is generally only the teacher who sets the test and is overall in charge of the test who has the time or inclination to do a thorough job. The problem of rater consistency is an extremely serious problem in assessment. The nub of the problem is one of interpretation, an issue that fills innumerable tomes in the human sciences, especially during this “postmodern” era. This is what is involved:

Logically prior to any question of the reliability and validity of an assessment instrument is the question of the human and social process of assessing…This is a radically interpersonal series of events, in which there is an enormous, unavoidable scope of subjectivity – especially when the competences being assesses are relatively intangible ones to do with social and personal skills, or ones in which the individual’s performance is intimately connected with the context.

It is the interpretation, or judgment, of errors that is the main problem in language testing. Ashworth and Saxton (in their quotation above) are concerned with the lack of equivalence in judgements and scores between raters. The subjectivity question in the battery of tests of this study remains a problem in the essay test. I tried to solve the problem by using four raters. But, in most testing situations only one rater and at most two raters are available. I would like to expand on the issue of rater reliability, because this seems to be the major problem in the assessment of “subjective” tests such as essay tests. Error analysis is brought into the picture.

The qualitative analysis of errors and quantitative measurement are closely related in issues of interrater reliability. In this section I discuss more theory, in this case the uses and limitations of error analysis, which serves as a background to the examination of a detailed practical example of the uses and limitations of error analysis and quantitative measurement. I begin the discussion by assessing the value of the quantitative procedures used in this study in relation to the lack of qualitative procedures used so far:

One may feel that the linguistic substance of individual errors obtained in an error analysis has more bite than reductionist “number-crunching” and that consequently this study has overreached itself by limiting itself to something as insubstantial as a statistical investigation. One might want to see additional analyses of a qualitative nature of the proficiency tests, especially of the integrative tests, where writing output is involved. Such a desire is understandable because scores by themselves don’t illuminate the linguistic substance behind the numbers owing to the fact that similar scores between raters do not necessarily mean similar judgements, and different scores between raters do not necessarily mean different judgements.

Error analysis can be useful because it provides information on the progress made towards the goal of mastery and provides insights into how languages are learnt and the strategies learners employ. Concerning learning strategies, the making of errors is part of the learning process. (An error analysis need not involve a “linguistic” analysis. For example, in an error analysis of writing one could look for cohesion errors, but if one were to examine the noun-to-verb ratio in individual protocols, this would not be an error analysis but a linguistic analysis).

This study is mainly concerned with norm-referenced testing. So, to include a linguistic/error analysis of the tests, would, besides being far too long and ambitious a project, go beyond the objectives of this study. The problem would be which tests to use in such an analysis, and how long such an analysis should be. Naturally, qualitative analysis is very important, but in the empirical part of the study I focus on quantitative data. As far as qualitative data are concerned, relevant to this study is examining the problems of error analysis as it relates to rater reliability. As mentioned a few paragraphs earlier, I shall be using a detailed concrete example later on in this section to examine this problem. But first some theory.

Often mother-tongue proficiency is advocated as an absolute yardstick of language proficiency, but, as Bachman and Clark point out “native speakers show considerable variation in proficiency, particularly with regard to abilities such as cohesion, discourse organisation, and sociolinguistic appropriateness.” As a result, theoretical differences between testers can affect the reliability of the test. Raters who know the language well and even mother-tongue speakers can differ radically in their assessments of such pragmatic tasks as essay tasks. That is why different raters’ scores on a particular protocol are often incommensurate with their judgements. Owing to these problems it is virtually impossible to define criterion levels of language proficiency in terms of actual individuals or actual performance. Bachman and Clark suggest that such levels must be defined abstractly in terms of the relative presence or absence of the abilities that constitute the domain. But again this doesn’t solve the problem because the difficulty is how to apply the definition to concrete situations of language behaviour.

Another problem is the representativeness of specific errors. In previous research I did an error analysis of Tswana speakers’ English but did not establish statistically whether the errors I was dealing with were common errors, e.g. *cattles (a plural count noun in Tswana “dikgomo”) and *advices (a plural count noun in Tswana “dikgakoloko). Under such circumstances one can be duped into believing that errors are common if one comes across them a few times, which may only create the feeling that they are common. Error analysis under such circumstances could indeed become merely an idiosyncratic – and mildly interesting – “stamp collection”.

Another example: Bonheim, coordinator of the Association of Language Testing in Europe, gives an example of a test taker who had done very well on a multiple-choice test, but in one of his/her few incorrect items of the test had circled an option that was an unlikely answer. Bonheim suggested that one should try and find out why this highly proficient learner had circled this option. This idiosyncratic example surely cannot contribute anything to the general principles of error analysis, i.e. tell us whether the error is common enough to warrant a time-consuming investigation. In proficiency testing, one is not looking for idiosyncratic errors but for general errors. In diagnostic testing, of course, the situation is quite different because one focuses on both individual and general errors, because the main aim of a diagnostic test is therapy, not finding out the level of a person’s present ability, which is what proficiency tests are about.

Obviously, the different types of tests, e.g. proficiency, diagnostic, aptitude and achievement, are related, but it is important to keep their main purposes distinct: otherwise there would be no point in creating these distinctive categories. For example, an itemised analysis can reveal the relative strengths and weaknesses of the different groups in different parts of the language. Such an analysis can be used as a diagnostic tool at the beginning or end of a specific course of instruction, or, in this case, as a measurement of specific points of proficiency. However, without quantitative procedures, the data one gathers remain unconvincing. For example, consider the percentage error on individual items of the error recognition test for the Non-Tswana L1 sub-group (NTL1) and the L2 group in Table 4.8:

TABLE 4.8  Error Recognition (Identification) Test: Percentage Error

Care must be taken in the interpretation of Table 4.8. The higher the scores, i.e. the higher the percentage error, the more difficult the item. The information in Table 4.8 reveals the similarities and differences between groups on each item. For example, in item 8, the difference between the NTL1 and the L2 group is not substantial. Item 8 is given below.

                     A                             B                      C             D

Item 8. Both Samuel and I/are much more richer/than we/ used to be.

Correct: Answer: B

In item 19 the NTL1 group does substantially better than the L2 group.

                     A                           B                                   C                              D

Item 19. Some believe that/a country should be ruled/by men who are/too clever than ordinary people. Correct: Answer: D

ESL learners often confuse intensifier forms such as “too clever”, “very clever” and “so clever” and comparative forms such as “more beautiful” and “cleverer.” The error in Item 19 is a double confusion between intensifier and comparative forms probably caused by false generalisation, or false analogy, from the English forms.

A quantitative analysis of errors was also found useful in identifying the “replacement language” subjects, which helps in establishing levels of proficiency between learners. Recall (Note 2, Chapter 3) that a “replacement language” is a language that becomes more dominant than the mother tongue, usually at an early age, but is seldom fully mastered, as in the case of some of the Coloured and Indian subjects in the sample, who belong to the Non-Tswana L1 (NTL1) sub-group. (Bantu speakers, of course, can also have replacement languages).

The”replacement” language subjects could be identified, to a certain extent, by the very low scores they obtained in the tests. An examination of particular errors made by those I suspected of being “replacement language” subjects increased the accuracy of the identification of these subjects. (These were Indians and “coloureds” who had beenusing English as a medium of instruction from the beginning of primary school).

Mother-tongue speakers do make or cannot recognise several grammatical errors. Consider the percentage error of the NTL1 group on items 8 and 19 given above – 75 and 30 respectively. In item 8, it is possible that 12-year old English-mother-tongue speakers would not be able to recognise that segment B (“are much more richer”) is an error. In item 19, it is more likely that 12-year old English-mother-tongue speakers would recognise segment D (“too clever than ordinary people”) as an error. But there is still a slight possibility that an English mother-tongue speaker would not recognise the error. There are, however, certain errors that all English-mother-tongue speakers would recognise. Thus if such a mistake were to be made by subjects that one suspected of being “replacement language” subjects, one would be almost certain that they indeed were. For example, consider the percentage error of item 27 of the error recognition test.

                          A                                 B                                    C                                      D

Item 27. As I have now studied/French for over three years/I can be able to/make myself understood when I go to France.

Correct: Answer: C (Recall that by “correct” her is meant identifying/recognising the error).

Percentage Error of Item 27

NTL1     L2

N=20   N=38

40%     93%

In item 27 segment C (“can be able”) is a notorious error among South African black ESL users. The L2 group had a percentage error of 93. What is interesting is that 40% of the NTL1 group got this item wrong. It is highly likely that an English-mother- tongue speaker would recognise this error. Such an example is good, if not absolute, evidence that those in the NTL1 group who didn’t recognise this error were “replacement language” subjects.

4.8.1 Rater reliability among educators of teachers of English This section is published as an article “Rater reliability in language assessment: the bug of all bears.” System, 28, 1-23. If you request, I shall scan this section from the original thesis and post it in wordpress.

4.9 Summary of Chapter 4

The statistical results were reported. High correlations were found between the tests and there was a substantial difference between the L1 and L2 groups. Reasons were given for not treating the L1 and L2 groups as separate populations in the correlational analysis. The dangers of comparing tests were also discussed.

Singled out for special attention was interrater reliability. The lack of interrater reliability is arguably the greatest problem in assessment because it is often the cause, though indirectly, of student failure – and success! It is on the issue of interrater reliability that matters of validity and reliability come to a head, because it brings together in a poignant, and often humbling, way what is being (mis)measured, and how it is (mis)measured. The next chapter deals with the battery of proficiency tests as predictors of academic achievement.

One response to “Chapter 4 – Results of the Proficiency Tests

  1. Pingback: THE Discrete-point/Integrative controversy and Authenticity in Language Testing « Grammargraph

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: