Chapter 1: Scope of the Study Ph.D.

1

1.1  Introduction: The  problem and purpose of the study

1.2  Psychometrics and norm-referenced testing

1.3  Summative assessment

1.4  The One Best Test

1.5  Hypotheses of the study

1.6  Historical and educational context

1.7  Measures used in the study

1.8  Method overview

1.9  Preview of Chapters 2 to 6

1.10 Summary of Chapter 1

1.1 Introduction: The  problem and purpose of the study

Language testing draws on three areas: (1) the nature of language, (2) assessment and (3) language ability.[1] Language ability is closely related to language proficiency, which is a key term in this study. (The relationship between ability and proficiency is explained in section 2.2 ff).

    Central concepts in the measurement of language ability are: (1) validity (what one is measuring), (2) reliability (how one is measuring), (3) practicability (economics of time and expense) and (4) accountability (why one is testing). If a test is not practicable, even if  judged to be valid and reliable, it would be uneconomical and accordingly of little use.

Owing to our ignorance of the processes of language learning and learning processes in general, much of what we know about language testing, and therefore also about teaching, remain tentative. (“Language testing is rightly central to language teaching”[2]). A major obstacle in test development has been the lack of agreement on what it means to know a language, on what aspects of language knowledge should be tested – and taught – and how they should be tested and assessed. Accordingly, it is not always easy to know why one is testing. As far as the how and the what is concerned, the more explicit, i.e. the more accurate, or reliable,  one tries to make a test, the more uncertain one becomes about (the validity of) what one is testing.

The study consists of  three parts:

Part I. A large part of this study is devoted to statistical measurement in assessment, specifically the relationship between communicative competence and traditional testing theories insofar as these shed light on the (1) construct of language proficiency, and (2) the use of proficiency tests as predictors of academic achievement. Statistical measurement in scoring procedures are also closely related to the structure and administration of tests.  This study re-examines and defends this interdependence in terms of the three key notions in testing: validity, reliability and practicality.

Part II. The prediction of academic achievement where English proficiency tests are used as the predictors. A longitudinal study is undertaken of the prediction of academic achievement from Grade 7 to Grade 12.

Part III. Implications of the study for procedures of assessment of language proficiency.

The educational context of this study is a High School in the North West Province, which will be referred to as MHS, where I taught and did language research for over seven years (January 1980 to April 1987).

The primary focus of the study is not on what “really matters  in first or second language proficiency and academic achievement[3], or, put another way, on “develop[ing] an adequate theoretical framework for relating language proficiency to academic achievement”.[4] Neither does it deal with individual differences in cognitive styles of learning.[5] Nor does it dwell on the many causes of academic failure. The longitudinal part of the study is much more about predictions than about causes of academic failure. I do, however, refer to causes of academic failure where relevant. For example, one of the causes of academic failure (and success!) that this study is particularly interested in is the lack of scoring consistency among raters, which is arguably the greatest problem in assessment.

There are two urgent needs in minority education:

(1) to pursue fundamental research on the nature of language proficiency and how it can be measured, and (2) to provide teachers with up-to-date knowledge of language proficiency assessment so they can improve their classroom assessment practices.[6]

The term minority has much more than a numerical meaning. In South Africa the majority of learners use English as an additional/secondlanguage, but the tradition is to refer to such learners as originating from minority language backgrounds. The term has an obvious discriminatory ring that implies that some acceptable level has not yet been reached. Yet, tests have to distinguish between levels of proficiency for them to have construct validity. Chapters 3, 4 and 5 deal with the statistical issues of assigning people to the same or different groups, and Chapter 6 deals with the educational and political implications of doing so. Two main levels of proficiency are examined, which are given the well-known labels of L1 and L2. The L1 and L2 labels are central to the study and are used differently to the normal connotation of  users of a first language and learners of a second language, which I shall explain shortly. These well-known terms together with the terms mother tongue and native language have been the occasion of much controversy. I discuss this controversy in section 6.2. In the empirical investigation the labels L1 and L2 will refer to the sample of subjects (i.e. informants) who take the subject English as a First Language and English as a Second Language, respectively, at MHS. This definition of L1 and L2 needs constant reminding throughout the study. In Chapter 6 (section 6.2), various other definitions of “L1” and “L2” are examined.

The sample of subjects is described in detail in Chapter 3. Tables 3.1 and 3.2 give a clear explanation of the sample of subjects and so it might be useful to continually refer to these two figures, which serve as guideposts to the description of the sample.

Psychometric questions, and discrete-point and integrative testing have been discussed at length in the literature on language testing and assessment (the distinction between testing and assessment is explained shortly). So one may ask why the need to, and where is the merit and originality of, devoting a PhD to such old and outdated  issues, which has not been a research issue in language assessment for over 15 years? Thereis indeed a pressing need because although communicative methods have become the prevalent source of testing with a “richer conceptual base for characterizing the language abilities to be measured, it has presented language testers with a major challenge in defining these abilities and the interactions among them with sufficient precision to permit their measurement.”[7] It is this problem of reconciling authentic subjectivity and objective precision that is the major problem in testing, indeed the major problem of cognition and language.[8] The authenticity issue has wider ramifications. It is not only central to testing but also to syllabus design and materials development. This study focuses on testing only.

In spite of decades of attempts to define it, the how[9] and the why[10] of language proficiency remains a conundrum. Although we may no longer stand before an “abyss of ignorance”[11] and may be able to agree with Alderson (in Douglas[12]) that language testing has “come of age”, there are still many problems in language testing, the greatest one being, I suggest, the problem of reliability[13] and specifically rater reliability (see Alderson and Clapham’s case studies of this problem[14]). There are two kinds of rater reliability: interrater reliability and intrarater reliability. These are dealt with in section 2.9.1.

Owing to our ignorance of the processes of language learning and learning processes in general, much of what we know about language testing, and therefore also about teaching, remain tentative. (“Language teaching is rightly central to language teaching”[15]). A major obstacle in test development has been the lack of agreement on what it means to know a language, on what aspects of language knowledge should be tested – and taught – and how they should be tested and assessed.

This problem is not a surprising one because language is closely connected to human rationalities, imaginations, motivations and desires, which comprise an extremely complex network of biological, cognitive, cultural and educational factors. As a result, all language testing theories are inadequate owing to the difficulties involved in devising tests that test authentic language reception and production. This does not mean that we should stop measuring until we’ve decided what we are measuring. We do the best we can by taking account of generally accepted views of the nature of language proficiency , of modern views and dated ones. In the modern literature on testing there seems to be an overemphasis on up-to-date theories, which gives the impression that “what is  dated is outdated”.[16] Widdowson’s up-to-date admonition that we should take more seriously dated views is taken to heart in this study.

What is a test? It is “the most explicit form of description, on the basis of which the tester comes clean about his/her ideas”.[17] What all testers are looking for are systematic elicitation techniques on which one can base useful decisions. The three underlying issues in testing are: to infer abilities, to predict performance and to generalise from context to context[18].  This means that tests should be valid, reliable and practicable. Communicative testers would add the notions of “impact” (i.e. face validity) and “interactionist”.[19] Opponents of discrete-point tests (such as grammar tests) and integrative tests (such as cloze tests and dictation tests) would probably concede that such tests are reliable and practical, but they would argue that they are not valid, i.e. they tell us little or nothing about the learner’s knowledge of authentic language. I shall argue that, on the contrary, they indicate a great deal about authentic language and that these old issues are not outdated and are still worthy of attention.

I use an analysis of a battery of English proficiency tests to substantiate my theoretical position. I then examine the predictive validity of the battery of English proficiency tests. Part of the predictive investigation involves a comparison between the predictive validity of the school reports from former schools of entrants to the School with the predictive validity of the English proficiency tests. These reports were the main criterion for admission to the School. Few entrants with School report aggregates under 60% were admitted.

A curriculum framework consists of the following components[20]:

–  Needs analysis

–  Objectives

–  Materials

–  Teaching

–  Testing

Accordingly, the curriculum is concerned with the syllabus as well as everything to do with pedagogical matters, i.e. teaching what to whom, when and how.[21]Syllabus is defined  as the content and sequence of content of the programme selected in order to make learning and teaching effective.[22] Although testing is the last component in the curriculum framework, this is only so chronologically, and not logically, because testing permeates the whole of the curriculum. This is the reason why there is the possibility  – and the temptation; perhaps justifiably so –  of teaching to the test.

A major part of testing is concerned with assessment.  In this study I use the term tests to refer to “elicitation techniques”[23] and the term assessment to  refer to the methods used to measure or analyse test results. In assessment one is concerned with the control of rater judgements and  scoring techniques. (Assessment is discussed in detail in section 1.3). Thus assessment, quantitative or qualitative, is not an intrinsic property of a test but a method one uses to measure or analysis the test results.

There are a variety of language test uses (Pollitt in Yeld[24]). The basic four uses are mentioned[25]:

–  Proficiency tests, which evaluate present knowledge in order to predict future achievement, usually at the beginning of a course of study. Proficiency tests are based on knowledge that has been gained independent of any specific syllabus but not independent of typical syllabuses because the knowledge to be tested must have been gained from some syllabus or other.

– Achievement tests, which evaluate how much has been learnt of a particular syllabus, where the focus is on success, usually at the end of a teaching programme.

– Diagnostic tests, which evaluate points not yet mastered, where the focus is on failure and therapy. Diagnostic tests. therefore may be considered to be the reverse of achievement tests.[26] Proficiency tests often involve diagnosing items that have not been mastered, and therefore diagnostic testing may be  part of proficiency testing.

– Aptitude tests, which evaluate abilities for language mastery, and is thus, like proficiency tests, of predictive value. Unlike the three other kinds of test uses, aptitude tests have no specific or general content, and are thus difficult tests to compile. They require, arguably, the most knowledge and care in their construction and application, for it is far worse to be told that one has no aptitude than to be told that one has low proficiency or has failed an achievement test. No aptitude means no hope at all, unless it is possible to have potential without aptitude.

Thus, both proficiency and achievement are concerned with present knowledge. It may be that a proficiency test contains material previously contained in an achievement test, but this difference is irrelevant to the validity of the proficiency test because, unlike an achievement test, a proficiency test is not concerned with whether the content of a test was previously taught. The American Council of the Teaching of Foreign Languages (ACTFL) Proficiency Guidelines are a case in point:

Because these guidelines identify stages of proficiency, as opposed to achievement, they are not intended to measure what an individual has achieved through specific classroom instruction but rather to allow assessment of what an individual can and cannot do, regardless of where, when, or how the language  has been learned or acquired: thus the words learned and acquired are used in the broadest sense.[27]

Proficiency is concerned with what somebody knows and can do here and now. Achievement should be ultimately concerned  with proficiency as well. That is why Spolsky[28] omits the term achievement in the following definition of language tests: “Language tests involve measuring a subject’s knowledge of, and proficiency in, the use of language.”

There are four important considerations in language testing[29]:

1.  How valid is the test?

2. How easy is it to compose?

3. How easy is it to administer?

4. How easy is it to mark?

The first, which is concerned with the purpose of a test is the most important theoretical issue in testing.  Ur feels so strongly about practicability that her next three considerations for choosing a test have to do with practicability. The fourth is also related to rater reliability. A test may be everything communicative testers require, but it would still be no good if it took too long or was too difficult to administer and assess. The more objective the test, the less the danger of rater unreliability. An essay test is a supreme example of a subjective test, because it is vulnerable to fluctuations in judgements between raters. The problem is finding the appropriate balance between the different testing considerations, where it is difficult, indeed, impossible, to give all of them equal prominence.

The reason why the value of discrete-point tests and many integrative tests such as cloze tests and dictation tests must be reassessed is mainly because of their practicality; if only one could solve the problem of  authenticity of  these tests.  The basic problem is whether indirect tests such grammar tests, cloze tests and dictation tests can predict real-life performance, which many authors (these authors are discussed at length in the study) equate with authentic language, and thus reject the notion that an indirect elicitation procedure of real-life language can be authentic. Of course, this problem of authenticity of indirect tests is not limited to language tests but to all kinds of indirect tests, e.g. intelligence tests. A major part of this study is concerned with the meaning of authenticity in testing.

Every test is an operationalisation about certain beliefs and values about language, whether the test is called authentic or not . These beliefs and values determine to a certain extent our mental and emotional reactions to language and to knowledge in general. What is required in this study is to justify the beliefs I hold about discrete-point tests, integrative tests and the necessary psychometric methods of assessment that they imply.

1.2 Psychometrics and norm-referenced testing

In language testing the opposition to psychometrics is closely connected to the “suspicion of quantitative methods”[30] and the opposition to “reductionist approaches to communicative competence”.[31]

The history of the quantitative/qualitative controversy can be viewed from two diametrically opposite angles: (1) qualitative research has been dominated by quantitative research for many decades and is only in recent years becoming accepted as a legitimate scientific approach[32] or (2) quantitative research has been for more than two decades challenging qualitative methods and also setting itself up as the only legitimate form of research.[33]

Galton’s view is that it is the scientists’ job to “devise tests by which the value of beliefs may be ascertained, and to feel sufficiently masters of themselves to discard contemptuously whatever may be found untrue”[34] (Rushton 1995; his frontispiece). For Galton tests must be statistically validated. Although I do not share Galton’s sweeping faith in statistics, statistical measurement in language testing has been given a undeserved bad press, e.g. Spolsky[35], Lantolf and Frawley[36] and  Macdonald[37].

The increasing number of studies in  purely ethnographical/sociolinguistic approaches to language proficiency assessment[38] is witness to the opposition  to the objectivist, or positivistic, or reductionist methods of psychometric research. For Harrison, psychometric measurement is inappropriate due to its subjective nature:

Testing is traditionally associated with exactitude, but it is not an exact science…The quantities resulting from test-taking look like exact figures – 69 per cent looks different from 68 pert cent but cannot be so for practical purposes, though test writers may imply that they are distinguishable by working out tables of  precise equivalences of test and level, and teachers may believe them. These interpretations of scores are inappropriate even for traditional testing but for communicative testing they are completely irrelevant. The outcome of a communicative test is a series of achievements, not a score denoting an abstract `level’.[39]

Thus, “the quantities resulting from test-taking [which] look like exact figures” (in the quotation above) appear to measure objectively, but in fact they measure subjectively. (See Morrow[40] for a similar view). Lantolf and Frawley[41] maintain that

    • [w]hat must be done is to set aside the test-based approach to proficiency and to begin to develop a theory of proficiency that is independent of the psychometrics. Only after such a theory has been developed and is proven to be consistent and exhaustive by empirical research should we reintroduce the psychometric factor into the picture, with the full realization that such a reintroduction may not be possible, given our earlier remarks on the scalability of human behavior.

Spolsky was contemptuous of “psychometrists”:

    • In the approach of scientific modern tests, the criterion of authenticity of task is generally submerged by the greater attention given to psychometric criteria of validity and reliability. The psychometrists[42] are ‘hocus-pocus’ scientists in the fullest sense; in their arguments, they sometimes even claim  not to care what they measure provided that their measurement predicts the criterion variable: face validity receives no more than lip service.[43]

Spolsky’s recent “postmodern” approach to psychometrics is that it should be used in conjunction with “humanist” approaches.[44] The view in this study, which, for some researchers is taken for granted but for others is highly contested, is that “language testing cannot be done without adequate statistics”.[45]

Psychometric assessmentin language assessment traditionally means norm-referenced assessment, which is not concerned with individual scores but with the dispersion of scores within a group, where the concern is with maximising individual differences between test takers on the variable that is being measured.[46] For many, psychometrics has become synonymous with quantitative measurement and statistical measurement[47], and this is how I use psychometrics in this study.

The perennial problem in language testing, indeed of all testing, is finding a balance between reliability and validity. Spolsky’s coupling of validity with psychometrics may create the impression that authenticity and validity are separate issues. Of course, there is much more to validity than validity coefficients. The major issue in validity has to do with specifying what authentic tests are. Thus, psychometric data only become digestible  – and palatable – when theory gives meat to the number-crunching.

The  term psychometric tests has another meaning.. For example, at a conference on academic development, where I presented a paper on this topic[48] , a member of the audience said that she was “boiling” because psychometrics, she insisted was far more than norm- referenced measurement”. Her view was that psychometric tests measured the psyche (the literal meaning of psychometrics) and was embedded, she insisted (correctly), in the flesh-and-blood context of individuals, and therefore psychometric tests involve far more than a comparison between individuals and groups. I was taken aback by her outburst, not because she did not have a valid definition of psychometrics, but because this definition of psychometrics, because the of the different context, had never entered my mind, owing, no doubt, to my maximum attention to what interested me.

The method used in this study is mainly quantitatively based, where the emphasis is on norm-referenced testing. In language testing, as in second language acquisition research in general, quantitative measurement has been challenged for more than two decades by qualitative methods of research: indeed, qualitative measurement has been setting itself up as the only legitimate form of research.[49] Terre Blanche distinguishes between “two different constituencies” of qualitative researchers:

those who would use qualitative methods as a humanist, emancipatory tool to access authentic subjective experiences so easily censored out by more hard-nosed quantitative methods, and those who want to use qualitative methods such as discourse analysis to critique the semantic practices of both ‘scientific’ and ‘humanist’ psychologies.[50]

Both constituencies reject the domination of the norm over the individual.

“Norm” has two meanings, which are sometimes not distinguished. For example, at the conference of the National Educators of Teachers of English (NAETE) at Potchefstroom (September 17-18 1998) I was discussing the concept of “norm” with Johan van der Walt of Potchefstroom University of Christian Education. We were in one accord that the individual without the norm is an abstraction. It is was only while Van der Walt was presenting his paper that I realised that we were using the same term for two distinct concepts.

In this regard, consider the following extract from a  repartee between Johan van der Walt and Colyn Davey at the 1998 National Association of Educators of Teachers of English (NAETE) conference[51] that was concerned with the topic of establishing norms of English.

Van der Walt.  Then you agree that there should be a norm.

Davey. Yes but learners should be able to choose the norm they prefer.

The question arises of whether it is possible to use the term “norm” in the sense of (1). Van der Walt’s imperative of conforming to a standard, or norm, by which he means “Standard” English, and 2. Davey’s imperative of freedom to choose the norm that one refers, which could be “Standard” English, or, say, institutionalised black South African English (IBSAE), which would comprise ubiquitous constructions such as “I am having a problem”, “He write English perfectly” and “When I was in Town I see my English teacher”.[52] It is indeed possible to use the term “norm” in these two senses, but if done so, the two meanings of “norm” should be clearly distinguished, which wasn’t the case in the discussion between Van der Walt and Davey.

To clarify the distinction between these two meanings it is necessary to introduce the notions of criterion-referenced and norm-referenced tests Criterion-referenced tests are concerned with how well an individual performs relative to a fixed criterion, e.g. how to ask questions. Norm-referenced tests are concerned with how well an individual performs compared to a group. This is traditional psychometric testing. Now, both Van der Walt and Davey believe in norms; the former a standardised norm, the latter an unstandardised norm. So,, in fact, both kinds of norm are concerned with how well an individual performs relative to a fixed criterion, which is what criterion-referenced tests are concerned with. The difference is that Van d Walt’s norm is imposed from above (the standardised norm), whereas Davey’s norm is bottom up, where the group chooses the norm it wishes to aspire to. Both these kinds of norm have got nothing to do with norm-referenced tests, which has got nothing to do with whether one is able to choose ones norm (Davey) or not (Van der Walt), but with, as defined above, how well an individual performs in a group, i.e. with the difference in ability revealed in some form “quantitative summary”.[53]

Norm-referenced,which is a key notion in this study, can be distinguished from criterion-referenced and individual-referenced tests:

1. Norm-referenced tests are concerned with how well an individual performs compared to a group which he or she is a member of. This is traditional psychometric testing.

2. Criterion-referenced tests are concerned with how well an individual performs relative to a fixed criterion, e.g. how to ask questions. This is what Cziko calls “edumetric” testing.[54]

3. Individual-referenced are concerned with how individuals perform relative to their previous performance or to an estimate of their ability.

Strictly speaking it is not the test that is norm-referenced or criterion- referenced or individual-referenced, but the purpose for which it is used. Similarly, tests in themselves are not valid, but rather it is the purpose that they are used for that makes them valid. (Validity is discussed in section 2.8).

The idea that norm-referenced tests, on the one hand, and criterion-referenced tests and individual-referenced tests, on the other, are mutually exclusive is based on two contrasting philosophical positions: the former “positivistic”, the latter “humanistic”. The former is interested in what makes people different, the latter finds it morally reprehensible to compare people and thus focuses on what they have in common. The latter view, is represented most vocally in South Africa by the protagonists of “outcomes-based education” where ” learner’s progress will be measured against criteria that indicate attainment of learning outcomes, rather than against other learners’ performances”.[55]

I shall argue that norm-referenced tests are indispensable. Norm- referenced tests are important because without data on the variance between individuals within a group, it is not possible to separate what (which is the concern of criterion-referenced tests) an individualknows from what other peopleknow.

Individual-referenced tests also cannot be separated from what other people know. The differences between individuals actually clarifies the matter under test.  In other words, the construct validity of a test is dependent on some people doing well and others doing less well, for if everybody did equally well, we would have little idea of what we were testing. That is the norm-referenced view, which is based on the notion that differences in nature, of which human abilities are a part, is distributed as a bell-curve. In what can be regarded as a key quotation in the defence of psychometrics in this study, Rowntree explains the importance of group statistics, i.e. norm-referenced tests, in assessment:

Consider a test whose results we are to interpret by comparison with criteria. To do so we must already have decided on a standard of performance and we will regard students who attain it as being significantly different from those who do not…The question is: How do we establish the criterion level? What is to count as the standard? Naturally, we can’t wait to see how students actually do and base our criterion on the average performance of the present group: this would be to go over into blatant norm-referencing. So suppose we base our criterion on what seems reasonable  in the light of past experience? Naturally, if the criterion is to be reasonable, this experience must be of similar groups of students in the  past. Knowing what has been achieved in the past will help us avoid setting the criteria inordinately high or low. But isn’t this very close to norm-referencing? It would even be closer if we were to base the criterion not just on that of previous students but on students in general.[56]

What we think is going on in each individual’s invisible mind, can be scientifically inferred and described only when one has some idea of what is going on in a many individual minds, i.e. what is going on in a group. Emphasising the individual over the group or vice versa is “somewhat metaphysical [because both] types of test sampling (for that is what norm and criterion referencing do: they sample) need one another”.[57]Recall the two meanings of “norm” discussed earlier:“norm” can refer to a criterion (!) such as Standard English or to the comparison between individuals within a group.The latter is the concern of norm-referenced tests.

Ranking individuals and generating scores is only one purpose of norm-referenced tests; another purpose is to gain understanding into the nature of the constructs under examination, which cannot be achieved if an individual is not compared with what other individuals do.  (I elaborate on the  complementary role of norm-referenced and criterion-referenced tests in the discussion of the inadequacies of correlational data in section 4.7).

1.3  Summative assessment

There are different kinds of assessment and a diversity of definitions. I  discuss the descriptions of Rea[58] and Rowntree.[59]

TABLE  1.1

Rea’s Schema of Assessment


Formative Assessment     Summative Assessment
Quantitative Methods Assessment                                  Evaluation
Qualitative Methods Appraisal

Rea[60] uses the term “evaluation” to refer to formal testing activities, which are external to the teaching situation, and which involve “test scores”. She uses the terms “assessment” and “appraisal” to refer to activities which are internal to the teaching program. Grades are given for assessment but not for appraisal. In Rea’s scheme, assessment and evaluation both use  measurement, i.e. quantitative methods, while appraisal does not.

Before I comment further on Rea, it is appropriate to say something about “evaluation”, which are terms used by Rea and Rowntree. There are many different definitions of evaluation. Bachman[61] defines evaluation as the “systematic gathering of information for the purpose of making decisions” while Brown defines evaluation as

the systematic collection and analysis of all relevant information necessary to promote the improvement of a curriculum, and assess its effectiveness and efficiency, as well as the participants’ attitudes within the context of the particular institutions involved.[62]

Recently, “evaluation” has been contrasted with “grading”.[63] Dreyer argues that grading, i.e. summative tests, cause people to fail (see the conclusion to the study for comment; section 6.7).

To return to Rea. What may be confusing in her schema is that “assessment”  is used generically  to cover everything to with testing as well as specifically to refer to “formative quantitative assessment”. Consider Rowntree’s schema: 

TABLE 1.2

Rowntree’s Schema of Assessment and Evaluation

Assessment (Focus on learner) Evaluation (Focus on teaching)
Quantitative Methods Summative Summative
Qualitative Methods Formative (Diagnostic appraisal) Formative

Rowntree’s “assessment” –  which he also uses in this generic way – is “put[ting] a value on something” , which translates into everything concerned with  “obtaining and interpreting information”  of any kind about another person in order to “[try] and discover what the student is becoming or has accomplished”.[64] Rowntree’s “formative (pedagogic) assessment” emphasises “potential”, while his “summative (classificatory) assessment” emphasises “actual achievement”.[65]

For Rowntree, “evaluation” is “an attempt to identify and explain the effects (and effectiveness) of the teaching.”[66] Rowntree’s “formative evaluation is intended to develop and improve a piece of teaching until it is as effective as it possibly can be…[s]ummative evaluation on the other hand, is intended to establish the effectiveness of the teaching once it is fully developed.”[67] Thus Rowntree’s “formative evaluation” is concerned with the washback effect of a syllabus and/or teaching programme, while “summative evaluation” is concerned with “terminal tests and examinations coming at the end of the student’s course, or indeed by any attempt to reach an overall description or judgement of the student (e.g. in an end-of-term report or a grade or class- rank”.[68]

For both Rea and Rowntree quantitative methods always involve summative assessment only. In Rea quantitative methods are also used in formative assessment, while in Rowntree quantitative methods are only used in summative assessment.. In this study, I use Rowntree’s meaning of summative assessment to involve (1) quantitative methods of assessing (2) the learner only (and not the teacher). I am concerned with quantitative methods in summative assessment. An important point: although I am not concerned with assessing the teacher, I am very much concerned with how a teacher (i.e. a rater) assesses a learner; in other words with rater reliability, which consists of two major kinds of judgements: (1) the order of priority for individual raters of performance criteria (criteria such as grammatical accuracy, appropriateness of vocabulary and factual relevance) and (2) the agreement between raters on the scores that should be awarded if or when agreement is reached on how to weight different criteria .[69] Rater reliability is discussed in sections 2.9.1, 4.2, 4.7 and 5.5.

Summative assessment is “terminal”, “formal” and “external”, and is only concerned with the beginning and end of a course, i.e. with classifying individuals in terms of numerical products, or scores.[70] Weir[71] believes that there is a more pressing need for research in formative testing as opposed to research in summative testing. There is still a pressing need for research in summative testing, even though mainstream language testing, in South Africa, at least, is taking a different turn, e.g. “outcomes-based education”. This is discussed in section 5.5.

1.4 The One Best Test

A major empirical problem in language testing is establishing  valid and reliable criteria for the assessment of  language proficiency, which is basically concerned with fluency and accuracy. Three important issues in language testing are:

1. The kinds of tests that should be used to assess levels of language proficiency.

2.  The relationship between statistical significance (numerical data) and their meaning (information).

3. Whether language proficiency tests can validly predict academic achievement.[72] Academic achievement in this study is represented by 1.  end-of-year aggregate scores and 2. Pass rate.

The above three issues are directly related to the search for the “best” test. In the 70s a major issue in language testing was whether it was possible to find the “One Best Test”. The “One Best Test” question is closely related to the question of whether language proficiency consists of a unitary factor analogous to a g  factor in intelligence, or of a number of independent factors. Bachman and Palmer relate the concerns they had 25 years ago:

[W]e shared a common concern: to develop the “best” test for our situations. We believed that there was a model language test and a set of straightforward procedures – a recipe, if you will – that we could follow to create a test that would be the best one for our purposes and situations.[73]

Yet, almost two decades ago, Alderson had already graduated from this kind of thinking and suggested that

regardless of the correlations, and quite apart from any consideration of the lack of face validity of the One Best Test, we must give testees a fair chance by giving them a variety of language tests, simply because one might be wrong: there might be no Best Test, or it might not have the one we chose to give, or there might not be one general proficiency factor, there may be several.[74]

The Unitary Competence Hypothesis (UCH), which is closely associated with the “best test” question, is a very important issue. High correlations between different kinds of tests show that the UCH, in its weak form, remains a force to be dealt with.[75] The weak form of the UCH adopts an interactionist approach between global and discrete components of language. Oller, in his famous paragraph, describes this approach:

[N]ot only is some sort of global factor dependent for its existence on the  differentiated components which comprise it, but in their turn, the components are meaningfully differentiated only in relation to the larger purpose(s) to which all of them in some integrated (integrative?) fashion contribute. (See  also Oller and Khan[76] and Carroll[77] for similar views).[78]

If one is no longer searching for that one Grand Unified Test (GUT), one should be still looking for good tests, indeed for the best tests available. This implies, I suggest, that what one is looking for has an “objective” reality, which, of course, does not mean that we can completely grasp it.[79]

If we have given up on finding or constructing that elusive (and illusory?) one best test, we are nevertheless looking, indeed are compelled to look, for a plurality of the best tests that we can find. The problem remains what tests to choose to test language proficiency, and ultimately, in this study, to predict academic achievement. A useful test is one that “correspond[s] in demonstrable ways to language in non-test situations.”[80] These non-test situations are described in the “new” paradigm of language testing as authentic, direct, real-lifenatural(istic) or communicative. An important part of this study consists of a critical analysis of these terms.

1.5 Hypotheses of the study

The following threenull hypotheses are investigated:

1. Discrete-point tests and/or integrative tests are not valid measures of levels of language  proficiency.

2. Discrete-point tests and/or integrative tests are not valid predictors of academic achievement.

3. Many of the reports (Grade 6)  from former schools that were used as criteria for admission to MHS were not valid predictors of academic achievement. Many of the entrants with high Grade 6 report scores did not get beyond Grade 9 at MHS. I investigate the question of indiscriminate advancement in DET[81] (Department of Education and Training) schools in the light of the consistently poor Grade 12 results (“matric”) from most former DET schools over the years. The predictive validity of the tests are examined in an attempt to shed clarity on this question. (I shall henceforth refer to “DET schools” and not “former DET schools”, because at the time the investigation in this study was conducted, the DET was still in existence). A major issue in this study is the relationship between the predictive validity of  the tests and the reliability (i.e. the consistency and accuracy) of these DET reports (section 5.5).

Although the study is not directly concerned with investigating mother-tongue[82] proficiency, it is referred to when required.

Most of the tests in this study belong to the “old paradigm”. I did not devise new tests because  it was  not  ermane to  the  objective  of  this investigation, which was to examine the validity and reliability and practicality of using traditional tests to predict academic achievement. the fact that most of these tests were already established tests meant that I had more time to devote to this objective.

Although it is possible that annual predictions between English proficiency and academic achievement would yield higher correlations than long-term predictions, the aim in this study is to try and find out what chance Grade 7 learners who entered the School in 1987 would have of  passing Grade 12.

1.6 Historical and educational context

MHS  has already had 19 years experience dealing with linguistic, cultural and educational problems, which are only now beginning to surface in many schools in South Africa.  English is used as the single medium of instruction at the School. The School offered the Joint Matriculation Board (JMB) syllabus up to 1992, and the Independent Examinations Board (IEB) syllabus after 1992. MHS was the only state school in a wide area containing hundreds of DET secondary schools that offered the JMB syllabus. This study shows how DET pupils coped at such a school.

One problem that the School has been dealing with since its inception in 1980 is how to reconcile affirmative action with academic merit. By affirmative action I mean the endeavour to put right the  imbalances of the past, where the majority of South Africans was discriminated against on the basis of race. The School’s policy is to provide education for advantaged as well as disadvantaged learners, where the latter are given the opportunity to learn in an advantaged school situation. Disadvantaged learners are those who have suffered educational, social and economic deprivation – often caused by political injustice – and this was what the School also meant by the term. It is also, paradoxically, the School’s policy to accept learners only on merit, which was indicated by high scores on former school reports. One problem with affirmative action is that it is often difficult to marry the idea of redress and the idea of academic merit (high achievement, in this case), potential or aptitude.

This difficulty was evidenced by the School’s Prospectus of 1986, which informed parents that their children “are admitted solely on the basis of merit”; by “merit” the School meant  high scores on reports from previous schools. Candidates are considered on the basis of the results of an entrance examination and their previous school achievement.” Thus, the  School’s intention was to select only those candidates who could cope with a JMB equivalent syllabus. Unfortunately, many learners dropped out along the way or were pushed out along the way by the system.

MHS’s policy was to use admission criteria. The tests in this study, although conducted after admission – during the first three days of the first school term, were partly concerned with the admission question because I wanted to find out whether those who were admitted on the basis of their former school reports should have been admitted. Owing to the recent abolition of  admission tests in South African state schools there would no longer be any point in trying to find the best admission tests. But it would certainly still be useful to find out (1) whether  those learners who had been admitted to the School had an adequate level of English proficiency to perform in a school where English was the medium of  instruction and (2) whether their former school reports were authentic, i.e. accurate, reflections of this level. As far as I am aware, former school reports , as is the general practice in all schools in South Africa, are still considered by the School as an important indication of an entrant’s ability – if not a criterion for admission.

It would seem that the School’s criteria for admission would generally have pinpointed those candidates who could not cope at the School, but this did not happen. Of concern at the School was the large number of failures in Grades 7, 8 and 9 among the DET learners. At the School there were no automatic internal promotions through the system as is claimed to occur in many DET schools.[83] (This issue is dealt with in Chapter 7). When low achievers at the School failed they often left without repeating a year. Many who failed at the School, whether they repeated a year or eventually were asked to leave owing to failure, did not manage to get beyond Grade 9. Table 1.3 shows the Grade 9 pass rate for  three consecutive years.

TABLE 1.3

Grade 9  Pass Rate

Number of learners in Grade 7 Passed Grade 9 % Passes
1 36 (1982) 13 36.1
2 67 (1983) 25 37.3
3 81 (1987) 49 60.5
184 87 47.3

Row 3 is the sample used in the prediction of academic achievement in this study. It excludes learners who passed a  Grade and then left the School before reaching Grade 9 (N=5), and includes learners who failed between Grades 7 and 9 but passed Grade 9 at a later stage (N=9). Samples 1 and 2 do not take this fact into account.

Many learners who reached Grade 12 managed to achieve a matriculation exemption at MHS but obtained disappointing symbols, e.g. D and E symbols. Mcintyre remarks in this regard:

Observations at [MHS] over the last few years would tend to support Young’s[84] statement [about the low “language competence” of  Grade 12 learners]. However, at [MHS] the problem is slightly different in the sense that Matric students pass successfully but often with symbols that are disappointing. [85]

Mcintyre’s statement that “Matric students pass successfully” requires comment: Although it is correct that there was a high Grade 12 pass rate at the School, this does not take into account the high failure rate between Grade 7 and Grade 9, which means that even though most Grade 12 learners passed, this does not imply that many others that started in Grade 7 didn’t drop out along the way. This high failure rate is what had been occurring at the School since its inception in 1980. (I am concerned with the period 1980 to 1993). Table 1.6 shows  the number of  Grade 12 passes (1992) that originated from the group of Grade 7 learners (1987) used in this study.

TABLE 1.4

Grade 12  Pass Rate

Original number  of learners in Grade 7 (1987) Total Grade 12 passes from original Grade 7
79 39 (49.4%)

Table 1.4 takes into account those who failed and passed Grade 12 in the subsequent year (12 learners) and those who left the school during their schooling for reasons other than failure, for example, relocation.

The following criteria of admission to MHS provides important background information: Admission to the School was based on (1) the results of entrance tests administered in October of the previous year (1986) and (2) former school achievement as revealed by Grade 6 reports from former schools. The School’s criteria for admission to Grade 7 consisted of:

– Grade 6 reports from former schools (the aggregate).

–  A Culture Fair Intelligence Test.[86]

– An English proficiency test, which consisted of a short essay of about half a   page. I was not involved in the administration or marking of this test and so had no information about this test.

– A mathematics proficiency test. As in the case of MHS’s English proficiency test, I had no information on this test.

The admission tests for the sample were written in October of the previous year (1986). The Grade 6 reports  were considered by the School to be the most important criterion for admission. However, a few pupils were admitted with Grade 6 aggregates below 60%.  Of the Schools admission criteria only the Grade 6 reports are used in this study. I was not able to obtain the scores of the School’s Grade 6 essay admission test. In any case, the admission essay test was marked by only one rater per protocol and thus there would have been no way of establishing the interrater reliability of the School’s essay test.

With regard to the culture-fair test data, the original intention was to include these in the predictive investigation, but owing to the problematic (scientific and political) nature of intelligence tests and the fact that the use of these tests as predictors would not be directly pertinent to the topic, I decided to exclude these tests from this investigation. Suffice it to say that L2 learners who score above average on intelligence tests tend to be better at formal second language learning – and first language learning.[87]

The School’s policy is that at least half of all admissions should consist of disadvantaged learners. Disadvantaged does not mean low scoring, because the School selects on the basis of good performance as indicated by former (Grade 6) school reports. These disadvantaged entrants come from DET Schools. The investigation will show that many of the Grade 6 reports (of former schools) that belonged to disadvantaged entrants were unreliablein the sense of being inconsistent with their language proficiency scores on the tests in that the English aggregates on these Grade 6 reports were  in many cases radically higher than the scores on the proficiency tests.

The full sample of subjects (N=86) is discussed in detail elsewhere, but for the moment I deal briefly with 70 subjects (Table 1.7). Compare the Aggregate and English scores of the Grade 6 reports of the following two groups of entrants to Grade 7 at the School, who comprise a major part of the sample of subjects used in this study:

(1) CM Primary School (N=33), which provided (at the time this research was conducted) most of the entrants who took English First Language as a subject at MHS. English was the official medium of instruction from Grade 1 at CM Primary School. The learners from this school were generally advantaged. As mentioned, disadvantaged  learners are those who have suffered educational, social and economic deprivation

(2) 28 DET schools (N=37), which provided the vast majority of the entrants who took English Second Language as a subject at the School. English was the medium of instruction from Grade 5 at DET schools. Entrants from DET schools were generally disadvantaged.

TABLE 1.5

Comparison of Grade 6 reports between CM Primary School

and 28 DET Schools (N=70)

Aggregate Grade 6 English Grade 6
Mean STD Mean STD
CM Primary (N=33): mostly advantaged and English used as a First Language 68.9 8.8 72.5 8.4
28 DET Schools  (N=34): mostly disadvantaged and English used as a Second Language 68.6 10.8 71.1 12.6
t Stat – 0.106 – 0.550
t Critical two-tail 1.995 1.995

The T-Test in Table 1.5 shows that there was no significant difference between the two groups, because the t Stat is less than the critical value.

This equivalence in Grade 6 report scores between these groups plays an important role in the arguments and predictions of subsequent chapters, where I investigate the reliability of the Grade 6 scores of the DET  entrants.

1.7  Measures used in the study

The measures used in the study are now briefly described.  The study only commenced after the intake of learners to the School, and thus these measures differed in purpose from the School’s criteria, which were admission criteria. A detailed description of the measures, e.g. format, instructions, layout,  is given in Chapter 3. For the moment I provide only a brief description of the measures:

I. English proficiency tests. Eight English proficiency tests were administered in January 1987 (Grade 7). I devised the essay tests myself, while all the other tests were obtained from various published sources. The English proficiency test battery consists of:

(i) Two cloze tests from Pienaar’s[88] “Reading for Meaning”.

(ii) Two dictation tests. These were two restored cloze tests from Pienaar.[89] The passages from Pienaar used for the cloze tests are different to the passages used for the dictation tests, but they both belong to the same level. (I explain later what I mean by “level”).

(iii) Two essay tests (devised by myself).

(iv) An “error recognition” test.[90]

(v) A “mixed grammar” test.[91]

The tests from Bloor et al. consist of multiple-choice items.  The “mixed grammar” test consists of  items that test a variety of structures, hence the term “mixed”.

I shall argue that although the tests used in the study, except for the essay test, may be out of fashion with many testers they are nevertheless still very useful for assessing language proficiency and predicting academic achievement.

II. Grade 6 end-of-year school reports from former schools. The Grade 6 aggregate scores are used.

III. End-of-year English scores and Aggregates from Grade 7 to Grade 11. These scores were obtained from the School’s mark schedules.

IV. Grade 12 results (of 1992 and 1993). These results are those of the JMB (1992) and the Independent Examinations Board (IEB; 1993). The reason why the IEB results are also taken into account is that included in the study are those subjects that failed once between Grade 7 and Grade 12, repeated a year and sat for the IEB Grade 12 examination in 1993. (The JMB matriculation examination ceased to exist after 1992).

1.8  Method overview

Some researchers separate statistical research from empirical research. For Lantolf and

Frawley[92] empirical research and statistical measurement are distinct. In contrast, when Tremblay and Gardner[93] state that in their opinion “empirical investigation is essential to demonstrate the theoretical and pragmatic value” of research, their “empirical” research is firmly based on statistics, without which they would have very little of what they consider to be “empirical” research (see also Cziko’s[94] “empirically-based models of communicative competence”).  Some empirical research is statistically based, while other empirical research (e.g. much of  ethnographical research) isn’t. I adopt Tremblay and Gardner’s view that statistics is an indispensable component of empirical research.

The empirical investigation consists of:

1. An examination of the structure and administration of the English proficiency tests.

2. A predictive investigation, where the English proficiency tests are used to predict academic achievement from Grade 7 to Grade 12.  The reliability of the Grade 6 achievement of entrants from former schools also examined.

Under method  I subsume, as is usual in most studies, the following:

–  Subjects (sampling)

–  Structure of the measures.

–  Procedures of administration and scoring.

A  common design in empirical studies is that data analysis, results and discussion are each reported in separate sections. I depart from this traditional structure and follow Sternberg[95] who recommends that these be treated together. This is a wise arrangement for this study because the data analysis, discussion and results are closely connected.

A clarification of the following terms are in order: type, method, procedure. Sometimes type refers to such things as multiple-choice type questions versus gap-filling type tests; sometimes method refers to such things as cloze methods versus dictation methods, in other words, methods is used to mean tests. Then there is procedure, e.g. the cloze procedure, the dictation procedure, etc., which also can mean methods or tests. I shall use tests to refer to elicitation techniques, and procedure to the way in which the test is presented and scored.

1.9  Preview of Chapters 2 to 6

Chapter 2 deals with theoretical issues in the testing of language proficiency and academic achievement, where the main focus falls on assessment. The chapter comprises  a review of the literature on the testing of language proficiency and an overview of key concepts such as assessment,validity and reliability.

Chapter 3 describes the sample of subjects and sampling procedures, and the structure and administration of the tests.

Chapter 4 presents the results of the tests and discussion.

Chapter 5 deals with the prediction of academic achievement,  examines the reliability of the Grade 6 reports from previous schools and summarises the findings.

Chapter 6 discusses the implications of the study for language testing and presents the conclusions. The three main implications are: (1) the viability of the distinction between first language and second language, (2) the kind of tests or tasks that should be used in the future, and (3) the problem of rater reliability. Also discussed are a few contemporary initiatives to improve language assessment in South Africa.

1.10  Summary of Chapter 1

The purpose, problem, main topics, method, hypotheses and educational context of the study were specified. The study deals with  the measurement of differences between learners in English proficiency and with assessing the reliability, validity and practicality of discrete-point and/or integrative tests as predictors of academic achievement. Central to the study is the argument that the “old paradigm” of discrete-point and integrative tests and the statistical methods required to measure them are very useful in language acquisition research and educational measurement. The next chapter deals with the theory of language testing where I examine what it means to describe language behaviour and language tests as “authentic”.

Endnotes

[1]Davies, A. Principles of language testing. p.4, 1990.

[2]Ibid., p.2.

[3]Saville-Troike, M. ‘What really matters in second language learning for academic achievement.’ TESOL Quarterly, 18(2):199-219.

[4]Cummins, J. Wanted: A theoretical framework for relating language proficiency to academic achievement among bilingual students’,  in Rivera, C. (ed.). Language proficiency and academic achievement, 1984.

[5](1) Diller, K.C., Individual differences and universals in language learning aptitude, 1981. (2) Skehan, P. Individual differences in second language learning, 1989. (3) Skehan, P. A cognitive approach to language learning, 1998.

[6]Rivera, C., The ethnographical/sociolinguistic approach to language proficiency assessment.1983, p.xii.

[7]Bachman, L.F.’Assessment and evaluation. Annual Review of Applied Linguistics (1989), 10:210-226 (1990a), p.210

[8]Lakoff, G., Women, fire and dangerous things, 1987.

[9]Bachman, L.F. Fundamental considerations in language testing, 1990b.

[10]Davies, A. Principles of language testing, 1990.

[11]Alderson, J.C. ‘Who needs jam?’, in Hughes, A. and Porter, D. Current developments in language testing, 1983, p.90.

[12]Douglas, D.’Developments in language testing.’ Annual Review of Applied Linguistics, 15:167-187 (1995), p.176.

[13]Moss, P. ‘Can there be validity without reliability?‘ Educational Researcher, 23 (2), 5-12 (1994).

[14]Alderson, J.C. and Clapham, C. ‘Applied linguistics and language testing: A case study of the ELTS test.’ Applied Linguistics, 13 (2), 149-167 (1992).

[15]Ibid., p.2.

[16]Widdowson, H.G.‘Skills, abilities, and contexts of reality.’ Annual Review of Applied Linguistics, 18, 323-333 (1998), p.323.

[17]Principles of language testing, 1990, p.2.

[18]Skehan, P. A cognitive approach to language learning, 1998, p.153.

[19]Bachman, L.F. and Palmer, A.S.Language testing in practice 1996, p.17.

[20]Brown, J.D. ‘Language programme evaluation: A synthesis of existing possibilities’, 1989, p.235.

[21]Stern, H. H. Fundamental concepts of language teaching, 1983.

[22] (1)Wilkins, D.A. ‘Notional syllabuses revisited.’ Applied Linguistics, 2 (1), p.83-89 (1981), p.83.

(2) Brumfit, C.J. Notional syllabuses revisited: A response. Applied   Linguistics, 2(1), 90-92 (1981), p.90.

[23]Ur, P.A course in language teaching: practice and theory, 1996, p.37.

[24]Yeld, N. Communicative language testing, 1986, p.36.

[25](1)  Corder, S.P. Error analysis and interlanguage, 1981, p.20.

(2) Davies, A. Principles of language testing, 1990, pp.20-21.

[26]Davies, ibid., 21.

[27]Byrnes, H. and Canale, M. Defining and developing proficiency: Guidelines, implementations and concepts, 1987, p.15.

[28]Spolsky, B.Conditions for second language learning, 1989, p.138.

[29]Ur, P.A course in language teaching: practice and theory, 1996, p.37

[30]Davies, A. Principles of language testing, 1990, p.1.

[31]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.182.

[32]Lazaraton, A. ‘Qualitative research in applied linguistics: A progress report.’ TESOL Quarterly, 29 (3), 455-471 (1995, p.455.

[33]Magnan, S.S. Review of Creswell, J.W. 1994. Research design: qualitative and quantitative approaches, 1997.

[34]Rushton, J.P. Race, evolution and behaviour, 1995; his frontispiece.

[35]Spolsky, B. ‘The limits of authenticity in language testing.’ Language Testing, 2, 31-40 (1985).

[36]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988).

[37](1) Macdonald, C.A. English language skills evaluation (A final report of the Threshold Project), Report Soling-17, 1990a.

(2) Macdonald, C.A. Crossing the threshold into standard three in black education: The consolidated main report of the Threshold Project, 1990b.

[38](1) Bennett, A. and Slaughter, H. ‘A sociolinguistic/discourse approach to the description of the communicative competence of linguistic minority children’, in Rivera, C.The ethnographical/ sociolinguistic approach to language proficiency assessment, 1983.

(2) Jacob, E. ‘Studying Puerto Rican children’s informal education at home’, in Rivera, C. (ed.). The ethnographical/sociolinguistic approach to language proficiency, 1983.

(3) Phillips, S. ‘An ethnographic approach to bilingual language proficiency assessment’, in Rivera, C. (ed.). The ethnographical/sociolinguistic approach to language proficiency assessment, 1983.

[39]Harrison, A. ‘Communicative testing: Jam tomorrow?’, in Hughes, A. and Porter, D. (eds.). Current developments in language testing, 1983. p.84.

[40]Morrow, K. ‘Communicative language testing: Revolution or evolution’, in Alderson, J (ed.). Issues in language testing, 1981, p.12.

[41]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.185.

[42]Psychometrist has two meanings: 1. a statistician, and 2. somebody with the paranormal power to find lost objects. I guess the double meaning is not lost on Spolsky.

[43]Spolsky, B. ‘The limits of authenticity in language testing.’ Language Testing, 2, 31-40 (1985), pp.33-34.

[44]Spolsky, B.Measured words, 1995, p.357.

[45]Davies, A. Principles of language testing, 1990, p.16.

[46]Cziko, G.A. ‘Improving the psychometric, criterion-referenced, and practical   qualities of integrative testing.’ TESOL Quarterly, 16 (3), 367-379 (1982), pp.27-28.

[47](1) Spolsky, B. ‘The limits of authenticity in language testing.’ Language Testing, 2, 31-40 (1985).

(2) Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.185.

[48]Gamaroff, R. Psychometrics and reductionism in language assessment. Paper   presented at the SAAAD/SAARDHE conference “Capacity-building for quality teaching and learning in further and higher education”, University of Bloemfontein, 22-24 September, 1998e.

[49]Magnan, S.S. Review of Creswell, J.W. 1994. Research design: qualitative and quantitative approaches, 1997.

[50]Terre Blanche, M.‘Crash.’ South African Journal of Psychology, 27 (2), 59-63 (1997), p.61.

[51]Van der Walt, J. The implications for language testing of IBSA [Institutionalised Black South African English].  National Association of Educators of Teachers of English (NAETE) conference ” Training teachers for the South African context,  Potchefstroom College of Education, September 17-18, 1998.

[52]Van der Walt’s paper followed immediately after the presentation of Makalela’s (1998) paper “Institutionalized Black South African English” (IBSAE) in which Makalela advocates that IBSAE be adopted as the norm among blacks in South Africa. The examples of IBSAE cited above are those given by Makalela.

[53]Messick, S.Validity, 1987, p.3.

[54]Cziko, G.A. ‘Improving the psychometric, criterion-referenced, and practical   qualities of integrative testing.’ TESOL Quarterly, 16 (3), 367-379 (1982).

[55]Gultig, J., Lubisi, C., Parker, B. and Wedekind, V. Understandingoutcomes-based education: Teaching and assessment in South Africa, 1998, p.12.

[56]Rowntree, D.Assessing students: How shall we know them, 1977, p.185.

[57]Davies, A. Principles of language testing, 1990, p.19.

[58]Rea, P. ‘Language testing and the communicative language teaching curriculum’, in Lee, Y.P. et al. New directions in language testing, 1985.

[59]Rowntree, D.Assessing students: How shall we know them, 1977, p.185.

[60]Rea, P. ‘Language testing and the communicative language teaching curriculum’, in Lee, Y.P. et al. New directions in language testing, 1985, p.29

[61]Bachman, L.F.Fundamental considerations in language testing, 1990b, p.20.

[62]Brown, J.D. ‘Language programme evaluation: A synthesis of existing possibilities’, in Johnson, R.K. (ed.). The second language curriculum, 1989, p.223.

[63]Dreyer, C. Testing: The reason why pupils fail. National Association of Educators of Teachers of English conference (NAETE) ” Training teachers for the South African context, Potchefstroom College of Education, September 17-18, 1998.

[64]Rowntree, D.Assessing students: How shall we know them, 1997, p.4.

[65]Ibid., p.8.

[66]Ibid., p.7.

[67]Ibid.

[68]Rowntree, D.Assessing students: How shall we know them, 1997, p.7.

[69]Gamaroff, R. ‘Language, content and skills in the testing of English for academic purposes.’South African Journal of Higher Education, 12 (1), 109-116 (1998b).

[70]Rea, P. ‘Language testing and the communicative language teaching curriculum’, in Lee, Y.P. et al. New directions in language testing, 1985. p. 29.

[71]Weir, C.J. Understanding and developing language tests, 1993, p.68.

[72]An applied linguist, who had read some of my research, queried whether I was doing applied linguistics research or educational research. “Applied linguistics” in the restricted meaning of the term is not directly concerned with predicting academic achievement, but “educational linguistics” is, which has a close connection to applied linguistics. The most important reason for the study of  academic language proficiency is its connection to academic achievement.

[73]Bachman, L.F. and Palmer, A.S.Language testing in practice, 1996, p.4.

[74]Alderson, J.C.  ‘Report of the discussion on general language proficiency’,  in Alderson, J.C. and Hughes, A. Issues in language testing: ELT Documents III. (The British Council, 1981a, p.190.

[75](1)Brown, J.D. A closer look at cloze: Validity and reliability, 1983.

(2)Hale, G.A., Stansfield, C.W. and Duran, R.P.TESOL Research Report 16, 1984.

(3) Oller, J.W., Jr. ‘Cloze tests of second language proficiency and what they measure.’ Language Learning, 23 (1), 105-118 (1973).

(4) Oller, J.W., Jr.  ‘A consensus for the 80s’, in Issues in language testing research, 1983).

(5) Oller, J.W., Jr. ‘Cloze, discourse, and approximations to English’, in  Burt, K. and Dulay, H.C. New directions in second language learning, teaching and bilingual education, 1976.

(6) Oller, J.W., Jr. ‘”g”, “What is it?’, in Hughes, A. and Porter, D. (eds.). Current developments in language testing, 1983a.

(7) Oller, J.W., Jr. Issues in language testing research,  1983b.

(8) Stubbs, J. and Tucker, G. ‘The cloze test as a measure of English proficiency.’ Modern Language Journal, 58, 239-241 (1974).

[76]Oller, J.W., Jr.  and Kahn, F.  Is there a global factor of language proficiency?, in Read, J.A.S. Directions in language testing, 1981.

[77]Carroll, J.B.Psychometric theory and language testing, 1983, p.82.

[78]Oller, J.W., Jr.  ‘A consensus for the 80s’, in Issues in language testing research, 1983, p.36.

[79] Many scientists, in contrast to many applied linguists, have not given up looking for Grand Unified Theory (GUT). Many applied linguists would probably say that physics deals with non-living matter whereas language testing deals with human beings. But this, in my view, is no justification for rejecting the search for unifying linguistic principles in humans; if one is interested in linguistic science, and not just in linguistic thought.

[80]Bachman, L.F. and Palmer, A.S.Language testing in practice, 1996, p.9.

[81]The DET  was the education department in charge of black education up to 1994. It is now defunct.

[82]At a translation committee meeting at the University of Fort Hare  in April 1998, the secretary of the meeting suggested that the term “mother tongue” was sexist.

[83]Educamus. Editorial: Internal promotions, 36 (9), 3 (1990), p.3

[84]Young, D. ‘A priority in language education: Language across the curriculum in black education’, in Young, D. and Burns, R. (eds.). Education at the crossroads, 1987.

[85]Mcintyre, S.P. ‘Language learning across the curriculum: A possible solution to poor results.’ Popagano, 9 and10, June (1992), p.10.

[86]Cattell, R.B.Measuring intelligence with culture-fair tests, 1973.

[87]Mitchell, R. and Myles, F. Second language acquisition, 1998.

[88]Pienaar, P. Reading for meaning: A pilot survey of (silent) reading standards in Bophuthatswana,1984, pp.59 and 61.

[89]Ibid., pp.58 and 62.

[90]Bloor, M., Bloor, T., Forrest, R., Laird, E. and Relton, H. Objective tests in English as a foreign language, 1970, pp.70-77.

[91]Ibid., pp.35-40.

[92]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.181.

[93]Tremblay, R.F. and Gardner, R.C.Expanding the motivation construct of language learning.’ The Modern Language Journal, 79 (4), 505-518 (1995), p.505.

[94]Cziko, G.A. ‘Some problems with empirically-based models of communicative competence.’ Applied Linguistics, 5 (1), 23-37 (1984).

[95]Sternberg, R.J.The psychologist’s companion: A guide to scientific writing for students and researchers, 1993, p.53.

Leave a comment