Ph.D. – Chapter 2: Theoretical Issues in the Testing of Language Proficiency and Academic Achievement

14 Jun
(The notes appear at the bottom of the post. The clicking function on the notes does not work).
2.1 Introduction

2.2 Ability, cognitive skills and language ability

2.3 Competence and performance

2.4 Proficiency

2.5 The discrete-point/integrative controversy

2.6 Cognitive and Academic Language Proficiency (CALP) and “test language”

2.7 Language proficiency and academic achievement

2.8 Validity

2.8.1 Face validity

2.8.2 Content validity

2.8.3 Construct validity

2.8.4 Criterion validity: concurrent and predictive validity

2.9 Reliability

2.9.1 Approaches to the measurement of reliability

2.10 Ethics of measurement

2.11 Summary of Chapter 2

2.1 Introduction

Whatever we say about language – or about anything – originates from a theory, i.e. a combination of knowledge, beliefs, wants and needs. For some authors[1] ( the testing  of language proficiency has been a circular enterprise. Vollmer[2] maintains that “language proficiency is what language proficiency tests measure” and  this circular statement is all that can be firmly said when asked for a definition of  language proficiency. Perhaps this is all that can be firmly established about language proficiency, but we swim on in the hope of  reaching terra firma.

2.2  Ability

I examine the notion of ability first and then discuss language ability. In the next sections I move on to competence and performance, which I shall relate to ability. I then discuss proficiency and the discrete-point/integrative controversy in language testing. As I mentioned in the first paragraph of the study, language testing draws on three areas: the nature of language, assessment, and language ability . The latter is closely related to language proficiency.

The precise definition of ability is not only seldom explicated, but, unlike competence, is often not even considered, in spite of the fact that the term is  used widely in everyday language as well as in scientific circles.  Important issues in the study of abilities, of which language  is only one ability, are:

- The fixety of abilities. If  abilities were highly variable over time they would reflect a state rather than a trait or an attribute. The two latter terms imply a fixed structure, or construct,  rather than a variable process. The constructs that this study is concerned with belong to the domain of language acquisition. (Construct validity is discussed in section 2.8.3).

- Consistency. For example, if an athlete  in a one-off streak “accident” breaks a world record, but is never able to repeat the performance, or even get near the record again, we still say that he or she has the ability to break a world record. We cannot apply the same logic to cognitive abilities, where  consistency of output, not records, is the name of the game. Consistency does not only apply to the ability of learners but also of teachers, who are usually also testers. The consistency, or the reliability, of judgements and scoring is a major issue in language testing. This issue is dealt with in various parts of study.

- When we say people have the ability to perform academically we mean that they are able to achieve a certain liminal level, i.e. minimum or threshold level. In trying to set a minimum level, one is concerned with what the individual can do in terms of established criteria. What the individual can do cannot be separated from what others can do. Hence the importance of norm-referenced tests .

- The variability in ability between individuals obeys a “bell-curve” distribution, as in the case of nature as a whole. The “bell-curve” or “normal” distribution  is the foundational principle of psychometrics.[3]

Taking  the four points above into account, Carroll[4] suggests the following definition of ability:

As used to describe an attribute of individuals, ability refers to the possible variations over individuals in the liminal levels of task difficulty (or in derived measurements based on such liminal levels) at which, on any given occasion in which all conditions appear favorable, individuals perform successfully on a defined class of tasks.

Several modern theories of education and psychology reject the notion of traits, i.e. the fixety of psychological constructs[5]. In traditional trait theories, e.g. Carroll (above), psychological constructs are like any other human trait, animal trait, or plant trait, where biological differences between living things are distributed according to a bell curve and differences in human abilities are also distributed according to the bell curve. This does not mean that people cannot improve, but only that the degree of improvement depends on fixed psychobiological constraints.[6]

A few comments on Carroll’s idea that ability is a fixed psychological trait is in order. Ability is closely connected to the notion of proficiency (to be discussed shortly) and proficiency is certainly something that can improve by using the correct strategies to enhance learning. The notions of “transferable” and “transferring skills” are used to explain the idea of “fixed” ability.

A major problem with learners with limited academic ability is the underdevelopment of “transfer skills”[7]. There are two kinds of transfer skills: (1) lower order “transferable skills” and (2) higher order “transferring skills”[8]. Transferable skills are skills that are learnt in one situation or one kind of subject-matter that are transferable to another. Examples are:  (i) a reading skill such as scanning that is learnt in the English class can be transferred to the geography class, (ii) using a dictionary, (iii) making charts and diagrams, (iv) completing assignments, (v) reviewing course material, (vi) learning formulas and dates, (vii) memorising material. “Transferring skills” are “metacompetences” of a far higher order. These metacompetences are : (i) A sensitive and intelligent discernment of similarities and differences, (ii) Cognitive equipment that one  uses to modify, adapt and extend, and (iii) attitudes and dispositions that support both of the above.[9]

These three “metacompetences” are interrelated. For example, without the “cognitive equipment” that enables one to modify, adapt and extend, it would not be possible to sensitively and intelligently discern similarities and differences. With regard to Bridges’ third “metacompetence” of “attitudes and dispositions”, which have to deal with intention, motivation and resulting approach to a task, I suggest that its successful development is to a large degree dependent on the successful development of the other two “metacompetences”. If one has the right healthy cognitive equipment, as well as the desire and opportunity to develop it, one will understand more; consequently, one will be more motivated to learn. Of course, socialisation into a community of learners and the correct mediation/intervention procedures between learner and task also  play an important role in cognitive development, e.g. the development of critical awareness and learning strategies.

Bridges’[10] distinction between lower order “transferable skills” and higher order “transferring skills” is useful in understanding the nature of the problem of transfer. The problem of transfer refers mostly to the higher order “transferring skills”. The question is whether higher order cognitive skills (i.e. Bridges’ “transferring skills”) can be acquired at all (whether independently or through teaching). Millar[11] maintains that courses in skills development (e.g. the development of executive processes) pursue the “impossible” because processes such as classifying and hypothesising cannot be taught, but can only develop (i.e. they are part of inborn potential, or ability). For Millar the challenge is to find ways of “motivating pupils to feel that it is personally valuable and worthwhile to pursue the cognitive skills (or processes) they [children] already possess to gain understanding of the scientific concepts which can help them make sense of their world”[12] (square brackets and italics added).

According to Millar, these cognitive skills, especially the higher order transferring skills (e.g. a sensitive and intelligent discernment of similarities and differences), can only be developed if they are based on something that learners already possess, namely, academic potential, or ability. I have raised some highly controversial issues, but they needed to be raised in order to explain what I meant by “fixed” ability. I cannot pursue the matter further here.[13]

I now discuss how the term ability is used in relation to language. In section 1.1,  I mentioned the four major test uses, achievement, proficiency, aptitude and diagnosis. These are all manifestations of what Davies’ calls “language ability”.[14]

I mentioned above that “fixed” ability in Carroll’s sense does not mean that people cannot develop and become better. If this were not so, it would be nonsensical to talk about things such as transitional competence and interlanguage, which feature, justifiably, so strongly in the applied linguistic literature. In the next section I discuss the notions of competence and its sibling, performance.

2.3  Competence and performance

The notions of competence and performance are essential to understanding language assessment.

For Chomsky[15] competence is the capacity to generate an infinite number of sentences from a limited set of grammatical rules. This view posits that competence is logically prior to performance and is therefore the generative basis for further

learning.[16] Competence, on this view, is equivalent to “linguistic” (or “grammatical”) competence. Chomsky distinguishes between “performance”, which is “the actual use of language in concrete situations”, and  “competence” or “linguistic competence” or “grammatical competence”, which is “the speaker-hearer’s knowledge of his language”.[17] Chomsky’s description of language involves no “explicit reference to the way in which this instrument is put to use…this formal study of language as an instrument may be expected to provide insight into the actual use of language, i.e. into the process of understanding sentences.”[18] Chomsky’s great contribution was to focus on linguistic introspection, without giving introspection (linguistic intuitions) the final word.[19]

Canale and Swain[20] make a distinction between knowledge of use and a demonstration  of this knowledge. Knowledge of use is often referred to in the literature as “communicative competence”[21], and the demonstration of this knowledge as performance. Communicative competence has come to subsume four sub- competences: grammatical competence, sociolinguistic competence, discourse competence and strategic competence[22]:

(1) Grammatical competence is concerned with components of the language code at the sentence level, e.g. vocabulary and word formation.

(2) Sociolinguistic competence is concerned with contextual components such as topic, status of interlocutors, purposes of communication, and appropriateness of meaning and form.

(3) Discourse competence is concerned with: (i) a knowledge of text forms, semantic relations and an organised knowledge of the world; (ii) cohesion – structural links to create meaning, and (iii) coherence – links between different meanings in a text; literal and social meanings, and communicative functions.

(4) Strategic competence, which is concerned with (i) improving the effectiveness of communication, and (ii) compensating for breakdowns in communication. Strategic competence means something very different in Bachman and Palmer , namely, metacognitive strategies, which is central to communication. For these authors “language ability” consists of “language knowledge” and “metacognitive strategies”.[23] (See Skehan[24]).

According to Widdowson communicative competence should subsume the notion of performance:

[T]he idea of communicative competence arises from a dissatisfaction with the Chomskyan distinction  between competence and performance and essentially seeks to establish competence status for aspects of language behaviour which were indiscriminately collected into the performance category.[25]

How does ability fit into the competence-performance distinction? Chomsky equates “ability” with “performance” (“actual use”),  which he regards as a completely different notion from “competence” or knowledge”.

Characteristically, two  people who share the same knowledge will be inclined to say quite different things on different occasions. Hence it is

hard to see how knowledge can be identified with ability…Furthermore, ability can improve with no change in knowledge.[26]

Thus, as Haussmann points out, “it should be noted that Chomsky’s original definition of the term [i.e. competence] always excluded this idea [i.e. ability].”[27] There doesn’t seem to be any reason, however, why “ability” cannot refer to (linguistic/grammatical) competence, (which is Chomsky’s interest), as well as to the knowledge one has of how to use the language in appropriate situations. We can retain performance  to mean the actual use of this knowledge. For example, Bachman and Clark use the term “ability” in the following way:

We will use the term “ability” to refer both to the knowledge, or competence, involved in language use and to the skill in implementing that knowledge, and the term “language use” to refer to both productive and receptive performance.[28]

Weir equates “ability” with “competence” as well:

There is a potential problem with terminology in some recent communicative approaches to language testing. References are often made in the literature to testing communicative ‘performance’ [e.g. B.J. Carroll 1980[29]]. It seems reasonable to talk of testing performance if the reference is to an individual’s performance in one isolated situation, but as soon as we wish to generalise about ability  to handle other situations, ‘competence’  would seem to be involved.[30] (Square brackets added)

This is Skehan’s position as well: “it is defensible to speak of competence-orientated abilities.”[31]

In other words, different performances point back to the underlying  competence or ability.

“Competency-based education and training” has a different set of concepts for the labels of competence, performance and ability to those discussed above.CBET is discussed in section 6.4 where the future of assessment in South Africa is dealt with.

2.4  Proficiency

Proficiency is closely related to ability, competence and performance discussed above. Proficiency is used in at least two different ways: it can refer to  (1) the “construct or competence level”[32], which is at a given point in time independent of a specific textbook or pedagogical method[33] or to (2) the “performance level”[34], which is a reflection of achievement in the test situation. The construct level or competence level is the knowledge of the language, and the performance level is the use of language.

Proficiency, like the notions of competence and performance,  is very much of a “chameleon” notion[35], because it can be defined not only in terms of knowledge (the construct or competence level) and in terms of specific tasks or functions (the performance level), but also in terms of degrees of behaviour that are observed at different stages (minimum to native-like[36]), in terms of language development (e.g. interlanguage studies), in terms of situations that require some skills but not others, or in terms of general proficiency, where no specific skill is specified.

Porter[37] uses the term  “communicative proficiency”, which seems to subsume the notions of  “communicative competence” and “performance” discussed above. According to Child[38], “proficiency” is a “general `across-the-board’ potential”, while “performance” is the “actualised skill”, the “mission performance” involved in “communicative” tasks, i.e. the output. Child has much in common with Alderson and  Clapham[39], who distinguish between “language proficiency” and “language use”, where proficiency, not use, is part of output.

I would like to spend some time on Landtolf and Frawley’s[40] views on language proficiency because they epitomise the opposition to the view that I am arguing for in this study. These authors will be referred to as L and F. In their abstract they state that they “argue against a definitional approach to oral proficiency and in favor of a principled approach based on sound theoretical considerations.” (The L and F reference throughout is their 1988 article). The authors use oral proficiency as a backdrop to their views on language proficiency in general. L and F,  in their criticism of “reductionism” in the assessment, of language proficiency,  leave few authors unscathed; authors that many would consider to be in the vanguard of the real-life/communicative movement, e.g. Hymes, Omaggio and Widdowson.

To adumbrate: in the second section of their article “The tail wagging the dog”, L and F use Omaggio’s[41] section of her manual entitled :”Defining language proficiency” to lament that the “construct of proficiency, reified in the form of the [American Council on the Teaching of Foreign Languages- ACTFL] Guidelines, has begun to determine how the linguistic performance of real people mustbe perceived”:

In her discussion, she considers various models of communicative competence, including those of Hymes, Munby, Widdowson, and Canale and Swain, all of which are reductionist approaches to communicative competence, because they define communicative competence by reference to a set of constitutional criteria. She then proceeds to a subsection entitled “From Communicative competence to Proficiency.” However, nowhere in her analysis is there any in-depth consideration of proficiency that is independent of the proficiency test itself.[42]

Strange that L and F consider Widdowson[43] a reductionist, who I would think fully appreciates the distinction between language structure and language in use, where grammar plays a vital role. By “grammar”, Widdowson does not mean merely morphology, phonology and syntax but lexico-grammar, where semantics is included under “grammar”. The inclusion of semantics under “grammar”, or “linguistic knowledge”, is what modern linguistics understands by these terms). The papers of the Georgetown University Round Table Conference[44] were concerned with the reality and authenticity of communicative language proficiency, where Widdowson argued that grammar  is not dead, but the life blood of language, communication and social meaning. Such a view is not reductionist!

L and F  reject the ACTFL’s  (American Council of the Teaching of Foreign Languages) adoption of a uniform yardstick for the measurement of foreign language ability based on real-life  behaviour[45]. The ACTFL’s tail (the series of real life descriptors) that is wagging the real dog is not, according to L and F, a real tail. The unreal tail for L and F is the unreal “construct”; the real dog being wagged is real people. The metaphor is clear: it is researchers who have fabricated the “construct”, and fabrications have no psychological reality.  In other words the construct constricts the reality of “the nontest world of human interaction”[46]. The test world, which represents the “construct” for these authors, “has come to determine the world, the reverse of proper scientific methodology”.[47]

Recall that L and F are arguing in “favor of a principled approach based on sound theoretical considerations” (italics added), which L and F  seem to think authors such as Widdowson do not  use. Yet Widdowson (who was probably not unaware of L and F’s criticism) ends his  “Aspects of language teaching” with the following:” There needs to be a continuing process of principled pragmatic enquiry. I offer this book as a contribution to this process – and as such, it can have no conclusion”[48] ( italics added). (See Gamaroff 1996a[49]).

Widdowson perceives the content of both the structural and the notional syllabus to be, in Nunan’s words, “synthetic” and “product-orientated”[50], i.e. the content of both syllabuses is static and lacks the power to consistently generate communicative behaviour. Widdowson’s argument against structuralist and notional syllabuses is that “[i]t has been generally assumed…that performance is a projection of competence…that once the rules are specified we automatically account for how people use language.”[51] His argument is that structural and functional-notional syllabuses do not link in past experiences with  new experiences, because they lack proper learner involvement[52]. Widdowson also believes that “the most effective means towards this achievement [i.e. "complete native-speaker mastery"] is through an experience of authentic language in the classroom.”[53]

L and F, and Widdowson are backing the same communicative horse. The main difference between them appears to lie in the value they place on school learning. All three believe in teaching language as communication, with the difference that much of Widdowson’s work is concerned with academic achievement and school learning rather than with real-life “natural” contexts.

I examine more closely the cogency of the distinction between “natural” contexts in “real-life” and “unnatural” contexts in the classroom. According to L and F “tasks cannot be authentic by definition”[54], which implies that very little in school is authentic, i.e. natural. The nub of L and F’s criticism is that the exchange between tester and test-taker is not a natural one,  therefore any kind of test cannot be a natural kind of communication. Communicative testing, it seems, would be for L and F a contradiction in terms. What is more, communicative school “tasks” would also be a contradiction in terms. In that case, school, which may be defined as an institution whose role it is to guide learners by defining and dispensing tasks, is another tail wagging the world (of  “reality”). The ACTFL Guidelines, according to L and F, draw a line between the world and the individual. L and F regard such a situation as scientifically unprincipled and morally untenable. There is very little in “tasks” such as  instructional activities and nothing in tasks such as tests that L and F find authentic in the Proficiency literature. L and F want language tasks to  be contextualised in natural settings such as cooking clubs.

Byrnes and Canale caution that the danger of “the proficiency movement” as espoused by L and F and others as “with any movement is that a rhetoric of fear and enthusiasm will develop which is more likely to misrepresent and confuse than to clarify the crucial issues.”[55] One confusing issue is that of “natural environment”.

“Just what is a ‘natural environment’ as far as learning or acquiring a second language under any circumstances is concerned?’” asks Morrissey[56] :

There is no environment, natural or unnatural, that is comparable with the environment in which one learns one’s mother tongue. Furthermore, it seems to me that there is a teaching (i.e. unnatural?) element in any L2-L1 contact situation, not just in cases of formal instruction. This element, even if it only consists in the awareness of the communicants that the [teaching or testing] situation exists, may be a more significant factor in L2 learning  and L2 acquisition [and L2 testing] than any other factor that is common to [the natural setting of]  L1 acquisition and L2 acquisition.[57]

L and F are seeking a testing situation analogous to the L2 “acquisition” situation (which by Krashen’s[58] definition is “natural”). But as Morrissey above suggest much of language and learning, like culture, consists of  extrapsychological elements, and in this sense are an “imposition” upon nature. However, although the test situation, i.e. school,  may be less “authentic” in the sense that the test is more concerned with learning language than with using it, with regard to the laws of learning, the dichotomy between “natural” and “unnatural” is a spurious one. It is incorrect to assume that “natural” approaches (e.g. Krashen and Terrell[59]) and immersion programmes mirror natural language acquisition and that the ordinary classroom doesn’t.[60] The swimming club, cooking club, tea party or cocktail party are in a sense not more neither less natural than the traditional classroom. That is why one doesn’t have to go outside the classroom in of search for “real reality”[61]. The learning brain needs stimulation, and it can get it in the classroom or at an informal (or formal) cocktail party. In other words, there is much informal learning in classrooms and much formal learning outside the classroom. But both kinds of learning are completely “natural” to the brain that is doing the learning.

2.5  The discrete-point/integrative controversy

L and F point out that in real life one uses far less words than one would use in school “tasks”, and this is one of the reasons why they maintain that tests are  inauthentic by definition.[62] However, as Politzer and Mcgroarty[63] show, it is possible to say or write few words (as one often does in natural settings) in a “communicative competence” test by using a “discrete-point” format. When one uses far less words in natural settings than one would use in many “artificial” school tasks, one is in fact using a “discrete-point” approach to communication. One doesn’t merely look at the format of a test to decided whether it is a “discrete-point” test, one looks at what the test is testing.

It is now opportune to examine what tests are testing. The way I have chosen to do so is by means of an examination of  the discrete-point/integrative controversy.

This controversy can only fully understood within the context of a parallel controversy: the structuralism/functionalism controversy. The discussion of the former controversy will serve as a background to the discrete-point/integrative controversy.

It is impossible to test the structures and functions of language without understanding how it is learnt. Language learning is language processing. The central issue in testing is assessing this language processing skill. Language processing, as with all knowledge, exists within a hierarchical organisation: from the lower level atomistic “bits” to the higher level discoursal “bytes”. The lower level bits traditionally belong to  the  “structuralist” levels, while the higher level bytes belong to the “functionalist”  levels. It is difficult to know where structure ends and function begins.[64].

The following continuum, adapted from Rea[65], includes the concepts on competence and performance discussed earlier.

TABLE 2.1

Functionalist and Structuralist Levels of language

This mutually exclusive classification of functionalism and structuralism is a highly controversial one. The structuralist/functionalist controversy is about whether the semantic meaning of words and sentences (structuralism) can be distinguished from the  pragmatic (encyclopaedic, i.e. world knowledge) meaning of  discourse (functionalism).

Halliday[66] proposes two meanings of the term function, namely,  “functions in structure” and “functions of language”. “Functions in structure” is concerned with the relationship between different words of a sentence. Structuralism is traditionally associated with the study of language at the sentence level and below. “Functions of language”, on the other hand, goes beyond individual linguistic elements or words (Saussure’s “signs”) to discourse. Functionalism is traditionally associated with the study of discourse.

I understand the terms linguistic knowledge, lexico-grammar (which is what recent modern linguistics understands by grammar ) and Halliday’s “functions in structure” to be synonymous. Therefore, lexico-grammar only deals with linguistic knowledge at the sentence level and below that level. Halliday’s “functions of language”, what I call functionalism deal with discourse, i.e. the intersentential domain. (The division of language  into a sentence level and an intersential level is itself problematic).

Functionalism rejects the Chomskyan idea that grammar is logically and psychologically the origin of “functions in language” (Halliday above). For functionalists like Halliday[67], the grammar of a specific language is merely “the linguistic device for hooking up the selections in meaning which are derived from the various functions of language”.

In functionalism it is communication that is claimed to be logically and psychologically prior to grammar. Givon[68], for whom the supreme function of language is communication, criticises Chomsky for trying to describe language without referring to its communicative function. Givon argues: “If language is an instrument of communication, then it is bizarre to try and understand its structure without reference to communicative setting and communicative function.”[69] Rutherford), whose view is similar to Givon’s communicative view, rejects the “mechanistic” view that grammatical structure (Givon’s “syntax”) is logically or psychologically prior to  communication.[70] Rutherford sees language as a  dynamic process and not a static “accumulation of entities”.[71]

In this regard Spolsky[72] suggests that the “microlevel” is “in essence” the “working level of language, for items are added one at a time”, keeping in mind that “any new item added may lead to a reorganisation of the existing system, and that items learnt contribute in crucial, but difficult to define ways to the development of functional and general proficiency.” Thus, according to Spolsky, building up the language from the microlevel to the macrolevel need not  be a static “accumulation of entities” (Rutherford above), but maylead toa dynamic “reorganisation of the existing system” (Spolsky above). Alderson’s view is similar to Spolsky’s:

Another charge levelled against (unidentified) traditional testing is that it views learning as a ‘process of accretion’. Now, if this were true, one would probably wish to condemn such an aberration, but is it? Does it follow from an atomistic approach to language that one views the process of language as an accretion? This does not necessarily follow from the notion that the product of  language learning is a series of items (among other things). (Original emphasis).[73]

The process and product methodologies “are too often perceived as generally separate”, i.e. they suffer from an “oppositional fallacy”.[74] The product is considered to be discrete, static and, accordingly, is not party to language processing,while the process is considered to be integrative and dynamic, and accordingly, the process is seen as belonging to language processing. (More about this in section 6.4). It is this oppositional fallacy that is the battleground of the discrete-point/integrative controversy.

It is widely believed that tests such as essay tests test the “use” of language, i.e. authentic communicative language, while tests such as error recognition tests and grammar accuracy tests test the “usage” of language[75], i.e. the elements of language. Such a distinction between the two kinds of tests, which Farhady describes as the “disjunctive fallacy”[76], is an oversimplification.

Many studies report high correlations between “discrete-point tests” and “integrative tests”.[77] It may be asked how the construct is able to account for this: “Shouldn’t supposedly similar types of tests relate more to each other than to supposedly different types of  tests?”  An adequate response presupposes three further questions: (1) “What are similar/different types of tests?” (2) Wouldn’t it be more correct to  speak of so-called discrete-point tests and so-called integrative tests?  (3) Isn’t the  discrete/integrative dichotomy irrelevant to what any test is measuring?

I consider some of the issues in the discrete-point/integrative controversy that are related to the questions posed above. The notion of “real-life” tests is also critically examined.

The terms “integrative” and “discrete-point” have fallen out of favour with some applied linguists, while for others these terms are still in vogue. For example, Fotos equates “integrative”  skills with “advanced skills and global proficiency”[78], which he contrasts with Alderson’s”basic skills”[79]. These “basic skills” are Alderson’s “low order” skills[80]. Alderson prefers to distinguish between “low order” and “higher order” tests  than between “discrete- point” and “integrative” tests.[81] Alderson in 1979[82] refrained from talking about discrete-point and integrative tests, but preferred to talk of “low order” and “higher order” tests. Yet, in his later collaborative textbook on testing, one of the book’s test specifications is that “tasks” should be “discrete point, integrative, simulated ‘authentic’, objectively assessable”.[83] These tests specifications would dovetail with the notion that although these tests do not mirror life, they are nevertheless “good dirty methods [of testing] overall proficiency”.[84]

Whatever one’s classification, all tests, except for the most atomistic of tests, reside along a continuum of “integrativeness”.[85] For example,  consider  two items from Rea[86]:

1. How —-  milk have you got?

(a) a lot (b) much of (c) much (d) many

2. —- to Tanzania in April, but I’m not sure.

(a) I’ll come (b) I’m coming (c) I’m going to come (d) I may come.

Item 1 is testing a discrete element of grammar. All that is required is an understanding of the “collocational constraints of well-formedness”[87], i.e. to answer the question it is sufficient to know that “milk” is a mass noun (see also Canale and Swain[88]). Item 2 relates form to global meaning. Therefore, all parts of the sentence must be taken into account, which makes it an integrative task. To use Rea’s terminology, her item 1 is testing “non-communicative performance”, while her item 2 (above) is testing what she calls “communicative performance”.[89] Other discrete-point, or “low order”, items could be shown to be more integrative, or “higher order”, than the items described above. (The terms in inverted commas are Alderson’s [1979[90]]).

Above I described an “objective” type of test. Consider now a test that lies toward the “pragmatic” extreme of the integrative continuum: the cloze test. Although cloze answers are short, usually a single word, the cloze test can still be regarded as an “integrative” test. A distinction needs to be made between integrative and discrete formats and integrative and discrete processing strategies. The salient issue in a cloze test or in any test is not the length of the answer or the length of the question,  but whether the test measures what it is supposed to measure, in this case integrative processing strategies. One should distinguish between the structure of the test  -  long answer, short answer, multiple choice  -  and what one is measuring. One is measuring the natural ability to process language and one component of this ability is the behaviour of supplying missing linguistic data in a discourse.

According to the “pop”[91] view, it is only in language use that natural language processing can take place. Although the “pop” view may not conflict with the idea of a continuum of integrativeness, such a view would nevertheless hold that language tests should only test language “use”, i.e. direct language, or authentic language. For language “naturalists” the only authentic tests are those presented in a direct real-life situation. Spolsky[92] maintains that  “authenticity of task is generally submerged by the greater attention given to psychometric criteria of validity and reliability, where face validity receives no more than “lip service”. For Spolsky and others[93], authenticity is closely related to communicative language, i.e. to direct language. Authentic tests for Spolsky would be “direct” tests  in contradistinction to “indirect” tests. Owing to the lack of clarity on the relationship between a global skill like composition writing and the components of composition writing, e.g. vocabulary, punctuation and grammar, Hughes recommends that it is best, in terms of one’s present knowledge, to try and be as comprehensive as possible, and the best way to do this would be to use direct tests.

“Direct” testers argue that in language use we do not process language in a multiple choice way as in the case of discrete-point tests. Yet, many multiple choice tests do test processing strategies, e.g. the making of predictions. Furthermore, multiple choice tests are neutral in themselves, i.e. they can serve any purpose; communicative or non-communicative. Rea[94] gives the following reasons why indirect tests should be used:

1. There is no such thing as a pure direct test.

2. Direct tests are too expensive and involve too much administration.

3.  Direct tests only sample a restricted portion of the language, which makes valid inferences difficult. (Of course, no battery of tests can sample the whole language. Rea’s point seems to be that indirect tests are able to be much more representative than direct tests).

If it could be shown that indirect test performance is a valid predictor of direct performance, this would be the best reason for using indirect tests. Even if indirect performance is accepted to be a valid predictor of direct “natural” performance, one may object that indirect tests are unnatural, and consequently lack face validity. As mentioned earlier.

The laws of learning and testing apply to all contexts, “naturalistic”[95] and otherwise. One can have authentic indirect tests, because tests are authentic activity types in their own right[96]. The quality of learning outcomes depends, of course, on the quality of input – and more importantly on the quality of intake.

There is a sense, though, in which “real-life” “authentic” tasks in the classroom, if not a contradiction in terms, are not possible: in the sense that learners are aware that life in the classroom is a preparation for, and simulation of, life outside the classroom – an understanding of life, which comprises not only life skills but content knowledge in specific disciplines and an understanding of their relationship. But this “preparation for life” view of the classroom, does not justify, I suggest, the radical rupture between “real-life” and the classroom, described by Lantolf and Frawley (see end of section 2.4). Tritely, life is one big classroom; and less tritely, the classroom is one small part of life. This does not mean that the classroom has to be turned into a cooking club or a cocktail lounge to get learners to respond authentically to a recipe or to something “stronger” – for instance, a test.

If by some good fortune we come of age in our understanding of what an “authentic” task is (and, accordingly, isn’t), it still doesn’t follow that it is necessary to do “authentic” tasks in order to prove that we are proficient to do them, because communicative tasks can be tested successfully through indirect tests[97]. For example, an eye test doesn’t directly, or “holistically”, measure whether someone can see the illuminated road clearly, but its a jolly good predictor of whether one will be able to see, if not avoid, that oncoming road-hog on that same illuminated road.

In sum, both direct tests and an indirect tests – as in all direct and indirect classroom activities – have communicative, or real-life, language as their aim. The difference lies in this: direct  tests, or outcomes, or activities are based on the view that communicative language should be directly taught and tested, while indirect tests are based on the view that indirect teaching materials and tests are a prerequisite and solid basis for ultimate real-life language. But I would go even further and agree with Widdowson that “semantic meaning is primary” (Chomsky’s dated! “linguistic competence”) where semantic meaning should (naturally, i.e. obviously) be internalised  to provide for “communicative capacity”[98]; which is the same idea as Spolsky’s[99] (mentioned above) where the building up of language from the microlevel to the macrolevel may be a dynamic and not necessarily a static “accumulation of entities” (Rutherford[100] above), which in turn leads toa dynamic “reorganisation of the existing system”.[101] [102]

2.6 Cognitive and Academic Language Proficiency and “test language”

Cognitive and Academic Language Proficiency (CALP) is closely related to the ability to do tests. Its features are better understood when  compared and contrasted with Basic Interpersonal and Communicative Skills (BICS).[103] BICS refers to salient basic features such as fluency (speed of delivery) and accent, and not to advanced social and communicative skills, which is a cognitively demanding task.  For example, the skills of persuading or negotiating in face-to-face communication require relatively much more cognitive involvement than a BICS task, and are therefore cognitively demanding CALP tasks. Thus, it would be incorrect to equate BICS with all face-to-face communication, because face-to-face communication may involve informal as well as formal speaking. Formal speech acts such as persuading and negotiating belong to advanced communicative skills, and are consequently part of CALP. Spoken language can be just as complex as written language. They differ in that speech is dynamic while writing is synoptic, and writing is lexically denser than speech: “written language does not have to be immediately apprehended in flight and does not need to be designed to counter the limitations of processing capacity”.[104]

Cummin’s BICS and CALP  have affinities with Bernstein’s “restricted code” and “elaborated code”, respectively[105]. The “elaborated code” has the following features:  precise verbalisations, large vocabulary, complex syntax, unpredictability, low redundancy, individuality differences between speakers; in contrast, the “restricted code” has the following features: loose verbalisations, limited vocabulary, simple syntax, high redundancy where assumptions are based on shared social experience.

Wald makes a distinction between “test language” (spoken and written), which he equates with CALP, and “spontaneous language”/”face-to-face” communication.[106] For Wald, test skills are CALP skills, which can involve all four language skills: listening, speaking, reading and writing. For example, an oral cloze test would be a CALP task.  In terms of these distinctions, it would be possible to have tests of basic language (grammar tests). Basic language tests  involve CALP because, according to Wald, they are tests. All tests are formal, no matter how “natural” one tries to make them. In terms of Wald’s definition of CALP as test language, the tests in this study are CALP tasks because they are tests. Accordingly, if  Wald is correct, and I think he is,  one could not have a BICS test.

Ur uses the term “informal”[107] not in the sense of natural, but in the sense that test takers are not told in advance what they need to know for a test. One could, for example, spontaneously test learners on their homework. On such an interpretation of “informal”, it follows that there can be “informal” tests (“informal” CALP tasks).

2.7  Language proficiency and academic achievement

Much of the research in second language acquisition involves finding factors which affect language proficiency. In such a research scheme, factors such as intelligence, motivation, mother-tongue interference and socio-economic standing are defined as the independent variables and language proficiency is defined as the dependent variable. Language proficiency in such a research context does not look beyond itself to its effect on academic achievement.

In the investigation of academic achievement the focus changes from considering language proficiency as the dependent variable (the criterion) to considering academic achievement as the dependent variable, as in Saville-Troike[108].

Consider the following schema. (The schema is highly simplified and is merely meant to present some of the predictor variables that could be involved in academic achievement and is therefore not a comprehensive “model”):

FIGURE  2.1

Second  Language Proficiency as a Criterion Variable and as a Predictor Variable

FIRST FOCUS – Second language proficiency as the criterion

Predictor variables Criterion variable

Intelligence

Motivation  (active participation)Second language

Mother-tongue interference                                                             Proficiency

Socio-economic standing

Personality (e.g. emotional maturity)

SECOND FOCUS – Academic achievement as the criterion

Predictor variables                                                                  Criterion variable

Intelligence

Motivation                                                                               Academic achievement

Mother-tongue interference

Socio-economic standing

Second language proficiency

Subject learning

One needs to know how language proficiency, which is embedded in other factors, promotes or hinders academic achievement. These other factors comprise an complex network of variables such as intelligence, learning processes and styles, organisational skills and content knowledge,  teaching methods, motivation and cultural factors. Owing to the complexity of the interaction between these variables, it is often difficult, perhaps impossible, to isolate them from language proficiency, which means that any one or any combination of these above-mentioned variables might be the cause of academic failure. Therefore, care must be taken not to make spurious causal links between any of these variables and academic failure. Although prediction does not necessarily imply causation, this does not mean that prediction should be ignored; on the contrary, prediction plays a very important role in the selection and placement of candidates. What is important is that these predictions be valid. (I stress that this study focuses mainly on those causes of academic failure that are related to the testing situation, e.g. rater unreliability).

Although the distinction between the two kinds of focus (Figure 2.1) can be useful, the change from one focus to the other does not merely involve rejuggling the variables that were previously used to predict language proficiency (the first focus) and then assigning them to the new game of predicting academic achievement (the second focus), where language proficiency, previously a criterion variable, would then become another predictor variable among those that were previously used to predict it. The reason is that when academic achievement is brought into the foreground, the predictive mechanism becomes far more complex. One cannot merely shift variables around, because in the second focus, learning in or through a second language is added to the demands of learning the second language itself.

Upshur[109] distinguishes between two distinct general questions: “Does somebody have proficiency?” and “Is somebody proficient?”  The first question considers such issues as grammatical competence and the use of language in discourse. The second question is concerned with the ability of language proficiency tests to predict future performance in tasks that require language skills, i.e. with the “prerequisites for a particular job or course of study”[110]. It is this second question,  namely, “Is somebody proficient?” (to do a particular task) that is the main concern of educationists.

2.8  Validity

Validity is concerned with “the purposes of a test”[111], which is basically concerned with the meaning of scores and the ways they are used to make decisions. A major difficulty in this regard is ensuring that one’s descriptions of validity are validly constituted, which involves reconciling “objective” reality with one’s own interpretation of “objective” reality – a daunting and  probably circular task.

For some researchers, validity comprises face validity, content validity, construct validity and criterion validities (concurrent and predictive validity), whereas for others, especially those belonging to the American Psychological Association[112] (APA), construct validity itself is validity. Face validity does not feature in the APA’s definitions of validity. The reason for this is explained in the next section.

2.8.1  Face validity

Face validity is concerned with what people (which includes test analysts and lay people) believe must be done in a test, i.e. what the test looks like it is supposed to be doing.

For Clark[113], face validity, oddly, covers the “whole business” of tests, i.e. looking “at what it’s got in it, at the way it is administered, at the way it’s scored.” Clark’s definition is unusual, because it covers everything to do with testing. Clark’s meaning of face validity is not what a test looks like to the non-tester but what it is to the tester, who should know from what it looks like what it is, i.e. “what it’s got in it”.

Spolsky’s meaning of face validity has affinities with Clarke’s. Spolsky equates face validity with “authenticity”: “authenticity of task is generally submerged by the greater attention given to psychometric criteria of validity and reliability”, where “face validity receives no more than lip service”.[114]

For Davies “face validity is desirable but not necessary, cosmetic but useful because it helps convince the public that the test is valid.”[115] The reason why face validity is desirable, according to Davies, is that, in spite of its “cosmetic” nature, it can still have a “major and creative influence for change and development”[116]. Yeld maintains that face validity should be capitalised on as a point of entry into testing for those “who have not been trained in the use of techniques of statistical analysis and are suspicious of what they perceive as ‘number-crunching’.”[117]

Thus, face validity (what Stevenson calls “pop” validity is so popular today because many language teachers have a poor knowledge of language testing and educational measurement, i.e. they are “metrically naive” [118]. Accordingly, they could remain satisfied with superficial impressions.

There are others who reject face validity altogether, because it relies too much on the subjective judgement of the observer[119]:

Adopting a test just because it appears reasonable is bad practice; many a `good-looking’ test has failed as a predictor… If one must choose between a test with `face validity’ and no technically verified validity and one with technical validity and no appeal to the layman, he had better choose the latter.[120]

Gardner and Tremblay[121] consider face validity to be the lowest form of validity, and should, accordingly, not be generally recommended as a research strategy. The difficulty with face validity in its usual connotation of what a test appears to be is that the prettier the package the worse may be the inherent quality of the tests. No matter what one’s opinions of face validity, it does have the following useful features: it increases a learner’s motivation to study for the test; it keeps sponsors happy; and it sustains the parents’ resolve to pay the ever-escalating school fees.

2.8.2 Content validity

Face validity and content validity can overlap, because what must be done in a test involves content. The latter subsumes subject matter as well as skills. Content validity  “implies a rational strategy whereby a particular behavioural domain of interest is identified, usually by reference to curriculum objectives or task requirements or job characteristics”[122]. Content validity is concerned with how test items represent the content of a syllabus or the content of real-life situations. Content validity is not only a match between (the situation, topic and style of ) tests and  real-life situations but also a match between tests and school life, both of which are part of “real” life.

2.8.3  Construct validity

The constructs, or human abilities, that this study is interested in belong to the domain of language acquisition. As I mentioned earlier (section 2.2), abilities are fixed attributes, or constructs (in the sense of consistent, not immutable). If behaviour is inconsistent it would be impossible to find out what lies behind the behaviour, i.e. discover the construct. The problem for scientists, whether physical scientists or linguistic scientists, is figuring out the nature and sequence of the contribution of (abstract) theory and (concrete) experience to construct validity.

Consider how evidence for construct validity is assembled. There are two main stages:  (1) hypothesise a construct and (2) construct a method that involves collecting empirical data  to test the hypothesis, i.e. develop a test to measure the construct. Hypothesising is concerned with theory, while the construction of a method, although inseparable from theory, is largely an empirical issue. The problem is that it is not clear whether theory should be the cart and experience  the horse, or vice versa, or some other permutation. Consider some of the problems in assessing the relative contribution of theory and experience in construct validatiFor Messick[123] construct validity is a unitary concept that subsumes other kinds of (sub-)validities, e.g. content validity and criterion validity. Messick[124] defines validity, which for him is construct validity, as  a

unitary concept that describes an integrated evaluative judgement of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferencesand actions based on test scores or other modes of assessment. (Original emphasis)

“Test scores” above refers to “quantitative summaries” (Messick 1987:3), which is the commonly understood meaning of the term. Messick’slonger and much denser definition of construct validity

implies a joint convergent and discriminant strategy entailing both substantive coverage and response consistency in tandem. The boundaries and facets of a behavioural domain are specified, but in this case as delineated by a theory of the construct in question. Items are written to cover that domain in some representative fashion, as with content validity, but in this approach the initial item pool is deliberately expanded to include items relevant to competing theories of the construct, if possible, as well as items theoretically irrelevant to the construct… Item responses are then obtained, and items are selected that exhibit response homogeneity consistent with the focal theory of the construct but are theoretically distinct from theoretically irrelevant items or exemplars of competing theories. [125]

What Messick’s definition loses in brevity and simplicity it gains in scientific  rigour. In Messick’s definition one should have a theory and only then a method. And the theory must be able to specify the problem without prejudging a solution: something very difficult to do. We try and ensure a “substantive coverage” (Messick above) of entities or qualities that are similar (a convergent strategy) and of those that are different (a discriminant strategy), which is “delineated by a theory of the construct in question” (Messick above). The problem is knowing what items to include or exclude in a test, because owing to the infinity of the corpus and the fact that elements and skills hang together[126], it is difficult to distinguish between “items relevant to competing theories of the construct, if possible” and “items theoretically irrelevant to the construct”? (Messick above; italics added). The Unitary Competence controversy is concerned with the nature and degree of interdependence between elements and skills, i.e. how, how much and which elements and skills hang together. Difficulties exist in the discrimination between items. One of these difficulties is distinguishing between low order (so called “discrete”) items and higher order (so called “integrative”) items, as was shown in the discussion of the discrete-point/ integrative controversy (section 2.5).

The group-differences approach to construct validity that is used in this study is now explained: The aim of testing is to discern levels of ability. If one uses academic writing ability as an example of a construct, one would hypothesise that people with a high level of this ability would have a good command of sentence structure, cohesion and coherence, while people with a low level of this ability would have a poor command of these. Tests are then administered and if it is found that there is a significant difference between a group of high achievers and a group of low achievers, this would be valid evidence for the existence of the construct. Second language

learners are often relatively less competent than fist language or mother-tongue users.[127]

Important for the arguments presented and the validation of the sample of subjects is that those who take English First Language as a subject are generally more competent than those who take English Second Language as a subject. If a test fails to discriminate between low-ability and high-ability learners there are three possible reasons for this:

- The construction of the test is faulty, e.g. the test may be too easy or too difficult for all or most of the test takers participating in the test.

- The theory undergirding the construct is faulty.

- The test has been inaccurately administered and/or scored, which would decrease the reliability, and hence also the validity of the test. (Reliability is discussed shortly).

2.8.4  Criterion validity: concurrent and predictive validity

Criterion validity is concerned with correlating one test against an external criterion such as another test or non-test behaviour. Ebel[128] maintains that “unless a measure is related to other measures, it is scientifically and operationally sterile.” Criterion validity should not be confused with criterion-referenced tests. Criterion-referenced tests deal with profiles, i.e. with setting a predetermined cut-off score for an individual.

Criterion validity, which relies mainly on empirical methods, ignores the theoretical contribution of construct and content validity. For this reason, some researchers, particularly those of the American Psychological Association, prefer to dissociate validity from descriptions of the criterion, e.g. Loevinger[129] and Bachman[130]. Bachman[131] prefers the term “concurrent relatedness”  to  “concurrent validity”, and  “predictive utility”  to “predictive validity”.

A term used by Messick is “criterion-related validity”, where the latter “implies an empirical strategy whereby items are selected that significantly discriminate between relevant criterion groups or that maximally predict relevant criterion behaviours”.[132] Messick’s “criterion-related validity” is the same notion as the simpler term “criterion validity”, which was defined in the first sentence of this section.

Criterion validity consists of concurrent validity and predictive validity.  Concurrent validity is also concerned with prediction because there is only a chronological  difference between concurrent and predictive validity.[133] So, we could distinguish between concurrent prediction and  prediction proper, which is concerned with the ability of one test to predict another test where the predictor and the criterion are not given concurrently and thus are separated from each other by a reasonable period of time.

The reason why predictive validity is easier to measure than other kinds of validity is that predictive validity does not depend on the nature of the test items, but on the consistency of the predictions of performance.[134] It would be possible to ignore all the other kinds of validity and still have a high degree of predictive validity. The question is whether one should be satisfied with predictive validity alone. No. That is why this study is also concerned  with construct validity.

If  the  construct validity  of one  test  always depends  on  the  validity  of  another test, there cannot exist any one test that  stands by itself such as an equivalent of a “Prime Mover”. Lado’s solution is to compare all tests in terms of  “some other criterion whose validity is self-evident, e.g. the actual use of the language.”[135] The question is: What is self-evident? Is there a self-evident test that pre-exists all other tests?  There isn’t, because “the buttressing validity of an external criterion is often neither definable nor, when found, reliable”.[136] This does not mean, of course, that any test or battery of tests, direct or indirect, will do. The problem, however, remains: what tests will do? [137]

Having said that, we don’t need to worry about the difficulty of establishing construct validity if we are merely interested in predictive validity. If Test A is a good predictor of Test B, then it seems we don’t Test C as a second predictor, because Test A is doing a good job already. However, recall the discussion of the “One Best Test” question: we can never be sure, and furthermore, it doesn’t look (face validity) fair to use only one test as a predictor. To do so would be regarded by some researchers as highly unethical. Spolsky[138] (1995:358) is a case in point:

Only the most elaborate test batteries, with multiple administrations of multiple methods of testing the multiple traits or abilities that make up language proficiency, are capable of producing rich and accurate enough profiles to be used for making critical or fateful decisions about individuals.

Such Herculean conditions, however, would probably “paralyze”[139] most testing endeavours, because it is impracticable in its unrealisability. We shouldn’t only start  measuring when we are clear about what we are measuring; rather we should do the best we can; always taking into account generally accepted theories, but not necessarily following them slavishly if we have cogent reasons why we shouldn’t.

2.9  Reliability

If the validity of a test depends on its close approximation to real life then validity would relate to subjectivity. We try to be as objective as possible in the test compilation, administration and  assessment. This search for objectivity is the domain of reliability. Reliability in testing is concerned with the accuracy and consistency of scoring and of administration procedures. The less the accuracy and consistency, the more the measurement error.

A major difficulty in testing is how to make the “leap from scores  to profiles “[140], i.e. how to define the cut-off points. In norm-referenced testing, one  defines cut-off points by computing the measurement error. In criterion-referenced tests, one makes a value judgement of what is progress enough for a specific individual.

To the extent that one can decrease the measurement error, one increases the reliability of the test. Measurement error has important ethical implications. It would be unjust to fail students  because they get 49% – perhaps even 47%, where does one draw the line? – instead of 50%. In subjective tests such as essay tests the problem is more serious, because even the best essay test, owing to its subjective scoring

procedures, is often not more than 80%-90%[141] reliable, and therefore measurement error should be calculated in order to make more equitable judgements.

The following “aspects” are germane to reliability:

- Facets: These refer to such factors as the (1) testing environment, e.g. the time of testing and the test setting, (2) test organisation, e.g. the sequence in which different questions are presented, and (3) the relative importance of different questions and topic content. Facets also include cultural and sexual differences between test takers, the attitude of the test taker, and whether the tester does such things as point out the importance of the test for a test-taker’s future.[142]

- Features or conditions: These refer to such factors as clear instructions, unambiguous questions and items that do or do not permit guessing.

- The manner in which the test is scored. A central factor in this regard is rater consistency. Rater consistency becomes a problem mainly in the kind of tests that involve subjective judgements such as essay tests. (I discuss interrater and intrarater consistency in the next section). According to Ebel and Frisbie[143],  consistency is not only concerned with the correlations between raters, but also with the actual scores, more specifically, the equivalence in scores between tests and between raters. (I discuss this issue in section 4.8.1.2).

I  clarify a possible confusion between rater reliability and concurrent validity.  Rater reliability has to do with the consistency between raters’ judgements on one test, e.g. an essay test.Concurrent validity, in contrast, has to do with the correlation between two or more different tests e.g. a dictation test and an essay test. In the next section more details are provided on the approaches to reliability, which may help clarify the concepts discussed.

2.9.1  Approaches to the measurement of reliability

There are five approaches to measuring reliability. Owing to the structure of this study, only approaches 2, 4 and 5 are used:

1. Stability, i.e. consistency over time. The method used to measure stability is the test-retest method, which involves giving the same test a second time and comparing the scores of the two test trials. If the scores are equivalent, the test is considered to be stable. A disadvantage of the test-retest method is that students may not be motivated to do the test a second time, which might affect performance on the retest.

2.  Internal consistency. This approach, also called the “split-half” test, divides the test into two halves. The two halves of the test are regarded as two parallel tests. For each student there is a separate score for each half of the test. It is possible to correlate the two sets of scores as if they were parallel tests.

3.  Rater reliability. Rater reliability is particularly important in non-objective tests such as essay tests, where there are liable to be fluctuations in scores between (1) different raters, which is the concern of interrater reliability, and (2) within the same rater, which is the concern of intrarater reliability. In this study I use essay assessment to examine interrater reliability (section 4.8.1).

4. Equivalence (in the form of the test). There are two meanings of equivalence:  firstly, the equivalence between test scores, and secondly, between the facets of the tests. The method used to measure equivalence is the parallel test method. In parallel tests it is difficult to ensure equivalent conditions within the many facets of a test, especially whether the content of the two parallel are equivalent. The problem doesn’t exist for multiple-choice type tests because the split-half method is used. In the case of “pragmatic” tests, however,  such as the cloze, dictation and essay tests in this study, there is a problem. This problem is examined in the discussion of the parallel reliability of the pragmatic tests in the study.

5.  A combination of stability and equivalence (in forms). The method used is a parallel test which is administered a period of time after the first test. The difficulties are compounded here, because they include the problems of both equivalence and of stability mentioned above.

The degree of reliability required depends on the relative importance of the decisions to be made. For example, an admission test would require more reliability than a placement test, because decisions based on a placement test can be more easily adjusted than decisions based on an admission test. A final evaluation for promotion purposes would require the most reliability of all.

2.10  Ethics of measurement

Validity should not be separated from what Sammonds[144] refers to as the following “ethical” questions. Most of these ethical questions are scientific questions as well: (The kinds of validity corresponding to each question are the appellations given by Sammonds).

1. Are the measures that are chosen to represent the underlying concepts appropriate? (“Construct validity”).

2. Has measurement error been taken into account, because in all measurement there is always a degree of error? (“Statistical conclusion validity”; in other words, reliability).

3. Are there other variables that need to be taken into account? (“Internal validity”).

4. Are the statistical procedures explained in such a way that a non- statistical person can understand them? Or sometimes even a statistical person. One reader may find an explanation superfluous or too detailed, while another  may find the same information patchy. Much depends on the background knowledge of readers and/or what they are looking for. Pinker[145] maintains that expository writing requires writers to overcome their “natural egocentricism” where “trying to anticipate the knowledge state of the generic reader at every stage of the exposition is one of the important tasks of writing well.” True, but there is much more, namely the basic expository problem of negotiating a path between under-information and overkill. Getting experts to read and provide comments before one submits one’s work to public scrutiny is one way of  reducing the expository problem. It may also compound the problem, however, owing to the diversity of beliefs in the world: of interpretations of interpretations (see section 6.2).

4. Has the description of the sample and the data analysis been properly done so that generalisations can be made from it? (“External validity”). This important issue is dealt with in the last chapter (section 7.1).

2.11  Summary of Chapter 2

The first part of the chapter dealt with theoretical issues in language proficiency, language learning, language testing and academic achievement. Key concepts such as authenticity, competence, performance, ability, proficiency, test language, integrative continuum and achievement were explained.

The second part of the chapter was concerned with explaining the key concepts in summative assessment. The two principal concepts in summative assessment are validity and reliability. Different kinds of validity were discussed, namely, content validity, face validity, construct validity and criterion validity, where the latter comprises concurrent and predictive validity. Other kinds of validity were also referred to in the context of the ethics of measurement. The group-differences approach to construct validity, to be used in the study, was described. Different approaches to the examination of reliability were discussed and those chosen for the study were specified.


[1]Ingram, E.Assessing proficiency: An overview on some aspects of testing, 1985, p.218. (2) Vollmer, H.J.Why are we interested in general language proficiency, 1981, p.152).[2]Ibid.

[3]“Norm” is used in the sense of an idealisation against which comparisons are made of what scientists call the “real” world. Although this “normal” curve is a mathematical abstraction it is based on the reasoning that if there were an infinitely large population then human abilities (and the milk yeld of cows) would be represented by a perfect bell curve.

[4]Carroll, J.B.Human cognitive abilities: A survey of factor analytic studies, 1993, p.10.

[5]Minick, N.J.L.S. Vygotsky and Soviet activity theory: New perspectives on the relationship between mind and society, 1985, pp.13-14.

[6]The concept of fixety in the social sciences carries the stigma of colonialism and racism, “an ideological constriction of otherness” (Bhaba 1994:66; see also Leung, Harris and Rampton 1997). This came out clearly in an article by Phatekile Holomisa, president of the Congress of Traditional Leaders of South Africa  (Financial Mail, October 16, 1998, p.22), who maintained that one of the reasons why Transkeians who have made good outside the Transkei do not return to help the rural poor is because they believe that they “might be seen as promoting ubuXhosa (Xhosa-ness), in contradiction to the ideal of nonracialism.” There is also the danger that the concept of fixety could translate into the constriction of the historical individual, which the generalising mode of science considers to have significance only insofar as it reveals a universal rule (Cassirer 1946:27).

[7]Botha, H.L. and Cilliers, C.D. ‘Programme for educationally disadvantaged pupils in South Africa: A multi-disciplinary approach.’ South African Journal of education, 13 (2),  55-60 (1993).

[8]Bridges, P.’Transferable skills: A philosophical perspective.’ Studies in Higher   Education, 18 (1), 43-52 (1993), p.50.

[9]Ibid.

[10]Ibid.

[11]Millar, R.‘The pursuit of the impossible.’ Physics Education, 23, 156-159 (1988), p.157.

[12]Ibid.

[13]Gamaroff, R.  ‘Solutions to academic failure: The cognitive and cultural realities of English as the medium of instruction among black ESL learners.’ Per Linguam, 11 (2), 15-33 (1995c).

__________   ‘Abilities, access and that bell curve.’ Grewar, A. (ed.).   Proceedings of the South African Association of Academic Development “Towards meaningful access to tertiaty education”. (Alice: Academic Development Centre, Fort Hare, 1996b).

___________  ‘Language as a deep semiotic system and fluid intelligence in language proficiency.’ South African Journal of Linguistics, 15 (1), 11-17 (1997b).

[14]Davies, A. Principles of language testing, 1990, p.6,.

[15]Chomsky, N. Aspects of the theory of syntax, 1965, p.6.

[16](1) Brown, K.Linguistics today, 1984, p.144.

(2) Leech, G. Semantics, 1981, p.69.

(3) Hutchinson, T. and Waters, A. English for special purposes: A learner-centred approach, 1987, p.28.

[17]Chomsky, N. Aspects of the theory of syntax, 1965, pp. 3-4.

[18]Chomsky, N. Syntactic structures, 1957, p.103.

[19]Atkinson, M., Kilby, D. and Roca, I. Foundations of general linguistics, 1982, 369.

[20]Canale, M. and Swain, M. ‘Theoretical bases of communicative approaches to second language teaching and testing.’ Applied Linguistics, 1 (1), 1-47 (1980), p.34.

[21]Hymes, D. ‘On communicative competence’, in Pride, J.B. and Holmes, J. (eds.). Sociolinguistics. (Harmondsworth, Penguin, 1972).

[22](1) Canale, M. and Swain, M. ‘Theoretical bases of communicative approaches to second language teaching and testing.’ Applied Linguistics, 1 (1), 1-47 (1980), p.34.

(2) Swain, S. ‘Large-scale communicative language testing: A case study’, in Lee, Y., Fok, A., Lord, R. and Low, G. (eds.). New directions in language testing. (Oxford,  Institute of English, 1985).

(3) Savignon, S.J.  Communicative competence: Theory and classroom practice.  (Reading, Mass. Addison-Wesley Publishing Company, 1983).

[23]Bachman, L.F. and Palmer, A.S.Language testing in practice, 1996. (See their Chapter 4).

[24]Skehan, P. A cognitive approach to language learning, 1998, p.16.

[25]Widdowson, H.G.Aspects of language teaching, 1990, p.40.

[26]Chomsky, N. Language and the problem of knowledge, 1988, p.9

[27]Haussmann, N.C. The testing of  English mother-tongue competence by means of a multiple-choice test: An applied linguistics perspective, 1992, p.16.

[28]Bachman, L.F. and Clark, J.L.D. ‘The measurement of foreign/second   language proficiency.’ American Academy of the Political and Social Science Annals, 490, 20-33 (1987), p.21.

[29]Carroll, B.J. 1980. Testing communicative performance. Oxford. Pergamon.

[30]Weir, C.J. Communicative language testing, 1988, p.10.

[31]Skehan, P. A cognitive approach to language learning, 1998, p.154

[32]Vollmer, H.J.The structure of foreign language competence, 1983, p.5.

[33]Brière, E. ‘Are we really measuring proficiency with our foreign language tests?’ Foreign Language Annals, 4, 385-91 (1971), p.322.

[34]Vollmer, H.J. Ibid, p.5.

[35]Hyltenstam, K. and Pienemann, M. Modelling and Assessing second language   acquisition, 1985, p.15.

[36]The term native is problematic. I discuss this problem in sections 3.2.1 and 6.1.1.

[37]Porter, D.Assessing communicative proficiency: The search for validity, 1983.

[38]Child, J. ‘Proficiency and performance in language testing.’ Applied Linguistic Theory, 4 (1/2), 19-54 (1993).

[39]Alderson, J.C. and Clapham, C. ‘Applied linguistics and language testing: A case study of the ELTS test.’ Applied Linguistics, 13 (2), 149-167 (1992), p.149.

[40]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988).

[41]Omaggio, A.C.Teaching language in context: Proficiency-orientated instruction, 1986.

[42]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.182..

[43]Widdowson, H.G. Explorations in applied linguistics. (Oxford,  Oxford University Press, 1979).

______________   ‘Knowledge of language and ability of use.’ Applied Linguistics, 10 (2), 128-137 (1989).

______________ Aspects of language teaching. (Oxford,  Oxford University Press, 1990).

______________ ‘Communication, community and the problem of appropriate use’, in Alatis, J.E. Georgetown University Round Table on Languages and Linguistics. (Washington, D.C. Georgetown University Press, 1992).

[44]Alatis, J.E. (ed.). Georgetown University Round Table on Languages and Linguistics,  1992.

[45]Byrnes, H. and Canale, M. (eds.). Defining and developing proficiency: Guidelines, implementations and concepts, 1987.

[46]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.182.

[47]Ibid.

[48]Widdowson, H.G.Aspects of language teaching. (Oxford,  Oxford University Press, 1990).

[49]Gamaroff, R. ‘Is the (unreal) tail wagging the (real) dog?: Understanding the construct of language proficiency.’ Per Linguam, 12 (1), 48-58 (1996a).

[50]Nunan, D. Syllabus design, 1988, p.28.

[51]Widdowson, H.G. Explorations in applied linguistics,1979, p.141.

[52]Ibid, p.246.

[53]Widdowson, H.G. ‘Communication, community and the problem of appropriate use’, in Alatis, J.E. Georgetown University Round Table on Languages andLinguistics, 1992, p.306.

[54]Lantolf, J.P. and Frawley, W. ‘Proficiency: Understanding the construct.’ Studies in Second Language Acquisition (SLLA), 10 (2), 181-195 (1988), p.183.

[55]Byrnes, H. and Canale, M.(eds.). Defining and developing proficiency: Guidelines, implementations and concepts, 1987, p.1.

[56]His specific context is the second language “acquisition”/second language “learning” controversy  of Krashen (1981); see Note 56 below..

[57]Morrissey, M. D.‘Toward a grammar of learner’s errors.’ InternationalReview of Applied Linguistics, 21 (3), 193, 207 (1983), p.200.

[58]Krashen, S. Second language acquisition and second language learning. (Oxford,  Pergamon Press, 1981).

[59]Krashen, S. and Terrell, T. The natural approach:Language acquisition in the   classroom, 1983.

[60]Butzkamm, W.’Review of H. Hammerly, “Fluency and accuracy: Toward balance in               language teaching and learning”‘. System, 20 (4), 545-548 (1992).

[61]Taylor, B.P. ‘In search of real reality.’ TESOL Quarterly, 16 (1), 29-43 (1982).

[62]Lantolf and Frawley, 1988, p.183.

[63]Politzer, R.L. and McGroarty, M. ‘A discrete-point test of communicative competence.’ International Review of Applied Linguistics, 21 (3), 179-191 (1983).

[64]Entwhistle, W.J. Aspects of language, 1953, p.157.

[65]Rea, P. Language testing and the communicative language teaching curriculum, 1985.

[66]Halliday, M.A.K. Learning how to mean, 1975, p.5.

[67]Ibid, p.2.

[68]Givon, T.Understanding grammar, 1979, pp.5 and 22

[69]Ibid, p.31.

[70]Rutherford, W.E. Second language grammar: Learning and teaching,1987, pp.1-5

[71]Ibid, pp.4 and 36-37.

[72]Spolsky, B.Conditions for second language learning, 1989, p.61.

[73]Alderson, J.C. Reaction to the Morrow paper, 1981c, p.47.

[74]Besner, N. ‘Process against product: A real opposition?’ English Quarterly, 18 (3), 9-16 (1985), p.9.

[75]Widdowson, H.G. Explorations in applied linguistics, 1979.

[76]Farhady, H. The disjunctive fallacy between discrete-point tests and integrative tests,  1983.

[77](1)Hale, G.A., Stansfield, C.W. and Duran, R.P. TESOL Research Report 16.(Princeton, New Jersey: Educational Testing Service, 1984).

(2) Henning, G.A., Ghawaby, S.M., Saadalla, W.Z., El-Rifai, M.A., Hannallah, R.K. and Mattar, M. S. ‘Comprehensive assessment of language proficiency and achievement among learners of English as a foreign language.’ TESOL Quarterly, 15 (4), 457-466 (1981).

(3) Oller, J.W., Jr. Language tests at school. London,  Longman, 1979).

(4) Oller, J.W. (Jr.) and Perkins, K. (eds.). Language in education: testing the tests. (Rowley, Massachusetts,  Newbury House, 1978).

[78]Fotos, S. ‘The cloze test as an integrative measure of EFL proficiency: A substitute for essays on college entrance examinations.’ Language Learning, 41 (3), 313-336 (1991), p.318.

[79]Alderson, J.C. ‘The cloze procedure and proficiency in English as a foreign language.’ TESOL Quarterly, 13, 219-227 (1979).

[80]Ibid.

[81]Ibid.

[82]Ibid.

[83]Alderson, J.C., Clapham, C. and Wall, D. Language test construction and evaluation. (Cambridge, CUP, 1995).

[84]Bonheim, H.  Roundtable on language testing. European Society of the Study of   English (ESSE) conference, Debrecen, Hungary, September, 1997).

[85]Oller, J.W., Jr. A consensus for the 80s, 1983, p.137.

[86]Rea, P. ‘Language testing and the communicative language teaching curriculum’, in Lee, Y.P. et al. New directions in language testing. (Oxford. Institute of English, 1985), p.22.

[87]Ibid.

[88]Canale, M. and Swain, M. ‘Theoretical bases of communicative approaches to second language teaching and testing.’ Applied Linguistics, 1 (1), 1-47 (1980), p.35.

[89]Ibid.

[90]Alderson, J.C. ‘The cloze procedure and proficiency in English as a foreign language.’ TESOL Quarterly, 13, 219-227 (1979).

[91]Stevenson, D.K.Pop validity and performance testing, 1985.

[92]Spolsky, ibid, p.33-34.

[93]For example, Hughes, A. Testing for language teachers. (Cambridge, Cambridge University Press, 1989), p.15.

[94]Rea, P. ‘Language testing and the communicative language teaching curriculum’, in Lee, Y.P. et al. New directions in language testing. (Oxford. Institute of English, 1985).

[95]Omaggio, A.C. Teaching language in context: Proficiency-orientated instruction. (Boston, Massachusetts,  Henle and Henle, 1986), p.312-313.

[96]Alderson, J.C. ‘Who needs jam?’, in Hughes, A. and Porter, D. Current   developments in language testing. (London, Academic Press, 1983), p.89.

[97]Politzer, R.L. and McGroarty, M. ‘A discrete-point test of communicative competence.’ International Review of Applied Linguistics, 21 (3), 179-191 (1983).

[98]Widdowson, H.G. ‘Skills, abilities, and contexts of reality.’ Annual Review of Applied Linguistics, 18, 323-333 (1998), p.329.

[99]Spolsky, B. Conditions for second language learning. (Oxford,  Oxford University Press , 1989), p.61.

[100]Rutherford, W.E. Second language grammar: Learning and teaching, 1987.

[101]Spolsky, ibid.

[102]This raises the contentious issue of separating “semantics” from “pragmatics” (See Hudson 1984). From the point of view of the ideational (or conceptualising) function of language, which is what most of language processing is concerned with, or should be concerned with, much more demands are made on semantic and syntactic encoding than the  communicative act itself, which, after all, is only the last stage of language in action – unless one speaks before one thinks (Widdowson 1998:330)

[103]Cummins, J. ‘The cross-lingual dimensions of language proficiency: Implications for bilingual education and the optimal age issue.’ TESOL Quarterly, 14 (2), 175-87 (1980).

_________Language proficiency and academic achievement, 1983.

_________  Wanted: A theoretical framework for relating language proficiency to academic achievement among bilingual students, 1984.

[104]Widdowson, H.G. ‘Skills, abilities, and contexts of reality.’ Annual Review of Applied Linguistics, 18, 323-333 (1998), p.326.

[105]Bernstein, B. Class, codes and control, 1971.

[106]Wald, B. A sociolinguistic perspective on Cummins’ current framework for relating language proficiency to academic achievement, 1984, p.57

[107]Ur, P. A course in language teaching: practice and theory, 1996.

[108]Saville-Troike, M. ‘What really matters in second language learning for academic achievement.’ TESOL Quarterly, 18 (2), 199-219 (1984), p.199.

[109]Upshur, J.A. ‘English language tests and predictions of academic success’, in Wigglesworth, D.C. (ed.). Selected conference papers of the Association of Teachers of English as a Second Language. Los Altos, California,  National Association for foreign Student Affairs (NAFSA) Studies and Papers, English Language Series 13, 85-93 (1967), p.85.

[110]Valette, R.L. Modern language testing: A handbook, 1969, p.5.

[111]Carmines, G. and Zeller, A. Reliability and validity assessment, 1979, 15.

[112]American Psychological Association.Standards of educational and   psychological measurement. 1974.

[113]Clark, J.L.D. Theoretical and technical considerations in oral proficiency testing, 1975, p.28.

[114]Spolsky, B. ‘The limits of authenticity in language testing.’ Language Testing, 2, 31-40 (1985), p.33-34.

[115]Davies, A. Principles of language testing, 1990), p.44.

[116]Ibid, p.7.

[117]Yeld, N. ‘Communicative language testing and validity.’ Journal of Language Teaching, 21 (3), 69-82 (1987), p.78.

[118]Stevenson, D.K.Pop validity and performance, 1985, p.112.

[119](1)American Psychological Association. 1974. Standards of educational and   psychological measurement. (Washington D.C,, American Psychological Association, 1974).

(2)Cronbach, L.J. Essentials of psychological testing. (New York: Harper and Row, 1970).

(3) Gardner, R.C.  and Tremblay, P.F. ‘On motivation: measurement and   conceptual considerations.’ The Modern Language Journal, 78 (4), 524-527 (1994).

(4) Stevenson, D.K.’Pop validity and performance testing’, in Lee, Y., Fok, A., Lord, R. and Low, G. (eds.). New directions in language testing. (Oxford, Pergamon, 1985).

[120]Cronbach, ibid, p.183.

[121]Gardner and Tremblay, ibid, p.525.

[122]Messick, S.Constructs and their vicissitudes in educational and psychological measurement, 1989a, p.1.

[123]Messick, S. Validity, 1989b, p.1-2

[124]Messick, S. Meaning and values in test validation: The science and ethics of measurement, 1988, p.2.

[125]Messick, S. Constructs and their vicissitudes in educational and psychological measurement, 1989a, p.1.

[126]In Item Response Theory (IRT) a major issue is the unidimensionality assumption that all items in a test measure a single ability. If, however, language “competence” is multidimensional (see Bachman 1990a), the assumption that all items hang together might not be correct. From the psychometric point of view, though, the unidimentionality assumption does not, as Henning (1992) argues, preclude the psychological basis of multidimensionality (see Douglas 1995).

[127]Of course, there are many second language users who have a far better command of academic discourse than mother-tongue  users. This is so because the ability to understand and produce academic discourse depends on much more than “linguistic ability”: it also depends on CALP and academic intelligence (Gamaroff 1995c, 1996b, 1997b)

[128]Ebel, R.L. ‘Must all tests be valid?’ American Psychologist, 16, 640-647 (1961), p.645.

[129]Loevinger, J. Objective tests as instruments of psychological theory, 1967, p.93.

[130]Bachman, L.F. Fundamental considerations in language testing, 1990b, p.253.

[131]Ibid.

[132]Messick, S. Constructs and their vicissitudes in educational and psychological measurement. (Princeton, New Jersey, Educational Testing Service, 1989a).

[133]Cronbach, L.J. Essentials of psychological testing, 1970, p.122.

[134]Weir, C.J. Communicative language testing, 1988, p.30

[135]Lado, R. Language testing, 1961, p.324.

[136]Davies, A. Principles of language testing, 1990, p.3.

[137]This problem is indicative of the much larger problem of the indeterminacy of language (and hence also of epistemology) itself. “There is no perfect hypothetical language, to which the languages we have are clumsy approximations” (Harris 1981:175). And this must inevitably lead the applied linguist to grapple with the slippery notions of native speaker and mother-tongue speaker. I discuss this issue further in section 6.1.1.

[138]Spolsky, B. Measured words, 1995, p.358.

[139]Ibid.

[140]Yeld, N. Communicative language testing. Report on British Council Course 559 offered at Lancaster University from 8 September to 20 September 1985. (Cape Town. University of Cape Town, 1986, p.31.

[141]According to Perkins (1983:655),“raters, guided by…holistic scoring guides…, can achieve a scoring reliability as high as .90 for individual writers.” Indirect objective tests such as multiple choice grammar and vocabulary tests, on the other hand, can have reliability coefficients as high as .99, because there is no problem of rater reliability involved, i.e. subjective judgements will not affect the scores (Hughes 1989:29).

[142]Bachman, L.F. Fundamental considerations in language testing. (Oxford: Oxford University Press, 1990b, pp. 116ff, 168-172, 244.

[143]Ebel, R.L. and Frisbie, D.A. Essentials of educational measurement, 1991, p.76.

[144]Sammonds, P. Ethical issues and statistical work, 1989, p.53.

[145]Pinker, S. The language instinct, 1995, p.401.

[Home] [Services & Rates] [Contact] [Ph.D.] [Articles and Conference Papers]

// <!–[CDATA[-->
_doc=document; _nav=navigator; _timezone=new Date(); function count() { _doc.write(
"");} _color="0"; _rez="0"; _java="U"; _kr="U"; _doc.cookie="_kr=y";
_doc.cookie.length>0?_kr="Y":_kr="N";
// ]]> // // Free Web Hit Counter

Click on the graphic to vote for this
page as a pMonkey.com Hot Site.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: