Journal of Language Teaching, 32(2):94-104, 1998.
Author: Raphael Gamaroff
For a statistical analysis of language tests found in this article, see here
2. DIRECT AND INDIRECT TESTS
3. THE UNITARY COMPETENCE HYPOTHESIS AND THE “ONE BEST TEST”
4. DISCRETE-POINT AND INTEGRATIVE TESTS
For “real-life” testers the discrete-point/integrative controversy is out of fashion and long dead and buried. It is argued that there is still a lot of life in the “old beast”, and the questions the controversy raises are as pertinent as ever. The overarching problem in language testing is how to validly and reliably assess authentic language reception and production. It is widely believed that “integrative” tests such as cloze tests, dictation tests and essay tests test authentic communicative language, while “discrete-point” tests such as cloze tests and grammar tests merely test the elements of language. Such a distinction between the two kinds of tests is an oversimplification, which is largely due to characterising “integrative” tests as authentic, real-life and naturalistic, while characterising “discrete-point” tests as unauthentic, unreal and artificial. Some even argue that “integrative” tests such as cloze and dictation are also unauthentic. It is argued that tests do not have to be “naturalistic” or “direct” to be authentic tests.
The basic question in second language acquisition is: “What does it mean to know and use a second language?” The basic question in testing is: “How do we test this knowledge and use?” The first question is concerned with the nature of knowledge, the second with the selection of methods for testing this knowledge. Although these are distinct questions, the validity of tests depends on understanding how languages are learnt and used.
Do researchers know what a real-life task is , not merely what it looks like; or, if they knew, is it necessary for learners to do real-life tasks in order to prove that they are proficient to do them? This question lies at the heart of the problem of what an authentic language activity or test is mean to be (Gamaroff, 1996). Although language proficiency has ultimately to do with language use, with authentic, or communicative, or direct, language, it doesn’t follow that language proficiency can only be validly tested “on the wing” (Harrison, 1983:82), i.e. naturalistically. The implication of the arguments presented is that until we know more about testing, it is legitimate to follow the practical route of using “discrete-point” tests or “integrative” tests to predict “real-life” language proficiency.
2. DIRECT AND INDIRECT TESTS
For language “naturalists”, the only authentic tests are those presented in a direct real-life situation, because they are based on “naturalistic contexts” (Omaggio, 1986:312-313; see also Ingram, 1985:239ff). For “direct” testers, tests such as grammar tests, cloze tests and dictation tests are regarded as indirect tests, while essay tests and interviews would be regarded as direct tests (Hughes, 1989; Omaggio, 1986).
Many studies have found high correlations between “direct” or “indirect” tests (e.g. Oller, 1979; Henning et al, 1981; Hale et al, 1984; Fotos, 1991; Haussmann, 1992). Henning et al (1981) found high correlations between composition tests and error identification tests (.76). Several studies in Hale et al (1984: 120, 152) report high correlations between cloze tests and grammar tests (.82 to .93), cloze tests and essay tests (.78 to .94), and error recognition tests and essay tests (.75 to .93). Darnell (1968), Oller (1973:114; 1976:282) and Oller and Conrad (1971) found high correlations between written cloze and listening comprehension, and between listening comprehension and oral cloze (Oller, 1973:114, 1976:282).
Cloze and dictation reveal similar production errors in writing (Oller, 1976:287ff, 1979:57), and a combination of cloze tests and dictation tests have been used effectively in determining general language proficiency (Stump, 1978; Hinofotis, 1980). Oller (1979:61) maintains that all “pragmatic” tasks such as cloze tests or dictation tests probe the same underlying skill. In contrast to Oller, Savignon (1983:264) does not believe that cloze and dictation tests test pragmatic language, that is, language use.
How is the construct is able to account for these differences in views described in the previous paragraphs: “Shouldn’t supposedly similar types of tests relate more to each other than to supposedly different types of tests?” An adequate response presupposes four further questions: (1) “What are similar/different types of tests?” (2) Wouldn’t it be more correct to speak of so-called discrete-point tests and so-called integrative tests? (3) Isn’t the discrete/integrative dichotomy irrelevant to what the cloze test (or any test) is measuring? And most importantly: (4) Is it necessary to use direct tests to predict direct language proficiency. In the next section I suggest some answers to these questions.
3. THE UNITARY COMPETENCE HYPOTHESIS AND THE “ONE BEST TEST”
The debate of whether language proficiency consists of a unitary or general factor (analogous to a g factor in intelligence), or of a number of independent factors has straddled three decades, receiving prominence in the work of authors such as Carroll (1961, 1983a) and Oller (1979, 1983, 1983a). The old beast has still not been put to rest (Oller, 1983a) but is very much alive (Davies, 1990).
Protagonists of the “unitary competence hypothesis” (UCH) – spearheaded by such writers as Oller (1979) and Oller and Kahn (1981) – believed that each of the four different language skills manifested a holistic language ability, and that accordingly it was possible to predict UCH from any one of these skills. For example a high proficiency in writing would indicate proficiency in all the other language skills.
The UCH hypothesis has a strong form and a weak form (Oller and Khan 1981). In the strong form, a single proficiency test could validly predict UCH. In the weak form, a unitary factor accounts for a large section of the variance in language tests, but differentiated components also need to be taken into account. Oller (1983a) has since opted for the weak form of the Unitary Competence Hypothesis, which adopts an interactionist approach between “global” and discrete components of language. Oller (1983a:36, see also Oller and Khan 1981) describes this approach:
…not only is some sort of global factor dependent for its existence on the differentiated components which comprise it, but in their turn, the components are meaningfully differentiated only in relation to the larger purpose(s) to which all of them in some integrated (integrative? – original brackets) fashion contribute.
(See Carroll [1983a:82] and Bachman and Palmer [1981:54] for similar views).
Physicists are ever-searching for that grand unified theory (GUT), or theory of everything (TOE). Many applied linguistics, in contrast, especially “real-life” testers, have given up the search for unitary theories and have happily buried the UCH. But the UCH is far from dead, because it is closely related to the problem of “whether skills (production and reception) are equal and how many separate tests are needed to assess proficiency, [and] that is the ‘one best test question’. (Davies, 1990:76; my brackets and italics).
Alderson (1981b:190) suggests that
regardless of the correlations, and quite apart from any consideration of the lack of face validity of the One Best Test, we must give testees a fair chance by giving them a variety of language tests, simply because one might be wrong: there might be no Best Test, or it might not have the one we chose to give, or there might not be one general proficiency factor, there may be several.
It would be very difficult to find the “One Best” or “perfect” test. The problem has to do not only with construct validity but also with face validity, because even if one was convinced that one had found the “One Best” test, it would not find general acceptance, owing to the fact that it would probably lack face validity. For example, Henning et al (1981) in their Egyptian study found that the highest correlation with Composition was with Error Identification (.76). They accordingly maintained that “Error Identification may serve as an indirect measure of composition writing ability” (Henning et al, 1981:462). Even though Henning et al (1981:464) established that Reading Comprehension “like Listening Comprehension was of little psychometric value in predicting general proficiency”, they conceded that they had to include reading in order for their battery to “find acceptance” (Henning et al, 1981:464). Accordingly, Henning et al (1981) replace their Error Identification test by a Reading Comprehension test. Thus, Henning et al had to choose between the psychometric evidence and “acceptance”. They capitulate to the latter, because decisions based on testing need to look right, i.e. they need to have face validity.
4.DISCRETE-POINT AND INTEGRATIVE TESTS
The terms “integrative” and “discrete-point” are rejected by some applied linguists. Alderson (1979) prefers to distinguish between “low order” and “higher order” tests than between “discrete-point” and “integrative” tests. Fotos (1991:318) equates “integrative” skills with “advanced skills and global proficiency”, which he contrasts with Alderson’s (1979) “basic skills” (Fotos, 1991:318). These “basic skills” are Alderson’s (1979) “low order” skills.
There are very few tests that do not involve some kind of integrative meaning. Consider the following examples from Rea (1985), Canale and Swain (1980) and from (Bloor et al, 1970:35-40). The following are two examples from Rea (1985:22):
1. How….milk have you got?
(a) a lot (b) much of (c) much (d) many
2. …. to Tanzania in April, but I’m not sure.
(a) I’ll come (b) I’m coming (c) I’m going to come (d) I may come.
Item 1 is testing a discrete element of grammar. All that is required is an understanding of the “collocational constraints of well-formedness” (Rea, 1985:22; see also Canale & Swain, 1980:35), i.e. to answer the question it is sufficient to know that “milk” is a mass noun. Item 2 relates form to global meaning, therefore all parts of the sentence must be taken into account, which makes it an integrative task. To use Rea’s (1985:22) terminology, her item 1 (above) is testing “non-communicative performance”, while her item 2 (above) is testing “communicative performance” (also called “communicative competence” [Canale & Swain, 1980:34]).
Consider Canale and Swain’s (1980:35) examples, which are similar to Rea’s examples above. The first example they regard as a discrete-point item, and the second as an integrative item:
1. Instruction – Select the correct preposition to complete the following sentence:
We went….the store by car. (a) at; (b) on; (c) for; (d) to
2. Instruction – The sentence underlined below may be either grammatically correct or incorrect. If you think it is correct go on to the next item; if you think it is incorrect, correct it by changing, adding or deleting only one element.
We went at the store by car.
The complex instructions of the second item and the fact that one has to produce the correct answer and not merely selectthe correct answer make such items more complex (more integrative) than the first item.
Consider the following three items from the mixed grammar test of Bloor et al (1970:35-40):
Item 38. My friend always goes home….foot.
C) on a
In item 38 knowledge of the correct preposition does not depend on global understanding.
Item 50. We….our meat from that shop nowadays.
A) were never buying
B) do never buy
C) never buy
D) never bought
In contrast to item 38, item 50 requires the understanding of more elements in the sentence, but does not require the understanding of all the elements.
Item 30. When the door-bell…., I was having a bath.
Item 30 is more difficult than the others, because it requires not only knowledge of an idiosyncratic past tense formation, but also an understanding of more elements of the sentence (e.g. “when”, “was having”) than in the case of the previous two items. However, one isn’t required to know the meaning of all the elements in item 30, e.g. “bath”. These examples show that there few tests that do not involve some degree of integrative meaning (Rea, 1985).
I would like to elaborate on the discrete-point/integrative controversy. It is widely believed that “pragmatic” tests (often mistakenly called “integrative” tests) such as dictation tests and essay tests test the “use” of language (Widdowson, 1979), i.e. authentic communicative language, while “discrete-point” tests such as error recognition tests and grammar accuracy tests test the “usage” of language (Widdowson, 1979), i.e. the elements of language. Such a distinction between the two kinds of tests, which Farhady (1983) describes as the “disjunctive fallacy”, is an oversimplification. Two opposing positions exist with regard to language proficiency tests: the one position maintains that the only authentic tests are “real-life, or “communicative” tests, because only such tests are able to measure individual performance (e.g. Morrow, 1979; Harrison, 1983; Lantolf & Frawley, 1988). Finding directions or interpreting maps are examples of a “real-life” task. The contrary position maintains that non-communicative tests, i.e. “discrete-point” tests, can successfully test “real-life” tasks (e.g.Rea, 1985; Politzer & Mcgroarty; 1983).
For language “naturalists” the only authentic tests – whether communicative tests or grammar tests – are those presented in a direct real-life situation. Spolsky (1985:33-34) maintains that “authenticity of task is generally submerged by the greater attention given to psychometric criteria of validity and reliability, where face validity receives no more than “lip service” (see Gamaroff, 1997). For Spolsky and others, e.g. Hughes (1989:15), authenticity is closely related to communicative language, i.e. to direct language. Authentic tests for Spolsky would be “direct” tests in contradistinction to “indirect” tests.
Indirect testing for Hughes (1989:15) is both “discrete testing”, i.e. “testing item by item”, and “integrative testing”, which combines “a variety of language elements at the same time” (Hughes, 1989:15), e.g. dictation and cloze tests. Hughes (1989:150) believes that the relationship between the performances of indirect and direct tests is both weak and uncertain. Owing to the lack of clarity on the relationship between a global skill like composition writing and the components of composition writing, e.g. vocabulary, punctuation and grammar, Hughes (1989:15) believes that it is best, in terms of one’s present knowledge, to try and be as comprehensive as possible in the choice of tests, where direct tests would be favoured. Alderson (1983a) maintains, however, that there is no clarity on what communicative tests measure, and therefore there is no cogent reason why one should only use direct, i.e. communicative, tests. Alderson (1983a:88) maintains that “`communicative testers’ only talk about face validity, at the expense of other validities.” Rea gives the following reasons why indirect tests (e.g. cloze tests) should be used:
- There is no such thing as a pure direct test.
- Direct tests are too expensive and involve too much administration.
- Direct tests only sample a restricted portion of the language, which makes valid inferences difficult.
Even if indirect performance is accepted to be a valid predictor of direct performance, one may still not be comfortable with the idea that direct performance, which one may regard as natural, can be predicted by indirect performance, which one may regard as unnatural, or artificial. There is a misunderstanding here in that the dichotomy between “natural” and “unnatural” is a spurious one with regard to the laws of learning. As Butzkamm (1992) points out, with regard to language teaching approaches, it is incorrect to assume that “natural” approaches (Krashen & Terrell, 1983) and immersion programmes mirror natural language acquisition and that the ordinary classroom doesn’t. The playground or the cocktail party or the cooking club are not more neither less natural than the traditional classroom (Gamaroff, 1986) The laws of learning, and of testing, apply to all contexts, “naturalistic” (Omaggio, 1986:312-313) and otherwise. Granted that the quality of learning depends on the quality of input; but this is as trite as the fact that one wouldn’t be able to learn a language without verbal input, or live without food.
All language testing has a certain arbitrary character. To establish consistency, testers need to decide how to control and develop this arbitrariness. The basic problem of communicative/ direct tests is that it is difficult to make them real-life and authentic, for the simple reason that they are tests. One can have authentic tests, because tests are authentic activity types in their own right (Alderson, 1983a:89). A key issue in language testing should be whether the test is an authentictest, not whether it is “natural”, if by the latter we mean “spontaneous”. Much of education in a tutored context, i.e. much of one’s young life, is “unspontaneous”, but not less natural for being so.
Even when there exists a strong psychometric justification for using indirect tests as predictors of communicative tasks, “communicative” testers will argue that indirect tests are not authentic, because they do not test real life. One might as well argue that an eye-test – say for a driver of a vehicle – is not authentic because it is done at the optometrist instead of on the street. If it could be proven that “discrete-point” tests are valid predictors of direct performance, this would be a good reason for using “discrete-point” tests. The practical implication of the arguments presented is that, until we know more about testing, it is legitimate to follow the practical route of using “discrete-point” tests to predict “real-life” language.
What is so difficult with the interpretation of tests such as essay tests and other “real-life/communicative tasks” (Weir, 1993) is that their “evidential basis” (Messick, 1988:19) is very subjective. Owing to the subjective nature of “real-life” tests, each protocol is the product of a unique web of meanings: the test-taker’s, entangled in another web of meanings: the rater’s.
All language testing theories are inadequate owing to the difficulties involved in devising tests that test authentic language reception and production (Oller, 1983a:269). This does not mean that you should stop measuring until you’ve decided what you are measuring (Spolsky, 1981:46). You do the best you can by taking account of generally accepted views of the nature of language proficiency (Alderson and Clapham, 1992:149), and disagreeing if you feel sensibly compelled to do so.
For a statistical analysis of language tests, which is based on this article, see here.
Alderson, J.C. (1979). The cloze procedure and proficiency in English as a foreign language. TESOL Quarterly, 13, 219-227.
Alderson, J.C. (1981a). Reaction to the Morrow paper. In J.C. Alderson & A. Hughes (Eds.). Issues in language testing: ELT Documents III. The British Council.
Alderson, J.C. (1981b). Report of the discussion on general language proficiency. In J.C. Alderson & A. Hughes (Eds.). A. Issues in language testing: ELT Documents III. The British Council.
Alderson, J.C. (1983a). Who needs jam? In A. Hughes & D. Porter. Current developments in language testing. London: Academic Press.
Alderson, J.C. (1983b). The cloze procedure and proficiency in English as a foreign language. In J.W. Oller, Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers. (A republication of Alderson, 1979)
Alderson, J.C. & Clapham, C. (1992). Applied linguistics and language testing: A case study of the ELTS test . Applied Linguistics, 13(2), 149-167.
Bachman, L.F. (1990). Fundamental considerations in language testing. Oxford:Oxford University Press.
Bloor, M., Bloor, T., Forrest, R., Laird, E. & Relton, H. (1970). Objective tests in English as a foreign language. London: Macmillan.
Bailey, C.J. (1976). The state of no-state linguistics. Annual review of anthropology, 5, 93-106.
Butzkamm, W. 1992. Review of H. Hammerly. “Fluency and accuracy: Toward balance in language teaching and learning.” System, 20(4):545-548.
Canale, M. & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1):1-47.
Carroll, J.B. (1961). Fundamental considerations in testing for English language proficiency of foreign language students. Washington, D.C: Center for Applied Linguistics.
Carroll, J.B. (1983). Psychometric theory and language testing. In J.W. Oller, Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.
Carroll, J.B. (1993). Human cognitive abilities: A survey of factor analytic studies. Cambridge: Cambridge University Press.
Cummins, J. (1979). Linguistic interdependence and the educational development of bilingual children. Review of Educational Research, 49:222-51.
Cummins, J. (1980). The cross-lingual dimensions of language proficiency: Implications for bilingual education and the optimal age issue. TESOL Quarterly, 14(2):175-87.
Cummins, J. (1983). Language proficiency and academic achievement. In J.W. Oller, Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.
Darnell, D.K. (1968). The development of an English language proficiency test of foreign students using a clozenthropy procedure: Final Report. Boulder: University of Colorado.
Duran, R.P. (1984). Some implications of communicative competence research for integrative proficiency testing. In C. Rivera, (Ed.). Communicative competence approaches to language proficiency assessment: Research and application. Clevedon, England: Multilingual Matters.
Farhady, H. (1983). The disjunctive fallacy between discrete-point tests and integrative tests. In J.W. Oller, Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.
Fotos, S. (1991). The cloze test as an integrative measure of EFL proficiency: A substitute for essays on college entrance examinations. Language Learning, 41(3), 313-336.
Gamaroff, R. 1996. Is the (unreal) tail wagging the (real) dog?:
Understanding the construct of language proficiency. Per Linguam, 12(1):48-58.
Gamaroff, R. 1997. Paradigm lost, paradigm regained: Statistics in language testing. Journal of the South African Association of Language Teaching (SAALT), 31(2)131-139.
Hale, G.A., Stansfield, C.W. & Duran, R.P. (1984). TESOL Research Report, 16. Princeton, New Jersey: Educational Testing Service.
Halliday, M.A.K. (1975). Learning how to mean. London: Arnold.
Harrison, A. (1983). Communicative testing: Jam tomorrow? In A. Hughes & D. Porter, D. (Eds.). Current developments in language testing. London: Academic Press.
Haussmann, N.C.(1992). The testing of English mother-tongue competence by means of a multiple-choice test: An applied linguistics perspective. Doctoral thesis, Rand Afrikaans University, Johannesburg.
Henning, G.A., Ghawaby, S.M., Saadalla, W.Z., El-Rifai, M.A., Hannallah, R.K. & Mattar, M. S. (1981). Comprehensive assessment of language proficiency and achievement among learners of English as a foreign language. TESOL Quarterly, 15(4), 457-466.
Hinofotis, F.B. (1980). Cloze as an alternative method of ESL placement and proficiency testing. In J.W. Oller, Jr. & K. Perkins. Research in language testing. Rowley, Massachusetts: Newbury House.
Hughes, A. (1989). Testing for language teachers. Cambridge: Cambridge University Press.
Krashen, S. & Terrell, T. (1983). The natural approach: Language acquisition in the classroom. Hayward, California: Alemany Press.
Lantolf, J.P. & Frawley, W. (1988). Proficiency: Understanding the construct. Studies in Second Language Acquisition (SLLA), 10(2), 181-195.
Messick, S. (1988). Meaning and values in test validation: The science and ethics of measurement. Princeton, New Jersey: Educational Testing Service
Morrow, K. (1979). Communicative language testing: Revolution or evolution. In C.J. Brumfit & K. Johnson. (Eds.). The communicative approach to language teaching. London Oxford University Press.
Oller, J.W., Jr. (1973). Cloze tests of second language proficiency and what they measure.Language Learning, 23(1), 105-118.
Oller, J. W., Jr. (1976). Cloze, discourse, and approximations to English In K. Burt & H.C. Dulay (Eds.). New directions in second language learning, teaching and bilingual education. TESOL: Washington, D.C.
Oller, J. W., Jr. (1979). Language tests at school. London: Longman.
Oller, J. W., Jr. (1983a). A consensus for the 80s. In J.W. Oller, Jr. (Ed.). Issues in language testing research. Rowley, Massachusetts: Newbury Publishers.
Oller, J.W., Jr. (1983b). “g”, “What is it? In A. Hughes & D. Porter. Current developments in language testing. London: Academic Press.
Oller, J.W., Jr. (1995). Adding abstract to formal and content schemata: Results of recent work in Peircean semiotics. Applied Linguistics, 16(3), 274-306.
Oller, J.W., Jr. & Conrad, C. (1971). The cloze technique and ESL proficiency. Language Learning, 21, 183-196.
Oller, J.W., Jr. & Kahn, F. (1981). Is there a global factor of language proficiency? In J.A.S. Read. Directions in language testing. Singapore: Singapore University Press.
Oller, J.W., Jr. & Perkins, K. (1980). Research in language testing. Rowley, Massachusetts:
Omaggio, A.C. (1980). Priorities for classroom testing for the 1980s. Proceedings of the national conference on professional priorities, ed. Dale, L. Lange. Hastings-on-Hudson, New York: ACTFL.
Omaggio, A.C. (1986). Teaching language in context: Proficiency-orientated instruction. Boston, Massachusetts: Henle & Henle.
Politzer, R.L. & Mcgroarty, M. (1983). A discrete-point test of communicative competence. International Review of Applied Linguistics, 21(3), 179-191.
Popham, W.J. (1981). Modern educational measurement. Englewood Cliffs, New Jersey: Prentice-Hall.
Rea, P. (1985). Language testing and the communicative language teaching curriculum. In Lee, Y.P. et al (1985). New directions in language testing. Oxford. Institute of English.
Savignon, S.J. (1983). Communicative competence: Theory and classroom practice. Reading, Mass: Addison-Wesley Publishing Company.
Spolsky, B. (1978). Approaches to language testing. In B. Spolsky (Ed.). Advances in Language Testing Series, 2. Arlington, Virginia: Center for Applied Linguistics.
Spolsky, B. (1981). Some ethical questions about language testing. In Klein-Braley & Stevenson.
Spolsky B. (1985). The limits of authenticity in language testing. Language Testing, 2:31-40.
Stevenson, D.K. (1985). Pop validity and performance testing. In: Y. Lee, A. Fok , R. Lord & G. Low (Eds.). New directions in language testing. Oxford: Pergamon.
Stump, T.A. (1978). Cloze and dictation tasks as predictors of intelligence and achievement scores. In J.W. Oller, Jr. & K. Perkins (Eds.). Language in education: testing the tests. Rowley, Massachusetts: Newbury House.
Weir, C.J. (1993). Understanding and developing language tests. London: Prentice Hall.
Widdowson, H.G. (1979). Explorations in applied linguistics. Oxford: Oxford University Press.