JALT Testing & Evaluation SIG Newsletter
Vol. 2 No. 1 Oct. 1998. (p. 3 - 9) [ISSN 1881-5537]


Do different C-tests discriminate proficiency levels of EL2 learners?

Cecilia B. Ikeguchi, Ph.D. Tsukuba Women's University

ENGLISH ABSTRACT * JAPANESE ABSTRACT


Since the introduction of the cloze procedure as a measure of readability by Wilson Taylor (1953), it was employed as a measure of reading ability of native speakers (Bormuth, J., 1967, Crawford, 1970). Other researchers later investigated the effectiveness of cloze testing as a measure of ESL/EFL proficiency (Darnell, D., 1968; Brown, 1983, 1988, 1993; Irvine, Atai and Oller, 1974; Oller, J. 1972, 1983) to name a few. The results have indeed been widely varied across studies; and a number of defects have been found with the procedure. In the light of these criticisms, Klein-Braley and Raatz proposed a modification, C-testing. The procedure, developed to answer the psychometric problems of cloze testing, had been claimed as empirically and theoretically valid measure of language proficiency (Raatz and Klein-Braley, 1981, Klein-Braley, 1985 and Klein-Braley and Raatz, 1984; 1985 and Raatz, 1985), and was later proposed by other researchers to be a substitute for cloze tests (Mc Beath, 1990, Cohen, Segall and Weiss, 1984).

Originally, the C-testing procedure involves making a test from four or five thematically distinct segments of connected discourse in which the second half of every second word (usually 100 words in all) are deleted, and the examinee gets credit for exact word restoration. The use of several different short texts minimizes the effect of text topic and difficulty. Nevertheless, researches had not so far, dealt with the issue of what kind of text produces higher realiability and validity, until Mochizuki (1995) experimented with other different kinds of texts: narration, explanation, argumentation and description for the construction of C-tests for classroom use. The study revealed that long passages, especially the Narration texts, were the most appropriate for making the C-test effective in terms of reliability and concurrent validity.

Klein-Braley and Raatz basically utilized teacher judgement or school grades as a criterion for validating C-tests, while other researchers have supplied evidence grounded on other kinds of criteria. For example, Nigishi (1987) reports correlation coefficients of .80 and .76 between C-tests and the reading subtest of ELBA and total ELBA, respectively; while the studies of Ikeguchi (1994) indicate the C-test responses to correlate highest with the grammar results of the TOFEL exams. Still other studies in support of this test procedure include the validation of C tests among ESL/EFL learners. For instance, Feldman and Stemmer (1987) found C test validation through verbal report, while Doornyei and Katona's (1992) study of a C test against different language tests, including an oral interview, found further support for C-testing, reporting that this test procedure gives a random and representative sample of an original text, in support of an earlier assertion that the every-other-word deletion in the C test produces a large number of 'random samples of the word classes of the text involved' (Klein-Braley, 1985, 1984). Other recent SLA researchers suggested that C-tests may also be useful for L2 vocabulary research. For instance, Singleton and Little (1991), found the responses of L2 learners to C-tests as a source of evidence about second language lexical development.

[ p. 3 ]


On the other hand, criticisms were levelled against the C-test procedure, and these researchers propose further investigation on the use of C-tests for second or foreign language learners on the grounds that this test type (1) can not fully evaluate the students' ability to process discourse for general proficiency (Cleary, 1988), (2) encourages only microlevel instead of macrolevel procesing (Cohen, Segal and Weiss, 1984), (3) has low face validity (Weir, 1988) and (4) as a test using messages with reduced redundancy, is not necessarily a test of language competence (Carroll, 1987). Although test items and task analysis suggest that the C test may be a measure of grammatical competence (Klein-Braley,1985), validity research does not provide evidence for the specific traits it may measure (Chapelle, C. & Abraham, R., 1990), and that 'assumptions of random sampling of the basic elements of a text are doubtful' (Jafarpur, A. 1995).

The use of the C-test since its introduction (Klein-Braley and Raatz, 1984) as a means of constructing norm-referenced measures for proficiency and placement testing, and to solve problems concerning the cloze procedures, has been extended to certain indefinite limits such as 'measure of language creativity' (Carroll, 1987), and has yielded results contrary to the researchers' expectations that were not the purpose for which this test was originally intended. Furthermore, the empirical evidence in support of C-tests is scanty (Weir, 1988) and warrants further investigation in the context of second language instruction, particularly in Japan.

Methodology

Purpose of the study

The objectives of this study are to investigate whether C-tests, using two procedures of construction, can discriminate levels of language proficiency between ESL learners in JapanAand to determine the superiority of a C-test using several passages (C-test 1) over a C-test constructed from only one long passage, the Narration type (C-test 2), in terms of reliability and correlation with an external criterion.

Subjects

Two groups of freshmen university students in Japan were chosen for the investigation: one group was composed of 60 undergraduate students enrolled in the general Freshman English course, the second group was made up 30 students in a Freshman English for returnees. Ss from the first group were picked at random from an intact class, while the Ss from the latter group belong to one English class for returnees. To qualify to join the returnees class, the students should have stayed for at least a period of one year, and have passed the qualifying exam administered by the university. In terms of proficiency level, they belong to the Advanced group in Listening and Oral skills, but can be grouped into the Post-Intermmediate level in their Writing and Reading ability (Tschirner, 1996).

Materials

Two kinds of C-tests were used in this study: one type was constructed using four short passages from different texts, while the other type was constructed using only one long Narration text. The use of several short segments of different texts has been proven, as mentioned in the researches above, to be satisfactory in terms of reliability -above .80, and to be empirically valid -at least .50, (Klein-Braley & Raatz, 1984). For this study, the four short passages were chosen from different texts within similar readability and interest levels using the Fry (1985) and Flesch (as described in Klare, 1984) index. The readability estimates of the texts where segments were chosen for this study give a 6 - 8 level by the Fry index, and a 6.7 - to 9.6 level by the Flesch index. These numbers which appear to be quite different scales are remarkable only in that they indicate variations in the readability levels of the passages used (Brown, 1993). C-test 1 was constructed using 25 items from different passages, making a total of 100 items. Every first and last sentence of each passage were left intact to provide a complete context.

[ p. 4 ]


The second type of C-test was adopted for use in this study for the following reason. An investigation was conducted by Mochizuki (1994) using different types of discourse: the description, the exposition, the narration and argumemtation to construct C-tests for classroom use. Among these four types of texts, the Narration type was found to be the most reliable - .92 . The Narration text "The Lock Keeper" consisting of 120 items which was found to be the most reliable and with the highest concurrent validity (Mochizuki, 1995).

This study is an attempt to investigate which of these two types of C-test construction will yield higher reliability and concurrent validity. The external criterion used was the STEP exam. The STEP exam consists of 66 written test questions on vocabulary, grammar and reading comprehension. STEP has been established in previous investigations as resulting in high reliability as well as high coefficients as an external validating criterion with Japanese university students (Kimura, 1995). In a previous study using the STEP and CELT results to investigate the external validity of C-tests constructed from different types of discourse, STEP was found to have a higher reliability (.778) than the CELT (.638), and other C-tests (Mochizuki, 1994).

Procedures

Each student from the two groups of Ss took the two versions of the C-tests and the STEP. To control for a potential order effect, the order of administering the C-test and the STEP was counterbalanced: half the subjects in the non-returnees group and the returnees group took the two C-tests first, and the STEP during the English class the following week. The other half of each group took the STEP test first, and then the English test.

Analyses

The Ss' responses for both C-test 1 and C-test 2 were scored for exact replacements. Descriptive statistics for the scores of the C-tests were obtained. Reliability coefficients were obtained by the KR 20 method. The use of KR 20 has been questioned in the past. For instance, Farhady (1983) and Bachman (1990) claim that the internal consistency reliability coefficients are inappropriate for cloze and C-tests because of the interdependence of items. On the other hand, Woods (1984), Henning (1987) and (Jafarpur, 1995) claimed that the KR 20 method yields the same results as the Cronbach's alpha. Besides Brown (1983) provided evidence that the differences between reliability coeficients from KR 20 and Cronbach's are negligible.

To address the first research question, that of determining the discriminative power of the C-tests, a comparison of the subjects' scores among groups was obtained, based on the results of the group t-tests. The subjects' mean scores within each group for each test was obtained and subjected to an analysis of variance test and t-tests.

For the second research objective which is to determine the reliability and correlation of C-tests and STEP, the Pearson product moment correlation coefficients were computed.

- continued -

HOME PAGE
www.jalt.org/test/ike_1.htm

[ p. 5 ]