Since the introduction of the cloze procedure as a measure of
readability by Wilson Taylor (1953), it was employed as a measure of
reading ability of native speakers (Bormuth, J., 1967, Crawford, 1970).
Other researchers later investigated the effectiveness of cloze testing
as a measure of ESL/EFL proficiency (Darnell, D., 1968; Brown, 1983,
1988, 1993; Irvine, Atai and Oller, 1974; Oller, J. 1972, 1983) to name
a few. The results have indeed been widely varied across studies; and a
number of defects have been found with the procedure. In the light of
these criticisms, Klein-Braley and Raatz proposed a modification,
C-testing. The procedure, developed to answer the psychometric problems
of cloze testing, had been claimed as empirically and theoretically
valid measure of language proficiency (Raatz and Klein-Braley, 1981,
Klein-Braley, 1985 and Klein-Braley and Raatz, 1984; 1985 and Raatz,
1985), and was later proposed by other researchers to be a substitute
for cloze tests (Mc Beath, 1990, Cohen, Segall and Weiss, 1984).
Originally, the C-testing procedure involves making a test from four
or five thematically distinct segments of connected discourse in which
the second half of every second word (usually 100 words in all) are
deleted, and the examinee gets credit for exact word restoration. The
use of several different short texts minimizes the effect of text topic
and difficulty. Nevertheless, researches had not so far, dealt with the
issue of what kind of text produces higher realiability and validity,
until Mochizuki (1995) experimented with other different kinds of
texts: narration, explanation, argumentation and description for the
construction of C-tests for classroom use. The study revealed that long
passages, especially the Narration texts, were the most appropriate for
making the C-test effective in terms of reliability and concurrent
validity.
Klein-Braley and Raatz basically utilized teacher judgement or school
grades as a criterion for validating C-tests, while other researchers
have supplied evidence grounded on other kinds of criteria. For
example, Nigishi (1987) reports correlation coefficients of .80 and .76
between C-tests and the reading subtest of ELBA and total ELBA,
respectively; while the studies of Ikeguchi (1994) indicate the C-test
responses to correlate highest with the grammar results of the TOFEL
exams. Still other studies in support of this test procedure include
the validation of C tests among ESL/EFL learners. For instance, Feldman
and Stemmer (1987) found C test validation through verbal report, while
Doornyei and Katona's (1992) study of a C test against different
language tests, including an oral interview, found further support for
C-testing, reporting that this test procedure gives a random and
representative sample of an original text, in support of an earlier
assertion that the every-other-word deletion in the C test produces a large number of 'random
samples of the word classes of the text involved' (Klein-Braley, 1985,
1984). Other recent SLA researchers suggested that C-tests may also be
useful for L2 vocabulary research. For instance, Singleton and Little
(1991), found the responses of L2 learners to C-tests as a source of
evidence about second language lexical development.
[ p. 3 ]
On the other hand, criticisms were levelled against the C-test
procedure, and these researchers propose further investigation on the
use of C-tests for second or foreign language learners on the grounds
that this test type (1) can not fully evaluate the students' ability to
process discourse for general proficiency (Cleary, 1988), (2)
encourages only microlevel instead of macrolevel procesing (Cohen,
Segal and Weiss, 1984), (3) has low face validity (Weir, 1988) and (4)
as a test using messages with reduced redundancy, is not necessarily a
test of language competence (Carroll, 1987). Although test items and
task analysis suggest that the C test may be a measure of grammatical
competence (Klein-Braley,1985), validity research does not provide
evidence for the specific traits it may measure (Chapelle, C. &
Abraham, R., 1990), and that 'assumptions of random sampling of the
basic elements of a text are doubtful' (Jafarpur, A. 1995).
The use of the C-test since its introduction (Klein-Braley and Raatz,
1984) as a means of constructing norm-referenced measures for
proficiency and placement testing, and to solve problems concerning
the cloze procedures, has been extended to certain indefinite limits
such as 'measure of language creativity' (Carroll, 1987), and has
yielded results contrary to the researchers' expectations that were not
the purpose for which this test was originally intended. Furthermore,
the empirical evidence in support of C-tests is scanty (Weir, 1988) and
warrants further investigation in the context of second language
instruction, particularly in Japan.
Methodology
Purpose of the study
The objectives of this study are to investigate whether C-tests, using
two procedures of construction, can discriminate levels of language
proficiency between ESL learners in JapanAand to determine the
superiority of a C-test using several passages (C-test 1) over a C-test
constructed from only one long passage, the Narration type (C-test 2),
in terms of reliability and correlation with an external criterion.
Subjects
Two groups of freshmen university students in Japan were chosen for
the investigation: one group was composed of 60 undergraduate students
enrolled in the general Freshman English course, the second group was
made up 30 students in a Freshman English for returnees. Ss from the
first group were picked at random from an intact class, while the Ss
from the latter group belong to one English class for returnees. To
qualify to join the returnees class, the students should have stayed
for at least a period of one year, and have passed the qualifying exam
administered by the university. In terms of proficiency level, they
belong to the Advanced group in Listening and Oral skills, but can be
grouped into the Post-Intermmediate level in their Writing and Reading
ability (Tschirner, 1996).
Materials
Two kinds of C-tests were used in this study: one type was constructed
using four short passages from different texts, while the other type
was constructed using only one long Narration text. The use of several
short segments of different texts has been proven, as mentioned in the
researches above, to be satisfactory in terms of reliability -above
.80, and to be empirically valid -at least .50, (Klein-Braley & Raatz,
1984). For this study, the four short passages were chosen from
different texts within similar readability and interest levels using
the Fry (1985) and Flesch (as described in Klare, 1984) index. The
readability estimates of the texts where segments were chosen for this
study give a 6 - 8 level by the Fry index, and a 6.7 - to 9.6 level by
the Flesch index. These numbers which appear to be quite different
scales are remarkable only in that they indicate variations in the
readability levels of the passages used (Brown, 1993). C-test 1 was
constructed using 25 items from different passages, making a total of
100 items. Every first and last sentence of each passage were left
intact to provide a complete context.
[ p. 4 ]
The second type of C-test was adopted for use in this study for the
following reason. An investigation was conducted by Mochizuki (1994)
using different types of discourse: the description, the exposition,
the narration and argumemtation to construct C-tests for classroom use.
Among these four types of texts, the Narration type was found to be the
most reliable - .92 . The Narration text "The Lock Keeper"
consisting of 120 items which was found to be the most
reliable and with the highest concurrent validity (Mochizuki, 1995).
This study is an attempt to investigate which of these two types of
C-test construction will yield higher reliability and concurrent
validity. The external criterion used was the STEP exam. The STEP exam
consists of 66 written test questions on vocabulary, grammar and
reading comprehension. STEP has been established in previous
investigations as resulting in high reliability as well as high
coefficients as an external validating criterion with Japanese
university students (Kimura, 1995). In a previous study using the STEP
and CELT results to investigate the external validity of C-tests
constructed from different types of discourse, STEP was found to have
a higher reliability (.778) than the CELT (.638), and other C-tests
(Mochizuki, 1994).
Procedures
Each student from the two groups of Ss took the two versions of the
C-tests and the STEP. To control for a potential order effect, the
order of administering the C-test and the STEP was counterbalanced:
half the subjects in the non-returnees group and the returnees group
took the two C-tests first, and the STEP during the English class the
following week. The other half of each group took the STEP test first,
and then the English test.
Analyses
The Ss' responses for both C-test 1 and C-test 2 were scored
for exact replacements. Descriptive statistics for the scores of the
C-tests were obtained. Reliability coefficients were obtained by the KR
20 method. The use of KR 20 has been questioned in the past. For
instance, Farhady (1983) and Bachman (1990) claim that the internal
consistency reliability coefficients are inappropriate for cloze and
C-tests because of the interdependence of items. On the other hand,
Woods (1984), Henning (1987) and (Jafarpur, 1995) claimed that the KR
20 method yields the same results as the Cronbach's alpha.
Besides Brown (1983) provided evidence that the differences between
reliability coeficients from KR 20 and Cronbach's are negligible.
To address the first research question, that of determining the
discriminative power of the C-tests, a comparison of the
subjects' scores among groups was obtained, based on the results
of the group t-tests. The subjects' mean scores within each
group for each test was obtained and subjected to an analysis of
variance test and t-tests.
For the second research objective which is to determine the
reliability and correlation of C-tests and STEP, the Pearson product
moment correlation coefficients were computed.