|
Testing listening: might there be a “third way”?
Jayanti Banerjee
Natalie Chen
Spiros Papageorgiou
English Language Institute, University of Michigan, Ann Arbor
A recurring debate in the design of listening tests is whether the
listening input should be played once or twice. Persuasive construct
arguments can be made on both sides of the question (Buck, 2009) and
research to date has not helped to break the theoretical impasse
(e.g. Fortune, 2004 and Pranculiene, 2004). Professional language
testing organizations have come down on one side or the other,
choosing to play recordings once or twice. The latter approach is
particularly popular with stakeholders.
However, playing a listening text twice does not resolve the matter.
In particular, it fails to address a key point of contention, which
is that when asked to repeat what they have said, speakers virtually
always reformulate. Even when the same words are repeated, the
intonation is invariably different. In fact, listeners are unlikely
to hear exactly the same text twice.
In this paper we will explore ways in which listening tests can
capture this real-world feature of aural input while also playing
the recording twice (as preferred by test-takers). We will
demonstrate our attempts to develop input texts that come in two
complementary parts so that when the input is played for the second
time, it is a reformulation of the first text. We will talk about
the issues arising for the development of such test items and
initiate a discussion about the construct validation of such an
approach to testing listening.
(Back to Top)
The Impact of the ISTEP+ on a School Corporation
with a Large ELL Population
April Burke
Luciana C. de Oliveira
Purdue University, West Lafayette, IN
English language learners (ELLs) consistently receive lower scores
than non-ELLs on the Indiana Statewide Testing for Educational
Progress Plus (ISTEP+). For example, only 52% of ELLs passed the
Language/arts section of the 2007-2008 ISTEP+ compared with 78% of
the non-ELLs who passed the exam. The performance of a school’s ELL
population can determine whether or not the school is deemed
“failing” and placed on “improvement status.” Despite the fact that
many schools face these possible consequences, few studies have
focused on the ramifications of using the ISTEP+ in schools with
large ELL populations. In response to this deficit, this study
investigates the impact of the ISTEP+ on a school corporation in
which ELLs make up over 26% of the student population. Through
interviews with three administrators, who were formerly teachers in
the corporation, this study provides their perspective on the impact
of the ISTEP+ on their corporation’s programming, funding, classroom
instruction, staff, and ELL students. The study also includes a
quantitative component which begins with a discussion of the
corporation’s overall subgroup performance on the ISTEP+, followed
by an analysis of individual free and reduced lunch ELL and non-ELL
ISTEP+ test scores.
(Back to Top)
Language Profiles of the Cutoff Borderline Cases for
OEPT Rating
Haiying Cao
Purdue University, West Lafayette, IN
OEPT is a test at Purdue University for certifying international
graduate teaching assistants in terms of English Proficiency. An
examinee receiving a score 5 is certified of oral English
proficiency while a 4 means requiring at least one semester of class
on oral English proficiency. The disagreement at the cutoff (the
borderline cases between Score 4 and 5) has remained consistently a
challenge to the rater training and the raters’ decision seems quite
random.
To address this issue, besides bettering the level descriptors,
individual language profiles of thick description were created for
five examinees that raters disagreed on in terms of cutoff in the
rescaling project. It was an attempt to better understand the nature
of cutoff borderline cases. Two items selected from each of the five
tests were transcribed, coded and analyzed in detail. A list of
common features derived from the analysis put four examinees into
the lower borderline cases and one into Level 4, one category below
the borderline cases. These examinees showed mastery of
pronunciation, idiomaticity and grammar whereas their performance
regarding syntax, content and organization was insufficient to
tackle the task on the OEPT test.
(Back to Top)
Possible Difficulty Factors Affecting Vocabulary
Sentence Completion Items
Yen-Tzu Chang
Georgetown University, Washington, DC
Previous examinations of the relationship between test item
characteristics and item difficulty have generally focused on cloze
items. As an early attempt to help address the lack of sentence
completion research, this exploratory study presents preliminary
findings from a work-in-progress regarding possible factors that
affect vocabulary sentence completion item difficulty.
This study examined the effects of multiple-choice sentence
completion item characteristics on item difficulty. Forty sentence
completion items were selected from national college entrance exams
in Taiwan. Item difficulty was calculated by applying the item
response theory (IRT) model to 10,000 randomly selected samples of
the exam items in question. Each item was analyzed for several
characteristics: (a) sentence length, (b) word frequency, (c)
polysemy (operationalized by WordNet sense ranking), (d) collocation
(measured by mutual information value), (e) type of sentence
(definitional, contrasting, causal, no relationship), and (f) parts
of speech. Correlation, regression, and ANOVA were employed to
analyze the data. The results indicated non-significant
relationships between these factors and item difficulty with an
observed power of .25. One explanation is that vocabulary sentence
completion item difficulty is affected by factors not included in
this study, but it is also possible that the factors considered were
not measured appropriately.
There is no doubt that the study’s small sample size weakens the
statistical power. Further research will be conducted with more
items and will include different ways of operationalizing these
factors in the hope that the information can eventually be used to
inform development of vocabulary sentence completion item writing
guidelines.
(Back to Top)
Examining Task
Variability on a Computer-based Oral English Proficiency Test
Lixia Cheng, Purdue University, West Lafayette, IN
The current popularity of performance testing and task-based
language assessment has triggered a great interest in researching
task difficulty and variability in language performance assessment.
The present study will determine whether the same examinees’
responses would have significantly different fluency measures across
two tasks (Newspaper Headline vs. Compare and Contrast) on a
computer-based semi-direct oral English proficiency test.
Transcriptions of task responses by two low-intermediate proficiency
groups of 25 Chinese ESL learners each are analyzed in terms of
temporal measures and lexical variables. Results of statistical
tests suggest that task does not have main effects on temporal
measures of fluency such as speech rate and mean syllable per run,
and the interaction of task with examinee proficiency is not
significant in determining these temporal variables either. T tests
to compare the means of lexical variables (number of tokens, number
of types of words, type token ratio, and lexical density) across the
two task types indicate that there is no significant difference
between the two tasks in type token ratio and lexical density
either. However, the total number of tokens and total number of
types do display significant differences between the two groups. But
an in-depth discourse analysis is needed before any claims can be
made about the possible transfer of lexes in the text prompt of one
of the tasks to examinees’ responses.
(Back to Top)
The Impact of World Englishes on Language Assessment
Huei-Lien Hsu, University of Illinois at Urbana-Champaign
Despite interest in World Englishes has generated significant amount
of in-depth research reflecting issues and concerns pertaining to
the transformation affecting global English use, the extent to which
the rich discussions on World Englishes impact on theory and
practice in language testing remains to be examined. This
presentation will first review three major issues within World
Englishes that may potentially influence the test construction and
theory of language testing: the controversy of the definition of
native speakers, intelligibility of varietal speakers and a call for
a reconceptualization of EFL teaching and learning. These issues
indicate the emerging linguistic richness poses an immediate
challenge to raters in speaking and writing test, which is a
potential threat to test validity. Raters are challenged as to how
they differentiate a test-taker who makes linguistic errors from a
proficient learner who presents local linguistic variations.
Guided by the third methodology paradigm, Mixed Methods approach,
this study explores raters’ perceptions on World Englishes, factors
that affect the extent to which they tolerate the varieties of
English and how their perceptions on the varieties of English
impact their scoring judgments. Raters in this study are currently
rating an oral test, developed by a large Midwest state University
and a language test provider. Insights gained from this study will
also help determine what components may be needed to facilitate
rater training.
The results of this study are of particular importance for three
main reasons. First, this study aims to increase language testers’
awareness of how rich linguistic variations impact language testing.
Secondly, this study indicates how language testing and the
framework of World Englishes may further benefit from each other in
the 21st Century. Lastly, the Mixed Methods approach
adopted in this study illustrates ways of mixing different methods
at different stages of the study in an effort to generate a broader
and deeper understanding of issues investigated as well as to honor
the complexity and contingency of human phenomena.
(Back to Top)
A Simplified Scoring Rubric for English-Chinese
Translation in Large-scale Tests
Jinlin Jiang, University of Illinois at Urbana-Champaign
Translation items are widely adopted in China’s large-scale English
tests, and translation scoring is notoriously time-consuming due to
the large number of examinees. This study investigates the
development of a simplified scoring rubric for English-Chinese
translation in large-scale tests. Two levels of analytic scoring
rubrics are constructed and compared: the thorough one and the
simplified one. Thorough scoring demands detailed scoring of “form”
and “meaning”; the “form” of translation is scored according to the
grammaticality, idiomaticity and style closeness of each sentence,
and the “meaning” of translation is scored phrase by phrase
according to its faithfulness to the source text. Simplified scoring
only requires scoring testing points of high degree of difference.
310 Chinese university students’ translations of an English text of
15 sentences are scored by three raters using these two scoring
rubrics. The two types of scoring of the whole text translation and
15 single sentence translation are compared respectively.
Correlation analyses and reliability statistics indicate that
simplified scoring of both the text translation and sentence
translation are highly correlated and consistent with their thorough
scoring.
Therefore, an argument can be made that simplified scoring performs
the same role as the more thorough scoring, and perhaps certain
efficiencies can be introduced into the current scoring process.
(Back to Top)
Evidence-Based Development of an Assessment of
Interactive Conversation Skills
Nancy Kauper, Purdue University, West Lafayette, IN
Purdue
University’s Oral English Proficiency Program (OEPP) trains and
certifies international graduate students for English proficiency.
We have traditionally assessed oral presentation skills in formal
classroom contexts, but found that assessment of face-to-face
interaction and conversation skills was lacking. Developing an
assessment of conversation skills has allowed us to focus on
students’ abilities to interact and communicate informally
one-on-one.
In
2007-2008, OEPP students were assessed using a working model of the
Interactive Conversation (IC) assessment, which included evaluation
criteria in three general areas: interactive understanding, active
listening, and conversation management. Prior to assessment,
students were provided with evaluation criteria, guidelines and
general strategies for successful conversations, and opportunities
to practice conversing with classmates and instructors.
Digital
video recordings of those IC assessments were made. A set of
recordings showing a wide range of variability was selected and the
conversations transcribed. Transcriptions were analyzed to see
whether and how students met the evaluation criteria, what specific
strategies and language they used to meet the criteria, and whether
there were gaps or superfluities in the criteria.
Information
gained from this analysis was then used to revise the evaluation
criteria, to provide students and instructors with examples of
successful conversation strategies and language, and to create new
scales and rubrics in order to better reflect the range of
variability in student assessment performances.
Excerpts
from transcripts will be shown and discussed, along with comparisons
of the working and present models of the IC assessment criteria,
scale and rubrics, and evaluation guidelines for raters.
(Back to Top)
A new approach to predict ESL learner’s communicative
competence: mimicry and free speech
Young-Mi Kim, University of Illinois at Urbana-Champaign
The
objective of this research is to compare English as Second Language
(ESL) learners’ abilities of mimicking native English speakers’
utterances with an independent measure of their communicative
competence. Recent research in Second Language Acquisition (SLA) has
been conducted to investigate the possibility of measuring ESL
learners’ overall communicative competence accurately, but rarely
has any researcher focused on the mimicking ability of ESL learners’
in terms of their language competence. The validity and reliability
of language tests themselves remain controversial issues,
especially in the performance tests such as speaking and writing
tests. More specifically, raters’ subjective judgment in scoring
spoken and written samples and their interpretation of complicated
holistic and analytic rating criteria have been considered as
unavoidable problems in implementing speaking and writing tests. By
comparing the score results of the mimicking utterances and the free
speech of same test takers, my research attempts to challenge such
complacency by searching for a meaningful relationship between the
ESL learners’ abilities to mimic native English speakers’ utterances
and their communicative competence. In doing so, the possibility of
adopting this measure of assessing mimicry as an aptitude test is
explored in order to predict ESL learners’ overall language
competence more objectively.
(Back to Top)
Assessing EFL Integrated Writing and Related Factors:
Test Task Development
Yu-Chen Tina Lin, Indiana University – Bloomington
In order to assess EFL learners’ integrated writing skills and
explore potential factors influencing the quality of their
integrated writing performance, this preliminary study designs two
reading tasks, one independent writing task, and one integrated
writing task. An on-line testing interface is developed for
conducting procedures of all tasks. In the main integrated writing
task, participants need to write English summaries for two reading
passages with different difficulty levels. Cloze tests and
multiple-choice questions are designed for these two passages to
measure reading comprehension. One English independent essay writing
task measures participants’ L2 writing ability. The on-line testing
interface, including test designer’s, test takers’, and monitoring
pages, allows individuals to take all tasks in different order so
that the measurement errors resulting from practice effects can be
taken care of. Multiple regression analysis will be used to predict
the gravity of all related independent variables to the dependent
variable—summary score. The reliability of scoring writing tasks is
examined by using rubrics from previous research and estimating the
degree of raters’ agreement.
After the on-line test administered to four participants, a group
interview was conducted right away. Participants also recalled and
wrote down how they finished the integrated writing task within one
to three days after the test. From participants’ test performances,
their oral and written feedback, and raters’ scoring experiences,
this study proposes suggestions/modifications for designing valid
test tasks for assessing integrated writing skills and for
increasing the reliability of scoring rubrics and the degree of
rater agreement.
(Back to Top)
Exploring the Test Developers’ Role in Test
Score Equating Procedures
Spiros Papageorgiou,
Jayanti Banerjee, English Language Institute, University of
Michigan, Ann Arbor
In
order to provide comparable scores across different test forms,
examination providers need to apply a process known as 'equating' (Kolen
& Brennan, 2004). Among the types of linking proposed by
Mislevv (1992) and Linn (1993), equating is the
strongest one because it is symmetric, that is, equated results can
be used interchangeably. Even though test developers usually rely on
the expertise of psychometricians when it comes to application of
equating procedures, there are a number of issues that have to be
addressed as part of on-going test development operations. This
presentation aims to address these issues by presenting the
experience from the development of an international ESL examination
offered by a Midwestern University.
In
particular the presentation will address the following:
-
Selection of type of equating (e.g. common-item equating v.
common-person equating)
-
Selection of common items for equating
-
Position of common items in test forms
-
Analysis of item information for establishing equated results.
It will be argued that, despite the psychometric expertise that
equating procedures require, a number of equating issues rest with
test developers, as they involve content and test design criteria as
much as psychometric criteria. The discussion of these criteria will
be of use to researchers involved in test development projects and
developers of language tests.
(Back to Top)
Temporal measures of fluency: automatic and manual
extraction of temporal variables
Soohwan
Park, Purdue University, West Lafayette, IN
Among several components of oral proficiency of foreign language
learners, fluency can be defined as ‘speed and smoothness of oral
delivery’ (Lennon 1990). We can measure fluency by calculating
temporal variables such as speech rate, articulation rate, mean
silent pause time and mean syllables per run (Riggenbach 1991,
Kormos and Denes 2004). Studying the relationship between temporal
measures of fluency and holistic scores on the Oral English
Proficiency Test (OEPT) showed the potential of using the fluency
measures (Ginther, Dimova and Yang 2009). The software PRAAT can
extract basic information like total response time, number of pauses
and number of syllables using scripts (de Jong and Wempe 2007). This
study investigates the possibility of automatic measurement of
temporal variables in speech samples by using acoustic analysis
functions of PRAAT. The data sets are 300 speech samples from OEPT
results, composed of a total of 150 speakers on two items across
three language groups with different proficiency levels (Chinese,
Hindi and English native). Two methods are used in analyzing the
speech samples: extracting temporal variables manually and by PRAAT.
This study extracts basic temporal information such as speech time,
pausing time, number of pauses and number of syllables using these
two methods, and calculates temporal measures of fluency to compare
results from the two methods. The result shows that automatic
measurement of temporal variables using PRAAT shows good performance
in measuring fluency, compared to extracting temporal variables
manually, especially in temporal measures using the information of
syllables.
(Back to Top)
Toward a Syntactic
Analysis of Oral English Proficiency
Sunny K. Park, Purdue University, West Lafayette, IN
Test developers struggle in their attempts to design items that are
able to elicit level differences in a meaningful way. Particularly
for oral proficiency, it is challenging to identify syntactic
qualities that characterize the levels. This study investigated the
differences in syntactic qualities between levels of oral English
proficiency on a semi-direct test. Formal syntactic theory reveals
structural differences between types of embedded clauses, such that
those functioning as complements (e.g. object clauses) are
structurally more complex and restricted than those acting as
adjuncts (e.g. adverbial clauses) within the main clause, predicting
that embedded complement clauses would cause more difficulty for
learners than embedded adjunct clauses. This study aimed to test
the following research questions: (a) Are the variables of clause
type and oral proficiency related in spoken L2 English? (b) Does the
proportion of complement clauses to all well-formed clauses predict
and correlate with oral proficiency level? Ninety-six transcripts
of oral test data from Chinese learners of English were coded for
the following clause types: main clause, embedded adjunct clause,
embedded finite complement clause, and embedded nonfinite complement
clause. Four levels of oral proficiency, including a native speaker
control group, were examined. Results showed that there is indeed a
significant effect of clause type on proficiency score (X2
=19.66, p=0.02). Findings also reveal that the number of
complement clauses used in a test response is a significant factor
between learners and native speakers. Implications for speaking
test development, particularly through incorporating formal
syntactic theory, are discussed.
(Back to Top)
The impact of the
First Certificate in English (FCE) examination upon the EFL
classroom: A washback study
Michael Perrone, Teachers College, Columbia University, New York,
NY
This study investigated the notion of washback (i.e., the impact of
teaching and learning) within the context of the high-stakes, First
Certificate of English (FCE) examination and the teachers and
learners at the British Institute of Florence (BIF). A review of the
literature focusing on washback revealed limitations in the previous
research. Thus, this pilot study attempted to address these
limitations by looking at how the high-stakes FCE examination
impacted the classroom activities of two BIF courses (i.e., FCE
Preparation and General EFL) and how instructors’ methodology
changed over the course of the academic year (i.e., November, March,
or May) as a result of the FCE administration date.
The research findings suggested that there were changes in
the degree to which the language skills were incorporated into the
two courses. These changes were dependent in part, on two factors:
1) the actual FCE course (i.e., FCE Preparation versus General EFL)
and, 2) the proximity of the FCE administration date (i.e.,
November, March, or May). In addition, the data suggested that the
extent to which exam-related activities were incorporated into the
two courses varied and this variability was due, in part, to the
actual FCE course and the FCE administration date. Thus, it may be
argued that the test-related activities and/or language skills in
the two courses varied, in part, because of the focus of the
particular course (i.e., FCE Preparation or General EFL). That is,
these findings suggested evidence for washback in this particular
context. However, a series of paired t-tests for independent means
indicated that the students’ mean scores on the FCE in the two BIF
courses were not significantly different, suggesting that classroom
methodology may have a limited impact on the test performance of FCE
examinees.
(Back to Top)
Developing measures for instructional comparison
research: Exploring competing user needs
Carol A. Chapelle, Erik Voss, Iowa State University
Researchers tackle conceptual issues in designing instructional
comparison studies by adhering to guidelines which specify how to
identify a sample, form groups, analyze test scores, and generalize
results. Beneath these research methodology issues is an assumption
that the tests used actually measure the relevant abilities. In
fact, the design of tests whose scores are valid for making
comparisons between groups in a specific context is a challenge
second language researchers meet in comparison studies such as those
comparing effects of classes using computer-assisted language
learning (CALL) and teacher-led classes.
This paper reports on the challenge of test development for such a
research project, which investigated outcomes of two instructional
conditions provided to over 200 ESL learners at three different
levels in an intensive English program. We describe the research
context to explain the assessment challenge we faced: the need to
test what was taught in the two classes (i.e., using principles of
criterion referenced test development) and to obtain test results
that would communicate to an external audience (i.e., in a manner
that norm-referenced tests do). We explain our approach to
reconciling these needs by drawing on test methods from the
literature (dictation and vocabulary levels tests) while sampling
some items from the curriculum. We examine the results of the test
development process through the item statistics of the various types
of items. Based on the item data, reliabilities, and the utility of
the tests for comparing the groups, we evaluate the success of the
hybrid approach to test development.
(Back to Top)
An Investigation of the Function of a Rating Scale
Rui Yang, Purdue University, West Lafayette, IN
Scaling is one of the critical issues in test development. The
purpose of this study was to investigate the function of a
four-point ordinal scale used in an oral English proficiency test
administered to certify the oral English proficiency of prospective
international teaching assistants at a large North American
university. Data in this study included ratings of 434 examinees by
10 raters from two operational administrations one year apart. A
Many-Facet Rasch analysis was conducted to examine how the raters
interacted with the rating scale upon the scores examinees received
on the test, and how well the scale distinguished the oral
proficiency of the examinees. The result shows slightly larger than
the recommended maximum gap (5 logits) between adjacent rating
categories. The findings invite the opportunity for scale revision
that would allow raters to better distinguish the oral proficiency
levels of the examinees.
(Back to Top)
Pre-Conference Workshop
Collaboration in Test Development and Assessment: A Swedish and
European Perspective
Gudrun Erickson, Senior Lecturer, University of Gothenburg,
Gothenburg, Sweden
The workshop will focus on different forms of
collaboration between stakeholders, both in test development and in
the design and use of continuous assessment procedures in
classrooms. The issue of collaboration is addressed from a
theoretical as well as from an empirical point of view and will be
further developed through discussions among participants.
Principles and practice characterizing the
development of large-scale, national language tests and assessment
materials in Sweden will be introduced. The test development
approach adopted, involving systematic collaboration with teachers,
teacher educators, researchers, and large numbers of students of
different ages, has two major aims: to optimize validity and
reliability, and to enable positive impact on learning and teaching.
In addition, a European survey of different views on language
testing and assessment will be briefly introduced. 1,400 adolescent
students, and their teachers, in ten European countries answered a
questionnaire focusing on students’ perceptions of ‘good’ vs. ‘bad’
language testing and assessment. Some results are outlined, and the
reflections of the students are considered in relation both to
theory and practice. Furthermore, the two empirical examples given,
as well as examples provided by the participants, will be discussed
in relation to the concept of Good Practice in language testing and
assessment.
Besides collaboration, a recurrent theme will be
the positive pedagogical potential of good testing and assessment.
Good assessment and good teaching share certain essential properties
which will be further explored and discussed in the course of the
workshop.
(Back
to Top)
|