Papers accepted for the 2009 MwALT Conference
 

Testing Listening: might there be a "third way"? Jayanti Banerjee, Natalie Chen, Spiros Papageorgiou University of Michigan, Ann Arbor Abstract
The Impact of the ISTEP+ on a School Corporation with a Large ELL Population April Burke, Luciana C. de Oliveira Purdue University, W. Lafayette, IN Abstract
Language profiles of the Cutoff Borderline Cases for OEPT Rating Haiying Cao Purdue University, W. Lafayette, IN Abstract
Possible Difficulty Factors Affecting Vocabulary Sentence Completion Items Yen-Tzu Chang Georgetown University, Washington, DC Abstract
Developing measures for instructional comparison research: Exploring competing user needs Carol A. Chapelle, Erik Voss Iowa State University Abstract
Examining Task Variability on a Computer-based Oral English Proficiency Test Lixia Cheng Purdue University, W. Lafayette, IN Abstract
The Impact of World Englishes on Language Assessment Huei-Lien Hsu University of Illinois at Urbana-Champaign Abstract
A Simplified Scoring Rubric for English-Chinese Translation in Large-scale Tests Jinlin Jiang University of Illinois at Urbana-Champaign Abstract

Evidence-Based Development of an Assessment of Interactive Conversation Skills

Nancy Kauper Purdue University, W. Lafayette, IN Abstract
A new approach to predict ESL learner's communicative competence: mimicry and free speech Young-Mi Kim University of Illinois at Urbana-Champaign Abstract
Assessing EFL Integrated Writing and Related Factors: Test Task Development Yu-Chen Tina Lin Indiana University, Bloomington Abstract
Exploring the Test Developers' Role in Test Score Equating Procedures Spiros Papageorgiou, Jayanti Banerjee University of Michigan, Ann Arbor Abstract
Temporal measures of fluency: automatic and manual extraction of temporal variables Soohwan Park Purdue University, W. Lafayette, IN Abstract
Toward a Syntactic Analysis of Oral English Proficiency Sunny K. Park Purdue University, W. Lafayette, IN Abstract
The impact of the First Certificate in English (FEC) examination upon the EFL classroom: A washback study Michael Perrone Columbia University, New York, NY Abstract
An Investigation of the Function of a Rating Scale Rui Yang Purdue University, W. Lafayette, IN Abstract

Pre-Conference and Plenary Sessions

Pre-Conference Workshop, "Collaboration in Test Development and Assessment: A Swedish and European Perspective" Gudrun Erickson, Senior Lecturer, University of Gothenburg, Gothenburg, Sweden
Plenary Session, "Language Testing Without Blueprints" Prof. Fred Davidson, University of Illinois at Urbana-Champaign

MwALT Conference Home Page

 

ABSTRACTS

Testing listening: might there be a “third way”?
Jayanti Banerjee

Natalie Chen
Spiros Papageorgiou
English Language Institute, University of Michigan, Ann Arbor 

A recurring debate in the design of listening tests is whether the listening input should be played once or twice. Persuasive construct arguments can be made on both sides of the question (Buck, 2009) and research to date has not helped to break the theoretical impasse (e.g. Fortune, 2004 and Pranculiene, 2004). Professional language testing organizations have come down on one side or the other, choosing to play recordings once or twice. The latter approach is particularly popular with stakeholders. 

However, playing a listening text twice does not resolve the matter. In particular, it fails to address a key point of contention, which is that when asked to repeat what they have said, speakers virtually always reformulate. Even when the same words are repeated, the intonation is invariably different. In fact, listeners are unlikely to hear exactly the same text twice. 

In this paper we will explore ways in which listening tests can capture this real-world feature of aural input while also playing the recording twice (as preferred by test-takers). We will demonstrate our attempts to develop input texts that come in two complementary parts so that when the input is played for the second time, it is a reformulation of the first text. We will talk about the issues arising for the development of such test items and initiate a discussion about the construct validation of such an approach to testing listening. 

(Back to Top)

 

The Impact of the ISTEP+ on a School Corporation with a Large ELL Population
April Burke

Luciana C. de Oliveira
Purdue University, West Lafayette, IN 

English language learners (ELLs) consistently receive lower scores than non-ELLs on the Indiana Statewide Testing for Educational Progress Plus (ISTEP+). For example, only 52% of ELLs passed the Language/arts section of the 2007-2008 ISTEP+ compared with 78% of the non-ELLs who passed the exam. The performance of a school’s ELL population can determine whether or not the school is deemed “failing” and placed on “improvement status.” Despite the fact that many schools face these possible consequences, few studies have focused on the ramifications of using the ISTEP+ in schools with large ELL populations. In response to this deficit, this study investigates the impact of the ISTEP+ on a school corporation in which ELLs make up over 26% of the student population. Through interviews with three administrators, who were formerly teachers in the corporation, this study provides their perspective on the impact of the ISTEP+ on their corporation’s programming, funding, classroom instruction, staff, and ELL students. The study also includes a quantitative component which begins with a discussion of the corporation’s overall subgroup performance on the ISTEP+, followed by an analysis of individual free and reduced lunch ELL and non-ELL ISTEP+ test scores.

(Back to Top)


 

Language Profiles of the Cutoff Borderline Cases for OEPT Rating
Haiying Cao
Purdue University, West Lafayette, IN 

OEPT is a test at Purdue University for certifying international graduate teaching assistants in terms of English Proficiency. An examinee receiving a score 5 is certified of oral English proficiency while a 4 means requiring at least one semester of class on oral English proficiency.  The disagreement at the cutoff (the borderline cases between Score 4 and 5) has remained consistently a challenge to the rater training and the raters’ decision seems quite random.  

To address this issue, besides bettering the level descriptors, individual language profiles of thick description were created for five examinees that raters disagreed on in terms of cutoff in the rescaling project. It was an attempt to better understand the nature of cutoff borderline cases. Two items selected from each of the five tests were transcribed, coded and analyzed in detail. A list of common features derived from the analysis put four examinees into the lower borderline cases and one into Level 4, one category below the borderline cases. These examinees showed mastery of pronunciation, idiomaticity and grammar whereas their performance regarding syntax, content and organization was insufficient to tackle the task on the OEPT test.   

 

(Back to Top)

 

Possible Difficulty Factors Affecting Vocabulary Sentence Completion Items
Yen-Tzu Chang
Georgetown University, Washington, DC 

Previous examinations of the relationship between test item characteristics and item difficulty have generally focused on cloze items. As an early attempt to help address the lack of sentence completion research, this exploratory study presents preliminary findings from a work-in-progress regarding possible factors that affect vocabulary sentence completion item difficulty. 

This study examined the effects of multiple-choice sentence completion item characteristics on item difficulty. Forty sentence completion items were selected from national college entrance exams in Taiwan. Item difficulty was calculated by applying the item response theory (IRT) model to 10,000 randomly selected samples of the exam items in question. Each item was analyzed for several characteristics: (a) sentence length, (b) word frequency, (c) polysemy (operationalized by WordNet sense ranking), (d) collocation (measured by mutual information value), (e) type of sentence (definitional, contrasting, causal, no relationship), and (f) parts of speech. Correlation, regression, and ANOVA were employed to analyze the data. The results indicated non-significant relationships between these factors and item difficulty with an observed power of .25. One explanation is that vocabulary sentence completion item difficulty is affected by factors not included in this study, but it is also possible that the factors considered were not measured appropriately.  

There is no doubt that the study’s small sample size weakens the statistical power. Further research will be conducted with more items and will include different ways of operationalizing these factors in the hope that the information can eventually be used to inform development of vocabulary sentence completion item writing guidelines.


(Back to Top)

Examining Task Variability on a Computer-based Oral English Proficiency Test 
Lixia Cheng, Purdue University, West Lafayette, IN

The current popularity of performance testing and task-based language assessment has triggered a great interest in researching task difficulty and variability in language performance assessment. The present study will determine whether the same examinees’ responses would have significantly different fluency measures across two tasks (Newspaper Headline vs. Compare and Contrast) on a computer-based semi-direct oral English proficiency test. Transcriptions of task responses by two low-intermediate proficiency groups of 25 Chinese ESL learners each are analyzed in terms of temporal measures and lexical variables. Results of statistical tests suggest that task does not have main effects on temporal measures of fluency such as speech rate and mean syllable per run, and the interaction of task with examinee proficiency is not significant in determining these temporal variables either. T tests to compare the means of lexical variables (number of tokens, number of types of words, type token ratio, and lexical density) across the two task types indicate that there is no significant difference between the two tasks in type token ratio and lexical density either. However, the total number of tokens and total number of types do display significant differences between the two groups. But an in-depth discourse analysis is needed before any claims can be made about the possible transfer of lexes in the text prompt of one of the tasks to examinees’ responses. 

 

(Back to Top)

 

The Impact of World Englishes on Language Assessment
Huei-Lien Hsu, University of Illinois at Urbana-Champaign 

Despite interest in World Englishes has generated significant amount of in-depth research reflecting issues and concerns pertaining to the transformation affecting global English use, the extent to which the rich discussions on World Englishes impact on theory and practice in language testing remains to be examined.  This presentation will first review three major issues within World Englishes that may potentially influence the test construction and theory of language testing: the controversy of the definition of native speakers, intelligibility of varietal speakers and a call for a reconceptualization of EFL teaching and learning. These issues indicate the emerging linguistic richness poses an immediate challenge to raters in speaking and writing test, which is a potential threat to test validity. Raters are challenged as to how they differentiate a test-taker who makes linguistic errors from a proficient learner who presents local linguistic variations.

Guided by the third methodology paradigm, Mixed Methods approach, this study explores raters’ perceptions on World Englishes, factors that affect the extent to which they tolerate the varieties of English and how their perceptions on the varieties of English  impact their scoring judgments. Raters in this study are currently rating an oral test, developed by a large Midwest state University and a language test provider. Insights gained from this study will also help determine what components may be needed to facilitate rater training.    

The results of this study are of particular importance for three main reasons. First, this study aims to increase language testers’ awareness of how rich linguistic variations impact language testing. Secondly, this study indicates how language testing and the framework of World Englishes may further benefit from each other in the 21st Century. Lastly, the Mixed Methods approach adopted in this study illustrates ways of mixing different methods at different stages of the study in an effort to generate a broader and deeper understanding of issues investigated as well as to honor the complexity and contingency of human phenomena.

(Back to Top)


 

A Simplified Scoring Rubric for English-Chinese Translation in Large-scale Tests
Jinlin Jiang, University of Illinois at Urbana-Champaign 

Translation items are widely adopted in China’s large-scale English tests, and translation scoring is notoriously time-consuming due to the large number of examinees. This study investigates the development of a simplified scoring rubric for English-Chinese translation in large-scale tests. Two levels of analytic scoring rubrics are constructed and compared: the thorough one and the simplified one. Thorough scoring demands detailed scoring of “form” and “meaning”; the “form” of translation is scored according to the grammaticality, idiomaticity and style closeness of each sentence, and the “meaning” of translation is scored phrase by phrase according to its faithfulness to the source text. Simplified scoring only requires scoring testing points of high degree of difference. 310 Chinese university students’ translations of an English text of 15 sentences are scored by three raters using these two scoring rubrics. The two types of scoring of the whole text translation and 15 single sentence translation are compared respectively. Correlation analyses and reliability statistics indicate that simplified scoring of both the text translation and sentence translation are highly correlated and consistent with their thorough scoring. Therefore, an argument can be made that simplified scoring performs the same role as the more thorough scoring, and perhaps certain efficiencies can be introduced into the current scoring process.

 

(Back to Top)

 

Evidence-Based Development of an Assessment of Interactive Conversation Skills
Nancy Kauper, Purdue University, West Lafayette, IN 

Purdue University’s Oral English Proficiency Program (OEPP) trains and certifies international graduate students for English proficiency. We have traditionally assessed oral presentation skills in formal classroom contexts, but found that assessment of face-to-face interaction and conversation skills was lacking. Developing an assessment of conversation skills has allowed us to focus on students’ abilities to interact and communicate informally one-on-one. 

In 2007-2008, OEPP students were assessed using a working model of the Interactive Conversation (IC) assessment, which included evaluation criteria in three general areas: interactive understanding, active listening, and conversation management. Prior to assessment, students were provided with evaluation criteria, guidelines and general strategies for successful conversations, and opportunities to practice conversing with classmates and instructors.  

Digital video recordings of those IC assessments were made. A set of recordings showing a wide range of variability was selected and the conversations transcribed. Transcriptions were analyzed to see whether and how students met the evaluation criteria, what specific strategies and language they used to meet the criteria, and whether there were gaps or superfluities in the criteria.  

Information gained from this analysis was then used to revise the evaluation criteria, to provide students and instructors with examples of successful conversation strategies and language, and to create new scales and rubrics in order to better reflect the range of variability in student assessment performances. 

Excerpts from transcripts will be shown and discussed, along with comparisons of the working and present models of the IC assessment criteria, scale and rubrics, and evaluation guidelines for raters.

(Back to Top)

 

A new approach to predict ESL learner’s communicative competence:  mimicry and free speech
Young-Mi Kim, University of Illinois at Urbana-Champaign 

The objective of this research is to compare English as Second Language (ESL) learners’ abilities of mimicking native English speakers’ utterances with an independent measure of their communicative competence. Recent research in Second Language Acquisition (SLA) has been conducted to investigate the possibility of measuring ESL learners’ overall communicative competence accurately, but rarely has any researcher focused on the mimicking ability of ESL learners’ in terms of their language competence. The validity and reliability of language tests  themselves remain controversial issues, especially in the performance tests such as speaking and writing tests.  More specifically, raters’ subjective judgment in scoring spoken and written samples and their interpretation of complicated holistic and analytic rating criteria have been considered as unavoidable problems in implementing speaking and writing tests. By comparing the score results of the mimicking utterances and the free speech of same test takers, my research attempts to challenge such complacency by searching for a meaningful relationship between the ESL learners’ abilities to mimic native English speakers’ utterances and their communicative competence. In doing so, the possibility of adopting this measure of assessing mimicry as an aptitude test is explored in order to predict ESL learners’ overall language competence more objectively.

   

(Back to Top)

Assessing EFL Integrated Writing and Related Factors: Test Task Development
Yu-Chen Tina Lin, Indiana University – Bloomington 

In order to assess EFL learners’ integrated writing skills and explore potential factors influencing the quality of their integrated writing performance, this preliminary study designs two reading tasks, one independent writing task, and one integrated writing task. An on-line testing interface is developed for conducting procedures of all tasks. In the main integrated writing task, participants need to write English summaries for two reading passages with different difficulty levels. Cloze tests and multiple-choice questions are designed for these two passages to measure reading comprehension. One English independent essay writing task measures participants’ L2 writing ability. The on-line testing interface, including test designer’s, test takers’, and monitoring pages, allows individuals to take all tasks in different order so that the measurement errors resulting from practice effects can be taken care of. Multiple regression analysis will be used to predict the gravity of all related independent variables to the dependent variable—summary score. The reliability of scoring writing tasks is examined by using rubrics from previous research and estimating the degree of raters’ agreement.  

After the on-line test administered to four participants, a group interview was conducted right away. Participants also recalled and wrote down how they finished the integrated writing task within one to three days after the test. From participants’ test performances, their oral and written feedback, and raters’ scoring experiences, this study proposes suggestions/modifications for designing valid test tasks for assessing integrated writing skills and for increasing the reliability of scoring rubrics and the degree of rater agreement.

  

(Back to Top)

 

Exploring the Test Developers’ Role in Test Score Equating Procedures
Spiros Papageorgiou
, Jayanti Banerjee, English Language Institute, University of Michigan, Ann Arbor

In order to provide comparable scores across different test forms, examination providers need to apply a process known as 'equating' (Kolen & Brennan, 2004).  Among the types of linking proposed by Mislevv (1992) and Linn (1993), equating is the strongest one because it is symmetric, that is, equated results can be used interchangeably. Even though test developers usually rely on the expertise of psychometricians when it comes to application of equating procedures, there are a number of issues that have to be addressed as part of on-going test development operations. This presentation aims to address these issues by presenting the experience from the development of an international ESL examination offered by a Midwestern University.

In particular the presentation will address the following:

  • Selection of type of equating (e.g. common-item equating v. common-person equating)

  • Selection of common items for equating

  • Position of common items in test forms

  • Analysis of item information for establishing equated results.

It will be argued that, despite the psychometric expertise that equating procedures require, a number of equating issues rest with test developers, as they involve content and test design criteria as much as psychometric criteria. The discussion of these criteria will be of use to researchers involved in test development projects and developers of language tests.

(Back to Top)

 

Temporal measures of fluency: automatic and manual extraction of temporal variables
 
Soohwan Park, Purdue University, West Lafayette, IN

Among several components of oral proficiency of foreign language learners, fluency can be defined as ‘speed and smoothness of oral delivery’ (Lennon 1990). We can measure fluency by calculating temporal variables such as speech rate, articulation rate, mean silent pause time and mean syllables per run (Riggenbach 1991, Kormos and Denes 2004). Studying the relationship between temporal measures of fluency and holistic scores on the Oral English Proficiency Test (OEPT) showed the potential of using the fluency measures (Ginther, Dimova and Yang 2009). The software PRAAT can extract basic information like total response time, number of pauses and number of syllables using scripts (de Jong and Wempe 2007). This study investigates the possibility of automatic measurement of temporal variables in speech samples by using acoustic analysis functions of PRAAT. The data sets are 300 speech samples from OEPT results, composed of a total of 150 speakers on two items across three language groups with different proficiency levels (Chinese, Hindi and English native). Two methods are used in analyzing the speech samples: extracting temporal variables manually and by PRAAT. This study extracts basic temporal information such as speech time, pausing time, number of pauses and number of syllables using these two methods, and calculates temporal measures of fluency to compare results from the two methods. The result shows that automatic measurement of temporal variables using PRAAT shows good performance in measuring fluency, compared to extracting temporal variables manually, especially in temporal measures using the information of syllables.

  (Back to Top)

 

Toward a Syntactic Analysis of Oral English Proficiency
Sunny K. Park, Purdue University, West Lafayette, IN

Test developers struggle in their attempts to design items that are able to elicit level differences in a meaningful way.  Particularly for oral proficiency, it is challenging to identify syntactic qualities that characterize the levels.  This study investigated the differences in syntactic qualities between levels of oral English proficiency on a semi-direct test.  Formal syntactic theory reveals structural differences between types of embedded clauses, such that those functioning as complements (e.g. object clauses) are structurally more complex and restricted than those acting as adjuncts (e.g. adverbial clauses) within the main clause, predicting that embedded complement clauses would cause more difficulty for learners than embedded adjunct clauses.  This study aimed to test the following research questions: (a) Are the variables of clause type and oral proficiency related in spoken L2 English? (b) Does the proportion of complement clauses to all well-formed clauses predict and correlate with oral proficiency level?  Ninety-six transcripts of oral test data from Chinese learners of English were coded for the following clause types: main clause, embedded adjunct clause, embedded finite complement clause, and embedded nonfinite complement clause.  Four levels of oral proficiency, including a native speaker control group, were examined.  Results showed that there is indeed a significant effect of clause type on proficiency score (X2 =19.66, p=0.02).  Findings also reveal that the number of complement clauses used in a test response is a significant factor between learners and native speakers.  Implications for speaking test development, particularly through incorporating formal syntactic theory, are discussed.

  (Back to Top)

 

The impact of the First Certificate in English (FCE) examination upon the EFL classroom:  A washback study
Michael Perrone, Teachers College, Columbia University, New York, NY 

This study investigated the notion of washback (i.e., the impact of teaching and learning) within the context of the high-stakes, First Certificate of English (FCE) examination and the teachers and learners at the British Institute of Florence (BIF). A review of the literature focusing on washback revealed limitations in the previous research. Thus, this pilot study attempted to address these limitations by looking at how the high-stakes FCE examination impacted the classroom activities of two BIF courses (i.e., FCE Preparation and General EFL) and how instructors’ methodology changed over the course of the academic year (i.e., November, March, or May) as a result of the FCE administration date.

          The research findings suggested that there were changes in the degree to which the language skills were incorporated into the two courses. These changes were dependent in part, on two factors: 1) the actual FCE course (i.e., FCE Preparation versus General EFL) and, 2) the proximity of the FCE administration date (i.e., November, March, or May).  In addition, the data suggested that the extent to which exam-related activities were incorporated into the two courses varied and this variability was due, in part, to the actual FCE course and the FCE administration date. Thus, it may be argued that the test-related activities and/or language skills in the two courses varied, in part, because of the focus of the particular course (i.e., FCE Preparation or General EFL). That is, these findings suggested evidence for washback in this particular context. However, a series of paired t-tests for independent means indicated that the students’ mean scores on the FCE in the two BIF courses were not significantly different, suggesting that classroom methodology may have a limited impact on the test performance of FCE examinees. 

(Back to Top)  

 

Developing measures for instructional comparison research: Exploring competing user needs
Carol A. Chapelle
, Erik Voss, Iowa State University 

Researchers tackle conceptual issues in designing instructional comparison studies by adhering to guidelines which specify how to identify a sample, form groups, analyze test scores, and generalize results.  Beneath these research methodology issues is an assumption that the tests used actually measure the relevant abilities.  In fact, the design of tests whose scores are valid for making comparisons between groups in a specific context is a challenge second language researchers meet in comparison studies such as those comparing effects of classes using computer-assisted language learning (CALL) and teacher-led classes.    

This paper reports on the challenge of test development for such a research project, which investigated outcomes of two instructional conditions provided to over 200 ESL learners at three different levels in an intensive English program.   We describe the research context to explain the assessment challenge we faced:  the need to test what was taught in the two classes (i.e., using principles of criterion referenced test development) and to obtain test results that would communicate to an external audience (i.e., in a manner that norm-referenced tests do).  We explain our approach to reconciling these needs by drawing on test methods from the literature (dictation and vocabulary levels tests) while sampling some items from the curriculum.  We examine the results of the test development process through the item statistics of the various types of items.  Based on the item data, reliabilities, and the utility of the tests for comparing the groups, we evaluate the success of the hybrid approach to test development.    

(Back to Top)

 

An Investigation of the Function of a Rating Scale
Rui Yang, Purdue University, West Lafayette, IN 

Scaling is one of the critical issues in test development. The purpose of this study was to investigate the function of a four-point ordinal scale used in an oral English proficiency test administered to certify the oral English proficiency of prospective international teaching assistants at a large North American university. Data in this study included ratings of 434 examinees by 10 raters from two operational administrations one year apart. A Many-Facet Rasch analysis was conducted to examine how the raters interacted with the rating scale upon the scores examinees received on the test, and how well the scale distinguished the oral proficiency of the examinees. The result shows slightly larger than the recommended maximum gap (5 logits) between adjacent rating categories. The findings invite the opportunity for scale revision that would allow raters to better distinguish the oral proficiency levels of the examinees.

(Back to Top)

 

Pre-Conference Workshop
Collaboration in Test Development and Assessment: A Swedish and European Perspective
Gudrun Erickson, Senior Lecturer, University of Gothenburg, Gothenburg, Sweden

The workshop will focus on different forms of collaboration between stakeholders, both in test development and in the design and use of continuous assessment procedures in classrooms. The issue of collaboration is addressed from a theoretical as well as from an empirical point of view and will be further developed through discussions among participants. 

Principles and practice characterizing the development of large-scale, national language tests and assessment materials in Sweden will be introduced. The test development approach adopted, involving systematic collaboration with teachers, teacher educators, researchers, and large numbers of students of different ages, has two major aims: to optimize validity and reliability, and to enable positive impact on learning and teaching. In addition, a European survey of different views on language testing and assessment will be briefly introduced. 1,400 adolescent students, and their teachers, in ten European countries answered a questionnaire focusing on students’ perceptions of ‘good’ vs. ‘bad’ language testing and assessment. Some results are outlined, and the reflections of the students are considered in relation both to theory and practice. Furthermore, the two empirical examples given, as well as examples provided by the participants, will be discussed in relation to the concept of Good Practice in language testing and assessment. 

Besides collaboration, a recurrent theme will be the positive pedagogical potential of good testing and assessment. Good assessment and good teaching share certain essential properties which will be further explored and discussed in the course of the workshop.

 

 (Back to Top)