Programme - Faculty of Arts and Humanities

Assessing World Languages Conference 2019

Poster

Conference Schedule

Dates: November 6 – 9, 2019

Faculty of Arts and Humanities, University of Macau

Conference Chairs: Antony John Kunnan, Cecilia Zhao and Matthew Wallace

Venue: FBA E22-G010 & FAH E21-G049 (Simultaneous Interpreting Lab)

Day 1, November 6, 2019 [Wednesday]

Venue: FBA E22-G010

Time	Events	Titles and Speakers
9:00 – 9:20	Registration & Opening	Antony John Kunnan, University of Macau
9:20 – 10:20	Talk 1	Validation Research on Tests of Writing for Academic Purposes Alister Cumming, University of Toronto
10:20 – 10:30	Break
10:30 – 11:30	Talk 2	Overall English Language Proficiency (Whatever That Is) James D. Brown, University of Hawaiʻi, Mānoa
11:30 – 11:50	Group Photo
11:50 – 12:50	Talk 3	Alternative Approaches to Writing Assessment in Higher Education Cecilia Guanfang Zhao, University of Macau
Lunch Break
14:30 – 15:30	Talk 4	Development of an English Proficiency Test for Intermediate-level EFL Learners Yong-Won Lee & Heesung Jun, Seoul National University
15:30 – 15:40	Break
15:40 – 16:40	Talk 5	China’s Standards of English Language Ability: Theoretical Underpinnings and Potential Impact on English Teaching Learning and Testing Jianda Liu, Guangdong University of Foreign Studies

Day 2, November 7, 2019 [Thursday]

Venue: FBA E22-G010

Time	Events	Titles and Speakers
9:00 – 9:20	Registration
9:20 – 10:20	Talk 6	The Role of Source-text Comprehension in Integrated Writing Task Performance: What L2 Assessment Research so far Tells Us Yasuyo Sawaki, Waseda University
10:20 – 10:30	Break
10:30 – 11:30	Talk 7	A Meta-analysis of L2 Speaking Proficiency and Its Features Yo In’nami, Chuo University
11:30 – 11:40	Break
11:40 – 12:40	Talk 8	How to Tell if You Grew up Speaking Spanish: Using Corpora and DIF to Distinguish Heritage and L2 Speakers’ Language Knowledge Melissa Bowles, University of Illinois, Urbana-Champaign
12:40 – 14:30	Lunch Break
14:30 – 15:30	Talk 9	Understanding and Assessing the Repertoire of an Indian Multilingual Through a Speaking Test Rama Mathew, Delhi University
15:30 – 15: 40	Break

15:40 – 16:40

Talk 10

Sources of Validity Evidence for the Interpretation of CU-TEP Cut Scores

Jirada Wudthayagorn, Chulalongkorn University

Day 3, November 8, 2019 [Friday]

Venue: FBA E22-G010

Time	Events	Titles and Speakers
9:00 – 9:20	Registration
9:20 – 10:20	Talk 11	Source-based Writing: Some Reflections on Task Design and Scoring Decisions Atta Gebril, The American University in Cairo
10:20 – 10:30	Break
10:30 – 11:30	Talk 12	Prisoner’s Dilemma: An Analysis of Parents’ Perceptions of the Hong Kong Territory-wide Systematic Assessment Qin Xie, Education University of Hong Kong
11:30 – 11:40	Break
11:40 – 12:40	Talk 13	The Application of Corpora in Minoritized Language Contexts: Supporting and Informing the Pedagogic Landscape Dawn Knight, Cardiff University
12:40 – 14:30	Lunch Break
14:30 – 15:30	Talk 14	Revisiting the Vertices Curriculum Design-language Teaching-language Assessment José Pascoal, University of Macau
15:30 – 15:40	Break
15:40 – 16:40	Talk 15	Who Determines Proficiency? Communicate Effectiveness as a Real-world Criterion for Speaking and Writing Ability Jonathan Schmidgall, Educational Testing Service, Princeton
16:40 – 16:50	Break

16:50 – 17:20

Talk 16

A Process-oriented Investigation of the Efficacy of Cambridge English Write & Improve® as a Diagnostic Assessment Tool in Chinese EFL Context

Sha Liu, University of Bristol

Jing Xu, Cambridge Assessment English

Guoxing Yu, University of Bristol

Day 4, November 9, 2019 [Saturday]

Venue: FAH E21-G049 (Simultaneous Interpreting Lab)

Time	Events	Titles and Speakers
9:30 – 10:oo	Talk 17	Classroom Assessment of L2 English Presentation Skills Using a Textbook-based Task and Rubric Rie Koizumi, Juntendo University Ken Yano, Taga Senior High School
10:00 – 10:30	Talk 18	Exploring the Construct Validity of Paraphrasing Emily Zhang Di, University of Macau
10:30 – 10:40	Break
10:40 – 11:10	Talk 19	Developing and Validating a Rating Scale of Speaking Prosody Ability for Speakers of Chinese as a Second Language Sichang Gao & Mingwei Pan, Shanghai International Studies University
11:10 – 11:30	Closing	Cecilia Guanfang Zhao & Matthew Wallace, University of Macau

Abstracts

Day 1

Talk 1: Validation Research on Tests of Writing for Academic Purposes

Alister Cumming, University of Toronto

What kinds of research, inferences, and evidence have recently been used to validate aspects of tests of writing for academic purposes? I review examples of the range of research that has formulated arguments and gathered, analyzed, and published evidence to assert that inferences about scores from particular tests of academic writing are appropriate and justified. The research has involved: domain definition to relate test scores to a target domain; evaluation to link test performance to observed scores; generalization to link observed scores reliably to the universe of scores; explanation to link the universe of scores to construct interpretations in test contexts; extrapolation to link the universe of scores to interpretations to real-world situations; and utilization to link interpretations to uses of score reporting, decisions made based on test results, and long-term consequences for institutions or stakeholders.

Talk 2: Overall English Language Proficiency (Whatever That Is)

James D. Brown, University of Hawaiʻi at Mānoa

From the beginning of my 40+ year career in language testing, I have always been uncomfortable with the notion of overall English language proficiency (whatever that is). I suspect that I am not alone given the determined focus in language testing on validity issues over the past four decades. Certainly, much discussion has been devoted to the classical three Cs view of validity (content, construct, and criterion-related), the unified theory of validity, Messick’s expansion the very idea of validity, Kane’s argument-based approach, and Toulmin’s schema for the structure of arguments (grounds, claim, warrant, backing, and rebuttal). Importantly, during those four decades, a number of things have changed in applied linguistics: (a) our conception of the nature of language has expanded enormously, (b) our pedagogical options in language teaching have multiplied greatly, (c) our available tools in language testing have expanded from basically three or four to a dozen or more, and (d) our ideas about who owns English have proliferated. In this talk, I will explore what all of this means for how we think about the ELP construct; what we can do in testing ELP to maximize the representativeness of the construct while minimizing the practical costs; and how language testing research may need to adapt in the future to deal with these issues.

Talk 3: Alternative Approaches to Writing Assessment in Higher Education

Cecilia Guanfang Zhao, University of Macau

An examination of the current writing assessment practices seems to suggest that unlike measurement theory, “writing theory has had a minimal influence on writing assessment” (Behizadeh & Engelhard, 2011, p. 189; Crusan, 2014). Consequently, the most often employed writing task, other than the earlier discrete point test items, is still the prompt-based single essay writing. However, assessment specialists have long called into question the usefulness of impromptu essay writing in response to a single prompt (Cho, 2003; Crusan, 2014). As a response to the above observation about the lack of theoretical support and generalizability of results in our current writing assessment practices, this paper seeks to outline and argue for a cognitive process-based writing assessment design that is informed by and operationalizes theoretical conceptions of (academic) writing ability. Such an alternative approach to assessment design can be of pedagogical and practical value to writing instructors and test developers alike.

Talk 4: Development of an English Proficiency Test for Intermediate-Level EFL Learners

Yong-Won Lee & Heesung Jun, Seoul National University

The TEPS (Test of English Proficiency developed by Seoul National University) is a general-purpose English proficiency test intended to assess English language learners of a wide range of proficiency levels, from beginner to advanced learner. Since the introduction of the test in 1999 (and even after its recent revision in 2018), one important strength of TEPS has been its discriminability among test takers with high English proficiency. For this reason, TEPS scores have been used for the selection of candidates for undergraduate and graduate admissions at prestigious Korean universities and professional graduate schools of medicine, dentistry, pharmacy, and law.

While the usefulness of TEPS scores in such contexts is widely acknowledged, needs analysis and test-taker feedback over the years have consistently pointed to a need for a localized test with good discriminating power among intermediate and beginner-level English language learners. Thus, with the goal of developing a new intermediate-level English proficiency test targeting secondary school students, civil service exam candidates, and job seekers, a test development project is current underway at the TEPS Center, LEI, SNU. Basic research conducted to date includes domain analysis, needs analysis, review of language proficiency scales, reverse engineering and review of existing English language tests and their score reports, and prototyping of new item types.

The main goal of our talk is to present the findings of our preliminary research studies and discuss a tentative design framework for the new test. In the talk, we will also outline plans for further research, including finalization of the test framework and specification, multiple pilot tests, and a field test, which are expected to help us to establish procedures for the scaling and equating of scores before the operationalization of the new test.

Talk 5: China’s Standards of English Language Ability: Theoretical Underpinnings and Potential Impact on English Teaching Learning and Testing

Jianda Liu, Guangdong University of Foreign Studies

China’s Standards of English Language Ability (CSE), issued by the Ministry of Education of China in 2018, has been attracting more and more attention inside and outside China. CSE sets out to define and describe the English competences that learners at different educational phases are supposed to achieve, to provide references and guidelines for English learning, teaching, and assessment, and to enrich the existing body of the language competence frameworks on a global basis. The influence of language frameworks and standards on curricula and examination reforms has been acknowledged in different areas around the world. This talk first introduces the theoretical underpinnings of the CSE, followed by an exploration of the potential impact of the CSE on English teaching, learning, and assessment in China. It concludes with suggestions on how the CSE can be applied in English teaching, learning, and testing in the Chinese EFL context.

Day 2

Talk 6: The Role of Source-text Comprehension in Integrated Writing Task Performance: What L2 Assessment Research so far Tells Us

Yasuyo Sawaki, Waseda University

For the last two decades the field of language assessment has witnessed a rapid increase of investigations into L2 learners’ performance on language assessment task types that require the integration of multiple modalities (reading, listening, speaking, and writing). Among them are integrated writing tasks, where the learner writes a summary of a source text, an argumentative essay in response to the information provided in a source text, among others. A small number of previous empirical studies have suggested an important role the adequacy of source-text comprehension might play in source-based writing (e.g., Asención-Delaney, 2014; Plakans & Gebril, 2011; Sawaki, Quinlan, & Lee, 2013; Trites & McGroarty, 2005). Despite this, however, relatively few previous studies have examined learners’ integrated writing performance from the perspective of the coverage and accuracy of source text information in learner responses, presumably due in part to the typical conceptualization of integrated writing task scores as measures of writing ability.

In this session, the presenter will review relevant previous studies in text processing, L1 reading/writing, and L2 assessment with a specific focus on how source-text content representation has been operationalized and assessed in source-based writing. The presenter will then provide an overview of key results of her three-year research project, where rating scales for assessing source-text content representation in learner-produced summaries were developed and validated for use in EFL academic writing courses at a university in Japan. Implications of the study and future research directions will be discussed for enhancing construct representation in integrated writing assessment.

Talk 7: A Meta-analysis of L2 Speaking Proficiency and Its Features

Yo In’nami, Chuo University

The complex, multifaceted construct of L2 speaking has been examined in relation to the overall oral proficiency and four of its subcomponents (i.e., vocabulary knowledge, grammar knowledge, working memory, and metacognition; Jeon, In’nami, & Koizumi, 2016). Another traditional and yet equally important approach to understanding L2 speaking is to investigate the features of oral production itself such as fluency, accuracy, grammatical complexity, appropriateness, and pronunciation (Fulcher, 2003; Luoma, 2003). Despite attempts to examine how these features are related to overall L2 speaking proficiency (e.g., Iwashita, Brown, McNamara, & O’Hagan, 2008; Sato, 2012), results have been not always consistent across studies, making it difficult to understand the relative degree of contribution of each oral feature to overall L2 speaking. This suggests the need to closely inspect previous studies and examine potential variables that moderate the relationship between oral features and overall L2 speaking. In order to investigate the relative importance of various oral features on determining overall L2 proficiency and also to examine moderating variables which may influence a systematic variation across study findings, we conducted a meta-analysis of correlation coefficients. A search was carried out on 10 databases and 24 journals for related studies. For each study, correlations between the aforementioned oral features and overall L2 speaking proficiency were coded to synthesize the relationship across studies. To examine whether the synthesized mean sizes of the correlations differ according to study features, moderator variables (e.g., learners’ ages, elicitation task types, presence/absence of rater training, publication types) were coded. Areas in need of further research and improvement in L2 speaking studies are discussed.

Talk 8: How to Tell if You Grew up Speaking Spanish: Using Corpora and DIF to Distinguish Heritage and L2 Speakers’ Language Knowledge

Melissa Bowles, University of Illinois at Urbana-Champaign

Language assessment is dominated by tests of English, both large- and small-scale. The vast majority of journal articles in assessment are on English, with a survey of articles from the 1980s to the present showing that just 3.52% of papers in Assessing Writing, 10.29% of papers in Language Assessment Quarterly, and 12.18% of papers in Language Testing focused on languages other than English (Yan & Bowles, 2019). Yet, there are unique assessment issues to be addressed in such contexts, including the topic of my talk, assessing heritage language speakers. I will define the term heritage speaker, and then focus on the research to date on the assessment of the most widely spoken heritage language in the US, Spanish. Specifically, I examine how heritage speakers of Spanish are identified and placed into Spanish language courses at the university level and report on the development of a Spanish placement test at my institution, which sought to distinguish heritage speakers from non-heritage (second-language) learners based on test performance. Specifically, I focus on a DIF analysis of items targeting early-acquired vocabulary, which addresses some of the shortcomings of prior heritage placement tests, such as being appropriate for just one specific variety of Spanish.

Talk 9: Understanding and Assessing the Repertoire of an Indian Multilingual Through a Speaking Test

Rama Mathew, (formerly) Delhi University

Educated multilinguals (MLs) in India are known to have competence not only in their own first and/or local language(s) but also in their Lingua Franca English. They use these languages in various combinations by code-switching, code-mixing and code-meshing which result in translanguaging in different contexts for different purposes. This they might do for various reasons, for example, due to lexical/syntactic gaps as in the case of a limited/not-so-proficient user of the language(s) or to use words/phrases/idiomatic expressions for effect, especially when the addressee shares one or more of these languages. The average ML user, regardless of the level of proficiency in any of the languages being used, uses one or more of language-switch-strategies to establish or negotiate his/her ML identity, as a strategy of neutrality or as a means to explore which language is best suited in which context and for what purpose. Whatever the reason behind such borrowings or switches, translanguaging, i.e. moving from one language to another freely within, between and among languages even in formal contexts is the norm rather than an exception.

This is perhaps not-so-complete a description of how MLs function. Nevertheless, the question of how to assess their English proficiency through English tests for communicative purposes is quite a tricky proposition. Given that we would like to assess proficiency of learners in authentic and natural situations, we will need to accommodate the ML nature of their language use as far as possible. This raises several questions especially in the case of speaking tests.

We introduced in Delhi (India) a speaking test as part of an English Proficiency course which students found very useful and fun to engage with. We also developed an assessment scale that took into consideration students’ needs in terms of their first language(s), cultural contexts and nuances, and English language levels. Thus we are on our way to creating a framework that is essentially Indian and usable locally. This also helps us to keep the international benchmarks as our goal posts and the process of localizing the international benchmarks might be of value to other assessors in the region.

However while it is easy to see the need to allow ML use in English speaking tests at lower levels such as A1 and A2, we are not sure if translanguaging will automatically stop or should be forcefully stopped as they progress to higher levels of English proficiency.

I will demonstrate through examples how we assess our learners on speaking using a multilingual scale. While further work is needed in the area, we are hopeful that such efforts will help policy makers realise the need for a Common Indian Framework of Reference vis-à-vis performance assessment in India.

Talk 10: Sources of Validity Evidence for the Interpretation of CU-TEP Cut Scores

Jirada Wudthayagorn, Chulalongkorn University Language Institute

Chulalongkorn University Test of English Proficiency (CU-TEP) is a one of the well-known locally made tests in Thailand. It has been used to assess adults’ Thai EFL learners by many institutions and organizations for decades. The test contains 120 items using a 4-multiple choice format, assessing receptive skills of listening and reading. Because the CEFR has been adopted as an English language policy in Thailand, the CU-TEP was then mapped to the Common European Framework of Reference (CEFR), published by Council of Europe (2001) (Wudthayagorn, 2018). The standard setting with Yes/No Angoff technique was carried out. The CEFR levels range from A2, B1, B2, and C1 which cover cut scores at 14, 35, 70, and 99, respectively. Based on the university record, the majority of the university students who took the test in the academic year 2015 was at B1 (67.99%), followed by B2 (15.84%), A2 (15.06%), and C1 (1.11%). This is crucial because the policy suggests that upon graduation students should perform at least at B2.

Questions arise. Is the cut score of 70 valid for B2? Can we infer that a student who is put at B2 has a high degree of speaking and writing ability? In this study, twenty fifth-year students of the Faculty of Dentistry were selected based on their CU-TEP scores which indicate their proficiency at B2. They performed a speaking task (i.e., a dentist-patient role play) and a writing task (i.e., writing a referral letter). Initial results demonstrate that some seems to be good enough to be classified as B2 students; others are not. Evidence of their speaking and writing skills is presented. Validity evidence is a focus of discussion. Adjustment of cut scores is also proposed.

Day 3

Talk 11: Source-based Writing: Some Reflections on Task Design and Scoring Decisions

Atta Gebril, The American University in Cairo

Source-based writing has received substantial interest in language assessment with the adoption of integrated tasks by several testing programs worldwide. Writing from sources is a critical skill in academic settings where students are required to synthesize information from external materials. For this reason, academic language tests attempt to tap into sub-skills related to this construct. Testing practices following an integrated approach are supported by research evidence citing a wide range of advantages for this approach. Examples of these advantages include authenticity since such tasks replicate language use in academic settings, and also fairness as external sources provide background knowledge to those test takers who are not familiar with assigned topics. However, source-based writing is not without problems and usually comes up with a host of challenges for language professionals. The current presentation attempts to address concerns raised about design and scoring of source-based writing tasks drawing on research results from a number of studies conducted by the presenter and his colleagues (e.g., Gebril & Plakans, 2009, 2013, 2014, 2016; Ohta, Plakans, & Gebril, 2018; Plakans & Gebril, 2012, 2013, 2017). More specifically, the presenter will share results related to writers’ processes while working on integrated tasks and how test takers employ source materials, attempting to link these issues to the design of integrated tasks. In addition, results of a series of studies targeting scoring of integrated tasks will be discussed through focusing on raters’ decision making processes, challenges they encounter, features they attend to, and use of different scoring rubrics in this context. The presenter will conclude by providing a number of practical implications for L2 writing instructors, curriculum designers, and assessment specialists.

Talk 12: Prisoner’s Dilemma: An Analysis of Parents’ Perceptions of the Hong Kong Territory-Wide Systematic Assessment

Qin Xie, Education University of Hong Kong

This research was originated from a recent public rebel against the Territory-wide Systematic Assessment (TSA) in Hong Kong. TSA was introduced by the Hong Kong government as an accountability measure of school effectiveness; it assesses student achievement in Chinese, English and Mathematics at the end of Key stages 1-3. TSA, however, has long been perceived negatively by teachers as bringing about extra workload and pressure and promoting teaching to the test. This tension reached a climax in late 2015 when anti-TSA campaigns organized by parents attracted massive media exposure.

Within language assessment, there is little research conducted focusing on parents, who is a key stakeholder of school assessment. Parents’ views towards school assessment were seldom heard or documented. This research investigated parents’ perceptions towards school assessment and TSA. Prisoner’s dilemma was adopted as an analytical lens to understand how individuals’ rational choices made in self-interest may end up as a bad decision for the group and the conflicts between individual and group rationality.

Talk 13: The Application of Corpora in Minoritized Language Contexts: Supporting and Informing the Pedagogic Landscape

Dawn Knight, Cardiff University

Corpora, and their associated concordancing software help to generate empirically based, objective analyses of how language is actually used. Such corpus-based methods have revolutionised the investigation of language and the significance and reach of their applications continues to grow. In language learning and pedagogy contexts specifically, corpora have an established role in the creation of teaching materials, and are also used in supporting reference and assessment resources. This paper provides an overview of some of the key concepts and applications of corpus-based research and teaching to date, then details a specific case study of the use of corpora for pedagogic purposes in the Welsh language context: the use of the National Corpus of Contemporary Welsh (Corpws Cenedlaethol Cymraeg Cyfoes – CorCenCC www.corcencc.org).

CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes) is a principled collection of ten million words of Welsh language. A key innovation of CorCenCC is that it is the first corpus that is integrated with an online Welsh-based pedagogical toolkit which works directly with the corpus data to support language learning and teaching. Inspired by the online Compleat Lexical Tutor data-driven language learning toolkit (https://www.lextutor.ca/), the CorCenCC toolkit includes a Vocabulary Profiler, frequency-based Cloze (gap-fill) exercise builder, reading comprehension assistant and a user-built lexicon with quiz builder. The paper will provide a detailed overview of these functionalities in use, and demonstrates how they may be practically implemented in educational contexts to facilitate the teaching, learning and testing of Welsh to students of different ages and different levels of ability.

CorCenCC’s pedagogic toolkit works directly with the data to support language teaching and learning, and, specifically, to help improve vocabulary knowledge and reading/writing skills in Welsh. However, this pedagogic toolkit also has impact beyond the Welsh language context. Its transformative methods for corpus creation represent a template, for use in any language, and in particular lesser-used, minoritised, or under-resourced languages and constitute a blueprint for the development of corpus toolkits for other user groups (e.g. translators, publishers, new researchers) beyond the life of this project.

Talk 14: Revisiting the Vertices Curriculum Design-language Teaching-language Assessment

José Pascoal, University of Macau

CAPLE-ULisbon exams and CELPE-Bras exam are the most well-known exams of Portuguese. In this presentation I will concentrate on two CAPLE-ULisbon exams: TEJO and the CIPLE two versions: P and oral. I will use them to revisit the triangle curriculum design-language teaching-language assessment (CV-LT-LA). I will use localization (O’Sullivan, 2013) to rethink the validity argument for a test and claim that a plurilingual/multilingual assessment may be the best response to language use in work contexts.

TEJO is a test administered in Portugal in schools where French is the language of schooling to 9-11 pupils. Their home language is usually French but Arabic or an African language are also possible. In many cases, they have more than one home language, which means that they may have one or two first languages. For professional reasons their parents have to work in different countries and they attend schools with a French curriculum. They learn English and Portuguese as modern foreign languages. The curriculum for these two languages is CEFR based and they are expected to reach A1 when they leave primary school when they are 11. But Portuguese is not available in education systems and Portuguese is the language used in the country where they live. How do these elements affect CV-LT-LA?

CIPLE is the A2 test of CAPLE-ULisboa suite A1-C2. In 2013, CAPLE-UL implemented the CIPLE-P version and in 2016 the CIPLE oral version. P stands for Portugal and is a synonym for Integration and oral means that the test has only the speaking component.

CIPLE was designed for (young) adults who learn Portuguese for work or study reasons. In 2012, the government decided to change the law regulating the access to Portuguese citizenship and CIPLE became the most sought after exam. It is possible to provide evidence of this minimum competence in Portuguese using other tools but they are not available. How was the exam affected by this new population? What means were given to implement a research driven new test for integration purposes? What measures were put in place to prevent a negative impact and to avoid exclusion due to lack of preparation of the candidates to sit the exam?

I will present two research studies conducted in 2014 to claim a plurilingual/multilingual approach to language assessment as a means towards equity and fairness in language education and language assessment.

Talk 15: Who Determines Proficiency? Communicate Effectiveness as a Real-world Criterion for Speaking and Writing ability

Jonathan Schmidgall, Educational Testing Service (Princeton, NJ)

One of the most important tasks for testing researchers is evaluating the extent to which test-based inferences about abilities accurately predict real-world outcomes. This aspect of validation is critical for establishing the usefulness of a test and has been traditionally conceptualized as predictive validity (Cureton, 1951) and more recently as extrapolation (Kane, 1992) or generalizability (Bachman & Palmer, 2010). In language testing, obtaining an appropriate real-world criterion or outcome measure is often challenging. Some outcome measures are relatively easy to obtain but are influenced by many factors unrelated to language tests (e.g., GPA for tests of academic English; course evaluations for tests of oral proficiency for ITAs). Consequently, the ability of language tests to predict these outcomes is typically limited (e.g., Bridgeman, Cho, & DiPietro, 2015). This challenge is complicated in contexts where English is often used as a lingua franca (Kankaanranta & Louhiala-Salminen, 2010), and some researchers have argued that differences between the background characteristics of raters and real-world interlocutors undermine validity (e.g., Jenkins & Leung, 2014).

A more practical approach to this type of validation research for assessments of speaking and writing ability may be through the use of impressionistic judgments of communicative effectiveness by potential real-world interlocutors, or linguistic laypersons. In this presentation, I will discuss the benefits and limitations of this approach, focusing on several research studies which explore the relationship between impressions of communicative effectiveness by linguistic laypersons and assessments of English speaking ability for international teaching assistants (i.e., English for academic purposes) and English speaking and writing ability in the international workplace (i.e., English for occupational purposes).

Talk 16: A Process-oriented Investigation of the Efficacy of Cambridge English Write & Improve® as a Diagnostic Assessment Tool in Chinese EFL Context

Sha Liu, University of Bristol

Jing Xu, Cambridge Assessment English

Guoxing Yu, University of Bristol

The literature on the use of Automated Writing Evaluation (AWE) has shown that the way of implementing AWE may affect how ESL/EFL students engage with AWE – generated feedback. Although a growing number of researchers suggest the use of AWE for diagnostic assessment of ESL/EFL writing, there has yet to be any research that explores the use of AWE as a diagnostic assessment tool. As part of a larger research project, this paper reviews the design and theoretical basis of Cambridge English Write & Improve^® (hereafter Write & Improve^®) and reports a small-scale study that investigated its efficacy as a diagnostic assessment tool in a Chinese EFL context. Write & Improve^® is a free online platform for learners of English to practise English writing and make improvements based on instant diagnostic feedback the system generated. Two Chinese EFL students enrolled in a Master programme in a British university were invited to write essays on the Write & Improve^® website and revised their drafts based on the automated feedback received. Data from multiple sources were collected. The two participants’ eye-movements when viewing and responding to the feedback were captured by an eye-tracking device, based on which subsequent stimulated-recall interviews were conducted. During the interview, participants discussed their general impression of the accuracy, adequacy, clarity, and usefulness of the feedback and reflected on their essay revision process of utilising the feedback. Upon the completion of the interview, their essay assignment data was exported from Write & Improve^® for analysis. The study found that Write & Improve^® prompted active revision behaviours and generally provided accurate feedback on word- and sentence-level errors that led to successful revisions in most cases. However, such feedback was relatively sparse as compared to that generated by other AWE systems. In addition, the indirect feedback in the form of sentence highlighting was not found effective in that the participants had difficulty in interpreting the information. The implications of the findings for improving Write & Improve and for appropriate use of the platform for writing instruction are discussed.

Day 4

Talk 17: Classroom Assessment of L2 English Presentation Skills Using a Textbook-based Task and Rubric

Rie Koizumi, Juntendo University

Ken Yano, Taga Senior High School

Assessing as well as teaching speaking English as a second language (L2) is encouraged in the classroom because there are increasingly more opportunities outside the classroom for native and nonnative speakers of English to interact in English. However, speaking assessment is not conducted regularly in Japanese senior high schools (SHSs). To increase and improve speaking assessment practices at Japanese schools, various measures have been planned and implemented. At the national level, knowledge and skills of English assessment will be incorporated as essential components in the Core Curriculum in pre-service and in-service teacher training programs. Books on the theory and practice of speaking assessment are available for a worldwide audience (e.g., Fulcher, 2003; Luoma, 2004; Taylor, 2011), including those aimed at English instructors in Japan (e.g., Koizumi, In’nami, & Fukazawa, 2017; Talandis, 2017). Furthermore, previous studies provide useful hints that can help SHS teachers learn about speaking assessment (e.g., Nakatsuhara, 2013; Ockey, Koyama, Setoguchi, & Sun, 2015). However, these resources are not clearly linked to the textbooks authorized by the Ministry of Education in Japan and used in daily lessons at schools. An explicit association between instruction and assessment is needed for formative and summative speaking assessment in the classroom. Therefore, a study of the development and examination of a speaking assessment task and a detailed rubric based on an authorized textbook would be helpful to fill this void. The current study attempts to address this issue by introducing an instance of speaking assessment in the classroom to show detailed procedures and outcomes based on the analysis of the test data. Presentations of 64 students were evaluated by two raters using two rating criteria, Task achievement and Fluency. Analysis of scores using many-facet Rasch measurement showed that the test functioned well in general, and the results of a posttest questionnaire suggested that students generally perceived the test positively. Favorable results to the speaking test and test scores in the current study, in combination with more studies using different types of speaking test formats and rubrics, would help English teachers better understand the feasibility of conducting tests and assessing speaking effectively.

Talk 18: Exploring the Construct Validity of Paraphrasing

Emily Zhang Di, University of Macau

Paraphrasing is of critical importance in source use practice to avoid plagiarism, while its construct remains slippery, making the relevant learning, teaching and testing practice difficult to operationalize. The present study purports to explore the construct validity of paraphrasing. A total of 226 first-year non-English-major college students were recruited to respond to a seven-item paraphrasing task. Two types of data were collected: product and process data. The product data refers to the test-takers’ written responses of paraphrasing, which were coded based on a proposed coding theme of four hypothetical variables of paraphrasing competence and rated by four holistic scales. The process data were test-takers’ think-aloud report of their thinking processes while responding to the task, as well as strategy use elicited by a paraphrasing strategy use inventory. Findings of the study are as follows: from the perspective of product, correlation and step-wise regression analyses showed that conceptual transformation has the highest predictive power of the construct of paraphrasing, followed by lexical transformation and syntactic transformation, indicating that comprehension occupies a central place in paraphrasing competence, and vocabulary and grammar are important components of the construct of paraphrasing. However, transformation extent has no predicative power, because L2 writers tend to rely heavily on verbatim source use. From the perspective of process, it is found that participants indeed employed many paraphrasing strategies. Cognitive strategies including analyzing, summarizing and translating strategies were most frequently used, followed by metacognitive strategies including evaluating and monitoring strategies. Overall, cognitive processes of comprehension, application, analysis, synthesis and evaluation in Bloom’s Taxonymy of Educational Objectives were reflected by paraphrasing. However, structural equating modeling analysis showed that all those strategies exert no significant effect on test-takers’ paraphrasing performance, which may be attributed to some factors such as the task-specificity of strategies, task familiarity and so on.

Talk 19: Developing and Validating a Rating Scale of Speaking Prosody Ability for Speakers of Chinese as a Second Language

Sichang Gao & Mingwei Pan, Shanghai International Studies University

Prosody helps speakers to express meaning, attitude and mood. Suprasegmental features as measures of prosody are inviting growing attention in speaking assessment and second language acquisition field. This study, therefore, sets out to develop a scale that measures speaking prosody ability for learners of Chinese as a Second Language (CSL). A “descriptor pool” was first generated from two sources. One source derived from the interview data of ten assessment experts, whose subjective description about CSL learners’ prosody performance. The other was an eclectic analysis of the available Chinese speaking proficiency scales from five universities in Mainland China. Altogether 47 descriptors, considered to be clear, useful and unambiguous, were then selected to form a questionnaire. Ninety-four CSL teachers were asked to rate their perceptions of the selected descriptors by Likert-scale questionnaires. The study found through item analysis that 23 out of 47 descriptors could survive, given their average scores of 3.5 or above, which means they were perceived as crucial indicators of prosody ability. Exploratory factor analysis was then conducted, which extracted four dimensions with 13 descriptors after stepwise elimination. Four dimensions of Chinese language learners’ speaking prosody ability were: 1. prosodic strategic competence (use intonation to express mood; use stress to express emotion and attitude; mood naturalness); 2. fluency (appropriate speech rate; accurate sentence segmentation; accurate rhythm; accurate prosodic boundary; no long pauses); 3. stress naturalness (rhythm naturalness; accent naturalness); 4. tone accuracy (accurate pronunciation; tone accuracy). The weightings of the four dimensions on free talk task were further explored by using multiple linear regression analysis. It was found that prosodic ability score =14.49+0.16* prosodic strategy ability +1.29* fluency +1.43* stress naturalness +0.3* tone accuracy+error. As such, fluency and stress naturalness are supposed to be two significant predictors for prosody ability. The findings of this study yield importance to the teaching of CSL speaking in the sense that stress naturalness should be paid due attention to in CSL language classrooms.