This proceedings volume covers issues of learner corpus design, collection and annotation and contains reports on various aspects of (written and spoken) learner interlanguage as well as design of learner-corpus-informed tools. Lire la suite
A comparison of learner and native speaker writing in online self-presentations:
This paper investigates the language used by both learners and native speakers of English when introducing themselves to peers in an online community, and then goes on to discuss the pedagogical potential of the findings. A small corpus of self-presentations written by 220 first-year students majoring in English at an Italian university was compiled during the 2009-2010 academic year. The learner corpus was compared with a reference corpus consisting of self-presentations produced by native speaker students in higher education in English-speaking countries and posted on online forums. The paper first considers why it is important that language majors aim to write in a way that is appropriate to a given genre, rather than merely focusing on morpho-syntactic accuracy. It then focuses on aspects of divergence between learner and native speaker production, presenting some of the linguistic choices made by learners when presenting themselves to peers. It goes on to discuss how the creation of awareness-raising materials based on the analysis can enhance learning by directing students' attention towards the differences between their texts and those of native speaker students.
Theodora ALEXOPOULOU, Helen YANNAKOUDAKIS & Angeliki SALAMOURA
Classifying intermediate learner English: A data-driven approach to learner
We demonstrate how data-driven approaches to learner corpora can support Second Language Acquisition research when integrated with visualisation tools. We employ a visual user interface supporting the investigation of a set of automatically determined features discriminating between pass and fail First Certificate in English (FCE) exam scripts. We illustrate how the interface can support the investigation of individual features. The analysis of the most discriminative features indicates that the development of grammatical categories allowing reference to complex events, referents and discourse relations is a crucial property of the upper-intermediate level.
Margit BRECKLE & Heike ZINSMEISTER
L1 transfer versus fixed chunks: A learner corpus-based study of L2 German
This study deals with the question of what strategies Chinese L2 learners of German follow when starting a declarative sentence in German. The investigation is based on the ALeSKo corpus, a linguistically annotated learner corpus of written German. In previous studies, we observed that the L2 texts show a significant overuse of sentences that start with an information-structural function in comparison to comparable L1 texts. In this paper, we pursue an alternative line of explanation that explores whether the observed difference is due to an overuse of chunks in the L2 texts. We perform a chunk classification and also automatically detect all material copied from the title and the task description – a particular type of chunk. Our findings indicate that although L2 learners use chunks to a substantial degree, an overuse with respect to the beginnings of the sentences could not be confirmed.
Julian BROOKE & Graeme HIRST
Native language detection with 'cheap’ learner corpora
We begin by showing that the best publicly available, multiple-L1 learner corpus, the International Corpus of Learner English (Granger et al. 2009), has issues when used directly for the task of native language detection (NLD). The topic biases in the corpus are a confounding factor that results in cross-validated performance that appears misleadingly high, for all the feature types which are traditionally used. Our approach here is to look for other, cheap ways to get training data for NLD. To that end, we present the web-scraped Lang-8 learner corpus, and show that it is useful for the task, particularly if large quantities of data are used. This also seems to facilitate the use of lexical features, which have been previously avoided. We also investigate ways to do NLD that do not involve having learner corpora at all, including double-translation and extracting information from L1 corpora directly. All of these avenues are shown to be promising.
Marcus CALLIES & Ekaterina ZAYTSEVA
The Corpus of Academic Learner English (CALE) – A new resource for the study and assessment of advanced language proficiency
This paper introduces the Corpus of Academic Learner English (CALE), a Language for Specific Purposes learner corpus that is currently being compiled for the quantitative and qualitative study of advanced learners' written academic English. CALE is designed to comprise seven academic genres produced by learners of English as a foreign language in a university setting and thus contains discipline- and genre-specific texts. The corpus will serve as an empirical basis to produce detailed case studies that examine linguistic determinants of lexico-grammatical variation, i.e. semantic, structural, discourse-motivated and processing-related factors that influence constituent order and the choice of structural variants, but also those that are potentially more specific to the acquisition of L2 academic writing such as task setting, genre and writing proficiency. Another major goal is to develop a set of linguistic criteria for the assessment of advanced proficiency conceived of as "sophisticated language use in context".
Integrating learner corpus data into the assessment of spoken interaction in English in an Italian university context
This paper reports on ongoing research conducted at the University of Padua on the teaching and assessment of spoken interaction in English at level B2 of the Common European Framework of Reference for Languages (CEFR, Council of Europe 2001). The study is mainly based on a small learner corpus (about 18,000 words) composed of transcripts of interactions between second-year English as a Foreign Language (EFL) students recorded during assessment sessions. It presents the context of the interactions, the corpora used and the results of a series of investigations carried out into some pragmatic aspects of the interactions. The paper then explores how these findings can help us to flesh out the construct for ‘Discourse Management’ and, ultimately, to set more reliable scoring criteria.
Intonational phrasing as a potential indicator for establishing prosodic learner profiles
Prosodic profiles have been extensively used in forensics and language pathology. However, they are rarely used in second language acquisition as yet. The aim of this paper is to show how prosody can be used to define learner profiles, possibly their learning styles and their different cognitive abilities. It is our claim that different segmentation modes of utterances define different prosodic learner profiles and we aim to characterise these. We will show that prosodic profiles of French learners of English can be drawn on the basis of phrasing and that a cluster of prosodic properties corroborates this typology. Our analysis is first based on read speech and the subsequent classifications on recorded interviews of the same speakers. It reveals the limitations in the assessment phonological criteria the Common European Framework of Reference for Languages (CEFRL) (Council of Europe 2001) advocates and makes a good case for reconsidering them.
Phrasal verbs in a longitudinal learner corpus: Quantitative findings
This study analyses Chinese learners’ use of phrasal verbs from a longitudinal perspective. Through a comparison of the learners’ output of phrasal verbs with that of two groups of native English speakers (American university students and British secondary school leavers), Chinese learners were found to be capable of producing an adequate number of phrasal verbs. Yet, they did not demonstrate appropriate choice of phrasal verbs. The longitudinal data reveal that the learners’ acquisition of phrasal verbs during their three years of study was not always linear. A considerable decrease in the number of phrasal verbs used in the students’ writing in their second year was noticed. No considerable increase in the use of phrasal verbs was observed at the end of their third year. Another important finding of this study is that the American students tend to use far more phrasal verbs than their British and Chinese counterparts.
Pieter DE HAAN & Monique VAN DER HAAGEN
The search for sophisticated language in advanced EFL writing: A longitudinal study
Even very advanced EFL writing tends to be less sophisticated than native writing. One of the problems seems to be finding the right collocations and the correct register. The aim of this article is to pinpoint what characterizes the development in very advanced Dutch EFL students’ written language production, more specifically the use of appropriate intensifiers. Compared to their native English speaking contemporaries, the Dutch students initially tend to use intensifiers that are found typically in spoken English, such as really and a bit, but these gradually disappear. Alternatively, as students progress, the use of the intensifiers so, quite, and rather, becomes more native-like. A qualitative analysis of a selection of essays written by four individual students shows that some students get more out of academic input than others.
Deise P. DUTRA & Tony Berber SARDINHA
Referential expressions in English learner argumentative writing
The aim of this paper is to report our findings of the investigation on lexical bundle types in learner argumentative writing. Our data consisted of the International Corpus of Learning English (ICLE), the Louvain Corpus of Native English Essays (LOCNESS), and Br-ICLE, the Brazilian sub-corpus of ICLE. Our classification followed the functional taxonomy proposed by Biber et al. (2004) and expanded by Simpson-Vlach & Ellis (2010). The research methodology included the extraction of 3-, 4- and 5-word bundles followed by manual and automatic categorization in broad categories (referential expressions, stance expressions and discourse organizing functions) as well as 18 specific subcategories (e.g. intangible and tangible framing attributes and quantity specification). Second, the most frequent categories in each corpus were identified. Third, we focused on the most frequent one: referential expressions. Fourth, the chi-square test, cluster analysis and ANOVA were used to detect significant differences across corpora. The subcategories that contributed the most to statistically significant differences across corpora were: specification of intangible framing attributes, identification and focus, and contrast and comparison. The results also show that there is more internal lexical variation of nouns in the intangible framing attribute bundles produced by native than non-native speakers. The conclusions are that referential expressions might need to receive more attention in pedagogical contexts so their discourse functions become more salient to learners.
Investigating lexical difficulties of learners in the error-annotated UPF learner translation corpus
The aim of this article is two-fold. First, it describes the learner translation corpus developed at the Universitat Pompeu Fabra School of Translation and Interpreting (UPF-LTC). A learner translation corpus is a corpus of translations written by students; the UPF-LTC has two search configurations: as a bilingual, sentence-aligned, English-Catalan translation corpus and as a monolingual Catalan translation corpus. It has been annotated both with linguistic information and with error tags according to a set taxonomy of translation errors. The second aim is to illustrate the applications of the corpus for research into the types of translation errors involving lexical use such as false friends and deficient or imprecise lexical choices. The results are relevant not only for the didactics of translation but also for translation-oriented bilingual lexicography.
Michael FLOR & Yoko FUTAGI
Producing an annotated corpus with automatic spelling correction
This paper describes ConSpel, a software system for automatic detection and correction of non-word misspellings. We also present an ongoing research project for constructing an ETS (Educational Testing Service) Spelling Corpus. The corpus consists of essays written by native and non-native speakers of English to the writing prompts of TOEFL® and GRE® tests. Essays are annotated for misspellings by trained annotators, using a semi-automated methodology. An evaluation of the ConSpel system was conducted, using the data from the completed phase of the annotation project. The ConSpel system achieves above 95% accuracy in error detection. The evaluation also indicates that an advanced correction algorithm, which takes into account the local context of misspellings, achieves correction accuracy of 77% and consistently outperforms a baseline context-blind approach.
If-conditionals in ICLE and the BNC: A success story for teaching or learning?
This paper aims to contribute to the methodological toolbox of "pedagogy-driven corpus-based research" (Gabrielatos 2006), that is, research which is situated at the intersection of language description, pedagogical lexicogrammar, and pedagogical materials evaluation (e.g. Harwood 2005; Hunston & Francis 1998; Kennedy 1992; Owen 1993). The contribution of the present paper mainly lies in proposing a method of triangulating the corpus-based evaluation of lexicogrammatical information in English as a Foreign Language coursebooks, by way of examining a relevant corpus sample of learner written output.
This and that in native and learner English: From typology of use to tagset characterisation
Learner corpus research is now faced with a multiplicity of tagsets. It is therefore difficult to carry out cross-corpus analysis due to the variety of tags used for each part-of-speech (POS). In this paper, we envisage this issue through a specific linguistic point. We propose a typology of uses in both native and non-native corpora. Various tagsets are analysed so as to measure the relevance of the linguistic information provided for this and that. Overall, a comparative analysis of this and that in tagsets is proposed and the benefits and flaws of manual fine-grained annotation versus automatic annotation are assessed. This study comes as a first step towards automated annotation of this and that in various corpora as this process would pave the way to corpus interoperability at POS level.
The Lexicon of Spoken Italian by Foreigners: A study on the acquisition of vocabulary by L2 Italian learners between measures of lexical richness and lexical fields
The aim of this paper is to present a corpus-based study of the acquisition of the vocabulary by learners of L2 Italian. The goal of the research is to study the lexical uses of non-native speakers and the processes of lexical acquisition underlying these uses, applying some measures of lexical richness and analysing the lexical fields of the corpus. The informants of the corpus were non-native speakers with different proficiency levels, learning Italian both in Italy and outside of it. The main results show how lexical competence develops above all quantitatively at the beginning and intermediate levels, as well as how it develops qualitatively at more advanced levels in particular. Different learning inputs greatly affect the development of lexical competence: learners acquiring Italian in Italy have a deeper knowledge of the Italian vocabulary compared to learners learning Italian outside of Italy. Regardless of the learning context or proficiency level, the most relevant categories among the lexical fields are those linked to everyday life, whereas those categories linked to more abstract domains are less relevant, but show a higher level of lexical richness compared to categories linked to daily life.
Learners of English and conversational proficiency
This study focuses on the inter-relatedness of fluency and complexity as explanatory factors and criteria for the assessment of conversational proficiency within the framework of two current cognitive models. It has been carried out on a cross-sectional corpus of 28 one-to-one conversations between native English teaching assistants and French English as Foreign Language (EFL) university students from the DIDEROT-LONGDALE project.
Jonė GRIGALIŪNIENĖ & Rita JUKNEVIČIENĖ
Recurrent formulaic sequences in the speech and writing of the Lithuanian learners of English
The present article reports an investigation of recurrent formulaic sequences (FSs) in the speech and writing of Lithuanian learners of English as a foreign language (EFL). Evidence from corpus research has shown that language makes an extensive use of recurrent multi-word units whose successful acquisition contributes to the naturalness of expression and is thus very important in language teaching and learning. The aim of this study is to identify and describe the recurrent FSs in the spoken and written English of Lithuanian EFL learners both quantitatively and qualitatively, and to check whether the current hypothesis that FSs are more frequent in speech than in writing is applicable to the Lithuanian EFL learner language as well. The data for the research comes from the Lithuanian component of the International Corpus of Learner English (ICLE), viz. LICLE, and a pilot version of LINDSEI-LITH, the Lithuanian component of the Louvain International Database of Spoken Interlanguage. The findings of the study show that although the speech of Lithuanian EFL learners is more formulaic than their written language, there is a considerable overlap between spoken and written language in terms of formulaicity. The learners have built a core set of FSs which recur both in speech and writing. The most frequent FSs in writing are expressions of discourse organization while high-frequency FSs in spoken language, which often appear in clusters of several FSs, usually indicate the speaker’s hesitation and uncertainty.
Hagen HIRSCHMANN, Anke LÜDELING, Ines REHBEIN, Marc REZNICEK & Amir ZELDES
Underuse of syntactic categories in Falko: A case study on modification
This paper shows how the automatic syntactic analysis of a corpus of advanced learners of German as a foreign language helps in understanding the acquisition of modification. In former corpus research modification has been studied only by comparing the distributions of single words (or groups of words) in learner and native speaker data. We argue that in order to study modification as a syntactic category it is necessary to work with syntactically analyzed corpora. In this vein, we sketch out our approach to parsing learner language and conduct two contrastive interlanguage studies on modification in the syntactically annotated corpus, showing that not only lexical modifiers can be underused (as shown in many other studies), but that modification as a whole category (including multi-word modifiers such as prepositional phrases, and clausal modifiers such as relative clauses) is underused in our learner corpus data.
Jarmo Harri JANTUNEN & Sisko BRUNNI
Morphology, lexical priming and second language acquisition: A corpus-study on learner Finnish
The present article discusses morphological priming in the context of second language acquisition. Morphological priming is a characteristic of both the core and cotextual items in a phraseological unit. It occurs when a word is repeatedly encountered in certain inflectional forms. Similarly to lexical priming on the whole (e.g. collocations and other cotextual qualities), it poses challenges for language learners. The paper focuses on atypicalities in morphophonological forms and, in addition, describes errors in inflection. It is hypothesized that learners of Finnish have problems in morphological priming, and that learners whose mother tongue is closely related to the target language and has inflection produce more target-language-like phraseological units.
A beginner French learner corpus
This paper introduces the beginner French learner corpus built at the Centre for Language Learning at the University of the West Indies in Trinidad and Tobago. The primary objective of this project is to improve the way French is taught in this particular Caribbean context. It is original in the sense that it targets learners with a low or intermediate proficiency in French. Since it was collected during a period of two and a half years, the corpus allows for both longitudinal and same-level studies. The interlanguage associated with this specific population of students shows the influence played by the L1 (English) that is sometimes reinforced by that of another prevalent L2 (Spanish). The learners’ productions also point to the strong impact that the textbooks and pedagogical approach to language teaching have on the students’ written production. This research project calls for adapting the teachers’ pedagogy and textbooks in order to help these beginner learners write more accurately and originally right from the beginning of instruction.
Concessive adverbial clauses in L2 academic writing
In a recent study, Wulff & Gries (2011) put forward the constructionist definition of accuracy in L2 production as the selection of a construction in its preferred context within a particular target variety and genre. By focusing on the use of concessive adverbial clauses in L2 academic writing, the current study takes up this definition of accuracy in L2 production and sets out to explore whether, and to what extent, the ‘genre-specific construction’ (i.e. genre-specific repository of symbolic form-function alignments) of advanced German learners of academic English is similar/different to that of native expert academic writers of English. To this end, all instances of concessive adverbial clauses were extracted from a 216,418 word-token learner corpus and coded for the various factors proposed in the literature. For comparison purposes, a data set of all relevant data points was distilled from a native expert corpus of the same size and annotated in terms of the same factors. The two annotated data sets were then submitted to a Hierarchical Configural Frequency Analysis (Gries 2009). A comparison of the findings revealed a slightly different set of ‘entrenched’ adverbial concessive clauses in the learner corpus, suggesting that the learners’ genre-specific panoply of certain constructional types is still not fully established. In accordance with Wulff & Gries (2011), the findings presented here give support to a usage-based constructionist approach as a promising and viable way of measuring accuracy in L2 production.
A comparison of spoken and written learner corpora: Analyzing developmental 277
patterns of vocabulary used by Japanese EFL learners
The purpose of this study is to compare the spoken and written language of Japanese learners of English. The man focus is on the developmental patterns of vocabulary in the different production modes. Two types of learner data were compared in this study. The spoken data were extracted from the National Institute of Information and Communications Technology Japanese Learner English Corpus (NICT JLE Corpus), and the written data were extracted from the Japanese EFL Learner Corpus (JEFLL Corpus). The approach adopted in this research has three characteristics. First of all, it is corpus-based. Second, it focuses on very common word-types. Third, it is based on multivariate analysis. Using these 100 common word-types, I will conduct a correspondence analysis in order to explore complex interrelationships between the word-types and subcorpora in the spoken and written data. The result of this study shows a contrast between spoken and written data as well as a contrast between novice and advanced learners.
Sun-Hee LEE, Markus DICKINSON & Ross ISRAEL
Corpus-based error analysis of Korean particles
We discuss the development of a corpus of learner Korean, performing an error analysis of particle usage with it. Although the corpus was largely developed for the evaluation of natural language processing (NLP) systems – as discussed in Lee et al. (2012) – there are two major design decisions which affect the use of the corpus and its annotation for qualitatively and quantitatively studying learner behavior and which have not been fully discussed before. First is the composition of the corpus, specifically what learner data to include. Second is how we define grammaticality, a particularly thorny problem for error annotation of Korean particles, which are, to some extent, optional. After explaining the nuances of particles in Korean in general, we turn to these two issues and then provide an error analysis, showing the differential error patterns between heritage and non-heritage learners. In particular, particle omission rates differ, illustrating the importance of clearly defining grammaticality for (sometimes) optional elements, both for annotation and for pedagogy.
Stéphanie LOPEZ, Anne CONDAMINES & Amélie JOSSELIN-LERAY
An LSP learner corpus to help with English radiotelephony teaching
The French Civil Aviation University (ENAC) is in charge of the French controllers’ initial training in English and has therefore specific needs in terms of English radiotelephony teaching. Consequently, an observation of the usage of English made by French controllers with international pilots, that is to say ongoing foreign language learners, was initiated. The aim of this project is to describe and categorise the different uses of English within pilot-controller communications through the means of a comparative study between two corpora . The ultimate purpose of this comparative analysis is foreign language (English for Specific Purposes) teaching.
Cristóbal LOZANO & Amaya MENDIKOETXEA
Corpus and experimental data: Subjects in second language research
This paper shows how corpus and experimental data can be combined to gain an insight into the processes that shape and constrain second language (L2) acquisition, by focusing on the L1 Spanish – L2 English acquisition of preverbal vs. post-verbal subject position: S-V vs. (XP-)V-S. The initial corpus study (Lozano & Mendikoetxea 2010) revealed that subject position in L1 Spanish – L2 English is constrained by the same principles as in native English (verb type, information structure and phonological weight), but learners show difficulties with the preverbal XP constituent: even advanced learners overuse it as the generic expletive (It occurred many important events) or omit XP (i.e., they use Ø as in Exist other means of obtaining money), while the use of there with verbs other than be is highly limited (There exist about two hundred organizations). To (dis)confirm these corpus findings, a follow-up online experiment was designed to test learners’ (N=250) knowledge of the preverbal XP element in XP-V-S structures whose design was structurally similar to those produced in the corpora (Ø/it/there/PP-V-S). The experimental results show a very robust pattern, which mostly confirms the corpus results. In the conclusion we advocate for the combined use of naturalistic and experimental data in a cyclic fashion.
Cross-linguistic influence on the accuracy order of L2 English grammatical morphemes
Contrary to the accepted notion of the ‘natural order’ that claims for the fixed L2 acquisition order of English grammatical morphemes, Luk & Shirai (2009) reviewed the literature and argued that the order may differ depending on learners’ L1. The present study empirically investigates whether the accuracy order of L2 English grammatical morphemes varies across L1 groups. By targeting over 3,000 essays across seven L1 groups in the Cambridge Learner Corpus, the study computed the accuracy of six morphemes in each L1 group and clustered them through statistical bootstrapping. The study, then, compared the accuracy order of the morphemes between L1 groups and demonstrated clear L1 influence. Overall, the groups whose L1s do not obligatorily mark the morpheme tend to have a lower accuracy order with respect to the morpheme compared to those whose L1s mark it. This was particularly the case for articles
Susana MURCIA-BIELSA & Penny MACDONALD
The TREACLE project: Profiling learner proficiency using error and syntactic analysis
This article describes ongoing research within the TREACLE project. TREACLE aims to profile the specific grammatical skills of Spanish university learners of English at various proficiency levels, and, on the basis of these profiles, develop proposals for re-designing curriculum and teaching materials particularly focused on the real needs of Spanish students at distinct proficiency levels. To this end, we are developing a methodology for grammatical profiling of proficiency levels using learner corpora. Some approaches (e.g. Dagneaux et al. 1998) have explored grammatical competence of learners by looking at the errors they make at each proficiency level. However, we believe that to get a clear picture of learner competence, we need to measure not only what they do wrong (errors), but also what they do right. We thus take a two-pronged approach, involving automatic syntactic tagging of the corpus to see what structures students are attempting, and manual error annotation to see what they do wrong. This paper presents our approach and reports on some preliminary results in profiling provided by our combined approach.
Susan NACEY & Anne-Line GRAEDLER
Communication strategies used by Norwegian students of English
This paper investigates the use of communication strategies by Norwegian learners of English, based on transcribed interviews recorded as part of the Louvain International Database of Spoken English Interlanguage (LINDSEI) (Gilquin et al. 2010). The data consists of 380 instances of communication strategies which have been categorized according to a taxonomy compiled from various pre-existing taxonomies of such strategies. The study reveals that the learners resort to achievement strategies in 96% of the cases. Among the achievement strategies, L2-based strategies are the most common, which makes sense considering the learners’ fairly high competence level in English. A substantial number of instances of L1-based strategies, such as code switching, can be attributed to the fact that the interviewers understand Norwegian perfectly despite being native speakers of English. This strategy type thus contributes positively to fluency, rather than disrupts communication. Other aspects that are analyzed include the tendency for different strategy types to occur in clusters, and the success of different types of cooperation strategies, where the learner implicitly or explicitly appeals to the interviewer for assistance.
The use of articles in Japanese EFL learners’ essays
This paper explores how article use changes according to the development of L2 writing proficiency. Argumentative essays were collected from 61 Japanese EFL learners who were in their first year
Learner corpus research is a young but vibrant new brand of research which stands at a crossroads between corpus linguistics, second language acquisition and foreign language teaching. Its origins go back to the late 1980s when academics and publishers started collecting data from foreign/second language learners with a view to advancing our understanding of the mechanisms of second language acquisition and/or developing pedagogical tools and methods that more accurately target the needs of language learners. At first limited to English as a Foreign Language, learner corpus research has begun to spread to a wide range of languages and as a result, the community group of learner corpus researchers is rapidly growing and diversifying. The First Learner Corpus Research Conference organized by the Centre for English Corpus Linguistics of the Université catholique de Louvain in September 2011 aimed to take stock of the advances made in the field in its over twenty years of existence. The resulting proceedings volume covers issues of learner corpus design, collection and annotation and contains reports on various aspects of (written and spoken) learner interlanguage – pronunciation, prosody, grammar, lexis, phraseology and discourse – as well as design of learner-corpus-informed tools. The volume also explores some of the ways in which learner corpus research could develop in the near future.