In this thesis, we investigate GEC for learners of English as a Second Language (ESL). Specifically, we treat GEC as a translation task from incorrect into correct English, explore new models for developing end-to-end GEC systems for all error types, study system performance for each error type, and examine model generalization to different corpora.
1. Technical Report UCAM-CL-TR-904 ISSN 1476-2986 Number 904 Computer Laboratory Grammatical error correction in non-native English Zheng Yuan March 2017 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/
2. c 2017 Zheng Yuan This technical report is based on a dissertation submitted September 2016 by the author for the degree of Doctor of Philosophy to the University of Cambridge, St. Edmund’s Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986
3. Grammatical error correction in non-native English Zheng Yuan Grammatical error correction (GEC) is the task of automatically correcting grammatical errors in written text. Previous research has mainly focussed on in- dividual error types and current commercial proofreading tools only target limited error types. As sentences produced by learners may contain multiple errors of dif- ferent types, a practical error correction system should be able to detect and correct all errors. In this thesis, we investigate GEC for learners of English as a Second Language (ESL). Specifically, we treat GEC as a translation task from incorrect into correct English, explore new models for developing end-to-end GEC systems for all error types, study system performance for each error type, and examine model generali- sation to different corpora. First, we apply Statistical Machine Translation (SMT) to GEC and prove that it can form the basis of a competitive all-errors GEC sys- tem. We implement an SMT-based GEC system which contributes to our winning system submitted to a shared task in 2014. Next, we propose a ranking model to re-rank correction candidates generated by an SMT-based GEC system. This model introduces new linguistic information and we show that it improves correction qual- ity. Finally, we present the first study using Neural Machine Translation (NMT) for GEC. We demonstrate that NMT can be successfully applied to GEC and help capture new errors missed by an SMT-based GEC system. While we focus on GEC for English, our methods presented in this thesis can be easily applied to any language.
4.
5. First and foremost, I owe a huge debt of gratitude to my supervisor, Ted Briscoe, who has patiently guided me through my PhD and always been very helpful, un- derstanding and supportive. I cannot thank him enough for providing me with opportunities that helped me grow as a researcher and a critical thinker. I am immensely grateful to my examiners, Paula Buttery and Stephen Pulman, for their thorough reading of my thesis, their valuable comments and an enjoyable viva. My appreciation extends to my fellow members of the Natural Language and Information Processing research group, with whom I have always enjoyed discussing our work and other random things. My gratitude goes to Stephen Clark and Ann Copestake for giving me early feedback on my work as well as Christopher Bryant for generously reading my thesis draft. I would especially like to thank Mariano Felice for being not just a great colleague but also a dear friend. A special mention has to be given to Matthew Purver, who got me interested in this field in the first place. I am thankful to Cambridge Trust and China Scholarship Council for funding my research, making it possible for me to pursue a doctoral degree in Cambridge. I am also grateful to the Computer Laboratory and St. Edmund’s College for supporting my conference attendance. Finally, I would like to express my heartfelt thanks to my family and friends. Special thanks go to Hui Xiao and Mo Jia for always being there whenever I need them. Words are powerless to describe my appreciation and gratitude to my par- ents, Xuewen Yuan and Yun Zeng, for all the sacrifices that they have made on my behalf. Their love and support have sustained me thus far and I know will continue to sustain me.
11. List of Abbreviations ALM adaptive language model BiRNN Bidirectional Recurrent Neural Network BLEU Bilingual Evaluation Understudy BNC British National Corpus CE correct edit CLC Cambridge Learner Corpus CNN Convolutional Neural Network CoNLL Conference on Computational Natural Language Learning EM Expectation-Maximisation ESL English as a Second Language EVP English Vocabulary Profile FCE First Certificate in English FN false negative FP false positive GEC grammatical error correction GLEU Generalized Language Evaluation Understanding GPU graphics processing unit GR grammatical relation GRU Gated Recurrent Unit HMM Hidden Markov Model HOO Helping Our Own I I-measure IELTS International English Language Testing System ILP Integer Linear Programming ITG Inversion Transduction Grammar L1 first language L2 second language LM language model LSTM Long Short-Term Memory MBR Minimum Bayes-Risk ME missed edit MEMT Multi-Engine Machine Translation MERT Minimum Error Rate Tuning MIRA Margin Infused Relaxed Algorithm MT machine translation
12. NB Naı̈ve Bayes NLP Natural Language Processing NLTK Natural Language Toolkit NMT Neural Machine Translation NP noun phrase NUCLE National University of Singapore Corpus of Learner English NUS National University of Singapore OOV out-of-vocabulary P precision POS part-of-speech R recall RASP Robust Accurate Statistical Parsing RBMT Rule-Based Machine Translation RNN Recurrent Neural Network SGD Stochastic Gradient Descent SMT Statistical Machine Translation SVM Support Vector Machine TM translation model TN true negative TP true positive UE unnecessary edit WAcc weighted accuracy WMT Workshop on Statistical Machine Translation
13. CHAPTER 1 Today, from Beijing to Brasilia, millions of people are learning English as a Second Language (ESL). According to a report published by the British Council in 2013, English is spoken at a ‘useful level’ by 1.75 billion people worldwide. In fact, non- native English speakers now outnumber native speakers. Furthermore, the number of ESL learners keeps on growing and it is estimated that 2 billion people will be using English - or learning to use it - by 2020. Nevertheless, learning a new language is never easy. Difficulties in acquiring a new language can be due to the differences between the new language and the learners’ first languages (L1s) (Lado, 1957). These differences may result in various kinds of errors in learner writing. Errors made by learners are different from those made by native speakers. Connors and Lunsford (1988) studied errors made by college students in the United States and compiled an error list ranked by frequency. Their work was later replicated by Donahue (2001) with a focus on ESL learners. Results showed that half of the ten most frequent error types made by native speakers were ‘negligible’ in ESL writings. There has been a great deal of commercial and academic interest in automatically correcting these written errors for ESL learners. From a commercial perspective, there is a great potential for many practical applications, such as proofreading tools that help second language (L2) speakers identify and correct their writing errors without human intervention or educational software for automated language learning and assessment. From a research perspective, correcting errors in learner writing is an interesting and challenging task as it involves various aspects of Natural Language Processing (NLP), such as language modelling, syntax and semantics. Early grammar checkers can be traced back to the 1980s, when hand-coded grammar rules were mostly used. However, due to the productive nature of language and the creativity of learners, it is impractical to define rules for every possible case. With the advent of large-scale annotated corpora in the 1990s, data-driven approaches made it possible to build systems for specific error types. Nevertheless, popular commercial proofreading tools only target a few error types that are easy to correct, such as spelling mistakes (a *baeutiful/beautiful girl ) or wrong past participle forms of irregular verbs (Dave has *runned/run 42 marathons), and do not include those aspects of English that are harder to learn. At the same time, most research in the area has focussed on two common error types made by learners, namely articles 13
14. (Mary’s sister is */a hairdresser ) and prepositions (the best places to visit *on/in July), assuming that there is only one error per sentence. However, errors do not always occur in isolation. Sentences produced by learners may contain multiple errors which belong to different error types. What is worse, errors may interact with others so that the correction of one error requires the correction of the other. See the following example sentences written by ESL learners: Example 1.1. I am plece to tell the information do you need for the group. The sentence contains three errors: a spelling mistake (plece → pleased ), a wrong verb (tell → provide) and an unecessary verb (do). Example 1.2. As you know, it is not suitable to wear a jean. The sentence contains two interacting errors: ‘a’ should be deleted and ‘jean’ should be changed to ‘jeans’ at the same time (a jean → jeans). An error correction system that can only correct one or a few types of errors will be of limited use to learners. Instead, a good system should be able to correct a variety of error types and corrections should be performed at a global rather than local level, including taking interacting errors into account. Our goal in this thesis is to develop robust error correction systems that can automatically detect and correct all errors present in learner text, trying to overcome the aforementioned limitations. The error correction task can be thought of as a type of monolingual ‘translation’, where the source is a sentence written by a learner and the target is a fluent and ade- quate sentence in the same language. A corrected sentence should be grammatically correct and preserve the original meaning of the source. Rather than building individual components for each error type, we apply the machine translation (MT) approach of ‘translating’ a grammatically incorrect sen- tence into a correct one to address all error types simultaneously. The MT approach takes advantage of large annotated learner data. Systems learn correction mappings from data and use them to generate a corrected version of the original sentence, correcting as many errors as possible. Our work investigates MT methods for cor- recting grammatical errors in non-native English text and addresses issues arising from applying existing MT techniques to the error correction task. We further iden- tify new techniques for developing robust error correction systems that outperform previous approaches. While English is by far the most spoken foreign language in the world, there is also a need for grammar checkers for other languages, such as Chinese, Spanish and Arabic. Although we focus only on English in this thesis, the methods described here can be applied to any language given appropriate data. 1.1 What is grammatical error correction? Grammatical error correction (GEC) is the task of automatically correcting gram- matical errors in written text. More specifically, the task is to build a system that takes an input text, analyses the context of the text to identify and correct any grammatical errors, and finally returns a corrected version that retains the origi- nal meaning. If there is no error in the input text, the system should output the 14
15. text without any modification. In this thesis, we focus on grammatical errors in non-native English text. It should be noted that not all errors present in learner text are grammatical errors, however. Errors were traditionally identified at five levels: 1) a lexical level, 2) a syntactic level, 3) a semantic level, 4) a discourse structure level, and 5) a pragmatic level (Kukich, 1992). Lexical errors are spelling mistakes that result in non-existent words, such as misspelling ‘type’ as ‘tipe’, where ‘tipe’ is not a legitimate word in English. Errors where the syntactic categories of the words do not fit their contexts are classified as syntactic errors, such as subject-verb agreement errors (she always *know/knows her place) or verb tense errors (the church *is/was rebuilt in 1948 ). Errors that cause semantic anomalies are semantic errors, which involve contextual spelling mistakes that result in legitimate words (we waited for twenty *minuets/minutes) and collocation/cooccurrence errors (*big conversation) (Kochmar, 2016). Discourse errors violate the inherent coherence relations in a text while pragmatic errors reflect some anomalies related to the goals and plans of the discourse participants. Correcting errors from the last two groups requires further discourse analysis. In this thesis, we use the broad term ‘grammatical error’ to refer only to lexical, syntactic and semantic errors, but do not tackle discourse and pragmatic errors whose ‘span’ goes beyond the sentence. 1.2 Thesis aims The work presented in this thesis aims to: 1. Develop end-to-end error correction systems that are capable of correcting grammatical errors present in text written by learners of English. As sentences produced by ESL learners may contain multiple errors which belong to different error types, we aim to develop robust systems that can automatically detect and correct a variety of error types and perform corrections at a global rather than local level, where interacting errors are covered as well. 2. Explore the use of several statistical NLP approaches for GEC: (a) Can SMT form the basis of a competitive all-errors GEC sys- tem? Statistical Machine Translation (SMT) has been successfully used to correct a limited number of grammatical errors in the past (see Brock- ett et al., 2006; Yuan and Felice, 2013), so we aim to investigate whether the same approach can be used to correct multiple grammatical errors at the same time. (b) Can candidate re-ranking improve sentence quality in SMT- based GEC? Since SMT was not originally designed for GEC, many standard features do not perform well on the error correction task. It is therefore necessary to add new local and global features to help the SMT decoder distinguish good from bad corrections. We propose a Support Vector Machine (SVM) ranking model to re-rank candidates generated 15
16. by an SMT-based GEC system. We aim to determine whether candidate re-ranking is a viable approach to address the decoding problem in this scenario and thus improve sentence quality. (c) Can NMT be applied to GEC? Typical SMT-based GEC systems suffer from data sparsity. Some errors are not covered by these systems because the mappings needed for correction have not been seen in the training data. With the recent advances in neural networks, Neural Ma- chine Translation (NMT) seems appealing for GEC as it may be possible to correct erroneous phrases and sentences that have not been seen in the training set more effectively. We investigate NMT systems and how they can be applied to GEC in order to capture new errors without the need for additional training data. 3. Examine and address issues concerning applying existing techniques to GEC. As we approach GEC as a special translation task, where the source and target sentences are both in English but the source may contain grammatical errors, it is inevitable that new problems may arise from adapting existing MT techniques to GEC. We discuss these problems and propose possible solutions. 4. Investigate system performance for each error type. Type-specific performance helps understand the strengths and weaknesses of the system, as well as iden- tify areas for future improvement. However, this is not easy to do for all-errors GEC systems which propose corrections without error types. We apply a type estimation strategy and present detailed error analyses. 5. Examine model generalisation to different learner corpora. It is not the aim of this thesis to beat the state-of-the-art result on one particular dataset (e.g. the CoNLL-2014 shared task test set - see Section 2.5.2). Instead, we are more interested in models that can consistently produce competitive results across different learner corpora without retraining or tuning for new datasets or GEC tasks. For this reason, we test model generalisation and compare the results with those from other models which are trained and tuned specifically for each corpus. 1.3 Thesis structure The structure of this thesis is as follows. Chapter 2 discusses several related topics in GEC. It begins with an overview of the automated approaches to detect and correct errors made by learners and goes on to describe the MT approach to error correction. Additionally, it gives a description of the learner corpora and automatic evaluation metrics for GEC, followed by a summary of a series of shared tasks on GEC. It concludes with a discussion of the datasets and evaluation metrics used in this thesis. Chapter 3 describes our approach to building a preliminary SMT-based error correction system. We address the major issues that arise from applying standard SMT to GEC. We explore different types of translation models (TMs), language 16
17. models (LMs) and alignment methods used in an SMT system. To overcome the lack of training data, we propose the use of three different types of data and demonstrate how they can help build robust SMT models. We also investigate phrase table filtering. We present an SMT system that forms one half of our winning system submitted to a shared task on grammatical error correction in 2014. A detailed error analysis of the SMT-based GEC system is also performed. In Chapter 4, we propose a supervised ranking model to re-rank candidates generated by an SMT-based GEC system. A range of novel features with respect to error correction are investigated and implemented in our re-ranker. An in-depth assessment of the role played by each feature type is carried out, quantifying its contribution from a statistical perspective. We also investigate the performance of different re-ranking techniques and find that our proposed model clearly outperforms the other two, showing its effectiveness in re-ranking candidates for GEC. Chapter 5 presents the first study on NMT for GEC, in an attempt to ameliorate the lack of training data for SMT-based GEC systems. Problems from adapting standard NMT to GEC are addressed. The performance of different NMT models on the error correction task is investigated. We also propose a two-step approach to address the ‘rare word’ problem in NMT for GEC and demonstrate how it can help provide a substantial improvement in system performance. Finally, Chapter 6 concludes this thesis and discusses some avenues for possible future research. 17
18.
19. CHAPTER 2 There is a large body of work on grammatical error detection and correction. This chapter puts the present research in context by offering an overview of latest research in the field. A more comprehensive survey of automated grammatical error detection for language learners can be found in the book by Leacock et al. (2014). 2.1 Early approaches to grammatical error cor- rection Early attempts at automated error correction employed hand-coded rules. The first widely used grammar checking tools, such as the Writer’s Workbench (MacDon- ald et al., 1982), were based on simple pattern matching and string replacement. Other rule-based systems incorporated syntactic analysis and used manually de- veloped grammar rules. For example, both Grammatik from Aspen Software and GramCheck (Bustamante and León, 1996) relied on basic linguistic analysis, while IBM’s Epistle (Heidorn et al., 1982) and Critique (Richardson and Braden-Harder, 1988) performed full syntactic analysis. Rule-based approaches are generally easy to implement for some types of errors and can be very effective. This is why they are still widely used by existing grammar checking systems. However, rules can be- come impractical for some complex errors and unmanageable with time. The highly productive nature of language makes it impossible to define rules for every potential error. So rule-based approaches are often avoided as a general solution. With the advent of large-scale annotated resources in the 1990s, researchers moved to data-driven approaches and applied machine learning techniques to build classifiers for specific error types (Knight and Chander, 1994; Han et al., 2004; Chodorow et al., 2007; De Felice and Pulman, 2007; Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2011; Dahlmeier et al., 2012). Most work using machine learning classifiers has focussed on two error types: articles and prepositions. This is due to the fact that these errors are some of the most common and challenging ones for ESL learners, and are also easier to tackle using machine learning approaches than hand-crafted rules (Felice and Yuan, 2014a). For these closed-class errors, a 19
20. finite confusion set or candidate set including all the possible correction candidates is defined, such as a list of articles or prepositions in English. Training examples - native and/or learner data - are represented as vectors of linguistic features that are considered useful for the error type. Possible features often include neighbouring words, part-of-speech (POS) tags, grammatical relations (GRs) and dependency trees. Various machine learning algorithms are used to train classifiers based on these features. Once a system has been trained, new errors are detected and corrected by comparing the original word used in the text with the most likely candidate predicted by the classifier. Since the most useful features often depend on the word class, it is necessary to build separate classifiers for each error type. Han et al. (2004) trained a maximum entropy classifier to detect article errors on a large diverse corpus and achieved an accuracy of 88%. Tetreault and Chodorow (2008) used maximum entropy models to correct errors for 34 common English prepositions in learner text. Errors made by ESL learners often depend on their L1s (Lee and Seneff, 2008). Systems perform much better when information about their L1s is included. Ro- zovskaya and Roth (2011) compared four linear machine learning classifiers for cor- recting preposition errors. Results showed that discriminative classifiers perform the best and adaptation to a writer’s L1 further improves performance. The authors proposed a way of integrating language-specific priors at decision time using Naı̈ve Bayes (NB) models instead of training separate classifiers for each L1. The weakness of approaches based on ‘classification by error type’ is that they only rely on local context and treat errors independently, assuming that there is only one error in the context and all the surrounding information is correct. However, sentences produced by learners may contain a complex combination of several types of errors which may further interact. An error correction system that only corrects one type of error is of limited use to language learners in practical applications. A commonly used solution is to build multiple classifiers and then cascade them into a pipeline system. A combination of classifier-based and rule-based steps is often used to build systems that correct multiple errors (Dahlmeier et al., 2012; Rozovskaya et al., 2013). This kind of solution is complex and laborious: several pre-processing and post-processing steps are required, and the order of classifiers also matters. Additionally, it does not solve the problem of interacting errors and pre- dictions from independent classifiers may be inconsistent. Here is a typical example taken from Rozovskaya and Roth (2013): Example 2.1. ... electric cars is still regarded as a great trial innovation ... Predictions made by a system that combines independently-trained classifiers: cars is → car are. Several approaches have been proposed to address the problem of interacting errors. Rather than making decisions independently, Dahlmeier and Ng (2012a) de- veloped a beam-search decoder to iteratively generate sentence-level candidates and score them using individual classifiers and a general LM. Five proposers were used to generate new candidates by making five types of changes: spelling, articles, prepo- sitions, punctuation insertion, and noun number. Results appeared promising and the decoder outperformed a pipeline system of individual classifiers and rule-based steps. However, their decoder only provides corrections for five error types and new 20
21. proposers need to be added into the system in order to cover more errors, some of which might not be easy to design. Furthermore, the number of candidates grows exponentially with the type of errors being considered (i.e. the number of proposers) and the sentence length. As it is infeasible to enumerate all candidates, building an efficient decoder becomes a problem. Wu and Ng (2013) proposed a joint inference model to resolve inconsistencies produced by individual classifiers. Integer Linear Programming (ILP) was used to incorporate the output of individual classifiers and a list of linguistic constraints. These constraints were manually defined and explic- itly encoded into the system. Any new constraints need to be hand-coded for new types of interacting errors. Rozovskaya and Roth (2013) built two joint classifiers to address two linguistic structures: subject-verb and article-NPhead.1 For each of the structures, rather than using two classifiers independently, a joint classifier simul- taneously predicts two words that are part of the same structure. Unlike the ILP model proposed by Wu and Ng (2013), the joint classifier does not need human de- fined constraints, as it can learn from the training data directly. However, it is more difficult to collect enough pairs of candidates that form the relevant structures to use as training data. As one joint classifier only targets one type of interacting error, new classifiers need to be built for every new type of interaction. These classifier- based approaches still use scores from individual classifiers, so it becomes infinitely time-consuming to train individual classifiers for all types of (interacting) errors. A more general approach for correcting multiple errors in ESL text is to use n-gram LMs (Gamon et al., 2008; Gamon, 2011). A single model is trained on a large number of correct sentences and then used to assign probabilities to sequences of words based on counts from the training data. Within this framework, the target word sequence is substituted for alternatives from a precompiled candidate set and the LM scores for the original text as well as the alternatives are computed. The sequence with the highest probability is chosen as the correct one. Ideally, correct word sequences will get high probabilities while incorrect or unseen ones will get low probabilities. Errors are assumed to occur in parts of a sentence where a low score is assigned. However, no matter how large a training corpus is, it is impossible to cover all possible correct word sequences in practice. Another problem lies in how to distinguish low-frequency word combinations from erroneous ones. Therefore, the LM approach is commonly used in addition to other approaches, especially to rank correction suggestions proposed by other models. Gamon et al. (2008) used a LM in addition to machine learning classifiers and combined them using a meta- classifier. Dahlmeier and Ng (2012a) used a LM in combination with classifiers to score correction candidates in a beam-search decoder. Additionally, some efforts have been made to tackle learner errors that are par- ticularly difficult to detect and correct. Rozovskaya et al. (2014b) proposed a lin- guistically motivated approach to verb error correction. Their model integrated a machine learning approach with a rule-based system that first identifies verb can- didates in noisy learner text and then makes use of verb finiteness information to identify errors and characterise the type of mistake. Xue and Hwa (2014) developed 1 article-NPhead: the interaction between the head of the noun phrase (NP) and the article that refers to the NP 21
22. a computational model for redundancy detection in ESL writings. They proposed a measure to assign high scores to words and phrases that are likely to be redundant within a given sentence by comparing an ESL sentence with the output from off- the-shelf MT systems. For content word combinations, Kochmar (2016) performed error detection in adjective-noun and verb-object combinations in learner data using compositional distributional semantic models. 2.2 Machine translation and error correction A practical error correction system should be able to correct various types of errors made by ESL learners. In more recent research, MT techniques have been used to successfully correct a broader set of errors. MT algorithms automatically translate text from a source language into a target language. Error correction thus can be seen as a special translation problem from grammatically incorrect sentences into correct ones. Unlike in standard MT tasks, the source and target sentences are both in the same language but the source may contain grammatical errors. MT-based GEC systems learn correction mappings from parallel examples and use these mappings to generate a corrected version of the original (erroneous) sentence, correcting as many errors as possible. 2.2.1 Statistical machine translation SMT, as the dominant MT approach in the last two decades, employs statistical models estimated from parallel corpora (i.e. source-target pairs) and monolingual corpora (i.e. target sentences) to transform text from one language to another.2 Brockett et al. (2006) first proposed the use of an SMT model for correcting mass/- count noun errors made by learners of English. A list of 14 mass nouns was compiled using dictionaries and the Chinese Learner English Corpus (Gui and Yang, 2003). An SMT system requires millions of examples of correct and incorrect usage to learn reliable translation mappings. Given that examples of correct usage are plentiful in native data while parallel examples of incorrect usage are much more difficult to collect, the authors transformed well-formed edited English sentences into mostly ungrammatical strings by introducing artificial mass noun errors. Hand-constructed regular expressions were used to make sure the generated strings exhibited charac- teristics of the learner corpus. A phrase-based SMT system was built using word alignments produced by GIZA++ (Och and Ney, 2003). Their SMT system suc- cessfully corrected 61.8% of mass noun errors from a set of 123 examples of incorrect usage. As noted by Leacock et al. (2014), this was only a first exploration of SMT techniques for GEC, but with enough training data, such a system could potentially be powerful enough to detect and correct errors that involve more than just the insertion, deletion or substitution of single words, as well as being able to provide stylistic writing assistance to ESL learners. 2 SMT algorithms are described in more detail in Chapter 3. 22
23. Mizumoto et al. (2011) applied the same SMT techniques for Japanese error correction but improved them by considering a wider set of error types and training on a large-scale real-world dataset. Rather than transforming correct sentences into grammatically incorrect strings, they extracted real examples from the language learning social network website Lang-8.3 Moses (Koehn et al., 2007) was used as a decoder and GIZA++ as an alignment tool. Evaluation was based on a character- level version of the Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002), a popular metric for automatic MT evaluation. Mizumoto et al. (2012) extended their work to English and investigated the effect of training corpus size on various types of grammatical errors. Their results showed that a phrase-based SMT system is effective at correcting errors that can be identified by a local context, but less effective at correcting errors that need long-range contextual information. Yuan and Felice (2013) trained a POS-factored SMT system to correct five types of errors in learner text for the CoNLL-2013 shared task on grammatical error cor- rection (Ng et al., 2013) (see Section 2.5.2). These five error types involve articles, prepositions, noun number, verb form, and subject-verb agreement. Since the lim- ited in-domain training data was insufficient to train an effective SMT system, we explored alternative ways of generating pairs of incorrect and correct sentences au- tomatically from other existing learner corpora. We also proposed several modifica- tions to address issues that affect system performance, like disabling word reordering and removing incorrect alignment mappings from the phrase table used by the SMT decoder. Although our SMT approach did not yield particularly high performance compared to other teams using machine learning classifiers, nevertheless, this re- vealed the potential of using SMT as a general approach for correcting multiple error types and interacting errors simultaneously. The version of the corpus used for the shared task only includes five error types and discards all the remaining cor- rections, resulting in some broken or partly-corrected sentences. These ill-formed sentences are particularly harmful for SMT-based systems which, unlike classifiers, work at a global rather than local level. As a result, many corrections proposed by our SMT system were considered incorrect because they did not belong to any of the five target error types. This showed that the SMT approach seems more suitable for an all-errors task rather than a constrained error correction task. In the CoNLL-2014 shared task (Ng et al., 2014) (see Section 2.5.2), the top performing systems demonstrated that the SMT framework can yield state-of-the-art performance on an all-errors correction task. Our winning system (Felice et al., 2014) is a pipeline of a rule-based system and a phrase-based SMT system (see Chapter 3). The SMT system was trained on parallel sentences and short phrase alignments extracted from fully annotated learner corpora (see Section 2.3). Word alignment was carried out using Pialign (Neubig et al., 2011). As most words translate into themselves and some errors are often similar to their correct forms, we introduced character-level Levenshtein distance (Levenshtein, 1966), which captures the number of edit operations required to change the source phrase into the target phrase. The 10-best correction candidates produced by the SMT system were then re-ranked using Microsoft’s Web N-gram Services, which provide access to large smoothed n- 3 http://lang-8.com 23
24. gram LMs built from English web documents containing trillions of tokens (Gao et al., 2010). Corrections were finally filtered by error type. Our work showed that an SMT-based GEC system can produce state-of-the-art performance on the task and candidate re-ranking can further improve it. The SMT framework was also adopted by Junczys-Dowmunt and Grundkiewicz (2014), who ranked third out of the 13 participating teams. Following the work of Mizumoto et al. (2012), they constructed a training corpus of more than 3 mil- lion pairs of parallel sentences from Lang-8. Since the Lang-8 data can be quite noisy, they performed error selection by keeping errors that resembled mistakes in a learner corpus and replacing others with their corresponding corrections. Apart from the LM built from the target side of the training data, a 5-gram LM estimated from the entire CommonCrawl data (approximately 440 billion tokens, see Buck et al., 2014) was used during decoding. Similar to our character-level Levenshtein distance feature, they introduced a word-based version. Feature weights were tuned for F-score using the k-best Margin Infused Relaxed Algorithm (MIRA) (Cherry and Foster, 2012) and Minimum Error Rate Tuning (MERT) (Och, 2003). Al- though they concluded that parameter optimisation was essential, Kunchukuttan et al. (2014) subsequently found that tuning for F-score to increase precision yielded worse performance. Grundkiewicz and Junczys-Dowmunt (2014) later introduced the WikEd Error Corpus, which consists of more than 12 million sentences extracted from Wikipedia revision histories. A similar error selection process was performed to only keep errors that resembled those made by ESL learners. In a following paper, Junczys-Dowmunt and Grundkiewicz (2016) introduced additional features based on edit operation counts, as well as an operation sequence model (Durrani et al., 2013) and a 9-gram LM based on word-classes produced by word2vec (Mikolov et al., 2013). However, the integration of additional model- s/features seemed to affect the underlying algorithm used in SMT. The authors also observed erratic behaviour when optimising the new features and therefore proposed partial solutions to task-specific parameter tuning. Finally, they reported new state- of-the-art performance on the CoNLL-2014 shared task test set, with an F0.5 score of 49.49%. The ‘translation’ approach has also been used to perform automatic post-editing. Simard et al. (2007) discussed the use of an SMT system to translate erroneous texts produced by a Rule-Based Machine Translation (RBMT) system into better texts in the same language. A phrase-based SMT system was used as an automatic post- editing system and results showed that the SMT system was effective at correcting repetitive errors made by the RBMT system. Instead of translating an erroneous English sentence into a correct one directly, an SMT system could be used as an auxiliary tool for producing ‘round-trip’ trans- lations (Hermet and Désilets, 2009; Madnani et al., 2012). The idea of round-trip SMT is to first translate an English sentence into a pivot foreign language, and then translate the pivot foreign language sentence back into English. By comparing the original English sentence and the round-trip translation, errors can be detected and corrected. Hermet and Désilets (2009) focussed on sentences containing preposition errors and generated a round-trip translation via French. They simply used the 24
25. round-trip translation as the ‘correction’ for the original sentence and their model was able to correct 66.4% of errors. Madnani et al. (2012) used round-trip transla- tions obtained from the Google Translate API4 via 8 different pivot languages for an all-errors task. Their results showed that it is rarely the case that one pivot lan- guage could offer a round-trip translation that corrected all errors in the sentence; but that several pivot languages, if combined properly, could. An alignment algo- rithm was designed to combine multiple round-trip translations generated from the API into a lattice using TERp, an extension of the Translation Edit Rate evaluation metric (Snover et al., 2009). The lattice was then used to extract whole-sentence corrections. Their experiments yielded fairly reasonable results but left significant room for improvement. 2.2.2 Candidate re-ranking Despite the success of SMT-based GEC systems, one of the weaknesses is that SMT features used in the framework might not perform well on the error correction task given that SMT was not originally intended for GEC. Since the SMT features were designed to capture translation regularities, they may fail to capture some correction regularities. As a result, the correction produced by an SMT system is not always the best. It thus seems necessary to add new features with respect to GEC for building effective SMT-based GEC systems, although work in this direction is very limited. Felice et al. (2014) and Junczys-Dowmunt and Grundkiewicz (2014) introduced Levenshtein distance to their phrase-based SMT systems. Felice et al. (2014) further used a web-based LM to re-rank the 10-best correction candidates produced by the SMT system. Re-ranking, on the contrary, has been widely used in many NLP tasks such as parsing, tagging and sentence boundary detection (Collins and Duffy, 2002; Collins and Koo, 2005; Roark et al., 2006; Huang et al., 2007). Various machine learn- ing algorithms have been adapted to these re-ranking tasks, including boosting, perceptrons and SVMs. Over the last decade, re-ranking techniques, especially dis- criminative re-ranking, have shown significant improvement in MT. For each source sentence, rather than outputting the candidate with the highest probability directly, an n-best list of candidate translations is collected from an SMT system and later re-ranked using re-ranking algorithms.5 New global and local features that have not been used during translation can then be easily added to the re-ranker, without worrying about fine-grained smoothing issues in the SMT framework. Shen et al. (2004) successfully applied discriminative re-ranking to MT and observed an im- provement in BLEU over the original output of the SMT system. As phrase-based SMT systems make little or no direct use of syntactic information, Och et al. (2004) proposed to use syntactic features to re-rank the n-best list. A wide range of fea- tures were systematically evaluated, including word-level features, shallow syntactic features based on POS tags and chunks, and features from Treebank-based syn- tactic analyses. However, these syntactic features only gave very small gains and 4 https://cloud.google.com/translate 5 Re-ranking algorithms are described in more detail in Chapter 4. 25
26. improvements were mostly due to the addition of translation probabilities from IBM Models (Brown et al., 1993), a non-syntactic feature. Goh et al. (2010) employed an online training algorithm for SVM-based structured prediction. Various global fea- tures were investigated for SMT re-ranking, such as the decoder’s scores, source and target sentences, alignments, POS tags, sentence type probabilities, posterior prob- abilities and back translation features. Farzi and Faili (2015) proposed a re-ranking system based on swarm algorithms, where a set of non-syntactic features that can be easily computed from LMs, TMs, n-best lists of candidates and POS tags were used. As candidate re-ranking seems potentially valuable for GEC, we propose an SVM ranking model to improve SMT output, making it the first work to use discriminative re-ranking for SMT-based GEC. 2.2.3 Neural machine translation In the past few years, neural network techniques have found success in a wide range of NLP tasks, such as language modelling (Mnih and Hinton, 2007; Mikolov and Zweig, 2012), discriminative parsing (Collobert, 2011), sentiment analysis (Socher et al., 2011; Glorot et al., 2011) and summarisation (Kågebäck et al., 2014). Thus, it is not surprising that neural network models have also been applied to error detection and correction. Sun et al. (2015), for example, employed a Convolutional Neural Network (CNN) for article error correction. Instead of building machine learning classifiers using pre-defined syntactic and/or semantic features, a CNN model is trained from surrounding words with pre-trained word embeddings. Lee et al. (2016) used a CNN to predict whether a sentence needs editing. Rei and Yannakoudakis (2016) looked into various neural network sequence labelling models for error detection in learner writing. The tide of neural models has also spread to the field of MT. Unlike SMT, NMT learns a single large neural network which inputs a source sentence and outputs a translation. The use of NMT models has shown promising results for several MT tasks (see Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015). Specifically, NMT systems ranked on par with phrase-based SMT systems on a couple of language pairs in the 2015 Workshop on Statistical Machine Translation (WMT) shared translation task (Bojar et al., 2015).6 NMT applies an encoder-decoder framework. An encoder first reads and encodes an input sentence into a vector representation. A decoder then outputs a trans- lation for the input sentence from the vector representation.7 Different network architectures have been proposed for NMT. Kalchbrenner and Blunsom (2013) first used a CNN to encode source sentences and a Recurrent Neural Network (RNN) to generate target translations. A similar CNN encoder was then used by Meng et al. (2015). Sutskever et al. (2014) and Cho et al. (2014) used RNNs for both encoding and decoding. Sutskever et al. (2014) used a multilayer Long Short-Term Memory (LSTM) to map a source sentence into a fixed-sized vector, and another LSTM to 6 The NMT system from Jean et al. (2015b) ranked 1st on English-German translation task and 3rd on Czech-English, English-Czech and German-English translation tasks (ties were allowed). 7 NMT algorithms are described in more detail in Chapter 5. 26
27. decode a target sentence from the vector. Cho et al. (2014) used two Gated Recur- rent Unit (GRU) models, one as the encoder and another as the decoder. Bahdanau et al. (2015) introduced an attention mechanism to NMT which helps the decoder focus on the most relevant information in a source sentence when predicting target words. Luong et al. (2015a) experimented with two attention mechanisms and com- pared various alignment functions. Both Bahdanau et al. (2015) and Luong et al. (2015a) have shown that attention-based models are better than non-attentional ones in handling long sentences. Towards the end of this thesis, we explore the potential of NMT for GEC, as we believe that the distributed representation of words could help correct previously unseen errors more effectively than SMT. To the best of our knowledge, this is the first work to use the NMT framework to build end-to-end GEC systems. 2.3 Learner corpora Unlike native corpora, learner corpora are collections of language data produced by non-native speakers. Having such learner resources is advantageous for GEC research: 1) it allows the investigation of real learner errors as well as the contexts in which they occur; 2) it facilitates the development of statistical models for GEC; for example, an SMT system requires millions of examples of correct and incorrect usage to learn reliable correction mappings; and 3) it provides a way of evaluating GEC system performance in a real world scenario. Recently, error-annotated learner corpora have become more readily and publicly available. In this section, we describe the learner corpora used in this thesis. 2.3.1 NUCLE The National University of Singapore Corpus of Learner English (NUCLE) is an an- notated corpus of learner text built by the National University of Singapore (NUS) NLP Group in collaboration with the NUS Centre for English Language Commu- nication (Dahlmeier et al., 2013). It consists of more than 1,400 essays written by undergraduate students at NUS who are non-native English speakers. These es- says were written in response to some prompts that cover a wide range of topics, such as environmental pollution, healthcare, and technology innovation. Two of the prompts used for data collection are shown below: Prompt 1 “Public spending on the aged should be limited so that money can be diverted to other areas of the country’s development.” Do you agree? Prompt 2 Surveillance technology such as RFID (radio-frequency identification) should not be used to track people (e.g. human implants and RFID tags on people or products). Do you agree? Support your argument with concrete examples. 27
28. Error type Prop. (%) Example Wrong collocation/idiom/prepo- 15.7 Singapore has invested heavily *on/in sition (Wcip) the establishment of Biopolis. Local redundancies (Rloc) 13.7 Abortion is available to end a life only *because of/because the fetus or embryo has the wrong sex. Article or determiner (Ar- 12.9 Sex selection technology should not be tOrDet) used in *non-medical/a non-medical sit- uation. Noun number (Nn) 8.5 Sex selection should therefore be used for medical *reason/reasons. Mechanics (Mec) 7.1 The *affect/effect of that policy has yet to be felt. Verb tense (Vt) 7.1 A university *had conducted/conducted the survey last year. Word form (Wform) 4.8 Sex-selection may also result in *addi- tion/additional stress for the family. Subject-verb agreement (SVA) 3.4 The boy *play/plays soccer. Verb form (Vform) 3.0 Will the child blame the parents after he *growing/grows up? Table 2.1: Proportion of the most common error types in the NUCLE corpus. Grammat- ical errors in the examples are printed in italics in the form *incorrect word/corrected word. The corpus contains over one million words which were manually annotated by professional English instructors at NUS using a tag set of 27 error categories (see Appendix A), resulting a total number of 46,597 error annotations. The statistics of NUCLE show that 57.6% of all sentences have no errors, 20.5% have exactly one error, 10.7% have exactly two errors, and 11.2% of all sentences have more than two errors. The highest observed number of error annotations in a single sentence is 28. The top nine error types in the NUCLE corpus are presented in Table 2.1. Although wrong word choice (Wcip) and redundancy errors (Rloc) are ranked at the top, Dahlmeier et al. (2013) reported that most Wcip errors are preposition errors, and a large percentage of Rloc errors involve articles that should be deleted. This confirms that articles and prepositions are two of the most common errors in ESL text. It also shows inconsistency in NUCLE labelling.8 2.3.2 CLC The Cambridge Learner Corpus (CLC) is the world’s largest learner corpus, devel- oped by Cambridge University Press and Cambridge English Language Assessment since 1993. It is a 52.5 million word collection of exam scripts written by learners of English who took Cambridge English examinations around the world. Currently, it comprises over 200,000 exam scripts produced by learners at various levels speaking 8 The NUCLE corpus was later revised for the CoNLL shared tasks to separate prepositions from Wcip and articles from Rloc (amongst other things) - see Section 2.5.2. 28
29. 148 different L1s living in 217 different countries or territories. A subset of the corpus (a 25.5 million word collection) has been manually error coded by linguists using an error-coding system with a taxonomy of approximately 80 error types de- vised specifically for the CLC (Nicholls, 2003). The majority of the error codes used in the CLC are two-letter codes with the first letter representing the general type of error (e.g. spelling, word form) and the second one representing the word class (i.e. POS) of the required word (see Appendix B). The coding format is explained with the following examples: Example 2.2. I am so excitingexcited that I have won the first prize. Example 2.3. I like playing in a team and deciding quickly what to do next. Error information is provided inside the tag, where the error type is also specified. Inside the tag, the original erroneous part is marked by the tag and its corrected version is marked by the tag. In Example 2.2, “RJ” stands for Replace adJective, where ‘exciting’ should be corrected to ‘excited’. In Example 2.3, “MD” stands for Missing Determiner, where the word ‘a’ should be added. Other error codes include Form, Unnecessary, Spelling and Derivation for the first letter; and Noun, Verb, preposiTion, Punctuation for the second letter. More detailed information about the error-coding scheme can be found in Nicholls (2003). The top nine error types in the error-coded CLC are presented in Table 2.2, with spelling errors excluded. The most frequent error type in the CLC is choosing an inappropriate open class word (noun, verb, adjective or adverb), followed by prepositions and determiners.9 A similar error distribution was observed in the NUCLE corpus - see Table 2.1. Each examination script in the CLC contains meta-data about the learner, such as L1, nationality, age, sex and level of English, as well as the examination. There are three examination suites in the CLC (Williams, 2008): • main suite (general purpose qualification): Certificate of Proficiency in English, Certificate of Advanced English, First Certificate in English (FCE), Preliminary English Test, and Key English Test; • Business English Certificates (focuses on the language of business); • International English Language Testing System (IELTS) (general and aca- demic modules). Two subsets of the CLC used in this thesis are described in detail: FCE and IELTS examination scripts. 9 The determiner errors include both determiner and pre-determiner errors, not just the articles a/an and the. 29
30. Error type Prop. (%) Example Content word choice error 19.9 We need to deliver the merchandise on a daily *base/basis. Preposition error 13.4 Our society is developing *in/at high speed. Determiner error 11.7 Wemust try our best to avoid *the/a shortage of fresh water. Comma error 9.3 However */, I’ll meet you later. Inflectional morphology 7.4 The women *weared/wore long dresses. Wrong verb tense 6.7 I look forward to *see/seeing you. Derivational morphology 4.9 It has already been *arrangement/ar- ranged. Pronoun 4.2 I want to make *me/myself fit. Agreement error 4.0 I *were/was in my house. Table 2.2: Proportion of the most common error types in the CLC. Grammatical errors in the examples are printed in italics in the form *incorrect word/corrected word. 2.3.2.1 FCE examination scripts The FCE dataset was released into the public domain in 2011 by Yannakoudakis et al. (2011). It is a set of 1,244 scripts written by learners of English who took the FCE examination between 2000 and 2001, which assesses English at an upper- intermediate level. The FCE dataset contains about half a million words and more than 50k errors. Each exam script contains two essays whose length varies between 120 and 180 words. Essays were written in response to tasks requiring a learner to write a letter, a report, an article, a composition or a short story. A typical prompt is shown below: Your teacher has asked you to write a story for the school’s English language magazine. The story must begin with the following words: “Unfortunately, Pat wasn’t very good at keeping secrets”. The anonymised scripts are annotated using XML and linked to meta-data in- cluding the question prompts and information about candidates. 2.3.2.2 IELTS examination scripts The IELTS dataset is another subcorpus of the CLC that comprises exam scripts written by ESL learners taking the IELTS examination. It consists of 851 scripts from 2008 and 100 scripts from 2010. Like in the FCE dataset, each exam script in the IELTS dataset consists of two essays in response to two tasks. The first task asks a learner to write a descriptive report on the information provided in a diagram, table or short piece of text, or write a short letter in response to a situation or problem with a minimum of 150 words. The second task asks a learner to use at least 250 words to present an argument or discuss a problem. 30
31. 2.4 Evaluation metrics For system development, it is necessary to have internal system evaluation. Auto- matic evaluation metrics allow fast and inexpensive feedback. When evaluating a GEC system, the system’s output is compared to gold-standard references provided by human experts. There is an on-going discussion on how to best evaluate GEC systems and several metrics have been proposed and used (Dale and Kilgarriff, 2011; Dale et al., 2012; Papineni et al., 2002; Dahlmeier and Ng, 2012b; Felice and Briscoe, 2015; Bryant and Ng, 2015; Napoles et al., 2015; Grundkiewicz et al., 2015; Sakaguchi et al., 2016). In this section, we present four evaluation metrics used in this thesis. 2.4.1 BLEU BLEU was first proposed by Papineni et al. (2002) and is now used as the dominant method for automatic MT evaluation. It estimates the quality of the text produced by MT systems so that the closer it is to human translations, the better. BLEU has been shown to correlate well with human judgments at the corpus level. It uses a modified n-gram precision (p n ) to compare a candidate against multiple references: P n-gram∈C countclip (n-gram) pn = P (2.1) n-gram∈C count(n-gram) where C is a candidate sentence. The count of each n-gram in C is clipped by its maximum reference count observed in any single reference for that n-gram: countclip = min(count, max ref count) (2.2) BLEU is then defined as: N ! X BLEU = BP · exp w n log p n (2.3) n=1 where N is the order of the highest n-gram to be considered (usually N = 4); wn stands for uniformly distributed weights: w n = N1 . BP is a brevity penalty which is used to prevent very short candidates from receiving very high scores: 1 if c > r BP = (2.4) (1− rc ) e if c ≤ r where c and r are the lengths of the system’s candidate and gold-standard ref- erence respectively. BLEU was used by Mizumoto et al. (2011, 2012) to evaluate SMT-based GEC systems. Unlike metrics which rely on references with explicitly labelled error anno- tations, BLEU only requires corrected references. On the one hand, it can be used as a generic evaluation method independent of the annotation scheme, but on the other hand, it fails to provide detailed error type feedback for GEC. Since both the original and corrected sentences are in the same language (i.e. English) and most 31
32. words in the sentence do not need changing, BLEU scores for GEC systems are relatively high compared with standard MT tasks. However, it is not enough to just compare BLEU scores from different GEC systems, it is also necessary to compare them with that of the original input. If the system’s ouput yields a higher BLEU score than the original input, it is assumed that the system improves the quality of the original input by making some corrections. In addition, BLEU allows multiple references, which is useful for errors with multiple alternative corrections. 2.4.2 M2 scorer The M2 scorer, proposed by Dahlmeier and Ng (2012b), is used to evaluate sys- tem performance by how well its proposed corrections or edits match the gold- standard edits. It computes the sequence of phrase-level edits between a source sentence and a system’s candidate that achieves the highest overlap with the gold- standard annotation. A parameter µ is used to limit the number of unchanged words (max unchanged words) so that edits including too many words are avoided. Evaluation is performed by computing precision (P), recall (R) and F-score (van Rijsbergen, 1979): Pn i=1 |ei ∩ gi | P = P n (2.5) i=1 |ei | Pn i=1 |ei ∩ gi | R= P n (2.6) i=1 |gi | P ×R F β = (1 + β 2 ) × (2.7) (β 2 × P) + R where ei = {e1 , e2 , ..., en } is the system’s candidate edit set and gi = {g1 , g2 , ..., gn } is the gold-standard edit set. The intersection between ei and gi is defined as: ei ∩ gi = {e ∈ ei | ∃ g ∈ gi (match (e, g))} (2.8) Two of the commonly used F-scores are F1 , which weights P and R evenly, and F0.5 , which emphasises P twice as much as R: P ×R F1 = 2 × (2.9) P +R P ×R F0.5 = (1 + 0.52 ) × (2.10) (0.52 × P) + R The M2 scorer was the official scorer in the CoNLL 2013 and 2014 shared tasks on grammatical error correction, where F1 was used in CoNLL-2013 and F0.5 was used in CoNLL-2014 (see Section 2.5.2). When building GEC systems, minimising the number of unnecessary corrections is often regarded as more important than covering a large number of errors, which is something users are willing to sacrifice as long as the system provides accurate corrections. In other words, high P is often preferred over high R. There is also a strong educational motivation, as flagging 32
33. Source Candidate Reference Classification a a a TN a a b FN a a - FN a b a FP a b b TP a b c FP, FN, FPN a b - FP, FN, FPN a - a FP a - b FP, FN, FPN a - - TP - a a TP - a b FP, FN, FPN - a - FP - - a FN Table 2.3: The extended writer-annotator-system evaluation scheme proposed by Felice and Briscoe (2015). correct text as incorrect would cause confusion among learners. This is why F0.5 was much preferred lately when reporting system performance. However, evaluation methods based on P, R and F-score (e.g. the M2 scorer) do not provide an indicator of improvement on the original text so there is no way to compare GEC systems against a ‘do-nothing’ baseline that keeps the input text unchanged. A ‘do-nothing’ baseline will always yield an F-score of 0 by definition, and an increase in P, R or F-score does not necessarily mean a reduction in the actual error rate. 2.4.3 I-measure The I-measure was designed by Felice and Briscoe (2015) to address problems with previous evaluation methods and to evaluate real improvement on the original sen- tence after corrections. System performance is first evaluated in terms of weighted accuracy (WAcc), based on a token-level alignment between a source sentence, a system’s candidate, and a gold-standard reference. An extended version of the writer-annotator-system evaluation scheme (Chodorow et al., 2012) was adopted where each token alignment is classified as a true positive (TP), true negative (TN), false positive (FP), false negative (FN), or both an FP and FN (FPN) - see Table 2.3. WAcc is defined as: w · TP + TN WAcc = FPN (2.11) w · (TP + FP) + TN + FN − (w + 1) · 2 where w > 1 is a weight factor. 33
34. An Improvement or I-measure (I) score is computed by comparing system per- formance (WAccsys ) with that of a baseline that leaves the original text unchanged (WAccbase ): ⌊WAcc sys ⌋ if WAcc sys = WAcc base WAcc sys − WAcc base I= if WAcc sys > WAcc base (2.12) 1 − WAcc base WAcc sys − 1 WAcc otherwise base Values of I lie in the [-1, 1] interval.10 Positive values indicate improvement, while negative values indicate degradation. A score of 0 indicates no improvement (i.e. baseline performance), 1 indicates 100% correct text and -1 indicates 100% incorrect text. As multiple annotations are taken into account, the I-measure is computed af- ter maximising WAccsys at the sentence level, so as to ensure all the evaluated hypotheses are paired with their highest scoring references. Trying to maximise I score directly can yield suboptimal results, as different combinations of WAccbase and WAccsys can produce the same final result (but the one with higher WAccsys is clearly preferred). 2.4.4 GLEU Generalized Language Evaluation Understanding (GLEU), proposed by Napoles et al. (2015), is a simple variant of BLEU for GEC which takes the original source into account. GLEU modifies the n-gram precision in BLEU to assign extra weight to n-grams present in the candidate that overlap with the reference but not the source (R\S), and penalise those in the candidate that are in the source but not the reference (S\R). For a correction candidate C with a corresponding source S and reference R, the modified n-gram precision (p n ′ ) for GLEU (C, R, S) is defined as: P countR\S (n-gram) − λ(countS\R (n-gram)) + countR (n-gram) n-gram∈C p n′ = P P n-gram∈C countS (n-gram) + n-gram∈R\S countR\S (n-gram) (2.13) where the weight λ determines by how much incorrectly changed n-grams are penalised. Given a bag of n-grams B, the counts in Equation 2.13 are collected as: X countB (n-gram) = d(n-gram, n-gram′ ) (2.14) n-gram′ ∈B 1 if n-gram = n-gram′ d(n-gram, n-gram′ ) = (2.15) 0 otherwise 10 I score is often expressed as a percentage. 34
35. By updating the n-gram precision in Equation 2.3, GLEU is defined as:11 N ! X GLEU (C, R, S) = BP · exp w n log p n ′ (2.16) n=1 Similar to BLEU, GLEU works with multiple references at the corpus level. It can be used as a generic evaluation method independent of the annotation scheme, but fails to provide detailed system performance for each error type. 2.5 Shared tasks on grammatical error correction In the last few years, four GEC shared tasks have provided a forum for participat- ing teams to compare results on common training and test data. Participants are provided with a fully annotated training set and encouraged to use any publicly available data and tools to build their GEC systems in a few months’ time. After that, new blind test data is used to evaluate system performance for the participat- ing teams. Systems are expected to detect grammatical errors in text written by non-native speakers and return corrected versions within a few days after the release of the test data. The organisers then evaluate each system’s output and release the final rankings. 2.5.1 HOO 2011 and 2012 The first two shared tasks - Helping Our Own (HOO) 2011 and 2012 - were aimed to promote the use of NLP tools and techniques for the development of automated systems that could provide writing assistance to non-native authors in the NLP community (Dale and Kilgarriff, 2011; Dale et al., 2012). In the HOO-2011 shared task, participants were provided with a set of documents extracted from the ACL Anthology12 written by non-native authors. The task was to automatically detect and correct all errors present in text. Errors were classified into 13 error types based on the CLC coding system (Nicholls, 2003). Six teams participated in the task, with some achieving top performance by focussing only on a limited number of error types. Given the difficulty of HOO-2011, the HOO-2012 shared task focussed only on article and preposition errors. The FCE dataset was provided as the official training set. The number of participating teams increased to 14 and most participants built machine learning classifiers. Evaluation in both HOO shared tasks was performed by computing P, R and F-score between a system’s edit set and a manually created gold-standard edit set. 11 We notice that there is a new version of GLEU appeared this year (Napoles et al., 2016b). However, scores reported in this thesis were computed using the original GLEU (Napoles et al., 2015) described in this section. 12 A digital archive of research papers in computational linguistics: https://aclweb.org/ 35
36. 2.5.2 CoNLL 2013 and 2014 The next two shared tasks took place in conjunction with the Conference on Com- putational Natural Language Learning (CoNLL). The CoNLL-2013 shared task (Ng et al., 2013) expanded the scope of HOO-2012 to include three new error types: noun number (Nn), verb form (Vform) and subject-verb agreement (SVA). Together with article (ArtOrDet) and preposition (Prep) errors, this new error list is more compre- hensive and also introduces interacting errors. NUCLE v2.313 was used as in-domain training data. The test data consists of 50 new essays, which were written in re- sponse to two prompts. One prompt was also used in the training set, while the other was new. Systems were evaluated using the M2 scorer with the max unchanged words pa- rameter µ set to 3 as suggested by the organisers (limiting the maximum unchanged words to three per edit). Rankings were based on F1 , weighting P and R equally. Initially, there was only one set of gold annotations, but since there are often multi- ple valid corrections for some errors, participating teams were subsequently allowed to propose alternative answers (gold-standard edits). This practice was adopted from the HOO 2011 and 2012 shared tasks. Therefore, there were two rounds of evaluation, the second of which allowed alternative answers. As noted by Ng et al. (2013), these new scores tended to be biased towards the teams which submitted alternative answers. Consequently, to reduce bias, they suggested future evaluation be carried out without alternative answers. In the end, 17 teams participated in CoNLL-2013. Among these teams, a common approach was to build classifiers for each error type. Other approaches included LM, MT and heuristic rules. The CoNLL-2014 shared task (Ng et al., 2014) tried to once again push the boundaries of GEC by returning to an all-errors correction task. In particular, there were three major changes compared with CoNLL-2013: 1) participating systems were expected to correct grammatical errors of all types; 2) two human annotators annotated the test essays independently; and 3) the evaluation metric was changed from F1 to F0.5 , to prioritise P over R. A newer version of the NUCLE corpus - NUCLE v3.0 - was used as official training data. Additionally, a new set of 50 essays written by non-native English speakers was used as blind test data. The CoNLL-2013 test set could be freely used for training and/or development. The M2 scorer was again used as the official scorer. In total, 13 teams submitted output to CoNLL-2014. Most of them built hybrid systems that combined different approaches. For non-specific error type correction, LM and MT approaches were used; whereas for single error types, rule-based ap- proaches and machine learning classifiers were preferred. We built a phrase-based SMT system (see Chapter 3) which contributed to our final hybrid system submit- ted to the shared task. Our system achieved the best F0.5 and R on the original 13 In NUCLE v2.3, 17 essays were removed from the first release of NUCLE, and the error types Wcip and Rloc were mapped to Prep, Wci, ArtOrDet, and Rloc- using POS tags. 36
37. 2.6 Use of datasets and evaluation metrics in this thesis In the rest of this thesis, different learner corpora and evaluation metrics are used in different chapters. This is because, according to Chodorow et al. (2012), there is no single best evaluation metric and the usefulness of a metric depends on the application and research goals. The work in Chapter 3 is presented in the context of the CoNLL-2014 shared task. The NUCLE corpus, as provided by the shared task organisers, is used as in-domain training data, and results are reported on the CoNLL-2014 development set (i.e. the CoNLL-2013 test set) and test set. F0.5 as calculated by the M2 scorer is used as the evaluation measure. Parallel sentences extracted from the publicly available FCE dataset and the IELTS dataset are used as additional training data. In Chapter 4 and 5, the publicly available FCE dataset is used and results are reported on the FCE test set. The reasons for using the FCE dataset rather than NUCLE are manifold. Firstly, the FCE dataset, as a subcorpus of the CLC, covers a wide variety of L1s and was used in the HOO-2012 error correction shared task. Compared with the NUCLE corpus used in the CoNLL 2013 and 2014 shared tasks, which only contains essays written by undergraduate students at NUS, the FCE dataset is a more representative test set of learner writing. Secondly, the error annotations in the NUCLE corpus are sometimes unreliable and inconsistent. This may introduce noise into final GEC systems and result in underestimations of performance. Thirdly, as described in Section 2.3.2, the CLC is the world’s largest learner corpus, and the FCE dataset was annotated using the same annotation scheme as the CLC. In order to make better use of the CLC and avoid any problems caused by the inconsistency between different annotation schemes, we use the FCE dataset in our experiments. Results reported on the publicly available FCE dataset can be used for cross-system comparisons. As discussed in Section 2.4, evaluation methods based on P, R and F-score (e.g. the M2 scorer) do not provide an indicator of improvement on the original text. Given this, and the fact that an increase in P, R or F-score does not necessarily mean a reduction in the actual error rate, we opt to use the I-measure and thus gain a better insight into system performance in terms of improvement on the original text. We also apply our CLC-trained systems to the CoNLL-2014 shared task devel- opment and test sets directly, without using or optimising for NUCLE. Results are reported using BLEU, GLEU, F0.5 (calculated by the M2 scorer) and I-measure for broader cross-system comparisons. 37
38.
39. CHAPTER 3 Building a preliminary SMT-based GEC system In this chapter, we investigate whether SMT can form the basis of a competitive all-errors GEC system and describe how to build an end-to-end system using the SMT framework. Our SMT method is evaluated in the context of the CoNLL- 2014 all-errors correction shared task (see Section 2.5.2) and the results presented in Section 3.5 were previously published in Felice et al. (2014). Additionally, the work on artificial data described in Section 3.3.4.2 was further developed in Felice and Yuan (2014b). 3.1 Statistical machine translation As one of the first applications envisaged for computers, MT is the process of trans- lating a sentence from one language into another automatically. Several approaches have been proposed, such as word-for-word translation, interlingual approaches (Mi- tamura et al., 1991), example-based MT (Nagao, 1984; Sato and Nagao, 1990) and SMT. In the last two decades, SMT has become the dominant approach both in the research community and in the commercial sector. The underlying idea is that language is so complex and productive that it could never be fully analysed and distilled into a set of rules. Instead, a machine should try to learn translation map- pings automatically from large parallel corpora by pairing the input and output of the translation process and learning from the statistics over the data, thus removing the need for linguists or language experts. SMT stands out for its low cost and rapid prototyping, which also produces state-of-the-art results for many MT tasks. GEC can be considered a special case of MT where the task is to translate ‘bad’ English sentences into ‘good’ ones. The SMT approach to GEC has several advan- tages. First, an SMT system can deal with multiple error types at the same time without having to (a) classify errors into different types, (b) decide on their bound- aries, or (c) combine the results of multiple classifiers, which is often the case in traditional machine learning approaches. Interacting errors are expected to be cor- rected as well since SMT systems work at a global rather than local level (compared 39
40. Source C Noisy channel E Ĉ P(C) P(E|C) Receiver Figure 3.1: A noisy channel model. with classifiers) and corrections are performed jointly for the entire sentence. As a result, the SMT approach seems more suitable for an all-errors task. Second, it does not need expert knowledge but only requires a parallel corpus of grammatically in- correct sentences as the source and their corrected versions as the target. Following the principle that ‘more data is better data’ (Church and Mercer, 1993), previous studies in SMT have shown that the larger the training parallel corpus, the more reliable the translation mappings that can be learnt and the better the translation performance that can be achieved (Koehn et al., 2003; Suresh, 2010; Axelrod et al., 2011). Last but not least, there are well-developed tools for SMT which can be easily adapted for GEC, so we can benefit from state-of-the-art SMT techniques. An SMT-based GEC system can be modelled as a machine translator. Here, the input (source) is an erroneous English sentence E = e1 e2 ... em , and the output (target) is a corrected sentence C = c1 c2 ... cl . The erroneous sentence E produced by ESL learners can be regarded as having passed through a noisy channel (Shannon, 1948) and is hence corrupted by noise, i.e. interference from learners’ L1s - see Figure 3.1. The goal is to recover the sentence C based on the corrupt sentence E: P (E|C)P (C) Ĉ = arg max P (C|E) = arg max = arg max P (E|C)P (C) (3.1) C C P (E) C where P (E) in the denominator is ignored since it is a constant across all Cs. The other three parameters that need to be considered are: • LM: estimates the corrected sentence probability P (C) from a target language cor- pus; • TM: estimates the translation (i.e. correction) probability P (E|C) from a parallel corpus; • decoder: searches for the target sentence C that maximises the product of P (C) and P (E|C). 3.1.1 The language model One essential component of any SMT system is the LM, which makes sure that the final system outputs fluent English sentences. A LM is a function that takes an En- glish sentence as input and returns the probability that it is a valid English sentence. 40