Grammatical error correction

Contributed by:
Steve
In this thesis, we investigate GEC for learners of English as a Second Language (ESL). Specifically, we treat GEC as a translation task from incorrect into correct English, explore new models for developing end-to-end GEC systems for all error types, study system performance for each error type, and examine model generalization to different corpora.
1. Technical Report UCAM-CL-TR-904
ISSN 1476-2986
Number 904
Computer Laboratory
Grammatical error correction
in non-native English
Zheng Yuan
March 2017
15 JJ Thomson Avenue
Cambridge CB3 0FD
United Kingdom
phone +44 1223 763500
http://www.cl.cam.ac.uk/
2. c 2017 Zheng Yuan
This technical report is based on a dissertation submitted
September 2016 by the author for the degree of Doctor of
Philosophy to the University of Cambridge, St. Edmund’s
Technical reports published by the University of Cambridge
Computer Laboratory are freely available via the Internet:
http://www.cl.cam.ac.uk/techreports/
ISSN 1476-2986
3. Grammatical error correction in non-native English
Zheng Yuan
Grammatical error correction (GEC) is the task of automatically correcting
grammatical errors in written text. Previous research has mainly focussed on in-
dividual error types and current commercial proofreading tools only target limited
error types. As sentences produced by learners may contain multiple errors of dif-
ferent types, a practical error correction system should be able to detect and correct
all errors.
In this thesis, we investigate GEC for learners of English as a Second Language
(ESL). Specifically, we treat GEC as a translation task from incorrect into correct
English, explore new models for developing end-to-end GEC systems for all error
types, study system performance for each error type, and examine model generali-
sation to different corpora. First, we apply Statistical Machine Translation (SMT)
to GEC and prove that it can form the basis of a competitive all-errors GEC sys-
tem. We implement an SMT-based GEC system which contributes to our winning
system submitted to a shared task in 2014. Next, we propose a ranking model to
re-rank correction candidates generated by an SMT-based GEC system. This model
introduces new linguistic information and we show that it improves correction qual-
ity. Finally, we present the first study using Neural Machine Translation (NMT)
for GEC. We demonstrate that NMT can be successfully applied to GEC and help
capture new errors missed by an SMT-based GEC system.
While we focus on GEC for English, our methods presented in this thesis can be
easily applied to any language.
4.
5. First and foremost, I owe a huge debt of gratitude to my supervisor, Ted Briscoe,
who has patiently guided me through my PhD and always been very helpful, un-
derstanding and supportive. I cannot thank him enough for providing me with
opportunities that helped me grow as a researcher and a critical thinker.
I am immensely grateful to my examiners, Paula Buttery and Stephen Pulman,
for their thorough reading of my thesis, their valuable comments and an enjoyable
viva. My appreciation extends to my fellow members of the Natural Language and
Information Processing research group, with whom I have always enjoyed discussing
our work and other random things. My gratitude goes to Stephen Clark and Ann
Copestake for giving me early feedback on my work as well as Christopher Bryant for
generously reading my thesis draft. I would especially like to thank Mariano Felice
for being not just a great colleague but also a dear friend. A special mention has to
be given to Matthew Purver, who got me interested in this field in the first place.
I am thankful to Cambridge Trust and China Scholarship Council for funding my
research, making it possible for me to pursue a doctoral degree in Cambridge. I am
also grateful to the Computer Laboratory and St. Edmund’s College for supporting
my conference attendance.
Finally, I would like to express my heartfelt thanks to my family and friends.
Special thanks go to Hui Xiao and Mo Jia for always being there whenever I need
them. Words are powerless to describe my appreciation and gratitude to my par-
ents, Xuewen Yuan and Yun Zeng, for all the sacrifices that they have made on my
behalf. Their love and support have sustained me thus far and I know will continue
to sustain me.
6.
7. 1 Introduction 13
1.1 What is grammatical error correction? . . . . . . . . . . . . . . . . . 14
1.2 Thesis aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Background 19
2.1 Early approaches to grammatical error correction . . . . . . . . . . . 19
2.2 Machine translation and error correction . . . . . . . . . . . . . . . . 22
2.2.1 Statistical machine translation . . . . . . . . . . . . . . . . . . 22
2.2.2 Candidate re-ranking . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Neural machine translation . . . . . . . . . . . . . . . . . . . . 26
2.3 Learner corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 NUCLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 CLC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2.1 FCE examination scripts . . . . . . . . . . . . . . . . 30
2.3.2.2 IELTS examination scripts . . . . . . . . . . . . . . . 30
2.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 BLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.2 M2 scorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.3 I-measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.4 GLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Shared tasks on grammatical error correction . . . . . . . . . . . . . . 35
2.5.1 HOO 2011 and 2012 . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2 CoNLL 2013 and 2014 . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Use of datasets and evaluation metrics in this thesis . . . . . . . . . . 37
3 Building a preliminary SMT-based GEC system 39
3.1 Statistical machine translation . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 The language model . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.1.1 N-gram language model . . . . . . . . . . . . . . . . 41
3.1.1.2 Kneser-Ney smoothing . . . . . . . . . . . . . . . . . 42
3.1.1.3 Modified Kneser-Ney smoothing . . . . . . . . . . . . 43
3.1.2 The translation model . . . . . . . . . . . . . . . . . . . . . . 44
3.1.2.1 IBM Models 1-5 . . . . . . . . . . . . . . . . . . . . 45
3.1.2.2 Phrase-based models . . . . . . . . . . . . . . . . . . 47
3.1.3 The reordering model . . . . . . . . . . . . . . . . . . . . . . . 48
8. 3.1.4 The decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Challenges in applying SMT to GEC . . . . . . . . . . . . . . . . . . 50
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.1 Experimental set-up . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2 Translation models . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.3 Language models . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.4 Increasing the size of the training set . . . . . . . . . . . . . . 55
3.3.4.1 Adding learner data . . . . . . . . . . . . . . . . . . 56
3.3.4.2 Adding artificial data . . . . . . . . . . . . . . . . . 56
3.3.4.3 Adding short parallel phrases . . . . . . . . . . . . . 57
3.3.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.5 A new method for building a phrase table . . . . . . . . . . . 60
3.3.6 Forced decoding for phrase table filtering . . . . . . . . . . . . 63
3.4 An end-to-end SMT-based GEC system . . . . . . . . . . . . . . . . . 64
3.4.1 System performance . . . . . . . . . . . . . . . . . . . . . . . 64
3.4.2 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.2.1 Type performance . . . . . . . . . . . . . . . . . . . 67
3.4.2.2 Sequential errors . . . . . . . . . . . . . . . . . . . . 70
3.4.2.3 Missed errors . . . . . . . . . . . . . . . . . . . . . . 71
3.5 Results in the CoNLL-2014 shared task . . . . . . . . . . . . . . . . . 73
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 Candidate re-ranking 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Feature space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1 SMT feature set . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1.1 Decoder’s scores . . . . . . . . . . . . . . . . . . . . 81
4.3.1.2 N-best list ranking information . . . . . . . . . . . . 81
4.3.2 Language model feature set . . . . . . . . . . . . . . . . . . . 81
4.3.2.1 LM features . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.2.2 ALM features . . . . . . . . . . . . . . . . . . . . . . 82
4.3.3 Statistical word lexicon feature set . . . . . . . . . . . . . . . 82
4.3.4 Levenshtein distance feature set . . . . . . . . . . . . . . . . . 83
4.3.5 Length feature set . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.6 Syntactic vs. non-syntactic . . . . . . . . . . . . . . . . . . . . 84
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Experimental set-up . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.2 SMT system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.3 SVM re-ranker . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4.3.1 Assigning gold labels . . . . . . . . . . . . . . . . . . 86
4.4.3.2 The feature set impact . . . . . . . . . . . . . . . . . 87
4.4.4 Oracle score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.5 Benchmark results . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.5.1 MBR re-ranking . . . . . . . . . . . . . . . . . . . . 89
4.4.5.2 MEMT candidate combination . . . . . . . . . . . . 90
9. 4.4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.5 Analysis and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.5.1 Results on the CoNLL-2014 shared task development set . . . 93
4.5.2 Results on the CoNLL-2014 shared task test set . . . . . . . . 95
4.6 Recent work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5 Neural machine translation for GEC 99
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Neural machine translation . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2.1 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . 101
5.2.2 Encoder-decoder . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.3 Training an NMT system . . . . . . . . . . . . . . . . . . . . . 108
5.3 Handling rare words . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4.1 Experimental set-up . . . . . . . . . . . . . . . . . . . . . . . 111
5.4.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4.3 NMT models . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.4.4 Sentence length . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.4.5 Beam size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.6 Vocabulary size . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.7 UNK replacement . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5 Analysis and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5.1 Results on the CoNLL-2014 shared task development set . . . 118
5.5.2 Results on the CoNLL-2014 shared task test set . . . . . . . . 119
5.6 Recent work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6 Conclusion 123
A NUCLE error codes 127
B CLC error taxonomy 129
Bibliography 145
10.
11. List of Abbreviations
ALM adaptive language model
BiRNN Bidirectional Recurrent Neural Network
BLEU Bilingual Evaluation Understudy
BNC British National Corpus
CE correct edit
CLC Cambridge Learner Corpus
CNN Convolutional Neural Network
CoNLL Conference on Computational Natural Language Learning
EM Expectation-Maximisation
ESL English as a Second Language
EVP English Vocabulary Profile
FCE First Certificate in English
FN false negative
FP false positive
GEC grammatical error correction
GLEU Generalized Language Evaluation Understanding
GPU graphics processing unit
GR grammatical relation
GRU Gated Recurrent Unit
HMM Hidden Markov Model
HOO Helping Our Own
I I-measure
IELTS International English Language Testing System
ILP Integer Linear Programming
ITG Inversion Transduction Grammar
L1 first language
L2 second language
LM language model
LSTM Long Short-Term Memory
MBR Minimum Bayes-Risk
ME missed edit
MEMT Multi-Engine Machine Translation
MERT Minimum Error Rate Tuning
MIRA Margin Infused Relaxed Algorithm
MT machine translation
12. NB Naı̈ve Bayes
NLP Natural Language Processing
NLTK Natural Language Toolkit
NMT Neural Machine Translation
NP noun phrase
NUCLE National University of Singapore Corpus of Learner English
NUS National University of Singapore
OOV out-of-vocabulary
P precision
POS part-of-speech
R recall
RASP Robust Accurate Statistical Parsing
RBMT Rule-Based Machine Translation
RNN Recurrent Neural Network
SGD Stochastic Gradient Descent
SMT Statistical Machine Translation
SVM Support Vector Machine
TM translation model
TN true negative
TP true positive
UE unnecessary edit
WAcc weighted accuracy
WMT Workshop on Statistical Machine Translation
13. CHAPTER 1
Today, from Beijing to Brasilia, millions of people are learning English as a Second
Language (ESL). According to a report published by the British Council in 2013,
English is spoken at a ‘useful level’ by 1.75 billion people worldwide. In fact, non-
native English speakers now outnumber native speakers. Furthermore, the number
of ESL learners keeps on growing and it is estimated that 2 billion people will
be using English - or learning to use it - by 2020. Nevertheless, learning a new
language is never easy. Difficulties in acquiring a new language can be due to the
differences between the new language and the learners’ first languages (L1s) (Lado,
1957). These differences may result in various kinds of errors in learner writing.
Errors made by learners are different from those made by native speakers. Connors
and Lunsford (1988) studied errors made by college students in the United States
and compiled an error list ranked by frequency. Their work was later replicated
by Donahue (2001) with a focus on ESL learners. Results showed that half of the ten
most frequent error types made by native speakers were ‘negligible’ in ESL writings.
There has been a great deal of commercial and academic interest in automatically
correcting these written errors for ESL learners. From a commercial perspective,
there is a great potential for many practical applications, such as proofreading tools
that help second language (L2) speakers identify and correct their writing errors
without human intervention or educational software for automated language learning
and assessment. From a research perspective, correcting errors in learner writing is
an interesting and challenging task as it involves various aspects of Natural Language
Processing (NLP), such as language modelling, syntax and semantics.
Early grammar checkers can be traced back to the 1980s, when hand-coded
grammar rules were mostly used. However, due to the productive nature of language
and the creativity of learners, it is impractical to define rules for every possible
case. With the advent of large-scale annotated corpora in the 1990s, data-driven
approaches made it possible to build systems for specific error types. Nevertheless,
popular commercial proofreading tools only target a few error types that are easy to
correct, such as spelling mistakes (a *baeutiful/beautiful girl ) or wrong past participle
forms of irregular verbs (Dave has *runned/run 42 marathons), and do not include
those aspects of English that are harder to learn. At the same time, most research in
the area has focussed on two common error types made by learners, namely articles
13
14. (Mary’s sister is */a hairdresser ) and prepositions (the best places to visit *on/in
July), assuming that there is only one error per sentence.
However, errors do not always occur in isolation. Sentences produced by learners
may contain multiple errors which belong to different error types. What is worse,
errors may interact with others so that the correction of one error requires the
correction of the other. See the following example sentences written by ESL learners:
Example 1.1. I am plece to tell the information do you need for the group.
The sentence contains three errors: a spelling mistake (plece → pleased ), a wrong verb
(tell → provide) and an unecessary verb (do).
Example 1.2. As you know, it is not suitable to wear a jean.
The sentence contains two interacting errors: ‘a’ should be deleted and ‘jean’ should
be changed to ‘jeans’ at the same time (a jean → jeans).
An error correction system that can only correct one or a few types of errors will
be of limited use to learners. Instead, a good system should be able to correct a
variety of error types and corrections should be performed at a global rather than
local level, including taking interacting errors into account. Our goal in this thesis is
to develop robust error correction systems that can automatically detect and correct
all errors present in learner text, trying to overcome the aforementioned limitations.
The error correction task can be thought of as a type of monolingual ‘translation’,
where the source is a sentence written by a learner and the target is a fluent and ade-
quate sentence in the same language. A corrected sentence should be grammatically
correct and preserve the original meaning of the source.
Rather than building individual components for each error type, we apply the
machine translation (MT) approach of ‘translating’ a grammatically incorrect sen-
tence into a correct one to address all error types simultaneously. The MT approach
takes advantage of large annotated learner data. Systems learn correction mappings
from data and use them to generate a corrected version of the original sentence,
correcting as many errors as possible. Our work investigates MT methods for cor-
recting grammatical errors in non-native English text and addresses issues arising
from applying existing MT techniques to the error correction task. We further iden-
tify new techniques for developing robust error correction systems that outperform
previous approaches.
While English is by far the most spoken foreign language in the world, there is
also a need for grammar checkers for other languages, such as Chinese, Spanish and
Arabic. Although we focus only on English in this thesis, the methods described
here can be applied to any language given appropriate data.
1.1 What is grammatical error correction?
Grammatical error correction (GEC) is the task of automatically correcting gram-
matical errors in written text. More specifically, the task is to build a system that
takes an input text, analyses the context of the text to identify and correct any
grammatical errors, and finally returns a corrected version that retains the origi-
nal meaning. If there is no error in the input text, the system should output the
14
15. text without any modification. In this thesis, we focus on grammatical errors in
non-native English text.
It should be noted that not all errors present in learner text are grammatical
errors, however. Errors were traditionally identified at five levels: 1) a lexical level,
2) a syntactic level, 3) a semantic level, 4) a discourse structure level, and 5) a
pragmatic level (Kukich, 1992). Lexical errors are spelling mistakes that result
in non-existent words, such as misspelling ‘type’ as ‘tipe’, where ‘tipe’ is not a
legitimate word in English. Errors where the syntactic categories of the words do
not fit their contexts are classified as syntactic errors, such as subject-verb agreement
errors (she always *know/knows her place) or verb tense errors (the church *is/was
rebuilt in 1948 ). Errors that cause semantic anomalies are semantic errors, which
involve contextual spelling mistakes that result in legitimate words (we waited for
twenty *minuets/minutes) and collocation/cooccurrence errors (*big conversation)
(Kochmar, 2016). Discourse errors violate the inherent coherence relations in a text
while pragmatic errors reflect some anomalies related to the goals and plans of the
discourse participants. Correcting errors from the last two groups requires further
discourse analysis. In this thesis, we use the broad term ‘grammatical error’ to
refer only to lexical, syntactic and semantic errors, but do not tackle discourse and
pragmatic errors whose ‘span’ goes beyond the sentence.
1.2 Thesis aims
The work presented in this thesis aims to:
1. Develop end-to-end error correction systems that are capable of correcting
grammatical errors present in text written by learners of English. As sentences
produced by ESL learners may contain multiple errors which belong to different
error types, we aim to develop robust systems that can automatically detect
and correct a variety of error types and perform corrections at a global rather
than local level, where interacting errors are covered as well.
2. Explore the use of several statistical NLP approaches for GEC:
(a) Can SMT form the basis of a competitive all-errors GEC sys-
tem? Statistical Machine Translation (SMT) has been successfully used
to correct a limited number of grammatical errors in the past (see Brock-
ett et al., 2006; Yuan and Felice, 2013), so we aim to investigate whether
the same approach can be used to correct multiple grammatical errors at
the same time.
(b) Can candidate re-ranking improve sentence quality in SMT-
based GEC? Since SMT was not originally designed for GEC, many
standard features do not perform well on the error correction task. It is
therefore necessary to add new local and global features to help the SMT
decoder distinguish good from bad corrections. We propose a Support
Vector Machine (SVM) ranking model to re-rank candidates generated
15
16. by an SMT-based GEC system. We aim to determine whether candidate
re-ranking is a viable approach to address the decoding problem in this
scenario and thus improve sentence quality.
(c) Can NMT be applied to GEC? Typical SMT-based GEC systems
suffer from data sparsity. Some errors are not covered by these systems
because the mappings needed for correction have not been seen in the
training data. With the recent advances in neural networks, Neural Ma-
chine Translation (NMT) seems appealing for GEC as it may be possible
to correct erroneous phrases and sentences that have not been seen in the
training set more effectively. We investigate NMT systems and how they
can be applied to GEC in order to capture new errors without the need
for additional training data.
3. Examine and address issues concerning applying existing techniques to GEC.
As we approach GEC as a special translation task, where the source and
target sentences are both in English but the source may contain grammatical
errors, it is inevitable that new problems may arise from adapting existing MT
techniques to GEC. We discuss these problems and propose possible solutions.
4. Investigate system performance for each error type. Type-specific performance
helps understand the strengths and weaknesses of the system, as well as iden-
tify areas for future improvement. However, this is not easy to do for all-errors
GEC systems which propose corrections without error types. We apply a type
estimation strategy and present detailed error analyses.
5. Examine model generalisation to different learner corpora. It is not the aim of
this thesis to beat the state-of-the-art result on one particular dataset (e.g. the
CoNLL-2014 shared task test set - see Section 2.5.2). Instead, we are more
interested in models that can consistently produce competitive results across
different learner corpora without retraining or tuning for new datasets or GEC
tasks. For this reason, we test model generalisation and compare the results
with those from other models which are trained and tuned specifically for each
corpus.
1.3 Thesis structure
The structure of this thesis is as follows. Chapter 2 discusses several related topics in
GEC. It begins with an overview of the automated approaches to detect and correct
errors made by learners and goes on to describe the MT approach to error correction.
Additionally, it gives a description of the learner corpora and automatic evaluation
metrics for GEC, followed by a summary of a series of shared tasks on GEC. It
concludes with a discussion of the datasets and evaluation metrics used in this thesis.
Chapter 3 describes our approach to building a preliminary SMT-based error
correction system. We address the major issues that arise from applying standard
SMT to GEC. We explore different types of translation models (TMs), language
16
17. models (LMs) and alignment methods used in an SMT system. To overcome the lack
of training data, we propose the use of three different types of data and demonstrate
how they can help build robust SMT models. We also investigate phrase table
filtering. We present an SMT system that forms one half of our winning system
submitted to a shared task on grammatical error correction in 2014. A detailed
error analysis of the SMT-based GEC system is also performed.
In Chapter 4, we propose a supervised ranking model to re-rank candidates
generated by an SMT-based GEC system. A range of novel features with respect
to error correction are investigated and implemented in our re-ranker. An in-depth
assessment of the role played by each feature type is carried out, quantifying its
contribution from a statistical perspective. We also investigate the performance of
different re-ranking techniques and find that our proposed model clearly outperforms
the other two, showing its effectiveness in re-ranking candidates for GEC.
Chapter 5 presents the first study on NMT for GEC, in an attempt to ameliorate
the lack of training data for SMT-based GEC systems. Problems from adapting
standard NMT to GEC are addressed. The performance of different NMT models
on the error correction task is investigated. We also propose a two-step approach to
address the ‘rare word’ problem in NMT for GEC and demonstrate how it can help
provide a substantial improvement in system performance.
Finally, Chapter 6 concludes this thesis and discusses some avenues for possible
future research.
17
18.
19. CHAPTER 2
There is a large body of work on grammatical error detection and correction. This
chapter puts the present research in context by offering an overview of latest research
in the field. A more comprehensive survey of automated grammatical error detection
for language learners can be found in the book by Leacock et al. (2014).
2.1 Early approaches to grammatical error cor-
rection
Early attempts at automated error correction employed hand-coded rules. The first
widely used grammar checking tools, such as the Writer’s Workbench (MacDon-
ald et al., 1982), were based on simple pattern matching and string replacement.
Other rule-based systems incorporated syntactic analysis and used manually de-
veloped grammar rules. For example, both Grammatik from Aspen Software and
GramCheck (Bustamante and León, 1996) relied on basic linguistic analysis, while
IBM’s Epistle (Heidorn et al., 1982) and Critique (Richardson and Braden-Harder,
1988) performed full syntactic analysis. Rule-based approaches are generally easy
to implement for some types of errors and can be very effective. This is why they
are still widely used by existing grammar checking systems. However, rules can be-
come impractical for some complex errors and unmanageable with time. The highly
productive nature of language makes it impossible to define rules for every potential
error. So rule-based approaches are often avoided as a general solution.
With the advent of large-scale annotated resources in the 1990s, researchers
moved to data-driven approaches and applied machine learning techniques to build
classifiers for specific error types (Knight and Chander, 1994; Han et al., 2004;
Chodorow et al., 2007; De Felice and Pulman, 2007; Tetreault and Chodorow, 2008;
Rozovskaya and Roth, 2011; Dahlmeier et al., 2012). Most work using machine
learning classifiers has focussed on two error types: articles and prepositions. This
is due to the fact that these errors are some of the most common and challenging ones
for ESL learners, and are also easier to tackle using machine learning approaches
than hand-crafted rules (Felice and Yuan, 2014a). For these closed-class errors, a
19
20. finite confusion set or candidate set including all the possible correction candidates
is defined, such as a list of articles or prepositions in English. Training examples -
native and/or learner data - are represented as vectors of linguistic features that are
considered useful for the error type. Possible features often include neighbouring
words, part-of-speech (POS) tags, grammatical relations (GRs) and dependency
trees. Various machine learning algorithms are used to train classifiers based on these
features. Once a system has been trained, new errors are detected and corrected
by comparing the original word used in the text with the most likely candidate
predicted by the classifier. Since the most useful features often depend on the word
class, it is necessary to build separate classifiers for each error type. Han et al. (2004)
trained a maximum entropy classifier to detect article errors on a large diverse corpus
and achieved an accuracy of 88%. Tetreault and Chodorow (2008) used maximum
entropy models to correct errors for 34 common English prepositions in learner text.
Errors made by ESL learners often depend on their L1s (Lee and Seneff, 2008).
Systems perform much better when information about their L1s is included. Ro-
zovskaya and Roth (2011) compared four linear machine learning classifiers for cor-
recting preposition errors. Results showed that discriminative classifiers perform the
best and adaptation to a writer’s L1 further improves performance. The authors
proposed a way of integrating language-specific priors at decision time using Naı̈ve
Bayes (NB) models instead of training separate classifiers for each L1.
The weakness of approaches based on ‘classification by error type’ is that they
only rely on local context and treat errors independently, assuming that there is only
one error in the context and all the surrounding information is correct. However,
sentences produced by learners may contain a complex combination of several types
of errors which may further interact. An error correction system that only corrects
one type of error is of limited use to language learners in practical applications.
A commonly used solution is to build multiple classifiers and then cascade them
into a pipeline system. A combination of classifier-based and rule-based steps is
often used to build systems that correct multiple errors (Dahlmeier et al., 2012;
Rozovskaya et al., 2013). This kind of solution is complex and laborious: several
pre-processing and post-processing steps are required, and the order of classifiers also
matters. Additionally, it does not solve the problem of interacting errors and pre-
dictions from independent classifiers may be inconsistent. Here is a typical example
taken from Rozovskaya and Roth (2013):
Example 2.1. ... electric cars is still regarded as a great trial innovation ...
Predictions made by a system that combines independently-trained classifiers: cars is
→ car are.
Several approaches have been proposed to address the problem of interacting
errors. Rather than making decisions independently, Dahlmeier and Ng (2012a) de-
veloped a beam-search decoder to iteratively generate sentence-level candidates and
score them using individual classifiers and a general LM. Five proposers were used to
generate new candidates by making five types of changes: spelling, articles, prepo-
sitions, punctuation insertion, and noun number. Results appeared promising and
the decoder outperformed a pipeline system of individual classifiers and rule-based
steps. However, their decoder only provides corrections for five error types and new
20
21. proposers need to be added into the system in order to cover more errors, some of
which might not be easy to design. Furthermore, the number of candidates grows
exponentially with the type of errors being considered (i.e. the number of proposers)
and the sentence length. As it is infeasible to enumerate all candidates, building an
efficient decoder becomes a problem. Wu and Ng (2013) proposed a joint inference
model to resolve inconsistencies produced by individual classifiers. Integer Linear
Programming (ILP) was used to incorporate the output of individual classifiers and
a list of linguistic constraints. These constraints were manually defined and explic-
itly encoded into the system. Any new constraints need to be hand-coded for new
types of interacting errors. Rozovskaya and Roth (2013) built two joint classifiers to
address two linguistic structures: subject-verb and article-NPhead.1 For each of the
structures, rather than using two classifiers independently, a joint classifier simul-
taneously predicts two words that are part of the same structure. Unlike the ILP
model proposed by Wu and Ng (2013), the joint classifier does not need human de-
fined constraints, as it can learn from the training data directly. However, it is more
difficult to collect enough pairs of candidates that form the relevant structures to
use as training data. As one joint classifier only targets one type of interacting error,
new classifiers need to be built for every new type of interaction. These classifier-
based approaches still use scores from individual classifiers, so it becomes infinitely
time-consuming to train individual classifiers for all types of (interacting) errors.
A more general approach for correcting multiple errors in ESL text is to use
n-gram LMs (Gamon et al., 2008; Gamon, 2011). A single model is trained on a
large number of correct sentences and then used to assign probabilities to sequences
of words based on counts from the training data. Within this framework, the target
word sequence is substituted for alternatives from a precompiled candidate set and
the LM scores for the original text as well as the alternatives are computed. The
sequence with the highest probability is chosen as the correct one. Ideally, correct
word sequences will get high probabilities while incorrect or unseen ones will get low
probabilities. Errors are assumed to occur in parts of a sentence where a low score
is assigned. However, no matter how large a training corpus is, it is impossible to
cover all possible correct word sequences in practice. Another problem lies in how
to distinguish low-frequency word combinations from erroneous ones. Therefore,
the LM approach is commonly used in addition to other approaches, especially to
rank correction suggestions proposed by other models. Gamon et al. (2008) used
a LM in addition to machine learning classifiers and combined them using a meta-
classifier. Dahlmeier and Ng (2012a) used a LM in combination with classifiers to
score correction candidates in a beam-search decoder.
Additionally, some efforts have been made to tackle learner errors that are par-
ticularly difficult to detect and correct. Rozovskaya et al. (2014b) proposed a lin-
guistically motivated approach to verb error correction. Their model integrated a
machine learning approach with a rule-based system that first identifies verb can-
didates in noisy learner text and then makes use of verb finiteness information to
identify errors and characterise the type of mistake. Xue and Hwa (2014) developed
1
article-NPhead: the interaction between the head of the noun phrase (NP) and the article that
refers to the NP
21
22. a computational model for redundancy detection in ESL writings. They proposed a
measure to assign high scores to words and phrases that are likely to be redundant
within a given sentence by comparing an ESL sentence with the output from off-
the-shelf MT systems. For content word combinations, Kochmar (2016) performed
error detection in adjective-noun and verb-object combinations in learner data using
compositional distributional semantic models.
2.2 Machine translation and error correction
A practical error correction system should be able to correct various types of errors
made by ESL learners. In more recent research, MT techniques have been used to
successfully correct a broader set of errors.
MT algorithms automatically translate text from a source language into a target
language. Error correction thus can be seen as a special translation problem from
grammatically incorrect sentences into correct ones. Unlike in standard MT tasks,
the source and target sentences are both in the same language but the source may
contain grammatical errors. MT-based GEC systems learn correction mappings
from parallel examples and use these mappings to generate a corrected version of
the original (erroneous) sentence, correcting as many errors as possible.
2.2.1 Statistical machine translation
SMT, as the dominant MT approach in the last two decades, employs statistical
models estimated from parallel corpora (i.e. source-target pairs) and monolingual
corpora (i.e. target sentences) to transform text from one language to another.2
Brockett et al. (2006) first proposed the use of an SMT model for correcting mass/-
count noun errors made by learners of English. A list of 14 mass nouns was compiled
using dictionaries and the Chinese Learner English Corpus (Gui and Yang, 2003).
An SMT system requires millions of examples of correct and incorrect usage to learn
reliable translation mappings. Given that examples of correct usage are plentiful in
native data while parallel examples of incorrect usage are much more difficult to
collect, the authors transformed well-formed edited English sentences into mostly
ungrammatical strings by introducing artificial mass noun errors. Hand-constructed
regular expressions were used to make sure the generated strings exhibited charac-
teristics of the learner corpus. A phrase-based SMT system was built using word
alignments produced by GIZA++ (Och and Ney, 2003). Their SMT system suc-
cessfully corrected 61.8% of mass noun errors from a set of 123 examples of incorrect
usage. As noted by Leacock et al. (2014), this was only a first exploration of SMT
techniques for GEC, but with enough training data, such a system could potentially
be powerful enough to detect and correct errors that involve more than just the
insertion, deletion or substitution of single words, as well as being able to provide
stylistic writing assistance to ESL learners.
2
SMT algorithms are described in more detail in Chapter 3.
22
23. Mizumoto et al. (2011) applied the same SMT techniques for Japanese error
correction but improved them by considering a wider set of error types and training
on a large-scale real-world dataset. Rather than transforming correct sentences into
grammatically incorrect strings, they extracted real examples from the language
learning social network website Lang-8.3 Moses (Koehn et al., 2007) was used as a
decoder and GIZA++ as an alignment tool. Evaluation was based on a character-
level version of the Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002),
a popular metric for automatic MT evaluation. Mizumoto et al. (2012) extended
their work to English and investigated the effect of training corpus size on various
types of grammatical errors. Their results showed that a phrase-based SMT system
is effective at correcting errors that can be identified by a local context, but less
effective at correcting errors that need long-range contextual information.
Yuan and Felice (2013) trained a POS-factored SMT system to correct five types
of errors in learner text for the CoNLL-2013 shared task on grammatical error cor-
rection (Ng et al., 2013) (see Section 2.5.2). These five error types involve articles,
prepositions, noun number, verb form, and subject-verb agreement. Since the lim-
ited in-domain training data was insufficient to train an effective SMT system, we
explored alternative ways of generating pairs of incorrect and correct sentences au-
tomatically from other existing learner corpora. We also proposed several modifica-
tions to address issues that affect system performance, like disabling word reordering
and removing incorrect alignment mappings from the phrase table used by the SMT
decoder. Although our SMT approach did not yield particularly high performance
compared to other teams using machine learning classifiers, nevertheless, this re-
vealed the potential of using SMT as a general approach for correcting multiple
error types and interacting errors simultaneously. The version of the corpus used
for the shared task only includes five error types and discards all the remaining cor-
rections, resulting in some broken or partly-corrected sentences. These ill-formed
sentences are particularly harmful for SMT-based systems which, unlike classifiers,
work at a global rather than local level. As a result, many corrections proposed by
our SMT system were considered incorrect because they did not belong to any of
the five target error types. This showed that the SMT approach seems more suitable
for an all-errors task rather than a constrained error correction task.
In the CoNLL-2014 shared task (Ng et al., 2014) (see Section 2.5.2), the top
performing systems demonstrated that the SMT framework can yield state-of-the-art
performance on an all-errors correction task. Our winning system (Felice et al., 2014)
is a pipeline of a rule-based system and a phrase-based SMT system (see Chapter 3).
The SMT system was trained on parallel sentences and short phrase alignments
extracted from fully annotated learner corpora (see Section 2.3). Word alignment
was carried out using Pialign (Neubig et al., 2011). As most words translate into
themselves and some errors are often similar to their correct forms, we introduced
character-level Levenshtein distance (Levenshtein, 1966), which captures the number
of edit operations required to change the source phrase into the target phrase. The
10-best correction candidates produced by the SMT system were then re-ranked
using Microsoft’s Web N-gram Services, which provide access to large smoothed n-
3
http://lang-8.com
23
24. gram LMs built from English web documents containing trillions of tokens (Gao
et al., 2010). Corrections were finally filtered by error type. Our work showed that
an SMT-based GEC system can produce state-of-the-art performance on the task
and candidate re-ranking can further improve it.
The SMT framework was also adopted by Junczys-Dowmunt and Grundkiewicz
(2014), who ranked third out of the 13 participating teams. Following the work
of Mizumoto et al. (2012), they constructed a training corpus of more than 3 mil-
lion pairs of parallel sentences from Lang-8. Since the Lang-8 data can be quite
noisy, they performed error selection by keeping errors that resembled mistakes in
a learner corpus and replacing others with their corresponding corrections. Apart
from the LM built from the target side of the training data, a 5-gram LM estimated
from the entire CommonCrawl data (approximately 440 billion tokens, see Buck
et al., 2014) was used during decoding. Similar to our character-level Levenshtein
distance feature, they introduced a word-based version. Feature weights were tuned
for F-score using the k-best Margin Infused Relaxed Algorithm (MIRA) (Cherry
and Foster, 2012) and Minimum Error Rate Tuning (MERT) (Och, 2003). Al-
though they concluded that parameter optimisation was essential, Kunchukuttan
et al. (2014) subsequently found that tuning for F-score to increase precision yielded
worse performance. Grundkiewicz and Junczys-Dowmunt (2014) later introduced
the WikEd Error Corpus, which consists of more than 12 million sentences extracted
from Wikipedia revision histories. A similar error selection process was performed
to only keep errors that resembled those made by ESL learners.
In a following paper, Junczys-Dowmunt and Grundkiewicz (2016) introduced
additional features based on edit operation counts, as well as an operation sequence
model (Durrani et al., 2013) and a 9-gram LM based on word-classes produced
by word2vec (Mikolov et al., 2013). However, the integration of additional model-
s/features seemed to affect the underlying algorithm used in SMT. The authors also
observed erratic behaviour when optimising the new features and therefore proposed
partial solutions to task-specific parameter tuning. Finally, they reported new state-
of-the-art performance on the CoNLL-2014 shared task test set, with an F0.5 score
of 49.49%.
The ‘translation’ approach has also been used to perform automatic post-editing.
Simard et al. (2007) discussed the use of an SMT system to translate erroneous texts
produced by a Rule-Based Machine Translation (RBMT) system into better texts
in the same language. A phrase-based SMT system was used as an automatic post-
editing system and results showed that the SMT system was effective at correcting
repetitive errors made by the RBMT system.
Instead of translating an erroneous English sentence into a correct one directly,
an SMT system could be used as an auxiliary tool for producing ‘round-trip’ trans-
lations (Hermet and Désilets, 2009; Madnani et al., 2012). The idea of round-trip
SMT is to first translate an English sentence into a pivot foreign language, and then
translate the pivot foreign language sentence back into English. By comparing the
original English sentence and the round-trip translation, errors can be detected and
corrected. Hermet and Désilets (2009) focussed on sentences containing preposition
errors and generated a round-trip translation via French. They simply used the
24
25. round-trip translation as the ‘correction’ for the original sentence and their model
was able to correct 66.4% of errors. Madnani et al. (2012) used round-trip transla-
tions obtained from the Google Translate API4 via 8 different pivot languages for
an all-errors task. Their results showed that it is rarely the case that one pivot lan-
guage could offer a round-trip translation that corrected all errors in the sentence;
but that several pivot languages, if combined properly, could. An alignment algo-
rithm was designed to combine multiple round-trip translations generated from the
API into a lattice using TERp, an extension of the Translation Edit Rate evaluation
metric (Snover et al., 2009). The lattice was then used to extract whole-sentence
corrections. Their experiments yielded fairly reasonable results but left significant
room for improvement.
2.2.2 Candidate re-ranking
Despite the success of SMT-based GEC systems, one of the weaknesses is that SMT
features used in the framework might not perform well on the error correction task
given that SMT was not originally intended for GEC. Since the SMT features were
designed to capture translation regularities, they may fail to capture some correction
regularities. As a result, the correction produced by an SMT system is not always the
best. It thus seems necessary to add new features with respect to GEC for building
effective SMT-based GEC systems, although work in this direction is very limited.
Felice et al. (2014) and Junczys-Dowmunt and Grundkiewicz (2014) introduced
Levenshtein distance to their phrase-based SMT systems. Felice et al. (2014) further
used a web-based LM to re-rank the 10-best correction candidates produced by the
SMT system.
Re-ranking, on the contrary, has been widely used in many NLP tasks such as
parsing, tagging and sentence boundary detection (Collins and Duffy, 2002; Collins
and Koo, 2005; Roark et al., 2006; Huang et al., 2007). Various machine learn-
ing algorithms have been adapted to these re-ranking tasks, including boosting,
perceptrons and SVMs. Over the last decade, re-ranking techniques, especially dis-
criminative re-ranking, have shown significant improvement in MT. For each source
sentence, rather than outputting the candidate with the highest probability directly,
an n-best list of candidate translations is collected from an SMT system and later
re-ranked using re-ranking algorithms.5 New global and local features that have
not been used during translation can then be easily added to the re-ranker, without
worrying about fine-grained smoothing issues in the SMT framework. Shen et al.
(2004) successfully applied discriminative re-ranking to MT and observed an im-
provement in BLEU over the original output of the SMT system. As phrase-based
SMT systems make little or no direct use of syntactic information, Och et al. (2004)
proposed to use syntactic features to re-rank the n-best list. A wide range of fea-
tures were systematically evaluated, including word-level features, shallow syntactic
features based on POS tags and chunks, and features from Treebank-based syn-
tactic analyses. However, these syntactic features only gave very small gains and
4
https://cloud.google.com/translate
5
Re-ranking algorithms are described in more detail in Chapter 4.
25
26. improvements were mostly due to the addition of translation probabilities from IBM
Models (Brown et al., 1993), a non-syntactic feature. Goh et al. (2010) employed an
online training algorithm for SVM-based structured prediction. Various global fea-
tures were investigated for SMT re-ranking, such as the decoder’s scores, source and
target sentences, alignments, POS tags, sentence type probabilities, posterior prob-
abilities and back translation features. Farzi and Faili (2015) proposed a re-ranking
system based on swarm algorithms, where a set of non-syntactic features that can be
easily computed from LMs, TMs, n-best lists of candidates and POS tags were used.
As candidate re-ranking seems potentially valuable for GEC, we propose an SVM
ranking model to improve SMT output, making it the first work to use discriminative
re-ranking for SMT-based GEC.
2.2.3 Neural machine translation
In the past few years, neural network techniques have found success in a wide range
of NLP tasks, such as language modelling (Mnih and Hinton, 2007; Mikolov and
Zweig, 2012), discriminative parsing (Collobert, 2011), sentiment analysis (Socher
et al., 2011; Glorot et al., 2011) and summarisation (Kågebäck et al., 2014). Thus,
it is not surprising that neural network models have also been applied to error
detection and correction. Sun et al. (2015), for example, employed a Convolutional
Neural Network (CNN) for article error correction. Instead of building machine
learning classifiers using pre-defined syntactic and/or semantic features, a CNN
model is trained from surrounding words with pre-trained word embeddings. Lee
et al. (2016) used a CNN to predict whether a sentence needs editing. Rei and
Yannakoudakis (2016) looked into various neural network sequence labelling models
for error detection in learner writing.
The tide of neural models has also spread to the field of MT. Unlike SMT, NMT
learns a single large neural network which inputs a source sentence and outputs a
translation. The use of NMT models has shown promising results for several MT
tasks (see Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014;
Bahdanau et al., 2015). Specifically, NMT systems ranked on par with phrase-based
SMT systems on a couple of language pairs in the 2015 Workshop on Statistical
Machine Translation (WMT) shared translation task (Bojar et al., 2015).6
NMT applies an encoder-decoder framework. An encoder first reads and encodes
an input sentence into a vector representation. A decoder then outputs a trans-
lation for the input sentence from the vector representation.7 Different network
architectures have been proposed for NMT. Kalchbrenner and Blunsom (2013) first
used a CNN to encode source sentences and a Recurrent Neural Network (RNN) to
generate target translations. A similar CNN encoder was then used by Meng et al.
(2015). Sutskever et al. (2014) and Cho et al. (2014) used RNNs for both encoding
and decoding. Sutskever et al. (2014) used a multilayer Long Short-Term Memory
(LSTM) to map a source sentence into a fixed-sized vector, and another LSTM to
6
The NMT system from Jean et al. (2015b) ranked 1st on English-German translation task and
3rd on Czech-English, English-Czech and German-English translation tasks (ties were allowed).
7
NMT algorithms are described in more detail in Chapter 5.
26
27. decode a target sentence from the vector. Cho et al. (2014) used two Gated Recur-
rent Unit (GRU) models, one as the encoder and another as the decoder. Bahdanau
et al. (2015) introduced an attention mechanism to NMT which helps the decoder
focus on the most relevant information in a source sentence when predicting target
words. Luong et al. (2015a) experimented with two attention mechanisms and com-
pared various alignment functions. Both Bahdanau et al. (2015) and Luong et al.
(2015a) have shown that attention-based models are better than non-attentional
ones in handling long sentences.
Towards the end of this thesis, we explore the potential of NMT for GEC, as we
believe that the distributed representation of words could help correct previously
unseen errors more effectively than SMT. To the best of our knowledge, this is the
first work to use the NMT framework to build end-to-end GEC systems.
2.3 Learner corpora
Unlike native corpora, learner corpora are collections of language data produced
by non-native speakers. Having such learner resources is advantageous for GEC
research: 1) it allows the investigation of real learner errors as well as the contexts
in which they occur; 2) it facilitates the development of statistical models for GEC;
for example, an SMT system requires millions of examples of correct and incorrect
usage to learn reliable correction mappings; and 3) it provides a way of evaluating
GEC system performance in a real world scenario. Recently, error-annotated learner
corpora have become more readily and publicly available. In this section, we describe
the learner corpora used in this thesis.
2.3.1 NUCLE
The National University of Singapore Corpus of Learner English (NUCLE) is an an-
notated corpus of learner text built by the National University of Singapore (NUS)
NLP Group in collaboration with the NUS Centre for English Language Commu-
nication (Dahlmeier et al., 2013). It consists of more than 1,400 essays written by
undergraduate students at NUS who are non-native English speakers. These es-
says were written in response to some prompts that cover a wide range of topics,
such as environmental pollution, healthcare, and technology innovation. Two of the
prompts used for data collection are shown below:
Prompt 1
“Public spending on the aged should be limited so that money can be diverted
to other areas of the country’s development.” Do you agree?
Prompt 2
Surveillance technology such as RFID (radio-frequency identification) should
not be used to track people (e.g. human implants and RFID tags on people or
products). Do you agree? Support your argument with concrete examples.
27
28. Error type Prop. (%) Example
Wrong collocation/idiom/prepo- 15.7 Singapore has invested heavily *on/in
sition (Wcip) the establishment of Biopolis.
Local redundancies (Rloc) 13.7 Abortion is available to end a life only
*because of/because the fetus or embryo
has the wrong sex.
Article or determiner (Ar- 12.9 Sex selection technology should not be
tOrDet) used in *non-medical/a non-medical sit-
uation.
Noun number (Nn) 8.5 Sex selection should therefore be used
for medical *reason/reasons.
Mechanics (Mec) 7.1 The *affect/effect of that policy has yet
to be felt.
Verb tense (Vt) 7.1 A university *had conducted/conducted
the survey last year.
Word form (Wform) 4.8 Sex-selection may also result in *addi-
tion/additional stress for the family.
Subject-verb agreement (SVA) 3.4 The boy *play/plays soccer.
Verb form (Vform) 3.0 Will the child blame the parents after
he *growing/grows up?
Table 2.1: Proportion of the most common error types in the NUCLE corpus. Grammat-
ical errors in the examples are printed in italics in the form *incorrect word/corrected word.
The corpus contains over one million words which were manually annotated by
professional English instructors at NUS using a tag set of 27 error categories (see
Appendix A), resulting a total number of 46,597 error annotations. The statistics
of NUCLE show that 57.6% of all sentences have no errors, 20.5% have exactly one
error, 10.7% have exactly two errors, and 11.2% of all sentences have more than
two errors. The highest observed number of error annotations in a single sentence
is 28. The top nine error types in the NUCLE corpus are presented in Table 2.1.
Although wrong word choice (Wcip) and redundancy errors (Rloc) are ranked at the
top, Dahlmeier et al. (2013) reported that most Wcip errors are preposition errors,
and a large percentage of Rloc errors involve articles that should be deleted. This
confirms that articles and prepositions are two of the most common errors in ESL
text. It also shows inconsistency in NUCLE labelling.8
2.3.2 CLC
The Cambridge Learner Corpus (CLC) is the world’s largest learner corpus, devel-
oped by Cambridge University Press and Cambridge English Language Assessment
since 1993. It is a 52.5 million word collection of exam scripts written by learners of
English who took Cambridge English examinations around the world. Currently, it
comprises over 200,000 exam scripts produced by learners at various levels speaking
8
The NUCLE corpus was later revised for the CoNLL shared tasks to separate prepositions
from Wcip and articles from Rloc (amongst other things) - see Section 2.5.2.
28
29. 148 different L1s living in 217 different countries or territories. A subset of the
corpus (a 25.5 million word collection) has been manually error coded by linguists
using an error-coding system with a taxonomy of approximately 80 error types de-
vised specifically for the CLC (Nicholls, 2003). The majority of the error codes used
in the CLC are two-letter codes with the first letter representing the general type
of error (e.g. spelling, word form) and the second one representing the word class
(i.e. POS) of the required word (see Appendix B). The coding format is explained
with the following examples:
Example 2.2. I am so excitingexcited that
I have won the first prize.
Example 2.3. I like playing in a team and deciding
quickly what to do next.
Error information is provided inside the tag, where the error type is also
specified. Inside the tag, the original erroneous part is marked by the
tag and its corrected version is marked by the tag. In Example 2.2, “RJ”
stands for Replace adJective, where ‘exciting’ should be corrected to ‘excited’. In
Example 2.3, “MD” stands for Missing Determiner, where the word ‘a’ should be
added. Other error codes include Form, Unnecessary, Spelling and Derivation for the
first letter; and Noun, Verb, preposiTion, Punctuation for the second letter. More
detailed information about the error-coding scheme can be found in Nicholls (2003).
The top nine error types in the error-coded CLC are presented in Table 2.2,
with spelling errors excluded. The most frequent error type in the CLC is choosing
an inappropriate open class word (noun, verb, adjective or adverb), followed by
prepositions and determiners.9 A similar error distribution was observed in the
NUCLE corpus - see Table 2.1.
Each examination script in the CLC contains meta-data about the learner, such
as L1, nationality, age, sex and level of English, as well as the examination. There
are three examination suites in the CLC (Williams, 2008):
• main suite (general purpose qualification):
Certificate of Proficiency in English, Certificate of Advanced English, First
Certificate in English (FCE), Preliminary English Test, and Key English Test;
• Business English Certificates (focuses on the language of business);
• International English Language Testing System (IELTS) (general and aca-
demic modules).
Two subsets of the CLC used in this thesis are described in detail: FCE and
IELTS examination scripts.
9
The determiner errors include both determiner and pre-determiner errors, not just the articles
a/an and the.
29
30. Error type Prop. (%) Example
Content word choice error 19.9 We need to deliver the merchandise on
a daily *base/basis.
Preposition error 13.4 Our society is developing *in/at high
speed.
Determiner error 11.7 Wemust try our best to avoid *the/a
shortage of fresh water.
Comma error 9.3 However */, I’ll meet you later.
Inflectional morphology 7.4 The women *weared/wore long dresses.
Wrong verb tense 6.7 I look forward to *see/seeing you.
Derivational morphology 4.9 It has already been *arrangement/ar-
ranged.
Pronoun 4.2 I want to make *me/myself fit.
Agreement error 4.0 I *were/was in my house.
Table 2.2: Proportion of the most common error types in the CLC. Grammatical errors
in the examples are printed in italics in the form *incorrect word/corrected word.
2.3.2.1 FCE examination scripts
The FCE dataset was released into the public domain in 2011 by Yannakoudakis
et al. (2011). It is a set of 1,244 scripts written by learners of English who took
the FCE examination between 2000 and 2001, which assesses English at an upper-
intermediate level. The FCE dataset contains about half a million words and more
than 50k errors. Each exam script contains two essays whose length varies between
120 and 180 words. Essays were written in response to tasks requiring a learner to
write a letter, a report, an article, a composition or a short story. A typical prompt
is shown below:
Your teacher has asked you to write a story for the school’s English language
magazine. The story must begin with the following words: “Unfortunately, Pat
wasn’t very good at keeping secrets”.
The anonymised scripts are annotated using XML and linked to meta-data in-
cluding the question prompts and information about candidates.
2.3.2.2 IELTS examination scripts
The IELTS dataset is another subcorpus of the CLC that comprises exam scripts
written by ESL learners taking the IELTS examination. It consists of 851 scripts
from 2008 and 100 scripts from 2010. Like in the FCE dataset, each exam script in
the IELTS dataset consists of two essays in response to two tasks. The first task asks
a learner to write a descriptive report on the information provided in a diagram,
table or short piece of text, or write a short letter in response to a situation or
problem with a minimum of 150 words. The second task asks a learner to use at
least 250 words to present an argument or discuss a problem.
30
31. 2.4 Evaluation metrics
For system development, it is necessary to have internal system evaluation. Auto-
matic evaluation metrics allow fast and inexpensive feedback. When evaluating a
GEC system, the system’s output is compared to gold-standard references provided
by human experts. There is an on-going discussion on how to best evaluate GEC
systems and several metrics have been proposed and used (Dale and Kilgarriff, 2011;
Dale et al., 2012; Papineni et al., 2002; Dahlmeier and Ng, 2012b; Felice and Briscoe,
2015; Bryant and Ng, 2015; Napoles et al., 2015; Grundkiewicz et al., 2015; Sakaguchi
et al., 2016). In this section, we present four evaluation metrics used in this thesis.
2.4.1 BLEU
BLEU was first proposed by Papineni et al. (2002) and is now used as the dominant
method for automatic MT evaluation. It estimates the quality of the text produced
by MT systems so that the closer it is to human translations, the better. BLEU has
been shown to correlate well with human judgments at the corpus level. It uses a
modified n-gram precision (p n ) to compare a candidate against multiple references:
P
n-gram∈C countclip (n-gram)
pn = P (2.1)
n-gram∈C count(n-gram)
where C is a candidate sentence. The count of each n-gram in C is clipped by
its maximum reference count observed in any single reference for that n-gram:
countclip = min(count, max ref count) (2.2)
BLEU is then defined as:
N
!
X
BLEU = BP · exp w n log p n (2.3)
n=1
where N is the order of the highest n-gram to be considered (usually N = 4); wn
stands for uniformly distributed weights: w n = N1 . BP is a brevity penalty which is
used to prevent very short candidates from receiving very high scores:

 1 if c > r
BP = (2.4)
 (1− rc )
e if c ≤ r
where c and r are the lengths of the system’s candidate and gold-standard ref-
erence respectively.
BLEU was used by Mizumoto et al. (2011, 2012) to evaluate SMT-based GEC
systems. Unlike metrics which rely on references with explicitly labelled error anno-
tations, BLEU only requires corrected references. On the one hand, it can be used
as a generic evaluation method independent of the annotation scheme, but on the
other hand, it fails to provide detailed error type feedback for GEC. Since both the
original and corrected sentences are in the same language (i.e. English) and most
31
32. words in the sentence do not need changing, BLEU scores for GEC systems are
relatively high compared with standard MT tasks. However, it is not enough to just
compare BLEU scores from different GEC systems, it is also necessary to compare
them with that of the original input. If the system’s ouput yields a higher BLEU
score than the original input, it is assumed that the system improves the quality of
the original input by making some corrections. In addition, BLEU allows multiple
references, which is useful for errors with multiple alternative corrections.
2.4.2 M2 scorer
The M2 scorer, proposed by Dahlmeier and Ng (2012b), is used to evaluate sys-
tem performance by how well its proposed corrections or edits match the gold-
standard edits. It computes the sequence of phrase-level edits between a source
sentence and a system’s candidate that achieves the highest overlap with the gold-
standard annotation. A parameter µ is used to limit the number of unchanged
words (max unchanged words) so that edits including too many words are avoided.
Evaluation is performed by computing precision (P), recall (R) and F-score (van
Rijsbergen, 1979):
Pn
i=1 |ei ∩ gi |
P = P n (2.5)
i=1 |ei |
Pn
i=1 |ei ∩ gi |
R= P n (2.6)
i=1 |gi |
P ×R
F β = (1 + β 2 ) × (2.7)
(β 2 × P) + R
where ei = {e1 , e2 , ..., en } is the system’s candidate edit set and gi = {g1 , g2 , ...,
gn } is the gold-standard edit set. The intersection between ei and gi is defined as:
ei ∩ gi = {e ∈ ei | ∃ g ∈ gi (match (e, g))} (2.8)
Two of the commonly used F-scores are F1 , which weights P and R evenly, and
F0.5 , which emphasises P twice as much as R:
P ×R
F1 = 2 × (2.9)
P +R
P ×R
F0.5 = (1 + 0.52 ) × (2.10)
(0.52 × P) + R
The M2 scorer was the official scorer in the CoNLL 2013 and 2014 shared tasks
on grammatical error correction, where F1 was used in CoNLL-2013 and F0.5 was
used in CoNLL-2014 (see Section 2.5.2). When building GEC systems, minimising
the number of unnecessary corrections is often regarded as more important than
covering a large number of errors, which is something users are willing to sacrifice
as long as the system provides accurate corrections. In other words, high P is often
preferred over high R. There is also a strong educational motivation, as flagging
32
33. Source Candidate Reference Classification
a a a TN
a a b FN
a a - FN
a b a FP
a b b TP
a b c FP, FN, FPN
a b - FP, FN, FPN
a - a FP
a - b FP, FN, FPN
a - - TP
- a a TP
- a b FP, FN, FPN
- a - FP
- - a FN
Table 2.3: The extended writer-annotator-system evaluation scheme proposed by Felice
and Briscoe (2015).
correct text as incorrect would cause confusion among learners. This is why F0.5
was much preferred lately when reporting system performance.
However, evaluation methods based on P, R and F-score (e.g. the M2 scorer)
do not provide an indicator of improvement on the original text so there is no way
to compare GEC systems against a ‘do-nothing’ baseline that keeps the input text
unchanged. A ‘do-nothing’ baseline will always yield an F-score of 0 by definition,
and an increase in P, R or F-score does not necessarily mean a reduction in the
actual error rate.
2.4.3 I-measure
The I-measure was designed by Felice and Briscoe (2015) to address problems with
previous evaluation methods and to evaluate real improvement on the original sen-
tence after corrections.
System performance is first evaluated in terms of weighted accuracy (WAcc),
based on a token-level alignment between a source sentence, a system’s candidate,
and a gold-standard reference. An extended version of the writer-annotator-system
evaluation scheme (Chodorow et al., 2012) was adopted where each token alignment
is classified as a true positive (TP), true negative (TN), false positive (FP), false
negative (FN), or both an FP and FN (FPN) - see Table 2.3. WAcc is defined as:
w · TP + TN
WAcc = FPN
(2.11)
w · (TP + FP) + TN + FN − (w + 1) · 2
where w > 1 is a weight factor.
33
34. An Improvement or I-measure (I) score is computed by comparing system per-
formance (WAccsys ) with that of a baseline that leaves the original text unchanged
(WAccbase ):


 ⌊WAcc sys ⌋ if WAcc sys = WAcc base




 WAcc sys − WAcc base

I= if WAcc sys > WAcc base (2.12)
 1 − WAcc base



 WAcc sys − 1



WAcc otherwise
base
Values of I lie in the [-1, 1] interval.10 Positive values indicate improvement,
while negative values indicate degradation. A score of 0 indicates no improvement
(i.e. baseline performance), 1 indicates 100% correct text and -1 indicates 100%
incorrect text.
As multiple annotations are taken into account, the I-measure is computed af-
ter maximising WAccsys at the sentence level, so as to ensure all the evaluated
hypotheses are paired with their highest scoring references. Trying to maximise I
score directly can yield suboptimal results, as different combinations of WAccbase
and WAccsys can produce the same final result (but the one with higher WAccsys is
clearly preferred).
2.4.4 GLEU
Generalized Language Evaluation Understanding (GLEU), proposed by Napoles
et al. (2015), is a simple variant of BLEU for GEC which takes the original source
into account. GLEU modifies the n-gram precision in BLEU to assign extra weight
to n-grams present in the candidate that overlap with the reference but not the
source (R\S), and penalise those in the candidate that are in the source but not the
reference (S\R). For a correction candidate C with a corresponding source S and
reference R, the modified n-gram precision (p n ′ ) for GLEU (C, R, S) is defined as:
P
countR\S (n-gram) − λ(countS\R (n-gram)) + countR (n-gram)
n-gram∈C
p n′ = P P
n-gram∈C countS (n-gram) + n-gram∈R\S countR\S (n-gram)
(2.13)
where the weight λ determines by how much incorrectly changed n-grams are
penalised. Given a bag of n-grams B, the counts in Equation 2.13 are collected as:
X
countB (n-gram) = d(n-gram, n-gram′ ) (2.14)
n-gram′ ∈B

 1 if n-gram = n-gram′
d(n-gram, n-gram′ ) = (2.15)
0 otherwise

10
I score is often expressed as a percentage.
34
35. By updating the n-gram precision in Equation 2.3, GLEU is defined as:11
N
!
X
GLEU (C, R, S) = BP · exp w n log p n ′ (2.16)
n=1
Similar to BLEU, GLEU works with multiple references at the corpus level. It
can be used as a generic evaluation method independent of the annotation scheme,
but fails to provide detailed system performance for each error type.
2.5 Shared tasks on grammatical error correction
In the last few years, four GEC shared tasks have provided a forum for participat-
ing teams to compare results on common training and test data. Participants are
provided with a fully annotated training set and encouraged to use any publicly
available data and tools to build their GEC systems in a few months’ time. After
that, new blind test data is used to evaluate system performance for the participat-
ing teams. Systems are expected to detect grammatical errors in text written by
non-native speakers and return corrected versions within a few days after the release
of the test data. The organisers then evaluate each system’s output and release the
final rankings.
2.5.1 HOO 2011 and 2012
The first two shared tasks - Helping Our Own (HOO) 2011 and 2012 - were aimed
to promote the use of NLP tools and techniques for the development of automated
systems that could provide writing assistance to non-native authors in the NLP
community (Dale and Kilgarriff, 2011; Dale et al., 2012). In the HOO-2011 shared
task, participants were provided with a set of documents extracted from the ACL
Anthology12 written by non-native authors. The task was to automatically detect
and correct all errors present in text. Errors were classified into 13 error types based
on the CLC coding system (Nicholls, 2003). Six teams participated in the task, with
some achieving top performance by focussing only on a limited number of error types.
Given the difficulty of HOO-2011, the HOO-2012 shared task focussed only on
article and preposition errors. The FCE dataset was provided as the official training
set. The number of participating teams increased to 14 and most participants built
machine learning classifiers. Evaluation in both HOO shared tasks was performed
by computing P, R and F-score between a system’s edit set and a manually created
gold-standard edit set.
11
We notice that there is a new version of GLEU appeared this year (Napoles et al., 2016b).
However, scores reported in this thesis were computed using the original GLEU (Napoles et al.,
2015) described in this section.
12
A digital archive of research papers in computational linguistics: https://aclweb.org/
35
36. 2.5.2 CoNLL 2013 and 2014
The next two shared tasks took place in conjunction with the Conference on Com-
putational Natural Language Learning (CoNLL). The CoNLL-2013 shared task (Ng
et al., 2013) expanded the scope of HOO-2012 to include three new error types: noun
number (Nn), verb form (Vform) and subject-verb agreement (SVA). Together with
article (ArtOrDet) and preposition (Prep) errors, this new error list is more compre-
hensive and also introduces interacting errors. NUCLE v2.313 was used as in-domain
training data. The test data consists of 50 new essays, which were written in re-
sponse to two prompts. One prompt was also used in the training set, while the
other was new.
Systems were evaluated using the M2 scorer with the max unchanged words pa-
rameter µ set to 3 as suggested by the organisers (limiting the maximum unchanged
words to three per edit). Rankings were based on F1 , weighting P and R equally.
Initially, there was only one set of gold annotations, but since there are often multi-
ple valid corrections for some errors, participating teams were subsequently allowed
to propose alternative answers (gold-standard edits). This practice was adopted
from the HOO 2011 and 2012 shared tasks. Therefore, there were two rounds of
evaluation, the second of which allowed alternative answers. As noted by Ng et al.
(2013), these new scores tended to be biased towards the teams which submitted
alternative answers. Consequently, to reduce bias, they suggested future evaluation
be carried out without alternative answers. In the end, 17 teams participated in
CoNLL-2013. Among these teams, a common approach was to build classifiers for
each error type. Other approaches included LM, MT and heuristic rules.
The CoNLL-2014 shared task (Ng et al., 2014) tried to once again push the
boundaries of GEC by returning to an all-errors correction task. In particular, there
were three major changes compared with CoNLL-2013: 1) participating systems
were expected to correct grammatical errors of all types; 2) two human annotators
annotated the test essays independently; and 3) the evaluation metric was changed
from F1 to F0.5 , to prioritise P over R. A newer version of the NUCLE corpus -
NUCLE v3.0 - was used as official training data. Additionally, a new set of 50
essays written by non-native English speakers was used as blind test data. The
CoNLL-2013 test set could be freely used for training and/or development. The M2
scorer was again used as the official scorer.
In total, 13 teams submitted output to CoNLL-2014. Most of them built hybrid
systems that combined different approaches. For non-specific error type correction,
LM and MT approaches were used; whereas for single error types, rule-based ap-
proaches and machine learning classifiers were preferred. We built a phrase-based
SMT system (see Chapter 3) which contributed to our final hybrid system submit-
ted to the shared task. Our system achieved the best F0.5 and R on the original
13
In NUCLE v2.3, 17 essays were removed from the first release of NUCLE, and the error types
Wcip and Rloc were mapped to Prep, Wci, ArtOrDet, and Rloc- using POS tags.
36
37. 2.6 Use of datasets and evaluation metrics in this
thesis
In the rest of this thesis, different learner corpora and evaluation metrics are used
in different chapters. This is because, according to Chodorow et al. (2012), there
is no single best evaluation metric and the usefulness of a metric depends on the
application and research goals.
The work in Chapter 3 is presented in the context of the CoNLL-2014 shared
task. The NUCLE corpus, as provided by the shared task organisers, is used as
in-domain training data, and results are reported on the CoNLL-2014 development
set (i.e. the CoNLL-2013 test set) and test set. F0.5 as calculated by the M2 scorer
is used as the evaluation measure. Parallel sentences extracted from the publicly
available FCE dataset and the IELTS dataset are used as additional training data.
In Chapter 4 and 5, the publicly available FCE dataset is used and results
are reported on the FCE test set. The reasons for using the FCE dataset rather
than NUCLE are manifold. Firstly, the FCE dataset, as a subcorpus of the CLC,
covers a wide variety of L1s and was used in the HOO-2012 error correction shared
task. Compared with the NUCLE corpus used in the CoNLL 2013 and 2014 shared
tasks, which only contains essays written by undergraduate students at NUS, the
FCE dataset is a more representative test set of learner writing. Secondly, the
error annotations in the NUCLE corpus are sometimes unreliable and inconsistent.
This may introduce noise into final GEC systems and result in underestimations of
performance. Thirdly, as described in Section 2.3.2, the CLC is the world’s largest
learner corpus, and the FCE dataset was annotated using the same annotation
scheme as the CLC. In order to make better use of the CLC and avoid any problems
caused by the inconsistency between different annotation schemes, we use the FCE
dataset in our experiments. Results reported on the publicly available FCE dataset
can be used for cross-system comparisons.
As discussed in Section 2.4, evaluation methods based on P, R and F-score
(e.g. the M2 scorer) do not provide an indicator of improvement on the original text.
Given this, and the fact that an increase in P, R or F-score does not necessarily mean
a reduction in the actual error rate, we opt to use the I-measure and thus gain a
better insight into system performance in terms of improvement on the original text.
We also apply our CLC-trained systems to the CoNLL-2014 shared task devel-
opment and test sets directly, without using or optimising for NUCLE. Results are
reported using BLEU, GLEU, F0.5 (calculated by the M2 scorer) and I-measure for
broader cross-system comparisons.
37
38.
39. CHAPTER 3
Building a preliminary
SMT-based GEC system
In this chapter, we investigate whether SMT can form the basis of a competitive
all-errors GEC system and describe how to build an end-to-end system using the
SMT framework. Our SMT method is evaluated in the context of the CoNLL-
2014 all-errors correction shared task (see Section 2.5.2) and the results presented
in Section 3.5 were previously published in Felice et al. (2014). Additionally, the
work on artificial data described in Section 3.3.4.2 was further developed in Felice
and Yuan (2014b).
3.1 Statistical machine translation
As one of the first applications envisaged for computers, MT is the process of trans-
lating a sentence from one language into another automatically. Several approaches
have been proposed, such as word-for-word translation, interlingual approaches (Mi-
tamura et al., 1991), example-based MT (Nagao, 1984; Sato and Nagao, 1990) and
SMT. In the last two decades, SMT has become the dominant approach both in
the research community and in the commercial sector. The underlying idea is that
language is so complex and productive that it could never be fully analysed and
distilled into a set of rules. Instead, a machine should try to learn translation map-
pings automatically from large parallel corpora by pairing the input and output of
the translation process and learning from the statistics over the data, thus removing
the need for linguists or language experts. SMT stands out for its low cost and rapid
prototyping, which also produces state-of-the-art results for many MT tasks.
GEC can be considered a special case of MT where the task is to translate ‘bad’
English sentences into ‘good’ ones. The SMT approach to GEC has several advan-
tages. First, an SMT system can deal with multiple error types at the same time
without having to (a) classify errors into different types, (b) decide on their bound-
aries, or (c) combine the results of multiple classifiers, which is often the case in
traditional machine learning approaches. Interacting errors are expected to be cor-
rected as well since SMT systems work at a global rather than local level (compared
39
40. Source C Noisy channel E Ĉ
P(C) P(E|C) Receiver
Figure 3.1: A noisy channel model.
with classifiers) and corrections are performed jointly for the entire sentence. As a
result, the SMT approach seems more suitable for an all-errors task. Second, it does
not need expert knowledge but only requires a parallel corpus of grammatically in-
correct sentences as the source and their corrected versions as the target. Following
the principle that ‘more data is better data’ (Church and Mercer, 1993), previous
studies in SMT have shown that the larger the training parallel corpus, the more
reliable the translation mappings that can be learnt and the better the translation
performance that can be achieved (Koehn et al., 2003; Suresh, 2010; Axelrod et al.,
2011). Last but not least, there are well-developed tools for SMT which can be
easily adapted for GEC, so we can benefit from state-of-the-art SMT techniques.
An SMT-based GEC system can be modelled as a machine translator. Here, the
input (source) is an erroneous English sentence E = e1 e2 ... em , and the output
(target) is a corrected sentence C = c1 c2 ... cl . The erroneous sentence E produced
by ESL learners can be regarded as having passed through a noisy channel (Shannon,
1948) and is hence corrupted by noise, i.e. interference from learners’ L1s - see
Figure 3.1. The goal is to recover the sentence C based on the corrupt sentence E:
P (E|C)P (C)
Ĉ = arg max P (C|E) = arg max = arg max P (E|C)P (C) (3.1)
C C P (E) C
where P (E) in the denominator is ignored since it is a constant across all Cs.
The other three parameters that need to be considered are:
• LM:
estimates the corrected sentence probability P (C) from a target language cor-
pus;
• TM:
estimates the translation (i.e. correction) probability P (E|C) from a parallel
corpus;
• decoder:
searches for the target sentence C that maximises the product of P (C) and
P (E|C).
3.1.1 The language model
One essential component of any SMT system is the LM, which makes sure that the
final system outputs fluent English sentences. A LM is a function that takes an En-
glish sentence as input and returns the probability that it is a valid English sentence.
40