Optimizing the impact of data augmentation for low-resource grammatical error correction

Solyman, Aiman; Zappatore, Marco; Zhenyu, Wang; Mahmoud, Zeinab; Alfatemi, Ali; Ibrahim, Ashraf Osman; Gabralla, Lubna Abdelkareim

doi:10.1016/j.jksuci.2023.101572

Grammatical Error Correction (GEC) refers to the automatic identification and amendment of grammatical, spelling, punctuation, and word-positioning errors in monolingual texts. Neural Machine Translation (NMT) is nowadays one of the most valuable techniques used for GEC but it may suffer from scarcity of training data and domain shift, depending on the addressed language. However, current techniques (e.g., tuning pre-trained language models or developing spell-confusion methods without focusing on language diversity) tackling the data sparsity problem associated with NMT create mismatched data distributions. This paper proposes new aggressive transformation approaches to augment data during training that extend the distribution of authentic data. In particular, it uses augmented data as auxiliary tasks to provide new contexts when the target prefix is not helpful for the next word prediction. This enhances the encoder and steadily increases its contribution by forcing the GEC model to pay more attention to the text representations of the encoder during decoding. The impact of these approaches was investigated using the Transformer-based for low-resource GEC task, and Arabic GEC was used as a case study. GEC models trained with our data tend more to source information, are more domain shift robustness, and have less hallucinations with tiny training datasets and domain shift. Experimental results showed that the proposed approaches outperformed the baseline, the most common data augmentation methods, and classical synthetic data approaches. In addition, a combination of the three best approaches Misspelling, Swap, and Reverse achieved the best F_1 score in two benchmarks and outperformed previous Arabic GEC approaches.