Simple Data Augmentation for Multilingual NLU in Task Oriented Dialogue Systems

Data augmentation has shown potential in alleviating data scarcity for Natural Language Understanding (e.g. slot filling and intent classification) in task-oriented dialogue systems. As prior work has been mostly experimented on English datasets, we focus on five different languages, and consider a setting where limited data are available. We investigate the effectiveness of non-gradient based augmentation methods, involving simple text span substitutions and syntactic manipulations. Our experiments show that (i) augmentation is effective in all cases, particularly for slot filling; and (ii) it is beneficial for a joint intent-slot model based on multilingual BERT, both for limited data settings and when full training data is used.


Introduction
Natural Language Understanding (NLU) in taskoriented dialogue systems is responsible for parsing user utterances to extract the intent of the user and the arguments of the intent (i.e. slots) into a semantic representation, typically a semantic frame (Tur and De Mori, 2011). For example, the utterance "Play Jeff Pilson on Youtube" has the intent PLAYMUSIC and "Youtube" as value for the slot SERVICE. As more skills are added to the dialogue system, the NLU model frequently needs to be updated to scale to new domains and languages, a situation which typically becomes problematic when labeled data are limited (data scarcity).
One way to combat data scarcity is through data augmentation (DA) techniques performing label preserving operations to produce auxiliary training data. Recently, DA has shown potential in tasks such as machine translation (Fadaee et al., 2017), constituency and dependency parsing (Ş ahin and Steedman, 2018;Vania et al., 2019), and text classification (Wei and Zou, 2019;Kumar et al., 2020). As for slot filling (SF) and intent classification (IC), a number of DA methods have been proposed to generate synthetic utterances using sequence to sequence models (Hou et al., 2018;Zhao et al., 2019), Conditional Variational Auto Encoder (Yoo et al., 2019), or pretrained NLG models . To date, most of the DA methods are evaluated on English and it is not clear whether the same finding apply to other languages.
In this paper, we study the effectiveness of DA on several non-English datasets for NLU in task-oriented dialogue systems. We experiment with existing lightweight, non-gradient based, DA methods from Louvan and Magnini (2020) that produces varying slot values through substitution and sentence structure manipulation by leveraging syntactic information from a dependency parser. We evaluate the DA methods on NLU datasets from five languages: Italian, Hindi, Turkish, Spanish, and Thai. The contributions of our paper are as follows: 1. We assess the applicability of DA methods for NLU in task-oriented dialogue systems in five languages. 2. We demonstrate that simple DA can improve performance on all languages despite different characteristic of the languages. 3. We show that a large pre-trained multilingual BERT (M-BERT) (Devlin et al., 2019) can still benefit from DA, in particular for slot filling. frame. The semantic frame conveys information, namely the user intent and the corresponding arguments of the intent. Extracting such information involves slot filling (SF) and intent classification (IC) tasks.
Given an input utterance of n tokens, x = (x 1 , x 2 , .., x n ), the system needs to assign a particular intent y intent for the whole utterance x and the corresponding slots that are mentioned in the utterance y slot = (y slot 1 , y slot 2 , .., y slot n ). In practice, IC is typically modeled as text classification and SF as a sequence tagging problem. As an example, for the utterance "Play Jeff Pilson on Youtube", y intent is PLAYMUSIC, as the intent of the user is to ask the system to play a song from a musician and y slot = ( O, B-ARTIST, I-ARTIST, O, B-SERVICE ) in which the artist is "Jeff Pilson" and the service is "Youtube"". Slot labels are in BIO format: B indicates the start of a slot span, I the inside of a span while O denotes that the word does not belong to any slot. Recent approaches for SF and IC are based on neural network methods that models SF and IC jointly (Goo et al., 2018;Chen et al., 2019) by sharing model parameter among both tasks.

Data Augmentation (DA) Methods
DA aims to perform semantically preserving transformations on the training data D to produce auxiliary data D . The union of D and D is then used to train a particular NLU model. For each utterance in D, we produce N augmented utterances by applying a specific augmentation operation. We adopt a subset of existing augmentation methods from Louvan and Magnini (2020), that has shown promising results on English datasets. We describe the augmentation operations in the following sections.

Slot Substitution (SLOT-SUB)
SLOT-SUB (Figure 1 left) performs augmentation by substituting a particular text span (slot-value pair) in an utterance with a different text span that is semantically consistent i.e., the slot label is the same. For example, in the utterance "Quali film animati stanno proiettando al cinema più vicino", one of the spans that can be substituted is the slot value pair (più vicino, SPATIAL RELATION). Then, we collect other spans in D in which the slot values are different, but the slot label is the same. For instance, we found the substitute candidates SP = {("distanza a piedi", SPATIAL RE-LATION), ("lontano", SPATIAL RELATION), ("nel quartiere", SPATIAL RELATION), . . . }, and then we sample one span to replace the original span in the utterance.

CROP and ROTATE
In order to produce sentence variations, we apply the crop and rotate operations proposed in Ş ahin and Steedman (2018), which manipulate the sentence structure through its dependency parse tree. The goal of CROP (Figure 1 middle) is to simplify the sentence so that it focuses on a particular fragment (e.g. subject/object) by removing other fragments in the sentence. CROP uses the dependency tree to identify the fragment and then remove it and its children from the dependency tree. #Augmented Utterances (D )

Experiments
Our primary goal is to verify the effectiveness of data augmentation on Italian, Hindi, Turkish, Spanish and Thai NLU datasets with limited labeled data. To this end, we compare the performance of a baseline NLU model trained on the original training data (D) with a NLU model that incorporates the augmented data as additional training instances (D + D ). To simulate the limited labeled data situation we randomly sample 10% of the training data for each dataset.  [CLS] to predict the intent, and h t i to predict the slot label. As for DA methods, in addition to the methods described in Section 2, we add one configuration COMBINE, which combines the result of SLOT-SUB and ROTATE, as ROTATE obtains better results than CROP on the development set.
Settings. The model is trained with the BertAdam optimizer for 30 epochs with early stopping. The learning rate is set to 10 −5 and batch size is 16. All the hyperparameters are listed in Appendix A. For SLOT-SUB the number of augmentation per sentence N is tuned on the development set. To produce the dependency tree, we parse the sentence using Stanza (Qi et al., 2020). For both CROP and ROTATE we follow the default hyperparameters from Ş ahin and Steedman (2018). We did not experiment with Thai for CROP and ROTATE as Thai is not supported by Stanza. The number of augmented sentences (D ) for each method is listed in Table  1. For evaluation metric, we use the standard CoNLL script to compute F1 score for slot filling and accuracy for intent classification.
Datasets. For the Italian language, we use the data from Bellomaria et al. (2019), translated from the English SNIPS dataset (Coucke et al., 2018). SNIPS has been widely used for evaluating NLU models and consists of utterances in multiple domains. As for Hindi and Turkish, we use the ATIS dataset from Upadhyay et al. (2018), derived from Hemphill et al. (1990). ATIS is a well known NLU dataset on flight domain. As for Spanish and Thai we use the FB dataset from Schuster et al. (2019) that contains utterances in alarm, weather, and reminder domains. The overall statistics of the datasets are shown in Table 1.

Results
The overall results reported in Table 2 show that applying DA improves performance on slot filling and intent classification across all languages. In particular, for SF, the SLOT-SUB method yields the best result, while for IC, ROTATE obtains better performance compared to CROP in most cases. These results are consistent with the finding from Louvan and Magnini (2020) on the English dataset, where SLOT-SUB improves SF and CROP or ROTATE improve IC. In general, ROTATE is better than CROP for most cases on IC, and we think this is because CROP may change the intent of the original sentence. Intents typically depend on the occurrence of specific slots, so when the cropped part is a slot-value, it may change the sentence's overall semantics.
We can see that languages with different typological features (e.g. subject/verb/object ordering) 1 benefit from ROTATE operation for IC. This result suggests that augmentation can produce useful noise (regularization) for the model to alleviate overfitting when labeled data is limited. When we use COMBINE, it still helps the performance of both SF and IC, although the improvements are not as high as when only one of the augmentation method is applied. The only language that gets the benefits the most from COMBINE is Turkish. We hypothesize that as Turkish has a more flexible word order than the other languages it benefits the most when ROTATE is performed.
Performance on varying data size. To better understand the effectiveness of SLOT-SUB, we perform further analysis on different training data size (see Figure 2). Overall, we observe that as we increase the training size, the benefit of SLOT-SUB is decreasing for all datasets. For some datasets, namely ATIS-HI and FB-ES, SLOT-SUB can cause performance drop for larger data size, although it is reasonably small (less than 1 F1 point). FB-TH consistently benefits from SLOT-SUB even when full training data is used. Until which training data size the improvement is significant vary across datasets 2 . For SNIPS-IT, improvement is clear for all training data size and they are statistically significant up until the training data size is 80%. For ATIS-HI improvements are significant until data size of 40%. As for FB datasets, improvements are significant only until the training data size is 10%. Overall, we can see that SLOT-SUB is effective for cases where data is scarce (5%, 10%), while it is still relatively robust for larger data size on all datasets. Performance on different numbers of augmentation per utterance (N ). We examine the effect of a larger number of augmentations per utterance (N ) to the model performance, specifically for SF (see Figure 3). For FB-ES, similarly to the results in Table 2, increasing N does not affect the performance. For the other datasets, increasing N brings performance improvement. For ATIS-HI, SNIPS-IT, and FB-TH the trend is that, as we increase N , performance goes up and plateau. For ATIS-TR, changing N does not really affect the gain of the performance as the performance trend is quite steady across number of augmentations. For most combinations of N in each dataset (except FB-ES), the difference between the performance of model that using SLOT-SUB and the model that does not use SLOT-SUB is significant 3 .

Related Work
Data augmentation methods that has been proposed in NLP aims to automatically produce additional training data through different kinds of methods ranging from simple word substitution (Wei and Zou, 2019) to more complex methods that aims to produce semantically preserving sentence generation (Hou et al., 2018;. In the context of slot filling and intent classification, recent augmentation methods typically apply deep learning models to produce augmented utterances. Hou et al. (2018) proposes a two-stages methods to produce the delexicalized utterances generation and slot values realization. Their method is based on a sequence to sequence based model (Sutskever et al., 2014) to produce a paraphrase of an utterance with its slot values placeholder (delexicalized) for a given intent. For the slot values lexicalization, they use the slot values in the training data that occur in similar contexts. Zhao et al. (2019) trains a sequence to sequence model with training instances that consist of a pair of atomic templates of dialogue acts and its sentence realization. Yoo et al. (2019) proposes a solution by extending Variational Auto Encoder (VAE) (Kingma and Welling, 2014) into a Conditional VAE (CVAE) to generate synthetic utterances. The CVAE controls the utterance generation by conditioning on the intent and slot labels during model training. Recent work from  make use of Transformer (Vaswani et al., 2017) based pre-trained NLG namely GPT-2 (Radford et al., 2019), and fine-tune it to slot filling dataset to produce synthetic utterances. We consider these deep learning based approaches as heavyweight as they often require several stages in the augmentation process namely generating augmentation candidates, ranking and filtering the candidates before producing the final augmented data. Consequently, the computation time of these approaches is generally more expensive as separate training is required to train the augmentation and joint SF-IC models. Recent work from Louvan and Magnini (2020) apply a set of lightweight methods in which most of the augmentation methods do not require model training. The augmentation methods focus on varying the slot values through substitution mechanisms and varying sentence structure through dependency tree manipulation. While the methods are relatively simple it obtains competitive results with deep learning based approaches on the standard English slot filling benchmark datasets namely ATIS (Hemphill et al., 1990), SNIPS (Coucke et al., 2018), and FB (Schuster et al., 2019) datasets.
Existing methods mostly evaluate their approaches on English datasets, and little work has been done on other languages. Our work focuses on investigating the effect of data augmentation on five non-English languages. We apply a subset of lightweight augmentation methods from Louvan and Magnini (2020) that do not require separate model training to produce augmentation data.

Conclusion
We evaluate the effectiveness of data augmentation for slot filling and intent classification tasks in five typologically diverse languages. Our results show that by applying simple augmentation, namely slot values substitutions and dependency tree manipulations, we can obtain substantial improvement in most cases when only small amount of training data is available. We also show that a large pre-trained multilingual BERT benefits from data augmentation.