Italian Counter Narrative Generation to Fight Online Hate Speech

English. Counter Narratives are textual responses meant to withstand online hatred and prevent its spreading. The use of neural architectures for the generation of Counter Narratives (CNs) is beginning to be investigated by the NLP community. Still, the efforts were solely targeting English. In this paper, we try to fill the gap for Italian, studying how to implement CN generation approaches effectively. We experiment with an existing dataset of CNs and a novel language model, recently released for Italian, under several configurations, including zero and few shot learning. Results show that even for underresourced languages, data augmentation strategies paired with large unsupervised LMs can held promising results. Italiano. Le Contro Narrative sono risposte testuali volte a contrastare l’odio online e a prevenirne la diffusione. La comunità di NLP ha iniziato a studiare l’uso di architetture neurali per la generazione di CN. Tuttavia, gli sforzi sono stati rivolti esclusivamente all’inglese. In questo lavoro, cerchiamo di colmare la lacuna per l’italiano, mostrando come implementare efficacemente approcci di generazione di CN. Sperimentiamo con un dataset esistente di CN e un modello del linguaggio per l’italiano recentemente rilasciato, in diverse configurazioni, tra cui zero e few shot learning. I risultati mostrano che anche per lingue con poche risorse, strategie di data augmentation abbinate a potenti modelli del linguaggio possono offrire risultati promettenti. Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


Introduction
The rise of online Hate Speech (HS) brings along the need for combating strategies as it can trigger harmful psychological effects on the target groups and more crimes against them. While research studies have been widely focusing on hate speech detection methodologies for social media platforms (Schmidt and Wiegand, 2017;Fortuna and Nunes, 2018), a recent line of research has taken the problem a step further by addressing the automatic generation of counter responses, aka counter narratives (Qian et al., 2019;Tekiroglu et al., 2020), in order to assist non-governmental organizations in their real-world online hatred combating efforts. An example of HS along with a possible CN are shown below: HS: Gli arabi sono tutti terroristi e vogliono conquistarci con la violenza e le bombe. Bisogna rispondere con il napalm.
[Arabs are all terrorists and they want to conquer us with violence and bombs. We must respond with napalm.] CN: Essere di origine araba non significa essere terroristi, evitiamo generalizzazioni che portano solo ad altro odio. [Being of Arab descent does not mean being a terrorist, let's avoid generalizations that only lead to more hatred.] Despite the encouraging results of the counter narrative generation task, experiments have been limited to English due to the scarcity of hate speech / counter narrative data in other languages. In this paper, we investigate counter narrative generation for Italian as a case study where zero or only a small amount of task specific in-language data is available. We first explore the portability of generation across languages, considering that recent neural machine translation (NMT) systems have shown outstanding performances. We pro-pose utilizing off-the-shelf NMT models to synthesize silver data from other languages, and finetuning GePpeTto (Mattei et al., 2020), a recently developed GPT-2 based language model for Italian, on the silver data. We then examine the effect of combining silver with gold data on CN generation by experimenting with various gold data sizes. Our findings show that a proper combination of silver and gold data while fine-tuning LMs can drastically reduce the need for expert-annotator effort on target languages.

Related Work
In this section we briefly recap relevant works for our counter narrative generation task, including the problem of online hatred recognition, effectiveness of approaches to hatred intervention, methodologies for generating counter-arguments, and text generation for low-resourced languages.
Counter-argument Generation. This task share the same abstract goal as CN generation -i.e. to produce the opposite or alternate stance of a statement. Previous works adopted sequenceto-sequence architectures to generate arguments (Rakshit et al., Hua et al., 2019;Rach et al., 2018;Le et al., 2018) targeting specific domains in which massive discussion is available, such as politics (Hua et al., 2019;Le et al., 2018), and economy (Le et al., 2018;Wachsmuth et al., 2018).
NLG for under-resourced languages. In spite of several studies addressing NLG, only a few have investigated the generation for languages other than English. For instance, there is the porting of SimpleNLG API (Gatt and Reiter, 2009) to Dutch (de Jong and Theune, 2018) and Italian (Mazzei et al., 2016), or Bilingual generation via combining NMT and Generative Adversarial Networks (Rashid et al., 2019).

Italian Counter Narrative Generation
Our main goal is to determine a methodology for Italian counter narrative generation considering the scarcity of gold standard data for training. Accordingly, we hypothesize that the availability of a decent amount of silver data can provide a kick-start for the generative models. Therefore, we resort to data augmentation through translation with the help of the existing datasets of hate speech / counter narrative pairs in other languages. For translation setting, we use DeepL 1 , an off-theshelf and well-performing MT system, to translate data from other languages to Italian. The translated pairs are used for fine-tuning a large Italian pre-trained generative model, i.e. GePpeTto, along with the original Italian gold standard pairs.

Dataset
For our study, we use CONAN dataset (Chung et al., 2019), which is a niche-sourced hatecountering dataset that consists of HS/CN pairs focusing on Islamophobia. The dataset provides pairs in English, French, and Italian, collected with the help of operators from three European NGOs specialized in online hate countering. Each pair in CONAN can either be an original or one of the 2 paraphrases of an original pair. In the experiments, we used the following splits: 1. 2142 pairs (original IT pairs and 1 IT paraphrase pair) as a training set made of gold standard data.
2. 5996 pairs as a training set made of silver data obtained by automatically translating FR and EN pairs to IT.
3. 1071 pairs (the rest of the IT paraphrased pairs) are kept for testing purposes.

Models
In order to inspect how Italian CN generation can be accomplished under different resource conditions, we test the effect of using (i) silver data, (ii) gold standard data, and (iii) their combination. In particular we experiment with the following configurations on which GePpeTto is fine-tuned: GP-trans. GePpeTto is fine-tuned on the silver data obtained by translating EN and FR pairs to IT using DeepL. This configuration represents the worst case scenario, where no HS/CN pair is available in the target language, and corresponds to a zero-shot learning setting.
Gp-ita. We fine-tune GePpeTto on all the original IT pairs in CONAN. This represents our practical best-case scenario, despite the fact that more pairs might provide better results.
GP-hybrid. We conjecture that introducing even a small amount of gold standard examples can help LMs adapt to the domain-specific idiosyncrasies. Moreover, we inspect how generation performance varies with the size of gold standard data provided. In this regard, we conduct a second phase of fine-tuning on top of the GPtrans model using 100, 300, 500, 800, and full IT pairs of CONAN. Therefore, we can represent various intermediate conditions of few-shot learning where few to several pairs for the target language are available. Thus, we assess how much the pretraining with the silver data helps to reduce the amount of gold standard data needed to reach a proper generation performance.

Training Details
For all the experiments, we have used GePpeTto as the pretrained Italian language model adopted from HuggingFace's Transformers library 2 and fine-tuned our models on a single K80 GPU using a batch size of 2048 tokens. The hyperparameter tuning details are provided in the following. At test time, we employed nucleus sampling with a p value of 0.9 for the generation of CNs. Conditioned on HSs, the generated sequence of text tagged with [CN start token] CN [CN end token] is selected as output.
Training Epochs We have empirically chosen 5 epochs for training for all the configurations, tuned from {2, 3 and 5} on test set. Preliminary experiments show that while lower number of epochs grant higher novelty in the output, they also came at the cost of lower BLEU scores. A further manual evaluation confirmed that the generation with 5 epoch provides more suitable responses.
Learning rate Once defining the epochs, we experimented with different learning rates of [1,2,5]e-5 and chose 5e-5 for the best performing setting -preliminary experiments show that while producing less novel and slightly more repeated text, the learning rate of 5e-5 consistently has better results in terms of BLEU and ROUGE scores.
Fine-tuning steps. In case where multiple datasets (silver and gold standard) were used, we followed a multi-step fine-tuning procedure by first using the silver and then the gold standard dataset. Gururangan et al. (2020) showed that task-adaptive pretraining using curated datasets from a dataset with similar distribution with the end task, provides significant improvements. Our fine-tuning schema follows this finding by first fine-tuning GePpeTto with the silver data as the task adaptive pretraining with an augmented dataset. Our preliminary experiments confirmed that adapting fine-tuned models towards the language characteristics of the target corpus is more effective than mixing silver and gold data together in a single fine-tuning procedure.

Evaluation
For our experiments we report word-overlap metrics BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) to evaluate the CN generation on the gold standard test set. As for the generation quality, we compute Repetition Rate (Bertoldi et al., 2013) and Novelty (Wang and Wan, 2018) to assess how Diverse a response is with reference to the given HS and how Novel the generation is concerning the training data.
We also conduct a human evaluation to compare the generation quality of the configurations based on 3 criteria: (i) Suitableness. How suitable the given CN is as a response for the input HS. (ii) Specificity. How specific the given CN is as a response. This metric is used to discern suitable responses that are nonetheless very generic. (iii) Grammaticality. How grammatically correct the given CN is. All scores were in a scale from 1 to 5.

Results and Discussion
Model comparison. Results in Table 1 show that using the silver data (GP-trans) provides a viable step towards a proper model. When gold standard data is also available (GP-hybrid), we obtain better quantitative performance in terms of BLEU and ROUGE scores in comparison to the best case scenario (GP-ita). Furthermore, mixing the silver translation and the Italian gold standard data (GPhybrid) yields better performances also in terms of output diversity (RR 11.7 vs 12.8). On the contrary, the most novel output is obtained by GPtrans, which can be expected since EN and FR pairs usually have slightly different focus on the topic of Islamophobia (topics and tropes can vary across nations and cultures). In Table 2 we provide few examples of generated CNs.
Learning Curve Discussion. As can be seen in Figure 1, even 100 Italian pairs are enough to dramatically improve the performances of GePpeTto on the task of CN generation over the baseline GP-trans. If we continue fine-tuning GPtrans with more and more Italian pairs, soon we are able to outperform also GP-ita. The number of examples required to obtain a new state of the art CN generation in Italian comes within 200 and 300, which reduces the required amount of gold standard data by around 80%. Therefore, it becomes clear that a good NMT model can be of fundamental help while porting the generation task to new languages, especially if few or no gold standard examples are available in the target language. Considering the fact that the counter narrative data collection is an expert-based task requiring costly human effort (Chung et al., 2019), decreasing the required amount of expert data can be of remarkable importance for low-resource languages.
Human Evaluation. As annotators, we employed 2 Italian native speakers that are expert in counter narrative production. The annotators were instructed in assessing CN suitableness, specificity, and grammaticality with respect to the paired hate speech. During training, we explained what a common and suitable counter narrative is, and then asked them to intuitively evaluate the generation without overthinking. We further presented 20 examples of HS/CN pairs to demonstrate the appropriate evaluation. In order to avoid comparison or primacy/recency effects, we have presented 20 random pairs from each condition to each annotator as a single randomized file and asked them to evaluate each counter narrative with respect to the 3 criteria. The results presented in Table 3 show that all models reach very high levels of grammaticality; most of the sentences were completely grammatical and few ungrammatical ones were due to dangling sentences. Moreover, using silver data alone can already provide a performance lower than but close to the GP-ita case for Suitableness and Specificity. Finally, finetuning GP-trans further using gold standard data (GP-hybrid) provides the most suitable and the least generic responses among the 3 models in line with their performance ranking of automatic metrics.

Conclusion and Future Work
Counter narrative generation using neural architectures is beginning to be studied for hatred intervention. In this paper, we presented the first attempt of CN generation for Italian, investigating several variations of generation when gold data is limited or not available. Our experiments reveal that with simple data augmentation strategies paired with powerful LMs can bring promising [It cannot be considered an invasion when the number of Muslims in Italy is less than even 5%.] Gold: Salve, in realtà il numero di musulmani in Italiaè pari a poco meno il 5% della popolazione totale: ritiene davvero che si possa parlare di sostituzione? [Hello, in reality the number of Muslims in Italy is equal to a little less than 5% of the total population: do you really think that we can talk about substitution?] Table 2: Sample CN generations along with EN translation. GP-trans generation is grammatically correct but focused on the UK/FR scenario. Instead, GP-ita and GP-hybrid can mimic gold arguments with novel and diverse wording.
results for under-resourced languages. As future work we plan to explore methods for improving specificity via the use of diverse types of knowledge (e.g., facts, events and named entities) from external resources. This would also mitigate the problem of the model possibly generating hallucinated content (i.e. non-existent or wrong content), see for example (Maynez et al., 2020;Nie et al., 2019). Finally, we plan to apply this approach to other hate phenomena such as antisemitism, homophobia, and misogyny.