How "BERTology" Changed the State-of-the-Art also for Italian NLP

The use of contextualised word embeddings allowed for a relevant performance increase for almost all Natural Language Processing (NLP) applications. Recently some new models especially developed for Italian became available to scholars. This work aims at evaluating the impact of these models in enhancing application performance for Italian establishing the new state-of-the-art for some fundamental NLP tasks.


Introduction
The introduction of contextualised word embeddings, starting with ELMo (Peters et al., 2018) and in particular with BERT (Devlin et al., 2019) and the subsequent BERT-inspired transformer models (Liu et al., 2019;Martin et al., 2020;Sanh et al., 2019), marked a strong revolution in Natural Language Processing, boosting the performance of almost all applications and especially those based on statistical analysis and Deep Neural Networks (DNN).
A recent study (He and Choi, 2019) tried to determine the new baselines for several NLP tasks for English fixing the new state-of-the-art for the examined tasks. This work aims at doing a similar process also for Italian. We considered a number of relevant tasks applying state-of-the-art neural models available to the community and fed them with all the contextualised word embeddings specifically developed for Italian.

Italian "BERTology"
The availability of various powerful computational solutions for the community allowed for Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). the development of some BERT-derived models trained specifically on big Italian corpora of various textual types. All these models have been taken into account for our evaluation. In particular we considered those models that, at the time of writing, are the only one available for Italian: • Multilingual BERT 1 : with the first BERT release, Google developed also a multilingual model ('bert-base-multilingual-cased' -bertMC) that can be applied also for processing Italian texts.
• AlBERTo 2 : last year, a research group from the University of Bari developed a brand new model for Italian especially devoted to Twitter texts and social media ('m-polignano-uniba/bert uncased L-12 H-768 A-12 italian alb3rt0' -alUC) trained by using 200 millions tweets from 2012 to 2015 (Polignano et al., 2019). Only the uncased model is available to the community. Due to the specific training of alUC, it requires a particular pre-processing step for replacing hashtags, urls, etc. that alter the official tokenisation, rendering it not really applicable to word-based classification tasks in general texts; thus, it will be used only for working on twitter or social media data. In any case we tested it in all considered tasks and, whenever results were reasonable, we reported them.
• GilBERTo 3 : it is a rather new CamemBERT Italian model ('idb-ita/gilberto-uncasedfrom-camembert' -giUC) trained by using the huge Italian Web corpus section of the OSCAR (Ortis Suárez et al., 2019) Webcorpus project consisting of more than 11 billions of tokens. Also for GilBERTo it is available only the uncased model.
• UmBERTo 4 : the more recent model developed explicitly for Italian, as far as we know, is UmBERTo . As well as GilBERTo, it has been trained by using OS-CAR, but the produced model, differently from GilBERTo, is cased.

Evaluation Tasks
Following the work of He and Choi (2019), we selected some basic tasks both for word and sentence/text classification. We mainly concentrated our efforts on tasks for which evaluation procedures were well established in the Italian community and reliable evaluation benchmark were available. We choose (a) two very basic wordclassification tasks, namely part-of-speech (PoS) tagging and Named Entity Recognition (NER), (b) the dependency parsing task and (c) two very important tasks for social-media text classification, namely Sentiment Analysis (Subjectivity/Polarity/Irony classification) and Hate Speech Detection (HSD). We mainly relied on some benchmark proposed in one of the past EVALITA evaluation challenges 5 or the Universal Dependencies (UD) project 6 .
After the influential paper from (Reimers and Gurevych, 2017) it is clear to the community that reporting a single score for each DNN training session could be heavily affected by the system initialisation point and we should instead report the mean and standard deviation of various runs with the same setting in order to get a more accurate picture of the real systems performance and make more reliable comparisons between them. Thus any new result proposed in this paper is presented as the mean and standard deviation of at least 5 runs.
With regard to the dataset splitting, if a specific dataset was already split in training/validation/test set, we adopted this subdivision, while, if the dataset was split only in development and test set, we split it and used the training/validation sets for training and tuning the stopping epoch and, once fixed that parameter, we retrained the system on the entire development set maintaining the same epoch for the early stopping.

Part-of-Speech Tagging
The first task we worked on is the part-of-speech tagging. This is a very basic task in NLP and a lot of applications rely on precise PoS-tag assignments. There are various data sets available for this task taken from one of the EVALITA 2007 tasks (Tamburini, 2007) and from the UD annotated corpora.

System
EVALITA 2007 (Tamburini, 2016) 98   The best results for the EVALITA 2007 data set has been obtained by (Tamburini, 2016) using a BiLSTM-CRF system based on word2vec word embeddings enriched with morphological information. For UD corpora we considered the ISDT corpus v2.5 and PoSTWITA: there are no evaluation data in literature for the ISDT corpus while for PoSTWITA the best results were obtained by (Basile et al., 2017) using a BiLSTM-CRF system and by the best system at EVALITA 2016 (Cimino and Dell'Orletta, 2016a).
The PoS-tagging system used for our experiments is very simple and consist of a slight modification to the fine tuning script 'run ner.py' available with the version 2.7.0 of the Huggingface/Transformers package 7 . We did not employ any hyperparameter tuning, the validation set has been used only for determining the stopping criterion.
Tables 1, 2 and 3 show the results obtained by fine tuning the considered BERT-derived models for this task. A very relevant increase in performance w.r.t. the literature is evident by looking at the results and UmBERTo is consistently the best system.

PoS-tagging on Speech Data
We participated to the EVALITA 2020 KIPOS challenge (Bosco et al., 2020) for evaluating PoStaggers on speech data by using exactly the same tagger. In this case, we did not make any parameter tuning: we used the basic parameters and stopped the training phase after 10 epochs. After the challenge, we evaluated all the BERT-derived models in order to propose a complete overview of the available resources.
Tables 4 show the results obtained by fine tuning all the considered BERT-derived models for the Main Task. A very relevant increase in performance w.r.t. the other participants is evident looking at the results and UmBERTo is again the best system. We did not participate at the official challenge for the two subtasks, but we included the results of our best system also for these tasks. Table 5 shows the results compared with the other two participating systems. Again, the simple fine tuning of a BERT-derived model, namely UnBERTo, exhibits the best performance on Sub-task B. The scarcity of data could probably affect the results on Subtask A.

Named Entity Recognition
The second task we considered is Named Entity Recognition. For system evaluation we relied on the nice evaluation benchmark used in the EVALITA 2009 campaign (Bartalesi Lenzi et al., 2009). The best results gathered from literature are due to (Basile et al., 2017) (Zanoli et al., 2009). For this task we used exactly the same script of the previous task, being both tasks simple wordclassification tasks, and did not apply any hyperparameter tuning at all, fixing a priori the number of epoch to 10. Table 6 outlines the obtained results. Again a simple fine tuning of BERT-derived models is enough powerful to guarantee relevant increases of performance with respect to the previous literature and, again, UmBERTo resulted the model producing the best performance.

Parsing Universal Dependencies
Parsing is one of the most important tasks in NLP and the recent advances due to DNN and contextualised distributed representations allowed for large performance improvements. Universal Dependencies project is the reference repository for standardised treebanks in various languages, thus it seemed natural to gather evaluation benchmarks from that project. As for PoStagging, we used two treebanks from UD v2.5, namely ISDT and PoSTWITA.
The recent work from Antonelli and Tamburini (2019) examined all the DNN parsers available at the time re-training them on some Italian dataset. In particular they showed that the neural parser from Dozat and Manning (2017) (version 1.0) was the parser exhibiting the best performance on UD-ISDT v2.1. Giving that experience, we included in our new experiments the last version (v3.0) of this parser 8 considering it as a strong baseline for this task. The word embeddings we used for these experiments were the same used in (Antonelli and Tamburini, 2019) and are computed using the ItWaC corpus (Baroni et al., 2009) and word2vec (Mikolov et al., 2013a,b).
Very recently, a new work from Vacareanu et al. (2020) showed that we can efficiently compute dependency parsing structures by treating this task as a double fine tuning task over a BERT-derived model, the first for determining the attachments and the second the edge labels, getting state-ofthe-art performance. Actually, the fine-tuning DNN is more complex than in the previous tasks, consisting of a bidirectional LSTM followed by some dense layers.
We applied their method and code (PaT) for our parsing experiments using the greedy cycle removal option. We changed text case depending on the BERT-derived model case used in a specific experiment. Tables 7 and 8 show the results for all the parsing experiments.
Considering the best results obtained by the Dozat and Manning (2017) parser and those presented in (Antonelli and Tamburini, 2019), we observe a relevant increase in performance due mainly to GilBERTo and UmBERTo.

Sentiment Analysis
Three main text-classification tasks are comprised in the 'Sentiment Analysis' umbrella: Subjectiv-8 https://github.com/tdozat/Parser-v3   ity, Polarity and Irony detection. Thanks to the EVALITA SENTIPOLC 2016 evaluation we could rely on a complete dataset annotated with respect to all the three tasks. Given the specific nature of dataset texts, namely tweet texts, we adopted the particular preprocessing procedure introduced by AlBERTo and all the other parameters were kept as in (Polignano et al., 2019) for comparability; the only difference regards the training batch size that was 512 on TPU in the original paper and we had to use gradient accumulation on GPU (batch size = 32 and accumulation steps = 16) to avoid memory problems. Given the small size of the dataset and the high variability of the various results, for these tasks we decided to make 10 runs instead of 5.

System
Macro F1 TensorFlow+TPU alUC 72.23* Fine-Tuning bertMC 72.92±0.86 (Castellucci et al., 2016) 74  We slighly modified the script 'run glue.py' from the version 2.7.0 of the Huggingface/Transformers package considering the three tasks as a BERT-derived model fine-tuning for text classification tasks respectively with 2, 4 and 2 classes.
Tables 9, 10 and 11 present the obtained results. We have to say that we had a lot of problems in reproducing the results in Polignano et al. (2019), both by using our script and also by using the original TPU-based script on Google Colab. In the cited tables, you can find the original results and the ones produced by us using the same script and setting marked by an asterisk (TensorFlow+TPU alUC ).

Hate Speech Detection
Hate Speech on social media has become a relevant problem in recent years and the automatic detection of such messages got a great importance in NLP.
Thanks to the dataset produced by Bosco et al. (2018) we had the possibility to test the same text  classification procedures we used for Sentiment Analysis also for this task both on Facebook and Twitter data. Table 12 shows the results we obtained comparing them with the best system at the EVALITA 2018 HaSpeDe campaign (Cimino et al., 2018). GilBERTo exhibit the best performance on both subtasks.

Discussion and Conclusions
The starting idea of this work was to derive the new state-of-the-art for some NLP tasks for Italian after the 'BERT-revolution' thanks to the recent availability of Italian BERT-derived models.
Looking at the results presented in previous sections for some very important tasks, we can certainly conclude that BERT-derived models, specifically trained on Italian texts, allow for a large increase in performance also for some important Italian NLP tasks. On the contrary, the multilingual BERT model developed by Google was not able to produce good results and should not be used when are available specific models for the studied language. A side, and sad, consideration that emerges from this study regards the complexity of the models. All the DNN models used in this work for the various tasks involved very simple fine-tuning processes of some BERT-derived model. Machine learning and Deep learning changed completely the approaches to NLP solutions, but never before we were in a situation in which a single methodological approach can solve different NLP problems always establishing the state-of-the-art for that problem. And we did not apply any parameter tuning at all! The only optimisation regards the early stopping definition on validation set. By tuning all the hyperparameters, it is reasonable we can further increase the overall performance.
For the future, it would be interesting to eval-uate end-to-end systems, for example for solving PoS-tagging + Parsing and PoS-tagging + NER by using the BERT-derived model fine tuning code and PaT for both end-to-end tasks. A lot of scholars are working in studying new transformer-based models or training the most promising ones on different languages; there are brand new Italian models that were made available very recently not yet included into our evaluations like the one produced by Stefan Schweter at CIS, LMU Munich 9 ; it would be interesting to insert them into our tests.