A Machine Learning approach for Sentiment Analysis for Italian Reviews in Healthcare

In this paper, we present our approach to the task of binary sentiment classification for Italian reviews in healthcare domain. We first collected a new dataset for such domain. Then, we compared the results obtained by two different systems, one including a Support Vector Machine and one with BERT. For the first one, we linguistic pre–processed the dataset to extract hand-crafted features exploited by the classifier. For the second one, we oversampled the dataset to achieve better results. Our results show that the SVMbased system, without the worry of having to oversample, has better performance than the BERT-based one, achieving an F1-score of 91.21%.


Introduction
Nowadays, when people want to buy a product or service, they often rely on online reviews of other buyers/users (think of online sales giants like Amazon). Likewise, patients are increasingly relying on reviews on social media, blogs and forums to choose a hospital where to be cured. This behaviour is occurring not only abroad (Greaves et al., 2012;Gao et al., 2012), but also in Italy. This is also demonstrated by the increasing amount of reviews in QSalute 1 , one of the most popular Italian ranking websites in healthcare. These reviews are often ignored by hospital companies, which do not exploit the potential of such data to understand patients' experiences and consequently improve their services. Due to the large amount of data, there is a need for automatic analysis techniques. To meet these needs, we decided to introduce a sentiment analysis system based on machine learning techniques, in order to classify whether a review has positive or negative sentiment. Since such systems require annotated data, the first step was to build a brand-new dataset. We present it in the next section. Then, we developed two systems based on two different classifiers described in Section 3 together with the features extracted from the text. In Sections 4 and 5 we show the experiments conducted during this study, the obtained results and their discussion. Finally, the last section provides concluding remarks and some possible future developments. While there exist several works on affective computing in several domains for the Italian language Cignarella et al., 2018;Barbieri et al., 2016), at the time we are writing there are no references in literature that address this particular domain in Italian. Thus, for the best of our knowledge, this is the first work of sentiment analysis on Italian reviews in healthcare.

Dataset
QSalute is an Italian portal where users share their experiences about hospitals, nursing homes and doctors. We have collected a total of 47,224 documents (i.e. reviews). Each document consists of the free text of the review and other metadata such as the document id, the disease area to which the document belongs and the title. In addition, among the provided metadata there is the average grade, i.e. the mean over the votes in four categories: Competence, Assistance, Cleaning and Services.
In this work, documents with an average grade less than or equal to 2 were assigned to the negative class (-1), while documents with an average grade greater than or equal to 4 were assigned to the positive class (1). The remaining documents were labelled with the neutral class (0). The dataset is strongly unbalanced towards the positive class: 40641 reviews for the positive class, 3898 for the neutral class and 2685 for the negative class. However, in this work, neutral reviews were discarded thus resulting in a dataset composed by 43326 reviews. The following analyses are then referred to this subset: in Table 1 we report some features of the dataset for each site (i.e. the disease area), while the distribution of tokens over their length is reported in Figure 1. In the first column are reported the name of sites, in the second column are reported the number of positive reviews whit respect to the total numbers of reviews, while in the third one are reported the lexicon values in terms of the number of unique words. Furthermore, in the last column are reported the lexicon overlap (in percentage) of each site with respect to all the others.

Methods
We developed two systems based on two stateof-the-art classifiers from the state-of-the-art for sentiment analysis, Support Vector Machine and BERT. In this Section, we present the implemented classifiers.

System 1 based on Support Vector
Machine (SVM) In order to build the first system, we followed the approach proposed by (Mohammad et al., 2013) for the sentiment analysis of English tweets and we adapted it for Italian reviews in healthcare. More precisely, we implemented a Support Vec-tor Machine (SVM) classifier with linear kernel, in terms of liblinear (Fan et al., 2008) rather than libsvm in order to scale better to large numbers of samples, as also reported in the documentation 2 of the LinearSVC model. Firstly, all documents pass through a preprocessing pipeline, consisting of a sentence splitter, a tokenizer and a Part-Of-Speech (POS) tagger (all of these tools have been previously developed by the ItaliaNLP 3 laboratory). Then, documents pass through a step of feature extraction, illustrated in the next section.

Feature Extraction
All features were chosen due to their effectiveness shown in several tasks for sentiment classification for Italian (Cimino and Dell'Orletta, 2016). We refer to these features under the name of handcrafted features and embedding features.

Raw and Lexical Text Features
• (Uncased) Word n-grams: presence or absence of contiguous sequences of n tokens in the document text, with n={1, 2, 3}.
• Lemma n-grams: presence or absence of contiguous sequences of n lemmas occurring in the document text, with n={1, 2, 3}.
• Character n-grams: presence or absence of contiguous sequences of n characters occurring in the document text, with n={2, 3, 4, 5}.
• Number of tokens: total number of tokens of the document.
• Number of sentences: total number of sentences of the document.
• Fine-grained Part-Of-Speech n-grams: presence or absence of contiguous sequences of n (fine-grained) grammatical categories, with n={1, 2, 3}. Word Embeddings Combination: this feature is composed of three vectors. Each vector was calculated by the mean over word embeddings belonging to a specific fine-grained grammatical category: adjectives (excluding possessive adjectives), nouns (excluding abbreviations), and verbs (excluding modal and auxiliary verbs). Word embeddings used in this work are vectors of 128 dimensions, and they were extracted from a corpus of more than 46 million tweets. Such embeddings were already used in (Cimino et al., 2018) and they are available for download at the website of ItaliaNLP 4 . Furthermore, three features have been added to indicate the absence of word embeddings belonging to such categories, for a total of 387 (128 * 3 + 3) features.

System 2 based on BERT
We also implemented Bidirectional Encoder Representations from Transformers, or as better known, BERT, to classify the sentiment of the reviews. BERT is a pre-trained language model developed by (Devlin et al., 2018) at Google AI Language. Pre-trained BERT (available at its GitHub 4 www.italianlp.it/resources/italianword-embeddings page 5 ) may be fine-tuned on a specific NLP task in a specific domain, such as the sentiment analysis for reviews in the healthcare domain. To do that, the original text must be tokenized with its own tokenizer.

Experiments
We conducted two types of experiments. In the first one, we wanted to evaluate which of the systems was the best. For each configuration, we have trained and tested the system using a stratified kfold cross-validation (with k = 5). In the second part, we wanted to evaluate the robustness of the best system in a context out-domain, dividing the folders by disease sites. The software has been entirely developed in Python.

System 1
We tested three different configurations of our SVM-based system, depending on the sets of features used in the experiment: only hand-crafted features (more than 626 thousands features), only embeddings (387 features), and a combination of both. For such experiments, the features that have shown to not bring improvements to the performance (numbers of tokens and sentences), or even to lower it (fg-POS n-grams, Lemmas n-grams with n={2, 3}) during a preliminary experimental phase were excluded from the hand-crafted features set. Thus, it turns out that such set is composed only of Uncased Word and cg-POS n-grams with n={1, 2, 3}, and Lemmas. In order to reduce the dimensionality of the set, but also to improve the performance of our system, the features pass through a step of filtering after being extracted for the training set. Each feature that appears less than a certain threshold th within the training set can be assumed to be not so relevant and is therefore discarded. Such threshold has been set equal to 1 (th=1) after a search of the optimal value during the preliminary experimental phase.

System 2
The experiments with BERT were conducted using the same partition into the 5 folds used during the experiments with the SVM-based classifier. This division allowed us to compare the results achieved by the two classifiers. The BERT model used in our experiments is the multilingual cased pre-trained one.
We tested two different approaches. These experiments have followed two pipelines. In the first one, the model was fine-tuned with folds from the original dataset described in section 2. In the second one, each fold was obtained by oversampling the minority class (i.e. the negative one) in the original fold. The oversampling was obtained by multiplying each negative sample in the fold by 4. These results in the ratio of negative to positive samples being increased from about 1:16 to about 1:4. Other experiments were conducted further increasing the ratio to about 1:2, but this has not led to significant improvements in performance at the expense of computational time. For both the approaches, the model was fine-tuned for 5 epochs on a 12 GB NVIDIA GPU with Cuda 9.0 with the following hyperparameters: • maximum sequence length of 128 tokens (it seems reasonable since this number is very close to the average length of the documents in the dataset, as reported in Figure1), • batch size of 24 samples, • and a learning rate of 5 * 10 -5 .  Table 2: Results of the experiments in the stratified 5-fold cross-validation. Performances are reported in terms of F1score (%) on each class and the (macro) average between the two. The best results are shown in bold. Table 2 resumes the results of the experiments in stratified 5-fold cross-validation. The performances are reported in terms of the macro average of F1-score.

Results and Discussion
After analyzing these results, we took the best model and we used it in the leave-one-site-out cross-validation context to test the reliability of the system in an out-domain (site) problem. These results are resumed in Table 3.
First of all, we can notice that such performances are much higher of the baseline system, i.e. the performance achieved by a hypothetical model that classifies all the samples as belonging to the majority class (that is, the positive class).
Due to the strong dataset imbalance and the low batch size, training BERT without oversampling the dataset leads the system to classify all samples as belonging to the majority class, i.e. the positive class. This leads to often obtain very bad performance, i.e. the baseline performance. Anyway, when this problem does not come up, the classifier shows the lowest value of the F1-score. These results clearly show the difficulties of BERT to deal with unbalanced datasets. Oversampling the minority class has shown to partially cope with such problems, leading to an improvement in terms of repeatability and performance.
For what concerns the experiments with the SVM-based system, they have shown that handcrafted features have greater relevance for the task than the embedding features. This suggests that the (Italian) healthcare reviews domain may be particularly lexical. Thus, sets of lexical features show better performance than those similaritybased features. However, the resulting best model is the one with both sets of features, outperforming the BERT-based system best configuration by about three percentage points. Figure 2: Results in terms of percentage of classified reviews and F1-score over threshold values on the probabilistic score p ∈ [0, 1] returned by the Platt scaling method applied on top of the SVM-based system. All the results are referred to the k-fold cross-validation (with k = 5) fashion. Note that for threshold = 0.5, even if the percentage of classified documents is 100%, the value of the macro average of F1-score is lower to the one reported in Table 2. This is due to the inherent inconsistency between the probabilities calculated through the Platt scaling method p and the decision score of the SVM model (i.e. the distance of the sample from the trained boundary, d ∈ (−∞, +∞)).  Furthermore, the leave-one-site-out experiments with this model result in a very good performance, showing the system to be reliable also in an out-domain (site) context. This last result can be due to two factors: 1) the high degree of overlap of the lexicon found in one domain on the lexicon of all other domains; 2) a larger size of the set used for training.
In addition to the two main phases of experiments, we further investigated the confidence of the best model developed in making decisions. The motivation behind this study is that it may have application in real-world cases, where an automated system is required to filter the documents on which it is highly confident (i.e., above a certain threshold) and then passes the most complex documents to a human operator. To do so, we applied the Platt scaling (Platt, 1999) method on top of the trained SVM model. This step is needed to convert the output of the model from a decision score d ∈ (−∞, +∞), i.e. the distance of the test sample from the trained boundary, to a probabilistic score p ∈ [0, 1], representative of the system confidence in making the decision. Figure 2 resumes the results of this analysis. As expected, the number of documents on which the system makes a decision falls as the confidence threshold required of the system increases. However, this trend does not have such a negative slope and still classify more than 91% of the documents with 99% confidence. At the same time, the performance advantage is clear, leading to an increase of F1-score on negative samples by more than ten percentage points.

Conclusion
In this paper, we have introduced a novel system for sentiment analysis for Italian reviews in Healthcare. For the best of our knowledge, this is the first work of this kind in such domain. To do so, we have collected the first dataset for this domain from the web. Then, we have implemented and compared two types of classifiers of the state of the art for such task, the SVM and BERT. Despite the strong dataset imbalance, we have obtained very good results, especially with the SVMbased system, which outperformed the BERTbased one, while maintaining a low computational burden during training. However, there is a chance that increasing the maximum sequence length of BERT it may outperform our best-developed system. Also, recent work (Nozza et al., 2020) has analyzed the contribution of language-specific models, showing in general improvements over BERT multilingual for a wide variety of NLP tasks. For this reason, it might be worth including in future works the use of specific models for Italian, such as GilBERTo 6 , UmBERTo 7 , and Al-BERTo 8 . The latter was already used for a sentiment classification task (Polignano et al., 2019). Future works on this dataset may also tackle the task of sentiment classification including the neutral class or sentiment regression of the average scores. Moreover, future research may tackle the task of cataloguing reviews to the area of disease they belong, maybe including other features from metadata such as titles.