#andràtuttobene: Images, Texts, Emojis and Geodata in a Sentiment Analysis Pipeline

This research investigates Instagram users’ sentiment narrated during the lockdown period in Italy, caused by the COVID-19 pandemic The study is based on the analysis of all the posts published on Instagram under the hashtag #andràtuttobene on May 4, May 18 and June 3, 2020 Our research carried out a view on a national, regional and provincial scale We analyzed all the different languages and forms (i e captions, hashtags, emojis and images) that constitute the posts The aim of this research is to provide a set of procedures revealing the different polarity trends for each kind of expression and to propose a single comprehensive measure Copyright © 2020 for this paper by its authors


Introduction
This paper investigates the case of the Italian most used hashtag about the lockdown period for the COVID-19 pandemic on Instagram: #andràtuttobene 1 .
The research team collected 7,482 posts, the entire amount published in three specific dates: May 4, May 18 and June 3, corresponding with three different steps of the reopening phase of the country, led by the government.
Instagram posts are composed by several kind of languages: captions (texts), hashtags, emojis and images. The aim of this work is to design a set of procedures revealing the different polarity trends for each one and to propose a unique measure. This measure can show the sentiment expressed by the texts, in their semiotic broad meaning.
Copyright ©️2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
The methodology proposed is based on a fully automatic natural language processing pipeline, including the images' analysis phase. Its output is an interactive dashboard ( Figure 1) that is able to explore the sentiment analysis values about every single kind of text, and the synthesis of all of them. Thanks to a system of interactions and filters, the observation is leaded by the images' features, such as different kind of spaces (indoor or outdoor) and different kind of the photos' subject (human or not human).
The collected geographical data enabled the analysis of several dimensions, with an overview observation based on the regional scale. Hence, it gave us an opportunity to focus on the deeper level of the Italian provinces. This choice is motivated by the Italian DPCM (Decreto Presidenza del Consiglio dei Ministri) published on 24 March 2020 2 , in which it is stated the partial autonomy of the regions.

State of the Art
In Natural Language Processing (NLP) studies, the automatic treatment of opinionated expressions and documents is known as Sentiment Analysis.
For the Italian language, significant contributions on sentiment analysis of social media come from Bosco et al. (2013Bosco et al. ( , 2014, Castellucci (2014Castellucci ( , 2016 and Stranisci (2016), among others. billion monthly active users, according to a study by Hootsuite and WeAreSocial. 2 https://www.gazzettauffciale.it/eli/id/2020/06/17/20G00071/sg Hahstag processing in Sentiment Analysis is particularly challenging in terms of word segmentation. Obviously, the absence of white spaces between words poses several problems that concerns ambiguity. Among the most relevant contribution in this area, we cite Zangerle (2018), Reuter (2016), Simeon (2016); Bansal 2015, Srinivasan (2012 and Celebi (2018). The solution proposed in literature concerns mostly the use of n-grams, syntactic complexity, pattern length, or pos-tagging.
In the last years, the way to communicate online involves many kinds of languages, connected to verbal and non-verbal features. This complexity makes classical textual analysis less adequate to have a real and representative perspective on people's interests and opinions. In particular way, the conventional approach seems to be not suitable for visual social media, such as Instagram, where all the languages are involved and the images seem to be dominant.
The analysis of these social media tends to underline the issues of textocentricity (Singhal & Rattine-Flaherty, 2006) and textocentrism (Balomenu & Garrod, 2019), making necessary a different way to approach the participant generated images (PGI) or user generated contents (UGC) in general.
Opinions, emotions, and contents are expressed in a mixed way, that is the combination of several languages, visual and textual, and the related metadata, such as: geographical position and hashtag which they are labeled with.
The content analysis of the images has been addressed from several perspectives and techniques.
This work, starting from a well experimented innovative approach on previous studies , Giordano et al, 2020, makes the choice to analyze the images in their textual translation, with a fully automatic analytical pipeline, designed in a semiotic point of view. Besides the semiotic interest to digital media date back to the early 2000s and continues to the present days, considering digital media a specific semiotic field (Cosenza 2014, Bianchi e Cosenza 2020. Lastly, considering design, the visual representation of the social media data is increasing widespread as vehicle for knowledge of several fields (Ciuccarelli et al., 2014).
The research team doesn't provide an algorithm to analyze the images but adopt the automatic translation from the social media algorithm, designed to the visual impaired users by parsing the html code of the Instagram web interface. The metadata involved is the "accessibility_caption".
These are lists of words, hierarchically distributed, that let us to define and observe subject and attributes of the images, in addition to allowing the analysis of the entities.

Methodology
In this work, we propose the automatic treatment of the sentiment expressed into 7,482 Instagram posts.
All the information composing the dataset (i.e. captions, hashtags, emojis and images) are automatically put into relation with one another and visualized into an interactive dashboard. The phenomenon, can be observed through a system of filters, zooms and interdependent interactions. The result captures the topography of feelings, moods and needs expressed on the Instagram platform during the lockdown.
The NLP activities are performed in this research through the software NooJ 5 , which allows both the formalization of linguistic resources and the parsing of corpora. The dictionaries and grammars, which have been built ad hoc for this work, complement the open-source resources of the basic Italian and English modules of NooJ (Vietri 2014).
All the pictures published on May 4, May 18 and June 3, 2020 with the hashtag #andratuttobene have been collected with a custom python script that simulate the human navigation. For each picture, we collected the entire source code of the web page in a JSON (JavaScript Object Notation) format.
This one has been parsed to a tabular one, in order to plan a format suitable for the adopted tools. The files have been refined selecting the endpoint useful for the analysis: captions (including hashtags); images hyperlink; accessibility captions; geographical coordinates and timestamp.
Some data required a data refinement phase. For the captions, it has been necessary to do a cleaning phase in which all the texts that were not 5 http://www.nooj-association.org/ 6 For instance, in the "human" cluster we have grouped all the accessibility caption containing words such as "people, man, woman, person" etc. At the same time, in the "outdoor" cluster we have grouped all the pictures with words such as "sea, skyline, lawn, beach" and so on. 7 This phase has been possible in an automatic way adopting the python library reverse-geocoder (https://github.com/thampiman/reverse-geocoder) 8 The performances of our method produced satisfactory results in the sentence-level analysis of the textual part of the corpus: 0,85 Recall; 0,96 Precision and 0,9 F-score.
written Italian have been detected automatically by adopting the google translate API (Application Programming Interface) and removed. Moreover, from this field all the hashtags have been extracted, to allow their standalone analysis.
Accessibility captions have been clustered on two dimensions: "human or not human" and "indoor or outdoor", previously defined thanks to a list of coherent words, subsequently matched by a pattern matching phase 6 . Geographical coordinates set the images on a specific point on the map, so it has been necessary to make a reverse geocoding procedure to find out region and province levels. 7 Furthermore, Timestamp have been converted in a conventional date and time format.
After these steps, images and texts became ready to be analyzed through NLP procedures and mapped with geographical visualization techniques, observing them on the desired timeframe.
For the analysis of verbal features, we used SentIta, a semi-automatically built lexicon task (Pelosi 2015a), containing more than 15,000 lemmas, simple words and multiword units. Each entry is annotated with polarity and intensity scores, into a scale that ranges from -3 to +3. It must be applied to texts in conjunction with a network of almost 130 embedded local grammars, formalized in the shape of Finite State Automata (Pelosi 2015b), which systematically modulate the prior polarity of words according to their syntactic local context 8 . These resources can be directly applied to the Instagram captions, while hashtags need to be initially segmented. In this phase, they are analytically decomposed into their constituents through 10 morpho-syntactic grammars applied simultaneously, but with different priorities. In this way, the selection of the most probable sequences is decided for the upstream 9 . 9 Basically, if the system produces more than one interpretation, the preferred one is the one in which the constituents have a longer length and the smallest number of constituents. In other words, the system firstly compares the whole normalized string with the word forms from SentIta, then continues the comparison with English and Italian word forms from the basic module. Hence, the dictionaries receive the higher priority and are applied before morphological grammars. If the system does not match any word in the lexicon, it starts the structural analysis of the string, which consist of a systematic comparison of substrings with the all the words contained in the dictionaries, according to part of speech specific syntactic structures. Such structure, ordered here by priority assignments, can be For the analysis of the non-verbal features, emojis are treated by using an electronic dictionary, which has been semi-automatically annotated with the same information used to analyze verbal features. We created this database with recognizable decimal codes in UTF-8 encoding from Emojipedia, then we carried out the automatic analysis of the textual descriptions of each emoji.
This dictionary has been used to locate and interpret the emojis occurring in the posts 10 .
After the clustering phase (human and not human; indoor and outdoor), all the findings of the sentiment on all the languages can be associated to the pictures' features, combined or not. 11

Visualization and Results
For a complete observation of the analysis' process and of its results, we developed a data visualization dashboard. In the following dashboard it is possible to observe the sentiment analysis on each language processed, with the chance of investigating the different trends during the days and the single hours day by day. Adopting the clusters detected in the images, a system of filters let to focus the results basing on the subjects depicted.
On the left side of the dashboard, a map shows the geographical situation, merging the 4 sentiment values in a single one (weighted average) and coloring the regional shape on chromatic scale from the minimum value (-3) in orange, to the maximum value (+3) in blue. The same scale is applied to the line chart on the right, in which each line is related to the vertical axes and colored as described before.
( * + ℎ * ℎ + * ) + ℎ + Each score reached by the three languages are taken into account, namely texts, hashtags and emojis, are weighted according to the assumption that the euphoric level of emojis' sentiment is higher than hashtags' one, and both are higher than written texts' one in general. According to these results, we propose this weighted average formula, in which emojis, hashtags and texts have different weights (P), respectively 33, 50, and 100. multiword expressions; free nominal, prepositional, adjectival and adverbial phrases; elementary sentences; and verbless sentences. 10 While the oriented words located into captions and hashtags respectively cover the 6% and the 9% of the As a matter of fact, Novak (2015) underlined that it is more common the use of positive emojis with respect to the negative ones. Moreover, Boia et al. (2013) observed a poor correlation between the perceived emotional polarity of emojis and the accompanying linguistic text alone. Although it is actually challenging to predict the interaction between emoji and texts, there are cases in which the emojis express or reinforce the sentiment of the text with which they occur and cases in which they modify it or even express an opposite emotional state (Guibon 2018, Shoeb 2019.
Hashtags are conventionally used in two ways: on one hand, to describe the contents in a list of words, and on the other hand for strategic purposes, in order to place the images in useful thematic spaces. This is also the reason why we have removed from the analysis of all the Instagramfull words contained in the posts, the sentiment labelled emojis cover the 19% of the total number of emojis in the corpus.

Figure 1 The sentiment analysis values
specific hashtags, such as: #likeforlike, #followoforfollow etc., that are not suitable or even could be misleading or biased for our investigation. At the same time, the hashtags are also used as part of the messages, in substitution of words, so they deserve to be included in the final measure, but not with the same relevance of the captions.
The performances of emoji, hashtag and texts as indicators for sentiment analysis purposes, alone and combined with one another, have been tested on our corpus. We verified a significant improvement in terms of document-level precision when the indicators are considered together (0.98), if compared with the precision of texts (0.91), hashtag (0.81) and emoji (0.65) considered alone. The different precisions reached by the three languages considered alone empirically confirm the diversification of weights we proposed in our formula. This weighted measure has, then, been compared by three different judges 12 with the arithmetic mean on a sample of 100 Instagram posts from our corpus and performed better in the 92% of the cases.
Nevertheless, the geographical dimension is very important to observe the different kind of languages in the online community (Arnaboldi et al., 2017). Through an overlay function, moving the cursor on the map (figure 2), we show the geographical data in the deeper level of the single province, focusing on each region. The result represents the possible different polarity value between different provinces. For instance, on May 4 in the provinces of Oristano (Sardegna), Genova (Liguria) and Viterbo (Lazio), the sentiment value is negative, despite the positive average value of the region. However, the average sentiment value over the three days analyzed is found always positive, with different evidences on regional and provincial scale. Lastly, users can explore the results focusing on one or more region though a filter function (by clicking or selecting). All the filters are interdependent, so it is possible to select all the functions available investigating the phenomenon from all the possible perspectives.

Conclusion
Throughout the quantitative and qualitative analysis of the different expressive forms used on Instagram, this work proposes a general view of COVID-19 in Italy. 12 For the evaluation of the three judges, we have calculated the intercoder reliability adopting the Krippen-The research brought together linguistic analysis and design into a more general semiotic framework. The aim was, in fact, to put in shape the pandemic phenomenon through a selection of linguistic relevance. The virus caused a series of unpredictable changes narrated on Instagram through the hashtag #andràtuttobene. A mantra for the Igers and an isotopy for the analysts (Greimas & Courtès 1979). Working on multiple levels, the research has offered a general and a local view of the emotions told during the lockdown period. Starting from a lexical base, made up by a list of words, and using electronic dictionaries also for the images, the analysis organized a large amount of data, developing a real map of emotions and needs expressed during the first wave of pandemic. The map can be visualized trough a dashboard letting users observe general and local reactions, down to the single province. The emotional effects of sense have been evaluated thanks to a polar and unique measure.

Figure 2 Overlay function: provincial scale
In the end, did everything really go well for Instagram's Italy? In general, it seems so. The average sentiment value over the three days analyzed is always positive, with variations on regional and provincial scale. Going down the single province, we can find differences, as the Sardinia, Lazio and Liguria cases.