Exploiting Distributional Semantics Models for Natural Language Context-aware Justifications for Recommender Systems

In this paper1 we present a methodology to generate context-aware natural language justifications supporting the suggestions produced by a recommendation algorithm. Our approach relies on a natural language processing pipeline that exploits distributional semantics models to identify the most relevant aspects for each different context of consumption of the item. Next, these aspects are used to identify the most suitable pieces of information to be combined in a natural language justification. As information source, we used a corpus of reviews. Accordingly, our justifications are based on a combination of reviews’ excerpts that discuss the aspects that are particularly relevant for a certain context. In the experimental evaluation, we carried out a user study in the movies domain in order to investigate the validity of the idea of adapting the justifications to the different contexts of usage. As shown by the results, all these claims were supported by the data we collected.


INTRODUCTION
Recommender Systems (RSs) [19] are now recognised as a very effective mean to support the users in decision-making tasks [20]. However, as the importance of such technology in our everyday lives grows, it is fundamental that these algorithms support each suggestion through a justification that allows the user to understand the internal mechanisms of the recommendation process and to more easily discern among the available alternatives. To this end, several attempts have been recently devoted to investigate how to introduce explanation facilities in RSs [16] and to identify the most suitable explanation styles [4]. Despite such a huge research effort, none of the methodologies currently presented in literature diversifies the justifications based on the different contextual situations in which the item will be consumed. This is a clear issue, since context plays a key role in every decision-making task, and RSs are no exception. Indeed, as the mood or the company (friends, family, children) can direct the choice of the movie to be watched, so a justification that aims to convince a user to enjoy a recommendation should contain different concepts depending on whether the user is planning to watch a movie with her friends or with her children.
In this paper we fill in this gap by proposing an approach to generate a context-aware justification that supports a recommendation. Our methodology exploits distributional semantics models [5] to build a term-context matrix that encodes the importance of terms and concepts in each context of consumption. Such a matrix is used to obtain a vector space representation of each context, which is in turn used to identify the most suitable pieces of information to be combined in a justification. As information source, we used a corpus of reviews. Accordingly, our justifications are based on a combination of reviews' excerpts that discuss with a positive sentiment the aspects that are particularly relevant for a certain context. Beyond its context-aware nature, another distinctive trait of our methodology is the fact that we generate post-hoc justifications that are completely independent from the underlying recommendation models and completely separated from the step of generating the recommendations.
To sum up, we can summarize the contributions of the article as follows: (i) we propose a methodology based on distributional semantics models and natural language processing to automatically learn a vector space representation of the different contexts in which an item can be consumed; (ii) We design a pipeline that exploits distributional semantics models to generate context-aware natural language justifications supporting the suggestions returned by any recommendation algorithm; The rest of the paper is organized as follows: first, in Section 2 we provide an overview of related work. Next, Section 3 describes the main components of our workflow and Section 4 discusses the outcomes of the experimental evaluation. Finally, conclusions and future work of the current research are provided in Section 5.

RELATED WORK
The current research borrows concepts from review-based explanation strategies and distributional semantics models. In the following, we will try to discuss relevant related work and to emphasize the hallmarks of our methodology.
Review-based Explanations. According to the taxonomy discussed in [3], our approach can be classified as a content-based explanation strategy, since the justifications we generate are based on descriptive features of the item. Early attempts in the area rely on the exploitation of tags [24] and features gathered from knowledge graphs [11]. With respect to classic contentbased strategies, the novelty of the current work lies in the use of review data to build a natural language justification. In this research line, [2] Chen et al. analyze users' reviews to identify relevant features of the items, which are presented on an explanation interface. Differently from this work, we did not bound on a fixed set of static aspects and we left the explanation algorithm deciding and identifying the most relevant concepts and aspects for each contextual setting. A similar attempt was also proposed in [1]. Moreover, as previously emphasized, a trait that distinguishes our approach with respect to such literature is the adaptation of the justification based on the different setting in which the item is consumed. The only work exploiting context in the justification process has been proposed by Misztal et al. in [9]. However, differently from our work, they did not diversify the justifications of the same items on varying of different contextual settings in which the item is consumed, since they just adopt features inspired by context (e.g., "I suggest you this movie since you like this genre in rainy days") to explain a recommendation.
Distributional Semantics Models. Another distinctive trait of the current work is the adoption of distributional semantics models (DMSs) to build a vector space representation of the different contextual situations in which an item can be consumed. Typically, DSMs rely on a term-context matrix, where rows represent the terms in the corpus and columns represents contexts of usage. For the sake of simplicity, we can imagine a context as a fragment of text in which the term appears, as a sentence, a paragraph or a document. Every time a particular term is used in a particular context, such an information is encoded in this matrix. One of the advantages that follows the adoption of DSMs is that they can learn a vector space representation of terms in a totally unsupervised way. These methods, recently inspired methods in the area of word embeddings, such as WORD2VEC [8] and contextual word representations [21]. Even if some attempts evaluating RSs based on DSMs already exists [13,12,14], in our attempt we used DSMs to build a vector-space representation of the different contextual dimensions. Up to our knowledge, the usage of DSMs for justification purposes this is a completely new research direction in the area of explanation.

METHODOLOGY
Our workflow to generate context-aware justifications based on users' reviews is shown in Figure 1. In the following, we will describe all the modules that compose the workflow.
Context Learner. The first step is carried out by the CON-TEXT LEARNER module, which exploits DSMs to learn a vector space representation of the contexts. Formally, given a set reviews R and a set of k contextual settings C = {c 1 . . . c k }, this module generates as output a matrix C n,k that encodes the importance of each term t i in each contextual setting c j . In order to build such a representation, we first split all the reviews r ∈ R in sentences. Next, let S be the set of previously obtained sentences, we manually annotated a subset of these sentences in order to obtain a set S = {s 1 . . . s m }, where each s i is labeled with one or more contextual settings, based on the concepts mentioned in the review. Of course, each s i can be annotated with more than one context. As an example, a review including the sentence 'a very romantic movie' is annotated with the contexts company=partner, while the sentence 'perfect for a night at home' is annotated with the contexts day=weekday. After the annotation step, a sentence-context matrix A m,k is built, where each a s i ,c j is equal to 1 if the sentence s i is annotated with the context c j (that is to say, it mentions concepts that are relevant for that context), 0 otherwise.
Next, we run tokenization and lemmatization algorithms [7] over the sentences in S to obtain a lemma-sentence matrix V n,m . In this case, v t i ,s j is equal to the TF/IDF of the term t i in the sentence s j . Of course, IDF is calculated over all the annotated sentences. In order to filter out non-relevant lemmas, we maintained in the matrix V just nouns and adjectives. Nouns were chosen due to previous research [15], which showed that descriptive features of an item are usually represented using nouns (e.g., service, meal, location, etc.). Similarly, adjectives were included since they play a key role in the task of catching the characteristics of the different contextual situations (e.g., romantic, quick, etc.). Moreover, we also decided to take into account and extract combinations of nouns and adjectives (bigrams) such as romantic location, since they can be very useful to highlight specific characteristics of the item.
In the last step of the process annotation matrix A n,k and vocabulary matrix V m,n are multiplied to obtain our lemma-context matrix C n,k , which represents the final output returned by the CONTEXT LEARNER module. Of course, each c i, j encodes the importance of term t i in the context c j . The whole process carried out by this component is described in Figure 2.
Given such a representation, two different outputs are obtained. First, we can directly extract column vectors c j from matrix C, which represents the vector space representation of the context c j based on DSMs. It should be pointed out that such a representation perfectly fits the principles of DSMs since contexts discussed through the same lemmas will share a very similar vector space representation. Conversely, a poor overlap will result in very different vectors. Moreover, for each column, lemmas may be ranked and those having the highest TF-IDF scores may be extracted. In this way, we obtain a lexicon of lemmas that are relevant for a particular contextual setting, and this can be useful to empirically validate the effectiveness  1 a 1 Table 1, we anticipate some details of our experimental session and we report the top-3 lemmas for two different contextual settings starting from a set of movie reviews.
Ranker. Given a recommended item (along with its reviews) and given the context in which the item will be consumed (from now on, defined as 'current context'), this module has to identify the most relevant review excerpts to be included in the justification. To this end, we designed a ranking strategy that exploits DSMs and similarity measures in vector spaces to identify suitable excerpts: given a set of n reviews discussing the item i, R i = {r i,1 . . . r i,n }, we first split each r i in sentences. Next, we processed the sentences through a sentiment analysis algorithm [6,17] in order to filter out those expressing a negative or neutral opinions about the item. The choice is justified by our focus on review excerpts discussing positive characteristics of the item. Next, let c j be the current contextual situation (e.g., company=partner), we calculate the cosine similarity between the context vector c j returned by the CONTEXT LEARNER and a vector space representation of each sentence s i . The sentences having the highest cosine similarity w.r.t. to the context of usage c j are selected as the most suitable excerpts and are passed to the GENERATOR.
Generator. Finally, the goal of GENERATOR is to put together the compliant excerpts in a single natural language justification. In particular, we defined a slot-filling strategy based on the principles of Natural Language Generation [18]. Such a strategy is based on the combination of a fixed part, which is common to all the justifications, and a dynamic part that depends on the outputs returned by the previous steps. In our case, the top-1 sentence for each current contextual dimension is selected, and the different excerpts are merged by exploiting simple connectives, such as adverbs and conjunctions. An example of the resulting justifications is provided in Table 2.

EXPERIMENTAL EVALUATION
The experimental evaluation was designed to identify the best-performing configuration of our strategy, on varying of different combinations of the parameters of the workflow (Research Question 1), and to assess how our approach performs in comparison to other methods (both context-aware and non-contextual) to generate post-hoc justifications (Research Question 2). To this end, we designed a user study involving 273 subjects (male=50%, degree or PhD=26.04%, age≥35=49,48%, already used a RS=85.4%) in the movies domain. Interest in movies was indicated as medium or high by 62.78% of the sample. Our sample was obtained through the availability sampling strategy, and it includes students, researchers in the area and people not skilled with computer science and recommender systems.
Experimental Design. To run the experiment, we deployed a web application 1 implementing the methodology described in Section 3. Next, as a first step, we identified the relevant contextual dimensions for each domain. Contexts were selected by carrying out an analysis of related work of context-aware recommender systems in the MOVIE domain. In total, we defined 3 contextual dimensions, that is to say, mood (great, normal), company (family, friends, partner) and level of attention (high, low). To collect the data necessary to feed our web application, we selected a subset of 300 popular movies (according to IMDB data) discussed in more than 50 reviews in the Amazon Reviews dataset 2 . This choice is motivated by our need of a large set of sentences discussing the item in each contextual setting. These data were processed by exploiting lemmatization, POS-tagging and sentiment analysis algorithms available in CoreNLP 3 and Stanford Sentiment Analysis algorithm 4 .

Attention=high
Attention=low Unigrams engaging, attentive, intense simple, smooth, easy Bigrams intense plot, slow movie, life metaphor easy vision, simple movie, simple plot

Restaurant Justification
Company=Partner You should watch 'Stranger than Fiction'. It is a good movie to watch with your partner because it has a very romantic end. Moreover, plot is very intense.

Company=Friends
You should watch 'Stranger than Fiction'. It is a good movie to watch with friends since the film crackles with laughther and pathos and it is a classy sweet and funny movie. tool. Some statistics about the final dataset are provided in Table 3.
In order to compare different configurations of the workflow, we designed several variant obtained by varying the vocabulary of lemmas. In particular, we compared the effectiveness of simple unigrams, of bigrams and their merge. In the first case, we encoded in our matrix just single lemmas (e.g., service, meal, romantic, etc.) while in the second we stored combination of nouns and adjectives (e.g., romantic location). Due to space reasons, we can't provide more details about the lexicons we learnt, and we suggest to refer again to Table 1 for a qualitative evaluation of some of the resulting representations. Our representations based on DSMs were obtained by starting from a set of 1,905 annotations for the movie domain, annotated by three annotators by adopting a majority vote strategy. To conclude, each user involved in the experiment carried out the following steps: 1. Training, Context Selection and Generation of the Recommendation. First, we asked the users to provide some basic demographic data and to indicate their interest in movies. Next, each user indicated the context of consumption of the recommendation, by selecting a context among the different contextual settings we previously indicated (see Figure 3-a). Given the current context, a suitable recommendation was identified and presented to the user. As recommendation algorithm we used a content-based recommendation strategy exploiting users' reviews.
2. Generation of the Justification. Given the recommendation and the current context of consumption, we run our pipeline to generate a context-aware justification of the item adapted to that context. In this case, we designed a between-subject protocol. In particular, each user was randomly assigned to one of the three configurations of our pipeline and the output was presented to the user along with the recommendation (see Figure 3-b). Clearly, the user was not aware of the specific configuration he was interacting with.
3. Evaluation through Questionnaires. Once the justification was shown, we asked the users to fill in a post-usage questionnaire. Each user was asked to evaluate transparency, persuasiveness, engagement and trust of the recommendation process through a five-point scale (1=strongly disagree, 5=strongly agree). The questions the users had to answer follow those proposed in [23]. Due to space reasons, we can't report the questions and we suggest to interact with the web application to fill in the missing details.
4. Comparison to baselines. Finally, we compared our method to two different baselines in a within-subject experiment.
In this case, all the users were provided with two different justifications styles (i.e., our context-aware justifications and a baseline) and we asked the users to choose the one they preferred. As for the baselines, we focused on other methodologies to generate post-hoc justifications and we selected (i) a context-aware strategy to generate justifications, which is based on a set of manually defined relevant terms for each context; (ii) a method to generate non-contextual review-based justifications that relies on the automatic identification of relevant aspects and on the selection of compliant reviews excerpts containing such terms. Such approach partially replicates that presented in [10].

Discussions of the Results
Results of the first experiment, that allows to answer to Research Question 1, are presented in Table 4. The values in the tables represent the average scores provided by the users for each of the previously mentioned questions. As for the movie domain, results show that the overall best results are obtained by using a vocabulary based on unigrams and bigrams. This first finding provides us with an interesting outcome, since most of the strategies to generate explanations are currently based on single keywords and aspects. Conversely, our experiment showed that both adjectives as well as couples of co-occurring terms are worth to be encoded, since they catch more fine-grained characteristics of the item that are relevant in a particular contextual setting. Overall, the results we obtained confirmed the validity of the approach. Beyond the increase in TRANSPARENCY, high evaluations were also noted for PERSUASION and ENGAGEMENT metrics. This outcome confirms how the identification of relevant reviews' excerpts can lead to satisfying justifications. Indeed, differently from feature-based justifications, that typically rely on very popular and well-known characteristics of the movie, as the actors or the director, more specific aspects of the items   emerge from users' reviews.
Next, in order to answer to Research Question 2, we compared the best-performing configurations emerging from Experiment 1 to two different baselines. The results of these experiments are reported in Table 5 which show the percentage of users who preferred our context-aware methodology based on DSMs to both the baselines. In particular, the first comparison allowed us to assess the effectiveness of a vector space representation of contexts based on DSMs with respect to a simple context-aware justification method based on a fixed lexicon of relevant terms, while the second comparison investigated how valid was the idea of diversifying the justifications based on the different contextual settings in which the items is consumed. As shown in the table, our approach was the preferred one in both the comparisons. It should be pointed out that the gaps are particularly large when our methodology is compared to a non-contextual baseline. In this case, we noted a statistically significant gap (p ≤ 0.05) for all the metrics, with the exception of trust. This suggests that diversifying the justifications based on the context of consumption is particularly appreciated by the users. This confirms the validity of our intuition, which led to a completely new research direction in the area of justifications for recommender systems.

CONCLUSIONS AND FUTURE WORK
In this paper we presented a methodology that exploits DSMs to build post-hoc context-aware natural language justifications supporting the suggestions generated by a RS. The hallmark of this work is the diversification of the justifications based on the different contextual settings in which the items will be consumed, which is a new research direction in the area. As shown in our experiments, our justifications were largely preferred by users. This confirms the effectiveness of our approach and paves the way to several future research directions, such as the definition of personalized justification as well as the generation of hybrid justifications that combine elements gathered from user-generated content (as the reviews) with descriptive characteristics of the items. Finally, we will also evaluate to what extent these justifications can explain the behavior of complex and non-scrutable models such as those based on complex deep learning techniques [22].   Table 5: Results of Experiment 2, comparing our approach (CA+DSMs) to a context-aware baseline that does not exploit DSMs (CA Static) and to a non-contextual baseline that exploit users' reviews (review-based). The configuration preferred by the higher percentage of users is reported in bold.