Surviving the Legal Jungle: Text Classification of Italian Laws in Extremely Noisy Conditions

In this paper, we present a method based on Linear Discriminant Analysis for legal text classification of extremely noisy data, such as duplicated documents classified in different classes. The results show that Linear Discriminant Analysis obtains very good performances both in clean and noisy conditions, if used as classifier in ensemble learning and in multi-label text classification. 1 Motivation and Background We address text categorization of businessoriented legal documents in Italian, but with a custom and overlapping hierarchy of product categories. A typical approach to tackle similar tasks is to exploit resources such as EUROVOC (Daudaravicius, 2012), a multilingual thesaurus consisting of over 6700 hierarchically-organised class descriptors used by many organizations of the European Union (EU) for the classification and retrieval of official documents. Our editorial system has a hierarchy of 23 product categories and more than 20600 labels, manually annotated and customized for different clients in more than 15 years, hence it is not possible to exploit resources like EUROVOC to categorize documents. In this paper, we propose a fast and efficient method for document classification for noisy data based on Linear Discriminant Analysis, a dimensionality reduction technique that has been employed successfully in many domains, including neuroimaging and medicine. We believe that our contribution will be useful to the NLP community in the context of document categorization as Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). well as automatic ontology population, in particular when dealing with very noisy data. The paper is structured as follows: in Section 1.1 we present the related works in the field of text classification and the potential of Linear Discriminant Analysis, in Section 2 we describe the datasets we used, in Section 3 we report and discuss the result of our classification experiments and in Section 4 we draw our conclusions.


Motivation and Background
We address text categorization of businessoriented legal documents in Italian, but with a custom and overlapping hierarchy of product categories. A typical approach to tackle similar tasks is to exploit resources such as EUROVOC (Daudaravicius, 2012), a multilingual thesaurus consisting of over 6700 hierarchically-organised class descriptors used by many organizations of the European Union (EU) for the classification and retrieval of official documents. Our editorial system has a hierarchy of 23 product categories and more than 20600 labels, manually annotated and customized for different clients in more than 15 years, hence it is not possible to exploit resources like EUROVOC to categorize documents.
In this paper, we propose a fast and efficient method for document classification for noisy data based on Linear Discriminant Analysis, a dimensionality reduction technique that has been employed successfully in many domains, including neuroimaging and medicine. We believe that our contribution will be useful to the NLP community in the context of document categorization as Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). well as automatic ontology population, in particular when dealing with very noisy data.
The paper is structured as follows: in Section 1.1 we present the related works in the field of text classification and the potential of Linear Discriminant Analysis, in Section 2 we describe the datasets we used, in Section 3 we report and discuss the result of our classification experiments and in Section 4 we draw our conclusions.

Related Work
There are many applications of NLP in the legal text domain, such as the creation of ontologies for knowledge extraction (Lenci et al., 2009) or legal reasoning (Palmirani et al., 2018), other tasks include dependency parsing (Dell'Orletta et al., 2012), deception detection (Fornaciari et al., 2013) and semantic annotation exploiting external resources like FrameNet (Venturi, 2011). In this domain, the most popular way to perform text categorization is to use ontologies: for example many used EUROVOC to label documents in several languages (Steinberger et al., 2013) with one label for each document, in order to train SVMs (Boella et al., 2013) or deep learning models (Caled et al., 2019), for the prediction of labels at different levels of granularity in the label hierarchy. Another approach is to use the judgments of the Supreme Court as gold standard labels, thus reducing the complexity of the task, and then train machine learning models, such as SVMs, to perform classification (Sulea et al., 2017). It is known that active learning does not reach a good performance in the legal domain (Cardellino et al., 2015), but it is possible to align different resources to perform ontology population or expansion (Cardellino et al., 2017). The state-of-the-art in text classification ranges from 40% to 85% or more, depending on the complexity and size of the dataset, and from the number of document classes (Adhikari et al., 2019). The results of a noise introduction simula-tion study revealed that substituting up to 40% of words with random text strings yields to a small decrease in text classification performance, while the substitution of more than 40% of the text yields a dramatic decrease in classification performance (Agarwal et al., 2007).
A similar task, Extreme Multi-Label Text Classification (XMTC), consists in the classification of documents annotated with multiple tags. Recent experiments of XMTC with Convolutional Neural Networks on a dataset of 57k legal documents annotated with multiple concepts from EU-ROVOC, revealed that word embeddings extracted with label-wise attention Networks (Mullenbach et al., 2018) leads to the best overall performance, compared pre-trained word embeddings, Hierarchical word embedding and Max-Pooling Scorers that produce section-based word embeddings (Chalkidis et al., 2019). It has been demonstrated in more than one context that cNNs perform well for text categorization, but also that there is no single algorithm that performed the best across the combination of data sets and training sample sizes (Keeling et al., 2019). The rationale behind the good performance of label-wise attention networks is their ability to maximise the difference of the words/features associated to different labels. A very similar -but faster-approach is Linear Discriminant Analysis (Balakrishnama and Ganapathiraju, 1998), a feature selection and classification technique that has been successfully used for the incremental classification of large streams of data (Pang et al., 2005), to find identity patterns in images before the advent of deep learning (Prince and Elder, 2007) and as feature selection technique for discriminating fMRI response patterns to visual stimuli (Mandelkow et al., 2016).
Linear Discriminant Analysis (henceforth LDA) is a widely accepted dimensionality reduction and classification method, which aims to find a transformation matrix to convert a feature space to a smaller space by maximising the between-class scatter matrix while minimising the within-class scatter matrix (Boroujeni et al., 2018). The criticism towards this technique emphasize the fact that it suffers from the domination of the largest objectives, in particular when close class pairs tend to overlap in a feature subspace, but this can be solved with various optimizations, including eigenvalue decomposition, among others (Li et al., 2017).

Data
Our dataset consists of 2030 legal italian documents with an average of 800 words each. We have 23 classes representing products manually annotated over 15 years, every document is categorized in one or more classes. Classes are not balanced, but their distribution is proportional to the whole editorial system, that consists of 443.7k documents. We extracted such a small dataset from the editorial system because we plan to update our models very frequently, using a small portion of documents each time in order to save computational power and time. Figure 1 reports the distribution of the classes in our dataset. Since documents can fall under more than one class, we have 43% of documents repeated under different classes. We tested the performance of different classifiers under two different conditions: noisy (with repeated documents) and clean (without the repeated documents).

Experiments and Discussion
In both cases (noisy and clean) we performed preprocessing on text, deleting punctuation and Italian stopwords. We did not use stemming or lemmatization since their usage has led to a degradation of results. We formalize the task in two ways: a simple multinomial classification, where we train a classifier to predict one class per document, and a multi-label classification, where we produce a score ranking of labels for each document and evaluate if the gold standard label occurs in the first N positions.

Multinomial Classification
We tested different feature settings and algorithms with 10-fold cross validation (10f-cv) and 70%-30% training-testing split in the clean and noisy dataset conditions. Table 1 reports the results in terms of accuracy, that is to say the percentage of documents correctly classified. In both conditions the majority baseline is very low, ranging from 4.6% to 8.3%. First we experimented with pre-trained GloVe word vectors as features (vector size 200). As a matter of fact the GloVe Project provides word vectors of different dimensions for words representation trained on massive web datasets (Pennington et al., 2014). For instance the word vectors we used here have been pre-trained by the GloVe Project from two massive corpora, Wikipedia 2014 and Gigaword 5. As we can see in Table 1 in the GloVe embeddings setting we used the following classification algorithms: cNN (with 2 convolutional layers with ReLU activation, 1 pooling layer and 1 output layer), rNN (with 1 rNN sequence layer, 1 LSTM layer with tanH activation and 1 rNN outpur layer), bayesian networks, naïve bayes, SVMs, random forest and LDA. In general, Deep Learning algorithms suffer from the small data used for the experiment, but surprisingly, cNNs performed badly and rNNs worked better, indicating that the sequentiality of text plays an important role. Among the other classification algorithms it turned out that random forest and LDA obtained the best performances, proving that the ability of the algorithm to generalize is crucial. The general low accuracies obtained with these features might indicate that the contexts of our documents represented by word embeddings are not very discriminative. The results increased significantly in the classification with the TF-IDF scores of 4700 words, especially with SVMs as algorithms. This suggests that using more features brings better results without overfittng the data, as shown by the similar accuracies obtained with a 10-fold cross validation and with training-test split. Next we experimented with feature selection, using LDA and Pearsons' correlations to select the best 200 words for the prediction. Results show that, in this feature setting, random forests are the best classification algorithm and that LDA outperforms correlations as feature selection algorithm. Furthermore, as can be seen in the last part of  the initial space of 200 word features, previously selected with LDA, in a space of 23 binary features corresponding to the final classes. On top of that we applied different classification algorithms, finding that SVM is the best performing one in the noisy dataset while random forest obtained the best performance in the clean dataset.

Multi-Label Classification
The Multi-Label classification task is structured as follows: for each document label in the training set, we create a Bag-of-Words (BoW) from the words of its associated documents, then we use TF-IDF scores to weight every word within the BoW obtaining a word ranking that we use for feature selection, since words with higher values better characterize a particular label. Then we apply LDA classification, but unlike the previous experiment, here the prediction returns a list of all the labels, ordered by the total score achieved, we call score ranking this algorithm. Since the classifier returns a list as an outcome, but the editors (our customers) want to choose one or more label from this list, we have to evaluate if the gold standard label occurs in the returned list, thus we can assign multiple labels to a document and test whether the original one is present or not. In this sense, the Score Ranking classifier is evaluated as a Multi-Class classifier (so the metrics in Table 2 are actually Hit@N metrics where N is the size of the returned list), but the returned list is used by the end users to simulate a Multi-Label functionality, leaving to the editors the choice of the best labels to assign among the ones returned. The result of this experiment, reported in Table 2, shows that the performance with 1 label is in line with the ensem-ble learning setting of the Multinomial classification, but the score ranking system only works well in the noisy dataset, as the results are very similar in both noisy and clean conditions. The performance increases at an average of +3.9% when keeping more than one label. In general, we observe that using 500 or 1000 words per label yield similar results in our small dataset, but using more words can help to capture more nuances in text, that might be useful in larger sets of documents. We also observe that 1000 words per label increase the results in the clean condition, while 500 words per label are enough in the noisy condition.

Conclusion and Future
We experimented with various settings, feature selection methods and classification algorithms, and we found a method to extract good models in extremely noisy conditions, even with documents repeated under different labels. LDA proved to be a valuable classification and feature selection technique, but we obtained the best performances when LDA is combined with other algorithms. The results we obtained with the score ranking classification are in line with the state-of-the-art, but our method is more suitable for small and noisy datasets. In the future we plan to apply the score ranking algorithm on a larger dataset and to use it in a real multi-label environment comparing the results with the state-of-the-art of Extreme Multi-Label Document Classification (Chalkidis et al., 2019). We also plan to make comparisons with the more recent state of the art deep learning techniques and to apply semantic indexing to the documents to check for improvements.