A Resource for Detecting Misspellings and Denoising Medical Text Data

English. In this paper we propose a method for collecting a dictionary to deal with noisy medical text documents. The quality of such Italian Emergency Room Reports is so poor that in most cases these can be hardly automatically elaborated; this also holds for other languages (e.g., English), with the notable difference that no Italian dictionary has been proposed to deal with this jargon. In this work we introduce and evaluate a resource designed to fill this gap.1 Italiano. In questo lavoro illustriamo un metodo per la costruzione di un dizionario dedicato all’elaborazione di documenti medici, la porzione delle cartelle cliniche annotata nei reparti di pronto soccorso. Questo tipo di documenti è cosı̀ rumoroso che in genere le cartelle cliniche difficilmente posono essere direttamente elaborate in maniera automatica. Pur essendo il problema di ripulire questo tipo di documenti un problema rilevante e diffuso, non esisteva un dizionario completo per trattare questo linguaggio settoriale. In questo lavoro proponiamo e valutiamo una risorsa finalizzata a condurre questo tipo di elaborazione sulle cartelle cliniche.


Introduction
Noise in textual data is a very common phenomenon afflicting text documents, especially when dealing with informal texts such as chats, SMS and e-mails. This kind of text inherently contains spelling errors, special characters, nonstandard word forms, grammar mistakes, and so on (Liu et al., 2012). In this work we focus on a type of text which can also be very noisy: emergency room reports. In the broader frame of a project aimed at detecting injuries stemming from violence acts in narrative texts contained in emergency room reports, we recently developed the VIDES, so dubbed after 'Violence Detection System' (Mensa et al., 2020). This system is concerned with categorizing textual descriptions as containing violence-related injuries (V) vs. nonviolence-related injuries (NV), which is a relevant task to the ends of devising alerting mechanisms to track and prevent violence episodes. VIDES combines a neural architecture which performs the categorization step (thus discriminating V and NV records) and a Framenet-based approach, whereby semantic roles are represented through a synthetic description employing a set of word embeddings. 2 More specifically, a model of violent event has been devised: records that are recognized as containing violence-related injuries are further processed by an explanation module, which is charged to individuate the main elements corroborating that categorization (V) by identifying the involved agent, the type of injury, the involved body district etc.. Explaining the categorization ultimately involves filling the semantic components of the violence frame. All such ele-ments contribute to recognizing a violent event as the source of the injuries complained by ER patients.
During the development of VIDES we realized that in order to run sophisticated algorithms for the detection and extraction of such violent traits we needed to cope with the noise contained in the input medical records. Some efforts have been invested to deal with different sorts of linguistic phenomena menacing the comprehension of texts; however, most existing works are focused on the English language, and rely on dictionaries that cannot be directly employed on Italian text documents.
In this preliminary work we start to tackle the issue of noisy words in medical records for Italian texts, by specifically focusing on misspellings. Our contribution is twofold: we first manually explore the dataset by analyzing a small sample of records in order to determine whether the main traits and issues present in other languages are also shared by Italian reports; secondly, we collect, merge and evaluate a set of Italian dictionaries, which constitute a brick fundamental to build any domain specific spell-checking algorithm (López-Hernández et al., 2019).

Related Work
Literature shows a limited but significant interest on the issue of detecting and correcting noisy medical text documents; nonetheless, some commonalities underlying this sort of text can be drawn.
Medical texts are often very noisy; among the most common mistakes we mention mistyping, lack or improper use of punctuation, grammatical errors and domain-specific abbreviations and Latin medical terminology (Siklósi et al., 2013). This is mainly due to the nature of the records themselves, and to the fact that the medical personnel compiling the entries is often under pressure and in a hurry.
Most of the spelling correction approaches have been carried out for English, with the exception of research in Swedish (Dziadek et al., 2017) and Hungarian (Siklósi et al., 2013), while no work has been found dealing with the Italian language. Regarding the methodologies, most works focus on non-word errors, while disregarding grammatical and real word mistakes. Non-word mistakes occur when a misspelling error produces a word that does not exist, such as 'patienz' instead of 'patient', while real word mistakes occur when a word is mistakenly replaced with anotherexisting-one, like the substitution of 'abuse' with 'amuse'. The adopted algorithms are diverse, with the prevalence of approaches relying on embeddings (Kilicoglu et al., 2015;Workman et al., 2019) or regular expressions and rule-based systems (Patrick et al., 2010;Sayle et al., 2012;Lai et al., 2015). However, basically all contributions adopt a preliminary dictionary look-up step (López-Hernández et al., 2019). To this purpose, besides the general dictionaries provided in toolkits such as Aspell and Google Spell Checker, 3 authors often rely on (medical) domain-specific dictionaries, such as The Unified Medical Language System (UMLS) (Aoki et al., 2004), the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT, 2020) and The SPECIAL-IST Lexicon (Browne et al., 2000). It is thus evident that the development of analogous resources for the Italian language is a crucial step for the design of tools and systems aimed at dealing with the spell-checking of Italian medical text documents.
Besides the treatment of misspellings, there are also works specifically focused on abbreviations. For instance, in (Wu et al., 2011) the authors present a corpus-based method to create a lexical resource of English clinical abbreviations via several machine learning algorithms. The resource has been used to automatically detect and expand abbreviations, and obtained interesting experimental results. More recently, another approach proposed in (Kreuzthaler et al., 2016) focuses on abbreviations ending with a period character; the proposed technique puts together statistical and dictionary-based strategies to detect abbreviations in German clinical narratives.
In the present work we are not proposing a specific technique for dealing with abbreviations, we are rather concerned with misspellings. However, the approaches already proposed for other languages will be considered in future work to also treat Italian abbreviations in our dataset.

Data Analysis
We analyze real data coming from a set of emergency room reports collected in Italian hospitals by the Italian National Institute of Health in the Dataset. The whole dataset amounts to 136, 144 non-empty entries, 592 of which were randomly selected for the manual analysis. Table 1 reports some figures describing the dataset. Double spaces and punctuation redundancy have been fixed through regular expressions, while tokens have been extracted by splitting the sentences based on spaces. Also, tokens containing numbers are presently discarded.
Analysis result. We performed a manual analysis on the subset of the original dataset: the 592 randomly selected entries herein were manually examined, and for each entry we looked for noisy words. Three main types of words were annotated: i) misspellings: a wrongly typed word, e.g., fratura instead of frattura -fracture; ii) abbreviations: a shortened form of a word or phrase, e.g., dx instead of destra -right; iii) acronyms: a word formed from the initial letters of other words, e.g., ps instead of pronto soccorso -emergency room. Interestingly enough, both abbreviations and acronyms can be at least partly considered as domain dependent: for example, in different settings, ps may denote post scriptum (something added at a later time, likely a letter, after the signature), but also 'Polizia di Stato' (Police) or 'previdenza sociale' (social security). Dealing with such phenomena thus involves access- ing a context dependent knowledge base that allows selecting the utterance appropriate for the context at hand. We are presently concerned with misspellings, acronyms and abbreviations as noise, but only the first category can be actually considered as an error. More specifically, while misspellings are actual errors, abbreviations and acronyms belong to a domain-specific language, and these are way too specific to be recognized as legitimate words through a general-purpose dictionary. As seen in literature, misspellings and abbreviations/acronyms must be treated with different techniques, and in this work we mainly focus on tackling the first category, while also obtaining interesting insights regarding the second one. Table 2 illustrates the results of the annotation process. We discovered that the dataset contains a lot of noise, amounting to almost the 10% of the tokens, on average 2 noisy tokens per record. By looking separately at the different typologies of noise we observe that misspellings are more scattered and diverse, while the usage of abbreviations and acronyms seems to be more coherent: we have 670 instances of abbreviations but only 76 unique abbreviations, while 304 out of the 433 instances of misspellings are unique. This phenomenon is also depicted in Figure 1, where we provide the log-log plot of the frequency of each misspelling, abbreviation and acronym ordered by rank. We observe that the distribution of abbreviations and acronyms has a different magnitude, but is very similar in shape; on the other side, the misspellings are clearly more scattered with a very long tail of items appearing only once.

Dictionaries Creation and Evaluation
The manual analysis uncovered characteristics and features that are in line with those found in literature for English datasets (López-Hernández et al., 2019). However, to allow the development of spell-checkers for Italian medical texts, another key component is still missing: most approaches aimed at error detection rely on dictionaries to determine if a token is a legitimate word or not. In fact, the simplest implementation of misspellings detection is as follows: if we have at our disposal the set W containing all of the terms of a given language, joined to all terms pertaining the specific domain at hand, any word w / ∈ W can be likely considered as a misspell. To the best of our knowledge, no such dictionary exists that is able to cope with Italian medical text documents, so we built a resource to answer to this need.

Source Dictionaries
The automatic development of a dictionary is not a trivial task. We want to reach the highest possible coverage for both general terms and specific medical terminology, but at the same time we cannot rely too much on unverified sources (e.g., crowd-sourced data) with the risk of introducing misspellings and errors into the dictionary. We selected different sources and arranged them into four main classes: • MED: a collection of medical terms built by putting together five medical online dictionaries (torrinomedica.it, 2020; abcsalute.it, 2020; codifa.it, 2020; my-personaltrainer.it, 2020a; my-personaltrainer.it, 2020b), containing medical specific terms and medications names; • ITA: a collection of Italian terms built by • WMED: a collection of terms from Wikipedia pages pertaining the medical domain. The list of Wikipedia medical pages has been obtained by querying the SPARQL endpoint of Wikidata (Vrandečić and Krötzsch, 2014), while the pages have been taken from the 20 August 2020 Wikipedia dump; • WMOV: since medical records also contain a brief narrative text of the events that led to the (either violent or accidental) injuries, we added terms associated to eventive and narrative genres by collecting Wikipedia pages pertaining to movies, television series and literary work that are expected to contain narrative terminology.
The set of terms extracted from Wikipedia can potentially contain misspellings and errors, and so we also set a frequency minimum which allows for the pruning of the tokens herein. We annotate this parameter with a subscript next to the set name, e.g., WMOV 1 indicates that the threshold was set to 1 for the terms frequency.

Evaluation
Building the dataset. In order to assess the quality of the collected dictionaries we started from the 49, 116 unique tokens in the dataset, removed the stop words 4 and randomly selected 5, 000 of them to be manually annotated. The annotation was carried out by four of the authors of this paper. The selection algorithm was designed so to increase the probability of a token to be selected in accordance to its frequency in the dataset. These 5, 000 tokens were then annotated Table 4: Results of the evaluation of the considered dictionaries. The first column reports the size of each dictionary, the second to fourth columns provide coverage and correctness along with their harmonic mean, while the last three columns illustrate the coverage of our dictionaries on tokens that were annotated as correct words, abbreviations and acronyms. with one of the following four classes: correct words (regardless of their domain specificity), abbreviations, acronyms and misspellings. The first three classes represent terms that should be found in our resource, while the last category contains words that should not be present in the dictionary. Table 3 reports the statistics featuring the dataset annotated for evaluation purposes.
Evaluating the dictionary. In Table 4 we report the results of the dictionaries evaluation. Each dictionary has been built by taking into consideration one or more of the previously presented sources. Multiple sources have been simply merged into a unique set of terms, without repetitions. We assess the quality of each dictionary via two measures, coverage and correctness. The coverage is the percentage of words that were found in the dictionary (either correct words, abbreviations or acronyms), while the correctness is the percentage of misspellings that were not present in the dictionary. We considered different combinations of the sources, the tuning of the frequency-based filtering parameter, and an additional lemmatization step. We observe that both the ITA and the MED sets are fundamentally correct, even though they also include words that in the common usage are frequently misspelled, such as passeggiero in place of the correct form passeggero. On the other side, its .62 coverage is unsatisfactory (please refer to the second row of Table 4, MED, ITA); it also witnesses that medical jargon is only partially grasped by dictionaries in the MED set. As expected, the introduction of terms from Wikipedia improves the coverage, but with detrimental effect on the correctness. This also holds for the WMOV set, which is rich but also pretty noisy. By fine tuning the frequency thresholds of both WMED and WMOV we were able to prune most of the noise and to pre-serve the coverage at the same time, finally obtaining a good dictionary with the combination MED, ITA, WMED 1 , WMOV 5 .
This setting was also tested by applying a lemmatization step on both Wikipedia terms and our dataset tokens. Interestingly, the lemmatization introduces more mistakes than it solves: this is due the the fact that unpredictably the lemmatizer converts misspellings into legitimate words that do not necessarily correspond to their correct spelling. This fact shows also that lemmatization, which is acknowledged as a task almost completely solved from a scientific point of view, still poses relevant issues for the medical jargon and for domain-specific languages more in general.
A lot of abbreviations are not yet covered in the dictionary. Once again, these abbreviations are dataset-specific (and perhaps also follow local uses rather than widely accepted practices), and thus these are very hard to find even on specialized public medical resources. For instance, incid (incidente -accident) appears very frequently and its easily understandable by humans but its not a common or medical abbreviation. The same phenomenon can also be observed on acronyms, that are less sparse and more adherent to widely accepted practices and standards.

Conclusions and Future Work
In this work we tackled the issue of detecting textual noise in Italian room emergency reports, focusing specifically on misspellings. Firstly we examined the reports and found out that the sorts of issues reported in literature for other languages can also be found in Italian text documents. Secondly, we developed and evaluated an Italian dictionary suited for the task of noise detection. In future work we plan to expand the dictionary by including the terms from the Italian ICD-9 and ICD-10 (International Classification of Diseases), that may be useful to interpret acronyms and resolve abbreviations. Moreover, we plan to employ this dictionary in a fully fledged spell-checking system. Finally, the usage of semantic -sense indexed-representations such as, e.g., (Mensa et al., 2018) and (Colla et al., 2020a;Colla et al., 2020b) will be explored, in order to deal with real word mistakes, and more in general contextual information (Basile et al., 2019) will be considered as a main cue in order to uncover and correct this sort of errors. For example, by leveraging the terminology surrounding noisy tokens we plan to distinguish the more scattered misspellings from the other terms that are not present in our dictionary.