Main Article Content

Authors

The present research study has used two state-of-the-art Spanish taggers with the primary goal of automatically tagging for POS a strictly assembled collection of unstructured text aimed at assisting a number of linguistic tasks, the subregional Mexican Corpus del Habla de Baja California (CHBC). These taggers, a Maximum-Entropy-based one and another one that adds to this statistical construct distributional similarity features, have recently been released but were missing an accuracy rate. Therefore, the second goal of this article is to evaluate and provide attested accuracy figures for the language models behind these taggers. In order to achieve these two goals, this article has proposed a novel, reduced tag set, which has also been proven useful for the goals here pursued. On a sample of almost 11,000 words and more than 12,500 tags for two genres (written text and transcribed oral speech), the Maximum Entropy tagger and the tagger with Maximum Entropy plus distributional similarity features have achieved results of 97.2% and 97.4%, respectively. By comparing these figures to a human ceiling or gold standard of 97.1%, also attested here, it is clear that the results of both taggers are competitive even when applied to an external data collection for which they have not been previously trained or tuned for. This is particularly important because under these kinds of experimental conditions taggers performance has been shown to deteriorate.

1.
Rico-Sulayes A, Saldívar-Arreola R, Rábago-Tánori Álvaro. Part-of-speech tagging with maximum entropy and distributional similarity features in a subregional corpus of Spanish. inycomp [Internet]. 2017 Jul. 1 [cited 2024 Nov. 22];19(2):53-65. Available from: https://revistaingenieria.univalle.edu.co/index.php/ingenieria_y_competitividad/article/view/5293