Automation of the construction of the terminological core of ontology in computer linguistics based on a corpus of texts

Article's languageRussian

Author(s)

Abstract

The paper proposes an approach to the automatic construction of the terminological core of ontology in computer linguistics. The issues of creating a top-level ontology, which defines possible classes of terms for their further search and systematization, are considered. An algorithm for generating and initially populating a subject dictionary is proposed. It includes two main stages. At the first step, a system of lexical-semantic classes based on ontology classes is built. The second step is filling the dictionary with terms and their correlation with dictionary classes based on available resources: a universal ontology of scientific knowledge, a thesaurus and a portal on computer linguistics. For conducting experiments, a corpus of analytical articles on computational linguistics was collected from the Habr website. Moreover, datasets with term marking were created, including 1065 sentences in Russian. Experiments were carried out to solve two problems: term detection and their classification based on ontology classes. For the first task, three neural network models were considered: xlm-roberta-base, roberta-base-russian-v0 and ruRoberta-large. The best results were obtained with the last model: 0.91 F-measures. An analysis of the classifier errors showed a high frequency of errors of incomplete selection of the term. For the second task, the ruRoberta-large model was chosen due to its results for the first task. The average F-measure value for the 12 used ontology classes was 0.89. A general architecture of a system for creating and populating ontologies is proposed, integrating linguistic approaches and machine learning methods.

DOI10.31144/SI.2307-6410.2023.N23.P13-32

UDK81’33

Issue # 23, 2023

Pages13-32

File 2023ovchinnikova_ivanov_sidorova.pdf (566.89 KB)

Bibliographic reference

Ovchinnikova, K.; Ivanov, A.; Sidorova, E. Automation of the construction of the terminological core of ontology in computer linguistics based on a corpus of texts. System Informatics 2023, 23, 13-32. https://doi.org/10.31144/SI.2307-6410.2023.N23.P13-32.