Development of information technology of term extraction from documents in natural language
Journal Title: Восточно-Европейский журнал передовых технологий - Year 2018, Vol 6, Issue 2
Abstract
<p class="KeywordsCxSpFirst">It is shown that domain dictionaries are widely used at various stages of design and operation of software products. The process of dictionary development, especially term extraction, is very labor-intensive, requiring high qualification of the expert. Studies are conducted to identify the most important characteristics of multi-word terms (MWT), such as: the probability of the presence of terms containing different numbers of words in the document; arrangement of nouns in MWT; possible number of nouns in MWT. The context of the use of terms is analyzed and possible limits of terms in the text are identified. The procedure is proposed for preliminary document grouping, thus avoiding the “loss” of terms included in short documents. The dependence of errors of term extraction on the size of the analyzed document is determined.</p><p class="KeywordsCxSpLast">The mathematical model of term representation, based on the definition of the set of word chains grouped around a head-word – a noun is proposed. Filtration of chains is performed depending on the frequency of their occurrence in the text based on a comparison of normalized representations of MWT.</p>Mechanisms for filling the domain dictionary with new records and adjusting existing ones in the process of analyzing the input document are developed. The solution to adjust the frequency of occurrence of terms based on the identification of inter-phrase relations is proposed. All processes and models are combined into a single information technology of construction of the domain dictionary. The problem of term interpretation is not considered in this paper, since it requires a separate solution. The software product allowing to automate substantially the process of term extraction from text documents is developed. The results of testing of the proposed solutions showed the absence of “lost terms” and, as a result, the reduction of the time of term extraction from texts of 10,000 words by 1.5 hours by freeing the expert from analyzing the original document. The research results can be used at various stages of design and operation of software products
Authors and Affiliations
Oleksii Kungurtsev, Svetlana Zinovatnaya, Iana Potochniak, Maxim Kutasevych
Synthesis of a fractional-order PIλDμ-controller for a closed system of switched reluctance motor control
<p>The relevance of creating high-quality control systems for electric drives with a switched reluctance motor (SRM) was substantiated. Using methods of mathematical modeling, transient characteristics of the process of...
Study of low-emission multi-component cements with a high content of supplementary cementitious materials
<p>The studies have established the influence of various types of supplementary cementitious materials on physical and mechanical properties and structure formation of low-emission multi-component cements. The results of...
Analysis of inrush currents of the unloaded transformer using the circuitfield modelling methods
<p>We studied theoretically the transition processes that occur during tests of power transformers in the mode of experimental idling. A circuit-field model of electromagnetic processes is developed, based on a three-dim...
Numerical methods for contact analysis of complex-shaped bodies with account for non-linear interface layers
<p>In order to ensure high technical characteristics of machines for various applications, it is necessary to increase the strength of the most loaded and heavy-duty elements of constructions, which are complex-shaped co...
Development of adaptive combined models for predicting time series based on similarity identification
<p>Adaptive combined models of hybrid and selective types for prediction of time series on the basis of a program set of adaptive polynomial models of various orders were offered. Selection in these models is carried out...