Development of information technology of term extraction from documents in natural language

Abstract

<p class="KeywordsCxSpFirst">It is shown that domain dictionaries are widely used at various stages of design and operation of software products. The process of dictionary development, especially term extraction, is very labor-intensive, requiring high qualification of the expert. Studies are conducted to identify the most important characteristics of multi-word terms (MWT), such as: the probability of the presence of terms containing different numbers of words in the document; arrangement of nouns in MWT; possible number of nouns in MWT. The context of the use of terms is analyzed and possible limits of terms in the text are identified. The procedure is proposed for preliminary document grouping, thus avoiding the “loss” of terms included in short documents. The dependence of errors of term extraction on the size of the analyzed document is determined.</p><p class="KeywordsCxSpLast">The mathematical model of term representation, based on the definition of the set of word chains grouped around a head-word – a noun is proposed. Filtration of chains is performed depending on the frequency of their occurrence in the text based on a comparison of normalized representations of MWT.</p>Mechanisms for filling the domain dictionary with new records and adjusting existing ones in the process of analyzing the input document are developed. The solution to adjust the frequency of occurrence of terms based on the identification of inter-phrase relations is proposed. All processes and models are combined into a single information technology of construction of the domain dictionary. The problem of term interpretation is not considered in this paper, since it requires a separate solution. The software product allowing to automate substantially the process of term extraction from text documents is developed. The results of testing of the proposed solutions showed the absence of “lost terms” and, as a result, the reduction of the time of term extraction from texts of 10,000 words by 1.5 hours by freeing the expert from analyzing the original document. The research results can be used at various stages of design and operation of software products

Authors and Affiliations

Oleksii Kungurtsev, Svetlana Zinovatnaya, Iana Potochniak, Maxim Kutasevych

Keywords

Related Articles

Studying the effect of the integrated bread baking improver "Mineral Freshness Super" on consumer properties of wheat bread

<p>Scientists from the National University of Food Technologies (Ukraine) developed the integrated bread baking improver "Mineral Freshness Super" whose formulation includes nutritional supplements with the GRAS status....

Electrochemical regeneration of oxygen­containing compounds in the extracts of used oils

<p>Growth in the volumes of oils applied for various purposes, including motor oils, results in the formation of large quantities of toxic waste – used oils. At the same time, they are a valuable raw material for the pro...

A study of an electrochromic device based on Ni(OH)2/PVA film with the mesh-like silver counter electrode

<p>The study is devoted to the development and testing of the electrochromic device based on Ni(OH)<sub>2</sub>/PVA (polyvinyl alcohol) composite and mesh counter-electrode. A copper wire with a layer of electroplated si...

Activation of the nickel foam as a current collector for application in supercapacitors

<p>Nickel foam is widely used as a current collector and as a major component of the faradic electrode in supercapacitors. Activation of nickel foam would allow increasing the capacity of the nickel hydroxide electrode o...

Study of the influence of a fast­changing temperature on metrological characteristics of the tensoresistive pressure sensor

<p>Based on dependences that describe the nonstationary temperature fields in the membrane and casing of the tensoresistive pressure sensor, we derived equations for thermomechanical processes in these elements, specific...

Download PDF file
  • EP ID EP528253
  • DOI 10.15587/1729-4061.2018.147978
  • Views 32
  • Downloads 0

How To Cite

Oleksii Kungurtsev, Svetlana Zinovatnaya, Iana Potochniak, Maxim Kutasevych (2018). Development of information technology of term extraction from documents in natural language. Восточно-Европейский журнал передовых технологий, 6(2), 44-51. https://europub.co.uk/articles/-A-528253