Development of information technology of term extraction from documents in natural language
Journal Title: Восточно-Европейский журнал передовых технологий - Year 2018, Vol 6, Issue 2
Abstract
<p class="KeywordsCxSpFirst">It is shown that domain dictionaries are widely used at various stages of design and operation of software products. The process of dictionary development, especially term extraction, is very labor-intensive, requiring high qualification of the expert. Studies are conducted to identify the most important characteristics of multi-word terms (MWT), such as: the probability of the presence of terms containing different numbers of words in the document; arrangement of nouns in MWT; possible number of nouns in MWT. The context of the use of terms is analyzed and possible limits of terms in the text are identified. The procedure is proposed for preliminary document grouping, thus avoiding the “loss” of terms included in short documents. The dependence of errors of term extraction on the size of the analyzed document is determined.</p><p class="KeywordsCxSpLast">The mathematical model of term representation, based on the definition of the set of word chains grouped around a head-word – a noun is proposed. Filtration of chains is performed depending on the frequency of their occurrence in the text based on a comparison of normalized representations of MWT.</p>Mechanisms for filling the domain dictionary with new records and adjusting existing ones in the process of analyzing the input document are developed. The solution to adjust the frequency of occurrence of terms based on the identification of inter-phrase relations is proposed. All processes and models are combined into a single information technology of construction of the domain dictionary. The problem of term interpretation is not considered in this paper, since it requires a separate solution. The software product allowing to automate substantially the process of term extraction from text documents is developed. The results of testing of the proposed solutions showed the absence of “lost terms” and, as a result, the reduction of the time of term extraction from texts of 10,000 words by 1.5 hours by freeing the expert from analyzing the original document. The research results can be used at various stages of design and operation of software products
Authors and Affiliations
Oleksii Kungurtsev, Svetlana Zinovatnaya, Iana Potochniak, Maxim Kutasevych
Theoretical research into spatial work of a steel-reinforced-concrete statically indeterminate combined structure
<p>The constructed mathematical model and the developed algorithm for spatial calculation of combined steel-reinforced-concrete truss systems make it possible to determine parameters of the stressed-strained state in the...
Determining the regions for efficient use of electrojet lowthrust engines
<p>This work addresses the issues on determining the optimal regions for using propulsion system for spacecraft at low near-Earth orbits. An analysis of spacecraft launches over the past 5 years has been performed. The r...
Influence of grain processing products on the indicators of frozen milkprotein mixtures
<p class="a">This paper reports a study into the influence of manna groats and extruded manna groats on the qualitative and quantitative indicators of milk and protein concentrates over a freezing–defrosting cycle. A sli...
Testing of measurement instrument software on the national level
<p class="Default">Features of software for measuring instruments are considered. A comparative analysis of the general requirements in the documents and guidelines of the international and regional organizations of legi...
Substantiation of the expediency to use iodine-enriched soya flour in the production of bread for special dietary consumption
<p>We have studied the possibility of using iodine-enriched soy flour in the process of making bread for people suffering from iodine deficiency, diabetes and celiac disease. The organoleptic, physical-and-chemical, and...