Automating the Shaping of Metadata Extracted from a Company Website with Open Source Tools

Abstract

 As part of a market analysis process, the objective was to automate the task of identifying the activities and skills of a collection of enterprises, namely Belgian and French open source companies. In order to avoid manual annotation through visual analysis of the websites’ content, a tool chain was developed to collect the content of websites and extract the important terms. Standard software libraries were identified, allowing to clean up HTML documents and to perform the part-of-speech tagging process used for extracting terminology. This procedure is supplemented by the extraction and the recognition of named entities. The terms extracted in the HTML pages of a company website were then merged and filtered and a circular tags cloud was generated. This presentation facilitates the identification of important terms, commonly referred to as activities and technologies supported by the company. Several changes are planned for this prototype, including, in particular, the extension to the texts in French, the association of extracted terms to the vocabulary of a classification scheme and the automatic generation of dashboards to facilitate the monitoring of the evolution of the industrial sector.

Authors and Affiliations

Dr Ir VISEUR

Keywords

Related Articles

Data Augmentation to Stabilize Image Caption Generation Models in Deep Learning

Automatic image caption generation is a challenging AI problem since it requires utilization of several techniques from different computer science domains such as computer vision and natural language processing. Deep lea...

An Algorithm Research for Supply Chain Management Optimization Model

In this paper, we consider the extended linear complementarity problem on supply chain management optimization model. We first give a global error bound for the extended linear complementarity problem, and then propose a...

Map Reduce: A Survey Paper on Recent Expansion

A rapid growth of data in recent time, Industries and academia required an intelligent data analysis tool that would be helpful to satisfy the need to analysis a huge amount of data. MapReduce framework is basically desi...

The Role of Hyperspectral Imaging: A Literature Review

Optical analysis techniques are used recently to detect and identify the objects from a large scale of images. Hyperspectral imaging technique is also one of them. Vision of human eye is based on three basic color (red,...

CREeLS: Crowdsourcing based Requirements Elicitation for eLearning Systems

Crowdsourcing is the process of having a task performed by the crowd. Because of the Web evolution, recently crowdsourcing is being used in the field of Requirements Engineering to help in simplifying its activities. Amo...

Download PDF file
  • EP ID EP157554
  • DOI 10.14569/SpecialIssue.2014.040105
  • Views 69
  • Downloads 0

How To Cite

Dr Ir VISEUR (2014).  Automating the Shaping of Metadata Extracted from a Company Website with Open Source Tools. International Journal of Advanced Computer Science & Applications, 4(1), 30-34. https://europub.co.uk/articles/-A-157554