Modern techniques of web scraping for data scientists

Journal Title: Revista Romana de Interactiune Om-Calculator - Year 2018, Vol 11, Issue 1

Abstract

Since the emergence of the World Wide Web an outstanding amount of information has become easily available with the use of a web browser. Harvesting this data for scientific purposes isn't feasible to be done manually and has evolved into a distinct new field, Web Scraping. Although at the beginning automatically collecting data in a structured format was at hand with any programming language able to process a text block, which was the HTML response of a HTTP request, with the latest evolution of web pages, complex techniques to achieve this goal are needed. This article identifies problems a data scientist may encounter when trying to harvest web data, describes modern procedures and tools for web scraping and presents a case study on collecting data from the Bucharest's Public Transportation Authority's website in order to use it in a geo-processing analysis. The paper is addressed to data scientists with little or no prior experience in automatically collecting data from the web in a way that doesn't require extensive knowledge of Internet protocols and programming technologies therefore achieving rapid results for a wide variety of web data sources.

Authors and Affiliations

Mihai Gheorghe, Florin Cristian Mihai, Marian Dardala

Keywords

Related Articles

Modelarea reutilizării resurselor în domeniul medical prin intermediul reţelelor sociale şi a internetului semantic

Dezvoltarea materialelor educaţionale a devenit una dintre principalele preocupări ale specialiştilor din numeroase domenii, inclusiv în medicină. Extinderea aplicaţiilor e-Learning impune crearea unor resurse care să pe...

Criminal detection – O aplicaţie pentru identificarea infractorilor

Prezenta lucrare îşi propune să vină în ajutorul celor ce doresc să identifice rapid un posibil infractor, pe baza informaţiilor, pe care le avem disponibile la un moment dat. Sistemul construit foloseşte ca intrare info...

Usability evaluation of a learning scenario for Biology implemented onto an augmented reality platform

The combination between real and virtual in the augmented reality systems requires suitable interaction techniques that need to be tested with users in order to avoid usability problems. Formative evaluation aims at find...

Interaction Techniques for Satellite Image Processing on the Grid

The current paper presents the Human-Computer interaction techniques, which follow from executing the business part of an application on the Grid. It presents the way in which such a software deals with data management,...

Combining Visual and Textual Attention in Neural Models for Enhanced Visual Question Answering

While visual information is essential for humans as it models our environment, language is our main method of communication and reasoning. Moreover, these two human capabilities interact in complex ways, therefore proble...

Download PDF file
  • EP ID EP673501
  • DOI -
  • Views 138
  • Downloads 0

How To Cite

Mihai Gheorghe, Florin Cristian Mihai, Marian Dardala (2018). Modern techniques of web scraping for data scientists. Revista Romana de Interactiune Om-Calculator, 11(1), 63-75. https://europub.co.uk/articles/-A-673501