Modern techniques of web scraping for data scientists

Journal Title: Revista Romana de Interactiune Om-Calculator - Year 2018, Vol 11, Issue 1

Abstract

Since the emergence of the World Wide Web an outstanding amount of information has become easily available with the use of a web browser. Harvesting this data for scientific purposes isn't feasible to be done manually and has evolved into a distinct new field, Web Scraping. Although at the beginning automatically collecting data in a structured format was at hand with any programming language able to process a text block, which was the HTML response of a HTTP request, with the latest evolution of web pages, complex techniques to achieve this goal are needed. This article identifies problems a data scientist may encounter when trying to harvest web data, describes modern procedures and tools for web scraping and presents a case study on collecting data from the Bucharest's Public Transportation Authority's website in order to use it in a geo-processing analysis. The paper is addressed to data scientists with little or no prior experience in automatically collecting data from the web in a way that doesn't require extensive knowledge of Internet protocols and programming technologies therefore achieving rapid results for a wide variety of web data sources.

Authors and Affiliations

Mihai Gheorghe, Florin Cristian Mihai, Marian Dardala

Keywords

Related Articles

Goal-oriented conversational agents – a proposed approach for practical domains

Conversational agents are nowadays desired in many fields, mainly for task automation, but also for entertaining, therapy or other purposes. This report will introduce the main Dialogue Systems categories, along with som...

Kinesthetic Learning – Haptic User Interfaces for Gyroscopic Precession Simulation

Some forces in nature are difficult to comprehend due to their non-intuitive and abstract nature. Forces driving gyroscopic precession are invisible, yet their effect is very important in a variety of applications, from...

The modeling, analysis and classification of conversations in collaborative environments 

The classification of conversations in collaborative environments is needed for a better understanding of discussed subjects. Ontologies represent an efficient and representative method of conceptualising a domain. Start...

SITAC – Innovative Computerized Adaptive Testing System

The computerized adaptive testing is an approach of the differential assessment which adapts the questions that are asked to the candidate’s ability level. Thus, the computer selects and displays the questions, then reco...

Paronym Generation Algorithms for Malapropism Correction

The Web pages have been intensively used lately for automatic or semiautomatic extraction of useful information. Because of the open nature of the Web, the texts that have no spelling errors are very rare exceptions. One...

Download PDF file
  • EP ID EP673501
  • DOI -
  • Views 151
  • Downloads 0

How To Cite

Mihai Gheorghe, Florin Cristian Mihai, Marian Dardala (2018). Modern techniques of web scraping for data scientists. Revista Romana de Interactiune Om-Calculator, 11(1), 63-75. https://europub.co.uk/articles/-A-673501