Modern techniques of web scraping for data scientists
Journal Title: Revista Romana de Interactiune Om-Calculator - Year 2018, Vol 11, Issue 1
Abstract
Since the emergence of the World Wide Web an outstanding amount of information has become easily available with the use of a web browser. Harvesting this data for scientific purposes isn't feasible to be done manually and has evolved into a distinct new field, Web Scraping. Although at the beginning automatically collecting data in a structured format was at hand with any programming language able to process a text block, which was the HTML response of a HTTP request, with the latest evolution of web pages, complex techniques to achieve this goal are needed. This article identifies problems a data scientist may encounter when trying to harvest web data, describes modern procedures and tools for web scraping and presents a case study on collecting data from the Bucharest's Public Transportation Authority's website in order to use it in a geo-processing analysis. The paper is addressed to data scientists with little or no prior experience in automatically collecting data from the web in a way that doesn't require extensive knowledge of Internet protocols and programming technologies therefore achieving rapid results for a wide variety of web data sources.
Authors and Affiliations
Mihai Gheorghe, Florin Cristian Mihai, Marian Dardala
Questionnaire Analysis for Improvement of Student’s Interaction in Tesys e-Learning Platform
This paper presents an extension of a previously conducted study that aims to reveal usability problems of specific functionalities. The current study also uses Tesys e-Learning platform as our study environment, therefo...
Text-to-Speech Synthesys for Romanian Language
This paper aims to challenge the problem of finding accurate and relevant search algorithms in order to obtain the best audio output in terms of intelligibility and naturalness, the usually employed measures to describe...
Controlling the applications running on a windows system by means of android devices
This article presents an application that the authors have developed for the Android platform, which allows a user to remotely control the applications on a computer which has the operating system Microsoft Windows. Ther...
2D graphical interaction in elearning
Sketching is often used by people to express ideas. Some concepts that are hard to explain in words can be easily expressed using a figure or drawing. As the pen-based user interfaces became common, many systems that use...
Multimedia Content Consumption in the Context of Digital TV
In this period ends switchover from analogue to digital TV. At the same time is accelerating the pace of diversifying multimedia resources, facilities offered by the receiving devices and communication technologies trans...