Modern techniques of web scraping for data scientists
Journal Title: Revista Romana de Interactiune Om-Calculator - Year 2018, Vol 11, Issue 1
Abstract
Since the emergence of the World Wide Web an outstanding amount of information has become easily available with the use of a web browser. Harvesting this data for scientific purposes isn't feasible to be done manually and has evolved into a distinct new field, Web Scraping. Although at the beginning automatically collecting data in a structured format was at hand with any programming language able to process a text block, which was the HTML response of a HTTP request, with the latest evolution of web pages, complex techniques to achieve this goal are needed. This article identifies problems a data scientist may encounter when trying to harvest web data, describes modern procedures and tools for web scraping and presents a case study on collecting data from the Bucharest's Public Transportation Authority's website in order to use it in a geo-processing analysis. The paper is addressed to data scientists with little or no prior experience in automatically collecting data from the web in a way that doesn't require extensive knowledge of Internet protocols and programming technologies therefore achieving rapid results for a wide variety of web data sources.
Authors and Affiliations
Mihai Gheorghe, Florin Cristian Mihai, Marian Dardala
Goal-oriented conversational agents – a proposed approach for practical domains
Conversational agents are nowadays desired in many fields, mainly for task automation, but also for entertaining, therapy or other purposes. This report will introduce the main Dialogue Systems categories, along with som...
Kinesthetic Learning – Haptic User Interfaces for Gyroscopic Precession Simulation
Some forces in nature are difficult to comprehend due to their non-intuitive and abstract nature. Forces driving gyroscopic precession are invisible, yet their effect is very important in a variety of applications, from...
The modeling, analysis and classification of conversations in collaborative environments
The classification of conversations in collaborative environments is needed for a better understanding of discussed subjects. Ontologies represent an efficient and representative method of conceptualising a domain. Start...
SITAC – Innovative Computerized Adaptive Testing System
The computerized adaptive testing is an approach of the differential assessment which adapts the questions that are asked to the candidate’s ability level. Thus, the computer selects and displays the questions, then reco...
Paronym Generation Algorithms for Malapropism Correction
The Web pages have been intensively used lately for automatic or semiautomatic extraction of useful information. Because of the open nature of the Web, the texts that have no spelling errors are very rare exceptions. One...